How much information is in a tweet?
There are a lot of snarky answers to that. Let's talk mathematically, where information is measured in bits: How many bits can be expressed in a tweet?
It's kind of fun to try to figure out how to cram in the most information. Our current in-house record is 4.2 kilobits (525 bytes) per tweet, but this can definitely be bested. Twitter's 140-character limit has been under assault for some time, but nobody has decisively anointed a winner.
To that end, announcing the 1st International Longest Tweet Contest, along the lines of the International Obfuscated C Code Contest. The goal: fit the most bits of information into a tweet. There will be glorious fame and a T-shirt for the winner. More on that later. But first, a dialog:
Alyssa P. Hacker: Not quite -- despite its SMS roots, Twitter supports Unicode, and the company says its 140-character limit is based on Unicode characters, not ASCII. So it's a lot more than 980 bits.
Ben: Ok, Unicode is a 16-bit character set, so the answer is 140 × 16 = 2,240 bits.
Alyssa: Well, since that Java documentation was written, Unicode has expanded to cover more of the world's languages. These days there are 1,112,064 possible Unicode characters (officially "Unicode scalar values") that can be represented, including in the UTF-8 encoding that Twitter and most of the Internet uses. That makes Unicode about 20.1 bits per character, not 16. (Since log2 1,112,064 ≈ 20.1.)
Ben: Ok, if each character can take on one of 1,112,064 possible values, we can use that to figure out how many total different tweets there can ever be. And from that, we'll know how many bits can be encoded into a tweet. But how?
Alyssa: It's easy! We just calculate the total number of different tweets that can ever be received. The capacity of a tweet, in bits, is just the base-2 logarithm of that number. According to my calculator here, 1,112,064 to the 140th power is about 28.7 quattuordecillion googol googol googol googol googol googol googol googol. That's the number of distinct 140-character messages that can be sent. Plus we have to add in the 139-character messages and 138-character messages, etc. Taking the log, I get 2,811 bits per tweet.
Ben: We'd get almost the same answer if we just multiplied 20.1 bits per character times 140 characters. Ok, 2.8 kilobits per tweet.
Alyssa: I just discovered a problem. There aren't that many distinct tweets! For example, I can't tell the difference between the single character '<' and the four characters '<'. I also can't send a tweet that contains nothing but some null characters. So we have to deflate our number a little bit. It's tricky because we don't know Twitter's exact limitations; it's a black box.
Ben: But the answer will still be roughly 2.8 kilobits. Can we do better?
Alyssa: By golly, I think we can! Turns out Unicode is basically identical to an international standard, called the Universal Multi-Octet Coded Character Set, known as UCS or ISO/IEC 10646. But the two standards weren't always the same -- in the 90s, there was a lot of disagreement between the computer companies who backed Unicode and proposed a simple 16-bit character set, and the standards mavens behind UCS, who wanted to accommodate more languages than 65,536 characters would allow. The two sides eventually compromised on the current 20.1-bit character set, but some historical differences still linger between the two "official" specifications -- including in the definition of the UTF-8 encoding that Twitter uses. Check this out:
UTF-8 according to Unicode
UTF-8 according to ISO/IEC 10646
Alyssa: Although Unicode's UTF-8 can only encode the 1,112,064 Unicode scalar values, UCS's version of UTF-8 is a lot bigger -- it goes up to 31 bits!
Ben: So, when Twitter says they use UTF-8, which version are they using? Unicode, or UCS?
Alyssa: Hmm, that's a good question. Let's write a program to test. I'll send Twitter a really big UCS character in UTF-8, represented in octal like this. This is way outside the bounds of well-formed Unicode.
$ perl -Mbytes -we 'print pack "U", (2**31 - 5)' | od -t o1 0000000 375 277 277 277 277 273
Ben: Hey, that kind of worked! When we fetch it back using Twitter's JSON API, we get this JSON fragment:
Alyssa: It's the same thing we sent! The character (really an unassigned code position) is represented in UTF-8, in octal with backslashes in ASCII. That means Twitter "characters" are actually 31 bits, not 20.1 bits. The capacity of a tweet is actually a lot bigger than the 2.8 kilobits we calculated earlier.
Ben: Should we worry that the text only shows up in the JSON API, not XML or HTML, and these high characters only last a few days on Twitter's site before vanishing?
Alyssa: Nope. As long as we had this exchange in our conversation, people on the Internet won't complain about those issues.
Using Alyssa's insight that Twitter actually supports 31-bit UCS characters (not just Unicode), we can come up with a simple program to send 4.2-kilobit tweets using only code positions that aren't in Unicode. That way there's no ambiguity between these crazy code positions, which Twitter represents as backslashed octal, and legitimate Unicode, which Twitter sends literally. These tweets have a text field that's a whopping 3,360 bytes long -- but in ASCII octal UTF-8, so they only represent 525 bytes of information in the end.
Here's an example of how to use the programs and verify that they can encode at least one arbitrary 4,200-bit message:
$ dd if=/dev/urandom of=random.bits bs=525 count=1 1+0 records in 1+0 records out 525 bytes (525 B) copied, 0.0161392 s, 32.5 kB/s $ md5sum random.bits a7da09e59d1b6807e78aac7004e6ba41 random.bits $ ./megatwit-send < random.bits 11037181699 $ ./megatwit-get 11037181699 | md5sum a7da09e59d1b6807e78aac7004e6ba41 -
But this is just the start -- we're not the only ones interested in really long tweets. (Structures, strictures...) Others have found an apparent loophole in how Twitter handles some URLs, allowing them to send tweets with a text field up to 532 bytes long (how much information can be coded this way isn't clear). Maxitweet has come up with a clever way to milk the most out of 140 Unicode characters without requiring a decoder program, at least at its lower compression levels. There are definitely even better ways to cram as many bits as possible into a tweet.
So here is the challenge for the 1st International Longest Tweet Contest:
That's it. Happy tweeting!