By Ksplice Post Importer on Mar 24, 2010
How much information is in a tweet?
There are a lot of snarky answers to that. Let's talk mathematically, where information is measured in bits: How many bits can be expressed in a tweet?
It's kind of fun to try to figure out how to cram in the most information. Our current in-house record is 4.2 kilobits (525 bytes) per tweet, but this can definitely be bested. Twitter's 140-character limit has been under assault for some time, but nobody has decisively anointed a winner.
To that end, announcing the 1st International Longest Tweet Contest, along the lines of the International Obfuscated C Code Contest. The goal: fit the most bits of information into a tweet. There will be glorious fame and a T-shirt for the winner. More on that later. But first, a dialog:
Ben Bitdiddle: How big can a tweet get? SMS is limited to 160 characters of 7-bit ASCII, or 1,120 bits. Since Twitter based their limit on SMS but with 20 characters reserved for your name, a tweet is probably limited to 140 × 7 = 980 bits.
Alyssa P. Hacker: Not quite -- despite its SMS roots, Twitter supports Unicode, and the company says its 140-character limit is based on Unicode characters, not ASCII. So it's a lot more than 980 bits.
Ben: Ok, Unicode is a 16-bit character set, so the answer is 140 × 16 = 2,240 bits.
Alyssa: Well, since that Java documentation was written, Unicode has expanded to cover more of the world's languages. These days there are 1,112,064 possible Unicode characters (officially "Unicode scalar values") that can be represented, including in the UTF-8 encoding that Twitter and most of the Internet uses. That makes Unicode about 20.1 bits per character, not 16. (Since log2 1,112,064 ≈ 20.1.)
Ben: Ok, if each character can take on one of 1,112,064 possible values, we can use that to figure out how many total different tweets there can ever be. And from that, we'll know how many bits can be encoded into a tweet. But how?
Alyssa: It's easy! We just calculate the total number of different tweets that can ever be received. The capacity of a tweet, in bits, is just the base-2 logarithm of that number. According to my calculator here, 1,112,064 to the 140th power is about 28.7 quattuordecillion googol googol googol googol googol googol googol googol. That's the number of distinct 140-character messages that can be sent. Plus we have to add in the 139-character messages and 138-character messages, etc. Taking the log, I get 2,811 bits per tweet.
Ben: We'd get almost the same answer if we just multiplied 20.1 bits per character times 140 characters. Ok, 2.8 kilobits per tweet.
Alyssa: I just discovered a problem. There aren't that many distinct tweets! For example, I can't tell the difference between the single character '<' and the four characters '<'. I also can't send a tweet that contains nothing but some null characters. So we have to deflate our number a little bit. It's tricky because we don't know Twitter's exact limitations; it's a black box.
Ben: But the answer will still be roughly 2.8 kilobits. Can we do better?
Alyssa: By golly, I think we can! Turns out Unicode is basically identical to an international standard, called the Universal Multi-Octet Coded Character Set, known as UCS or ISO/IEC 10646. But the two standards weren't always the same -- in the 90s, there was a lot of disagreement between the computer companies who backed Unicode and proposed a simple 16-bit character set, and the standards mavens behind UCS, who wanted to accommodate more languages than 65,536 characters would allow. The two sides eventually compromised on the current 20.1-bit character set, but some historical differences still linger between the two "official" specifications -- including in the definition of the UTF-8 encoding that Twitter uses. Check this out:
UTF-8 according to Unicode
UTF-8 according to ISO/IEC 10646
Alyssa: Although Unicode's UTF-8 can only encode the 1,112,064 Unicode scalar values, UCS's version of UTF-8 is a lot bigger -- it goes up to 31 bits!
Ben: So, when Twitter says they use UTF-8, which version are they using? Unicode, or UCS?
Alyssa: Hmm, that's a good question. Let's write a program to test. I'll send Twitter a really big UCS character in UTF-8, represented in octal like this. This is way outside the bounds of well-formed Unicode.
$ perl -Mbytes -we 'print pack "U", (2**31 - 5)' | od -t o1 0000000 375 277 277 277 277 273
Ben: Hey, that kind of worked! When we fetch it back using Twitter's JSON API, we get this JSON fragment:
Alyssa: It's the same thing we sent! The character (really an unassigned code position) is represented in UTF-8, in octal with backslashes in ASCII. That means Twitter "characters" are actually 31 bits, not 20.1 bits. The capacity of a tweet is actually a lot bigger than the 2.8 kilobits we calculated earlier.
Ben: Should we worry that the text only shows up in the JSON API, not XML or HTML, and these high characters only last a few days on Twitter's site before vanishing?
Alyssa: Nope. As long as we had this exchange in our conversation, people on the Internet won't complain about those issues.
Using Alyssa's insight that Twitter actually supports 31-bit UCS characters (not just Unicode), we can come up with a simple program to send 4.2-kilobit tweets using only code positions that aren't in Unicode. That way there's no ambiguity between these crazy code positions, which Twitter represents as backslashed octal, and legitimate Unicode, which Twitter sends literally. These tweets have a text field that's a whopping 3,360 bytes long -- but in ASCII octal UTF-8, so they only represent 525 bytes of information in the end.
- The sender program reads the first 525 bytes of standard input or a file, slices it into 30-bit chunks, and sends it to Twitter using 31-bit UCS code positions. The high bit of each character is set to 1 to avoid ever sending a valid Unicode UTF-8 string, which Twitter might treat ambiguously. It outputs the ID of the tweet it posted. You'll have to fill in your own username and password to test it.
- The receiver program does the opposite -- give it an ID on the command line, and it will retrieve and decode a "megatwit" encoded by the sender.
Here's an example of how to use the programs and verify that they can encode at least one arbitrary 4,200-bit message:
$ dd if=/dev/urandom of=random.bits bs=525 count=1 1+0 records in 1+0 records out 525 bytes (525 B) copied, 0.0161392 s, 32.5 kB/s $ md5sum random.bits a7da09e59d1b6807e78aac7004e6ba41 random.bits $ ./megatwit-send < random.bits 11037181699 $ ./megatwit-get 11037181699 | md5sum a7da09e59d1b6807e78aac7004e6ba41 -
But this is just the start -- we're not the only ones interested in really long tweets. (Structures, strictures...) Others have found an apparent loophole in how Twitter handles some URLs, allowing them to send tweets with a text field up to 532 bytes long (how much information can be coded this way isn't clear). Maxitweet has come up with a clever way to milk the most out of 140 Unicode characters without requiring a decoder program, at least at its lower compression levels. There are definitely even better ways to cram as many bits as possible into a tweet.
So here is the challenge for the 1st International Longest Tweet Contest:
- Come up with a strategy for encoding arbitrary binary data into a single tweet, along the lines of the sample programs described here.
- Implement a sender and receiver for the strategy in a computer programming language of your choice. We recommend you choose a language likely to be runnable by a wide variety of readers. At your option, you may provide a Web site or other public implementation for others to test your coding scheme with arbitrary binary input.
- Write a description, in English, for how your coding scheme works. Explain how many bits per tweet it achieves, and justify your calculation. The explanation must be convincing because it is intractable to prove conclusively that a certain capacity is achievable, even if a program successfully sends and receives many test cases.
- Send your entry, consisting of #2 and #3, to
email@example.com_blog_us_grp@oracle.com before Sunday, April 11, 2010, 23h59 UTC.
- Based on the English explanations of the coding schemes (#3), we'll select finalists from among the entrants. In mid-April, we'll post the finalists' submissions on the blog. In the spirit of Twitter, we'll let the community assess, criticize, test, and pick apart the finalists' entries.
- Based on the community's feedback, we'll choose at least one winner. This will generally be the person whose scheme for achieving the highest information encoded per tweet is judged most convincing.
- The winner will receive notoriety and fame on this blog, and a smart-looking Ksplice T-shirt in the size of your choice. You will be known the world over as the Reigning Champion of the International Longest Tweet Contest until there is another contest. Legally we can't promise this, but there's a chance Stephen Colbert will have you on as a result of winning this prestigious contest.
- By entering the contest, you are giving Ksplice permission to post your entry if you are selected as a finalist. The contest is void where prohibited. The rules are subject to change at our discretion. Employees of Ksplice aren't eligible because they already have T-shirts. Judging will be done by Ksplice in its sole discretion.
That's it. Happy tweeting!