Harnessing CAPTCHAs to Read Books
By woodjr on May 25, 2007
NETWORK WORLD has a piece about an interesting project called reCAPTCHA. The project started when Carnegie Mellon professor Luis von Ahn realized that people are collectively wasting over 150,000 hours of effort every day solving those visual CAPTCHA puzzles that certain sites (such as Yahoo) require before allowing you to perform some action (such as registering or posting). His solution won't save you any time, but it will make that time go to a good use: digitizing the world's books.
Instead of just making up an arbitrary visual puzzle which is useful only for determining whether you're a human, reCAPTCHA uses images of text from not-yet-digitized books as its puzzles. It still serves the primary purpose of determining whether you're a human, but has the secondary benefit of identifying text which is hard for present OCR systems to handle.
Clever, isn't it? But wait, you say... If the puzzle uses text which isn't yet digitized, how does the system know whether you've answered correctly or not? Don't worry, they thought of that. Per the project's own description:
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
Pretty cool stuff. Want to learn more? You could start by looking into the field of human computation. Or for an example of a not-so-positive application, there is a well-know method to defeat one site's CAPTCHAs by recycling them in some other popular setting (such as a free porn site).