Friday May 25, 2007

Harnessing CAPTCHAs to Read Books

NETWORK WORLD has a piece about an interesting project called reCAPTCHA. The project started when Carnegie Mellon professor Luis von Ahn realized that people are collectively wasting over 150,000 hours of effort every day solving those visual CAPTCHA puzzles that certain sites (such as Yahoo) require before allowing you to perform some action (such as registering or posting). His solution won't save you any time, but it will make that time go to a good use: digitizing the world's books.

Instead of just making up an arbitrary visual puzzle which is useful only for determining whether you're a human, reCAPTCHA uses images of text from not-yet-digitized books as its puzzles. It still serves the primary purpose of determining whether you're a human, but has the secondary benefit of identifying text which is hard for present OCR systems to handle.

Clever, isn't it? But wait, you say... If the puzzle uses text which isn't yet digitized, how does the system know whether you've answered correctly or not? Don't worry, they thought of that. Per the project's own description:

But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

Pretty cool stuff. Want to learn more? You could start by looking into the field of human computation. Or for an example of a not-so-positive application, there is a well-know method to defeat one site's CAPTCHAs by recycling them in some other popular setting (such as a free porn site).

Monday Jan 22, 2007

Wikipedia Decides Its Outgoing Links Can't Be Trusted?

I find this sad. By adding the rel="nofollow" attribute to the outgoing links in all articles, the Wikipedia seems to be wavering in its trust of volunteers. Yes, link spam is a problem. And with its combination of high visibility and open authoring, the Wikipedia is a prime target. But why not deal with this problem the same way it deals with other inaccurate and abusive content? Count on the volunteer base to detect and correct issues quickly (and give the administrators tools to lock certain articles which are repeated targets).

Until yesterday, that's exactly how the English-language Wikipedia dealt with link spam. But now the project has thrown up a white flag and said that its volunteers and tools aren't adequate to police the situation. Instead, the equivalent of martial law has been declared and everyone suffers.

The Wikipedia is the closest thing we have to a collective and collaborative voice in describing our world. When an external URL is referenced in a Wikipedia article, it must pass the editorial "litmus test" of all Wikipedians watching that article (who will presumably have high interest and expertise in the subject). With the blanket inclusion of the nofollow attribute on these links, search engines such as Google will no longer use these links as part of their determination of which URLs are most important. So we end up with slightly poorer search results and one less way to register our "votes" for improving them. Sad.

On the bright side, the original announcement does note that "better heuristic and manual flagging tools for URLs would of course be super." Presumably, this means that when such tools are made available, the blanket application of nofollow will be removed. Let's hope that happens. Soon.

About

woodjr

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today