By ultan o'broin on Jan 01, 2011
Fascinating article in the UK Guardian newspaper called "Can Google break the computer language barrier?" In the article, Andreas Zollman, who works on Google Translate, comments that the quality of Google Translate's output relative to the amount of data required to create that output is clearly now falling foul of the law of diminishing returns. He says:
"Each doubling of the amount of translated data input led to about a 0.5% improvement in the quality of the output," he suggests, but the doublings are not infinite. "We are now at this limit where there isn't that much more data in the world that we can use," he admits. "So now it is much more important again to add on different approaches and rules-based models."
The Translation Guy has a further discussion on this, called "Google Translate is Finished". He says:
"And there aren't that many doublings left, if any. I can't say how much text Google has assimilated into their machine translation databases, but it's been reported that they have scanned about 11% of all printed content ever published. So double that, and double it again, and once more, shoveling all that into the translation hopper, and pretty soon you get the sum of all human knowledge, which means a whopping 1.5% improvement in the quality of the engines when everything has been analyzed. That's what we've got to look forward to, at best, since Google spiders regularly surf the Web, which in its vastness dwarfs all previously published content. So to all intents and purposes, the statistical machine translation tools of Google are done. Outstanding job, Googlers. Thanks."
Surprisingly, all this analysis hasn't raised that much comment from the fans of machine translation (MT), or its detractors either for that matter. Perhaps, it's the season of goodwill? What is clear to me, however, of course is that Google Translate isn't really finished (in any sense of the word). I am sure Google will investigate and come up with new rule-based translation models to enhance what they have already and that will also scale effectively where others didn't. So too, will they harness human input and guidance, which really is the way to go in training MT in the right quality direction.
But that aside, what does it say about the quality of the data that is being used for statistical machine translation in the first place? From the Guardian article it's clear that a huge human-translated corpus drove the gains for Google Translate and now what's left is the dregs of badly translated and poorly created source materials that just can't deliver quality translations. There's a message about information quality there, surely.
In the enterprise applications space, where we have some control over content this whole debate reinforces the relationship between information quality at source and translation efficiency, regardless of the technology used to do the translation. But as more automation comes to the fore, that information quality is even more critical if you want anything approaching a scalable solution. This is important for user experience professionals. Issues like user generated content translation, multilingual personalization, and scalable language quality are central to a superior global UX; it's a competitive issue we cannot ignore.