An Oracle blog about translation

Where Next for Google Translate? And What of Information Quality?

Fascinating article in the UK Guardian newspaper called "Can Google break the computer language barrier?"
In the article, Andreas Zollman, who works on Google Translate, comments that the quality of Google Translate's output relative to the amount of data required to create that output is clearly now falling foul of the law of diminishing returns. He says:

"Each doubling of the amount of translated data input led to about a 0.5% improvement in the quality of the output," he suggests, but the doublings are not infinite. "We are now at this limit where there isn't that much more data in the world that we can use," he admits. "So now it is much more important again to add on different approaches and rules-based models."

The Translation Guy has a further discussion on this, called "Google Translate is Finished". He says: 

"And there aren't that many doublings left, if any. I can't say how much text Google has assimilated into their machine translation databases, but it's been reported that they have scanned about 11% of all printed content ever published. So double that, and double it again, and once more, shoveling all that into the translation hopper, and pretty soon you get the sum of all human knowledge, which means a whopping 1.5% improvement in the quality of the engines when everything has been analyzed. That's what we've got to look forward to, at best, since Google spiders regularly surf the Web, which in its vastness dwarfs all previously published content. So to all intents and purposes, the statistical machine translation tools of Google are done. Outstanding job, Googlers. Thanks."

Surprisingly, all this analysis hasn't raised that much comment from the fans of machine translation (MT), or its detractors either for that matter. Perhaps, it's the season of goodwill?
What is clear to me, however, of course is that Google Translate isn't really finished (in any sense of the word). I am sure Google will investigate and come up with new rule-based translation models to enhance what they have already and that will also scale effectively where others didn't. So too, will they harness human input and guidance, which really is the way to go in training MT in the right quality direction.

But that aside, what does it say about the quality of the data that is being used for statistical machine translation in the first place? From the Guardian article it's clear that a huge human-translated corpus drove the gains for Google Translate and now what's left is the dregs of badly translated and poorly created source materials that just can't deliver quality translations. There's a message about information quality there, surely.

In the enterprise applications space, where we have some control over content this whole debate reinforces the relationship between information quality at source and translation efficiency, regardless of the technology used to do the translation. But as more automation comes to the fore, that information quality is even more critical if you want anything approaching a scalable solution.
This is important for user experience professionals. Issues like user generated content translation, multilingual personalization, and scalable language quality are central to a superior global UX; it's a competitive issue we cannot ignore.

Join the discussion

Comments ( 4 )
  • Vadim Berman Tuesday, January 4, 2011
    "Reaching a glass ceiling" is not equivalent to "finished", that is correct. The glass ceiling, however, appears quite low for anything but Tier 1 languages.
    Tinkering with Google Translate shows that they this is not just plain vanilla statistical MT; it appears that they try to align the results with monolingual corpora. An interesting move, although in many cases it improves the way the results look but deviates from the original meaning, and won't work with anything which does not rely on publicly available information.
    Many people have been talking for a while that pure statistical methods are not sufficient. The "more data" mantra apparently works only to a certain extent, and it requires more than 7 billion monkeys with 7 billion typewriters to produce enough data. Maybe this is the reason why the mainstream enterprise MT is still rule-based?
  • Ultan Wednesday, January 5, 2011
    Excellent point about not all languages being at the same level of quality.
  • Matt Train Wednesday, January 5, 2011
    I work for a translation agency. We test Google Translate's output every six months between our five most ordered language pairs and have our in-house native speakers revise the work up to what we regard as the professional standard.
    So far the tests have pretty much all shown that the time required to re-draft and in some cases completely re-translate the work negate any advantages gained in terms of time and cost in the translation stage.
    If we lower our expectations though, there may be cases for the use of machine translation. In combination with Translation Memory and Glossaries, a customised machine translation engine may get the technology to a point where a human editor can get the text to standard within a usual editing timeframe.
    That would represent a real step forward. Over the next five to ten years, driven by a global explosion of digital information and content, it is our educated professional hunch that machine translation will play a big part in localisation in future.
    But translators will always be in work. Just with shinier, cleverer tools that help them do a lot more in less time, on the go, in small chunks.
    Professional Translation and Language Services.
  • Kirti Tuesday, January 25, 2011
    I am providing an SMT practitioner comment on this in my blog:
    I think we have seen enough examples where data quality beats data volume (within some reasonable bounds of course) that it is clear that there is value in paying much more attention to the data quality used in training and also in processing the data quality of any source material.
    Also I think the reason that there possibly are more RbMT systems in enterprise use is that these systems have been around for 40+ years in some commercial form, while SMT systems have barely been available for 5 years in a commercial form.
    I also think that getting linguists much more involved in building MT engines will also drive quality improvements with small amounts of data.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.