Following its Factory Tour back in May, Google has been dribbling out details of the NIST Machine Translation Evaluation. MT software engineer Franz Och commented on the results in the Google corporate blog earlier this week, leading CNET to pick up the story. Google clearly won this MT evaluation, but while looking at the NIST documents, we were intrigued with the absence of some companies, the process, and how the companies arrived at their results.
First off, whenever we see “Arabic,” “English,” and “machine translation” in the same story, we expect to read about Language Weaver, but it was absent from the NIST results. Another big player in MT is SDL, but its translation server was also missing in action. To clear up these mysteries we spoke to CEO Bryce Benjamin at Language Weaver and Jay Marciano, director of MT development at SDL.
SDL’s absence is easily explained — its MT server doesn’t support Arabic to English or Chinese to English, so its results couldn’t be evaluated. However, Language Weaver — the self-proclaimed champion of commercial MT for Arabic — couldn’t use that excuse.
Benjamin told us that participation is voluntary, adding that “historically the NIST evaluation for MT has been mainly research institutions and particularly funded institutions that have participated. Because they’re research systems, there’s no commercial constraints on how they go about doing the translations. The research groups work without constraints on the amount of computing power or time applied.”
So what is this no-holds-barred kind of MT computing test? It is Big Blue’s BLEU (Bilingual Evaluation Understudy — “blue” is part of IBM’s cute naming convention for showcase systems, like chess computers). BLEU is a method for automatically evaluating the output of MT, focused on “performance” which its creators define as “the closer a machine translation is to a professional human translation, the better it is.” It computes the “translation closeness” based on the word error rate algorithms used in speech recognition modified for several possible reference translations plus options in word choice and order. In summary, there is no computing system performance component per se, and the focus is on output quality at any price rather than the practicality of a normal commercial computing environment.
Google reportedly set 1,000 computers to work on the task, gobbling up 40,000 hours of computing time for its win. Commercially available systems such as Language Weaver, SDL, and Systran target the more common uniprocessor model running on a notebook, desktop, or single server — which would take less than ten minutes to translate the 100 articles in the NIST test.
While Google does appear to have clobbered all comers with its effort, it’s not the end of the story. Benjamin told us that DARPA analyzed the translation results using a different metric than what NIST used. In that evaluation, Language Weaver’s intellectual progenitor — an MT research platform from the University of Southern California’s Information Sciences Institute (USC/ISI) was judged to be “the best.” Language Weaver maintains a strong working relationship with its ISI creators.
So whose MT is best? The BLEU tests provide one metric for evaluation and DARPA offers another. As we have seen in other industries, sooner or later potential MT buyers will come to expect suppliers to benchmark their products against their competitors. Over time, vendors will start adding features to beat the tests — as we saw in the 1980s SQL database industry with a succession of tests ranging from the Transaction Processing Council to NIST. Given Google’s success in the NIST test and propagating the story, we expect Language Weaver to show up for next year’s evaluation ready to race. Meanwhile, we expect all the MT players to re-double their efforts so they can beat the champ.
How should buyers react to the NIST results? With caution. Like any enterprise technology solution, you won’t know how any machine translation server performs for your company until you try it out with your content, your workflow, and your user base. Meanwhile, heady competition, even if driven by an incomplete or flawed benchmark, will improve the state of the art.