Developers of machine translation (MT) technology have become increasingly active in announcing major architectural enhancements that promise to deliver substantial gains in output quality and decreases in training costs. Journalists in mainstream media have picked up the MT industry's neural networking theme and predicted universal communication. But these news reports focus on press releases, marketing claims, and self-evaluations of MT products by their developers.
An objective assessment of MT is long overdue. In 2005, CSA Research analyzed that year's NIST shoot-out, won by Google. We outlined the need for standardized metrics to benchmark products, and predicted that "vendors will start adding features to beat the tests – as we saw in the 1980s SQL database industry." What happened since then is that the NIST comparison has gone by the wayside, MT developers publish their own results if and only if they beat competitors, and gaming of MT benchmarks such as BLEU has become a common practice.
In response to such issues, translation software developer Lilt announced the creation of Lilt Labs, "a collaborative effort between computational linguists, scientists, and language professionals designed to address a growing and painful problem in the language space" – the lack of independent and objective information on quality and performance. CSA Research spoke with co- founder Spence Green who told us that his company created Lilt Labs as a forum for publishing research, evaluations, and insights about computer-aided (CAT) and machine translation (MT). The first post at the site describes a quantitative assessment of output quality from MT, evaluating engines from Google, Microsoft, SDL, SYSTRAN, and Lilt itself. We discussed three concerns with Green:
- Reference translations may not reflect an ideal. The human reference translations used in shared tasks and shoot-outs like those hosted by the WMT conference are often quite bad – and consequently easily bested by MT engines. Thus, developers evaluate their results against – and develop features that support – defective standards of reference. Lilt Labs will provide a forum for sharing better references, discussing testing methodologies, and posting results. Green said that Lilt will publish its assessment scripts so that practitioners can attempt to reproduce the results. This approach should result in greater transparency.
- Reference-based methods are imprecise. Developers frequently brag about increases of one or two BLEU points, but these numbers fall with the inherent imprecision of these methods and generally do not correspond to meaningful improvements in real-world use. Compounding this problem, most evaluations use a single human reference, which means they evaluate how similar MT output is to a particular human translation, which may penalize good translations that do not happen to resemble that reference. In response, Green paraphrased Winston Churchill's comment about capitalism, noting that BLEU is the worst metric, except for all the others. He also stated that years of BLEU data provide a solid baseline for comparison moving forward.
- ISVs often release only the good news. Developers trumpet results that benefit them, but ignore negative results. They all acknowledge that BLEU scores don't mean much, but prospective buyers expect them and so MT companies pick and choose the most favorable results. Furthermore, MT researchers often find that when they give specific negative examples at conferences, the developers of those systems tweak them to address their specific cases without delivering systematic fixes. Despite these limitations, developers have yet to deliver better alternatives. Green hopes that Lilt Labs will gain traction and become the go-to forum for objective information about the technology.
But for all that's good in Lilt Labs, there's still no vendor-neutral "MT Labs" such as PC Labs that conducts systematic and unbiased testing. Green acknowledged that an independent lab would be better, but no objective third-party has stepped up to the task. Given that fact, Lilt is trying something different with its Labs. It set up its own open repository of test results, has begun developing tests that are arguably less biased than what any MT vendor would do, and is publishing the results. While it has two dogs in the fight – Phrasal MT and its own MT enhancements in the Lilt translation tool – it can also work with a variety of commercial MT solutions.
The idea behind Lilt Labs reflects the founding ethos of Lilt, which grew out of an academic heritage that encourages open disclosure and replication of results. To the company's credit, it was not afraid to show in the just-published assessment that Google neural MT software outperformed its own solution by a slight margin. Such a candid view is almost unheard of from commercial vendors who worry that if they live by BLEU scores, they will die by them. Lilt Labs does have some of the limitations of previous evaluations, such as the use of reference-based quality measures, assessment of a limited set of language pairs, and the current lack of formal peer review for Lilt Lab postings. Despite these issues, sharing results run by a third party is an important step toward transparency in the fast-developing world of machine translation.
The bottom line is that creating a clearinghouse for MT evaluation is a good move. Lilt Labs could bring scientific rigor, lack of bias, and objective peer review to a field that currently lacks all three. If employees of other developers are willing to contribute and candidly share results, it will help buyers and developers understand the true state of the field rather than the rosy picture that ISVs like to paint. To succeed, Lilt will need to convince others that openness is the best route to take in the long run and get them to contribute so that Lilt Labs can become more than a mouthpiece for one player. And if it gains traction, some independent association or academic institution might bring it under its wing.