|
Common Sense Advisory Blogs
|
|
|
|
More, Better Data Improves Statistical MT Results
|
|
|
|
Asia Online yesterday released a "Study on the Impact of Data Consolidation and Sharing for Statistical Machine Translation." We touched base with company VP Kirti Vashee to discuss the report about optimizing automated translation technologies.
The study is a detailed analysis of "the effect of combining data from multiple companies for the purpose of building statistical machine translation (SMT) engines." This kind of information is critical in determining the value of sharing multilingual assets such as translation memories (TMs) and terminology bases among companies in the same field, a proposed benefit of the TAUS Data Assocation (TDA) effort.
In June 2009, we wrote of our interest in finding out what TDA members will learn when they begin incorporating shared data into their own TMs, what benefits MT developers and users will derive from this bigger pool of data, and whether this effort will have enough payback for its members. Asia Online's study is the first to address those concerns.
Three TAUS members in the same market sector offered their translation memories (TM) for analysis. Vashee told us that Asia Online created 29 discrete SMT engines by combining the TMs in various ways and comparing the output results with the BLEU and F-Measure metrics on all those configurations. It found that SMT output quality depends on the quality of the training data, and issued several recommendations:
- Use clean, normalized translation memories. SMT engines built with raw TMs results in lower-quality engines than those configured with clean structured data. If you choose to use lesser-quality TMs, you will need much more data to achieve similar quality. This will be bad news for some prospective SMT users. Our recent MT research contended that many organizations will find that their TMs are not up to snuff -- these manually created memories often carve into stone the aggregated work of lots of people of random capabilities, passed back and forth among LSPs over the years with little oversight or management.
- Spend time on terminology clean-up. Without highly consistent use of terms across the shared data, you will not achieve the potential of your normalized translation memories. This is a good practice regardless -- improving terminology management is a critical element of effective global content management, with or without machine translation.
- Fill in gaps with good TM data. By adding clean, quality-assured datasets from other companies or repositories such as what the TDA proposes, you can fill in lexical gaps and improve the output. This is the promise of shared TM repositories, provided that the contents are of the necessary quality.
Volumes being equal, good data begets good better engines and bad data results in lower-quality output. However, the study did show massive volumes of moderately bad input could result in a better engine than one built with a tiny amount of clean data. These findings give more meat to the continuing discussion about how much data an SMT engine requires to be truly useful.
In our report on the business case for machine translation, we said that buyers and vendors alike would benefit from practical benchmarks, conducted by an independent and objective body -- something like the database industry’s Transaction Processing Council (TPC), a trusted body that defines and delivers transaction processing and database management assessments based on test suites that reflect real-world use cases. The TPC analog could be LISA or TAUS working in conjunction with a standards body such as NIST. For now, this new study from one of the industry's suppliers will help to advance the discussion.
|
|
|
|
|
|
Link To This Page
|
|
Bookmark this page using the following link:http://www.commonsenseadvisory.com/Default.aspx?Contenttype=ArticleDetAD&tabID=63&Aid=603&moduleId=391
Do you have a website? You can place a link to this page by copying and pasting the code below.
|
|
|
Back
|
|
|
|
Keywords: Machine translation, Terminology management, Translation memory, Translation technologies |
|
|
|
|