SlideShare a Scribd company logo
SMT & Data QualityUnderstanding Data Quality IssuesKirti Vashee – kirti. vashee@asiaonline.net
Making Machine Translation Work for You Customization is Key to QualityReferenceMonolingual	SMT utilizes existing linguistic resources to create customer specific and domain focused systems including:All Legacy TM  - Cleaned and NormalizedDictionaries & GlossariesOld versions of ManualsExamples of high quality Monolingual DataBilingualCustomized Translation System
A Custom Engine is Only as Good as the Data UsedThe more clean high quality in domain data that a custom  engine is built with, the higher quality the translation output.Golden Rule5 Rules For Creating “GREAT” Custom EnginesLess clean high quality data is better than more   low quality data
Fewer post edits required on translation output
 Faster engine maturity
 Variable markers
 Custom tags, HTML, XML, Rich Text etc.
 Telegraphic style text (i.e. “pilot crash lands plane” vs.                                              “the pilot crash landed the plane”)
 Poor quality translations
 Misaligned segments
 Misclassified content (out of domain)
 Mixed language contentBuilding From a FoundationLanguage Pair Foundation Data =+Client/Custom DataCustom EngineDomain Foundation Data
What is Foundation Data and it’s Purpose?Foundation data is a foundation from which to build a custom engine from. Foundation data is not sufficient on its own to deliver a high quality engine. Custom data is required.Foundation data reduces the amount of data a client needs to provide, lowering the barriers to entry.Asia Online has prepared data for and trained hundreds of foundation engines using foundation data only. Not intended as production release enginesA foundation engine is in no way symbolic of quality that acustom engine in the same language pair would deliverIntended to verify process and any language specific handling that is requiredWill not typically be high quality as they have not been normalized, or focused on a specific purposeConsist mainly of bilingual dataLimited monolingual data. Monolingual data is a key part of customization and every client has a different desired grammatical style.Add your custom data to foundation data to get quality
Data Used to Build a Custom Engine 1. Bilingual Source and Target LanguagePre-AlignedNon-Aligned Dictionaries & Glossaries
 Translation Memories   (TMX, XLIFF, CSV, etc.) HTML, XML,  MS Word,  Plain Text, etc.Minimum: 20,000 SegmentsRecommend Minimum: 100,000+ SegmentsIdeal : 500,000+ Segments – the more the better in domain text 2. Monolingual Target Language Documents in target language
 URLs of similar style and grammar in target languageMinimum: 500MB after cleaning – plain textRecommend Minimum: 1GB+ after cleaning – plain textIdeal: 3-4GB+ after cleaning – plain text 3. Tuning and Test Data “Gold standard” quality translations

More Related Content

DOCX
Dbm 380 week 3 learning team ms access tables
DOCX
Dbm 380 week 2 learning team ms access
PPTX
Software performance testing_overview
PPT
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
PDF
User Empowered Machine Translation. Dion Wiggins, Asia Online
PDF
TAUS Scotland Asia Online Technology Platform V1
PPT
What is machine translation
PPTX
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
 
Dbm 380 week 3 learning team ms access tables
Dbm 380 week 2 learning team ms access
Software performance testing_overview
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
User Empowered Machine Translation. Dion Wiggins, Asia Online
TAUS Scotland Asia Online Technology Platform V1
What is machine translation
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
 

Similar to Smt & data quality (20)

PDF
iMT Language Solutions
 
PDF
TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...
PPTX
Improving the quality of a customized SMT system using shared training data
PDF
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
PDF
Workshop on the tauyou machine translation platform
PPTX
Tips for Preparing Training Data for High Quality Machine Translation
PDF
Gestión proyectos traducción - Universitat Autònoma de Barcelona
PDF
Gestión proyectos traducción en la Universitat Autònoma de Barcelona
PDF
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
PPT
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Kerstin Bier, Sybase, 4...
PPT
Lexcelera MT Breaking Compromises
PPTX
Learn the different approaches to machine translation and how to improve the ...
 
PDF
Machine Translation Master Class at the EUATC Conference by Diego Bartolome
PPTX
Strata - Final_IB_02_17
PPT
Build your own statistical engines
PDF
MiTiN 2013 Keynote in Detroit Michigan
PPTX
Can Big Data Change the Translation Industry?
PPT
Gala Webminar September 2013
PPTX
Tools-Driven Content Curation & Engine Training ATMA 2014
PDF
TAUS USER CONFERENCE 2010, Machine translation in the imperfect world - Pract...
iMT Language Solutions
 
TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...
Improving the quality of a customized SMT system using shared training data
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
Workshop on the tauyou machine translation platform
Tips for Preparing Training Data for High Quality Machine Translation
Gestión proyectos traducción - Universitat Autònoma de Barcelona
Gestión proyectos traducción en la Universitat Autònoma de Barcelona
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Kerstin Bier, Sybase, 4...
Lexcelera MT Breaking Compromises
Learn the different approaches to machine translation and how to improve the ...
 
Machine Translation Master Class at the EUATC Conference by Diego Bartolome
Strata - Final_IB_02_17
Build your own statistical engines
MiTiN 2013 Keynote in Detroit Michigan
Can Big Data Change the Translation Industry?
Gala Webminar September 2013
Tools-Driven Content Curation & Engine Training ATMA 2014
TAUS USER CONFERENCE 2010, Machine translation in the imperfect world - Pract...
Ad

More from LUSPIO LanguageCamp (6)

DOC
LUSPIO Translation Automation Conference (LTAC) 2011
PPT
"Traduttese": tendenze e implicazioni
PDF
Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...
PPT
PPT
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
PPT
Linguaggi controllati: il caso italiano
LUSPIO Translation Automation Conference (LTAC) 2011
"Traduttese": tendenze e implicazioni
Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
Linguaggi controllati: il caso italiano
Ad

Smt & data quality

  • 1. SMT & Data QualityUnderstanding Data Quality IssuesKirti Vashee – kirti. vashee@asiaonline.net
  • 2. Making Machine Translation Work for You Customization is Key to QualityReferenceMonolingual SMT utilizes existing linguistic resources to create customer specific and domain focused systems including:All Legacy TM - Cleaned and NormalizedDictionaries & GlossariesOld versions of ManualsExamples of high quality Monolingual DataBilingualCustomized Translation System
  • 3. A Custom Engine is Only as Good as the Data UsedThe more clean high quality in domain data that a custom engine is built with, the higher quality the translation output.Golden Rule5 Rules For Creating “GREAT” Custom EnginesLess clean high quality data is better than more low quality data
  • 4. Fewer post edits required on translation output
  • 5. Faster engine maturity
  • 7. Custom tags, HTML, XML, Rich Text etc.
  • 8. Telegraphic style text (i.e. “pilot crash lands plane” vs. “the pilot crash landed the plane”)
  • 9. Poor quality translations
  • 11. Misclassified content (out of domain)
  • 12. Mixed language contentBuilding From a FoundationLanguage Pair Foundation Data =+Client/Custom DataCustom EngineDomain Foundation Data
  • 13. What is Foundation Data and it’s Purpose?Foundation data is a foundation from which to build a custom engine from. Foundation data is not sufficient on its own to deliver a high quality engine. Custom data is required.Foundation data reduces the amount of data a client needs to provide, lowering the barriers to entry.Asia Online has prepared data for and trained hundreds of foundation engines using foundation data only. Not intended as production release enginesA foundation engine is in no way symbolic of quality that acustom engine in the same language pair would deliverIntended to verify process and any language specific handling that is requiredWill not typically be high quality as they have not been normalized, or focused on a specific purposeConsist mainly of bilingual dataLimited monolingual data. Monolingual data is a key part of customization and every client has a different desired grammatical style.Add your custom data to foundation data to get quality
  • 14. Data Used to Build a Custom Engine 1. Bilingual Source and Target LanguagePre-AlignedNon-Aligned Dictionaries & Glossaries
  • 15. Translation Memories (TMX, XLIFF, CSV, etc.) HTML, XML, MS Word, Plain Text, etc.Minimum: 20,000 SegmentsRecommend Minimum: 100,000+ SegmentsIdeal : 500,000+ Segments – the more the better in domain text 2. Monolingual Target Language Documents in target language
  • 16. URLs of similar style and grammar in target languageMinimum: 500MB after cleaning – plain textRecommend Minimum: 1GB+ after cleaning – plain textIdeal: 3-4GB+ after cleaning – plain text 3. Tuning and Test Data “Gold standard” quality translations
  • 17. Examples of what you want the output to look like
  • 18. Guides the engines optimization strategy
  • 19. Blind test data evaluate translation quality and quality improvement3,000-6,000 Segments (can be extracted from existing TMs)
  • 20. How SMT Works:Monolingual and Bilingual DataBilingual and Monolingual Text SourcesCleanGenerateMonolingual Data Grammar and Style
  • 21. Vocabulary ChoiceTranslatedArchivesTranslationMemoriesDictionaries / GlossariesInternetCleanAlign SentencesBilingual DataTrain Custom Engine Vocabulary and Terminology
  • 22. Word and Phrase PatternsUser InputHuman Translation
  • 23. Quality Data Makes A DifferenceClean and Consistent DataA statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training.Controlled DataFewer translation options for the same source segment, and “clean” translations lead to better foundation patterns. Common DataHigher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems. Current DataEnsure that the most current TM is used in the training data. Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style
  • 24. Data Focus Produces QualityNot RecommendedRecommendedMixing a focused set of bilingual domain data togetherDifferent sources are okProviding large enough monolingual data to support grammar structures
  • 25. Mixing a wide variety of bilingual domain data together
  • 26. Do not mistake somewhat related content as content in the same domainE.g. Anti-virus is more in the security domain than the IT domainProviding insufficient monolingual data to support grammar structuresThe more variety in bilingual data, the more monolingual data will be required. Example of translated output influenced by anti-virus text (security domain) mixed into IT domainProtect your documentsENPLChronikomputerprzeddokumentów(Protects your computer from documents)
  • 27. With “Clean Data” Correction is PossibleTypically about 10-20 examples for each clean word of phrase.Each correction has statistical relevance and impact can be clearly seen.Corrections usually involve adding data to fill gaps.Far less correction of actual errors.Clean data means cause of errors can be understood and corrected.Concordance used to create unbiased examples/phrases and ensure scope covered. Large volumes of dirty data prohibits manual correction.Individual corrections would not be statistically relevant.Manual corrections would compete against 1,000’s of bad examples. Impractical to create enough examples manually.Understanding the cause of errors is difficult.Slows training and overall processing time. Requires more resources to process excess data.Only solution is to acquire more dirty data and hope problem is fixed. But may get worse or cause new errors.
  • 28. Understanding “Clean Data” for SMTGood Translation Memories are not always good for SMTThe best DB isThe best database isThe best RDBMS isConsistent terminologyConsistent and minimal variablesFormatting removed Multiple examples of use of terms in training data and language modelHigh quality translations and language modelSplit at single sentences and phrases Terms and use of terms should be consistent
  • 29. Industry standardization helps further%1%$VAR1$\{AGENT_SH\}%1 $1<1> Less variables or no variables is better
  • 30. Tokenization has to be adapted to handle variablesXMLSGML\r\nHTMLRTF Many translation memories have multiple sentences or are partial phrasesData Cleaning Utilities to normalize and standardize data prior to consolidation to provide maximum leverageRecent study for TAUS proves conclusively that sharing clean data provides leverage Smaller amount of clean data can produce better results than datasets even 2X largerConsistent Terminology matters and provides real leverageData optimized for TM Tools can be “dirty data“ for SMT http://www.asiaonline.net/resources/reportID4523.aspx for full studyThe Importance of Clean Data605550454035302520CleanDirtyBLEU ScoreAsia OnlineGoogleSystran
  • 31. Training Data: Volume vs. Quality*Data optimized for TM tools may often not be suitable for SMT
  • 33. Relative BLEU score comparisons The datasets that were cleaner at the outset produced better results and tend to benefit and improve consistently from Asia Online’s light cleaning effortsDataset A had less data but still produced better results than Dataset B that had twice the data volume“Dirty” and noisy data has unpredictable results and is much harder to correct and improve
  • 34. Key ObservationsConsolidating clean data results in better quality SMT systemsSome TM Tool optimized data may be considered dirty for SMTData cleaning is a critical and necessary step for high quality SMT enginesConsistent Terminology produces significant benefits in SMTNormalization of formatting and terminology will boost SMT engine qualityIntroducing known dirty data can reduce SMT engine qualitySmaller amounts of clean data can outperform systems built with as much as 2X dirty dataSystems built with clean data and consistent terminology tend to perform better and improve faster