Smt & data quality

1. SMT & Data QualityUnderstanding Data Quality IssuesKirti Vashee – kirti. vashee@asiaonline.net

2. Making Machine Translation Work for You Customization is Key to QualityReferenceMonolingual SMT utilizes existing linguistic resources to create customer specific and domain focused systems including:All Legacy TM - Cleaned and NormalizedDictionaries & GlossariesOld versions of ManualsExamples of high quality Monolingual DataBilingualCustomized Translation System

3. A Custom Engine is Only as Good as the Data UsedThe more clean high quality in domain data that a custom engine is built with, the higher quality the translation output.Golden Rule5 Rules For Creating “GREAT” Custom EnginesLess clean high quality data is better than more low quality data

4. Fewer post edits required on translation output

5. Faster engine maturity

6. Variable markers

7. Custom tags, HTML, XML, Rich Text etc.

8. Telegraphic style text (i.e. “pilot crash lands plane” vs. “the pilot crash landed the plane”)

9. Poor quality translations

10. Misaligned segments

11. Misclassified content (out of domain)

12. Mixed language contentBuilding From a FoundationLanguage Pair Foundation Data =+Client/Custom DataCustom EngineDomain Foundation Data

13. What is Foundation Data and it’s Purpose?Foundation data is a foundation from which to build a custom engine from. Foundation data is not sufficient on its own to deliver a high quality engine. Custom data is required.Foundation data reduces the amount of data a client needs to provide, lowering the barriers to entry.Asia Online has prepared data for and trained hundreds of foundation engines using foundation data only. Not intended as production release enginesA foundation engine is in no way symbolic of quality that acustom engine in the same language pair would deliverIntended to verify process and any language specific handling that is requiredWill not typically be high quality as they have not been normalized, or focused on a specific purposeConsist mainly of bilingual dataLimited monolingual data. Monolingual data is a key part of customization and every client has a different desired grammatical style.Add your custom data to foundation data to get quality

14. Data Used to Build a Custom Engine 1. Bilingual Source and Target LanguagePre-AlignedNon-Aligned Dictionaries & Glossaries

15. Translation Memories (TMX, XLIFF, CSV, etc.) HTML, XML, MS Word, Plain Text, etc.Minimum: 20,000 SegmentsRecommend Minimum: 100,000+ SegmentsIdeal : 500,000+ Segments – the more the better in domain text 2. Monolingual Target Language Documents in target language

16. URLs of similar style and grammar in target languageMinimum: 500MB after cleaning – plain textRecommend Minimum: 1GB+ after cleaning – plain textIdeal: 3-4GB+ after cleaning – plain text 3. Tuning and Test Data “Gold standard” quality translations

17. Examples of what you want the output to look like

18. Guides the engines optimization strategy

19. Blind test data evaluate translation quality and quality improvement3,000-6,000 Segments (can be extracted from existing TMs)

20. How SMT Works:Monolingual and Bilingual DataBilingual and Monolingual Text SourcesCleanGenerateMonolingual Data Grammar and Style

21. Vocabulary ChoiceTranslatedArchivesTranslationMemoriesDictionaries / GlossariesInternetCleanAlign SentencesBilingual DataTrain Custom Engine Vocabulary and Terminology

22. Word and Phrase PatternsUser InputHuman Translation

23. Quality Data Makes A DifferenceClean and Consistent DataA statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training.Controlled DataFewer translation options for the same source segment, and “clean” translations lead to better foundation patterns. Common DataHigher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems. Current DataEnsure that the most current TM is used in the training data. Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style

24. Data Focus Produces QualityNot RecommendedRecommendedMixing a focused set of bilingual domain data togetherDifferent sources are okProviding large enough monolingual data to support grammar structures

25. Mixing a wide variety of bilingual domain data together

26. Do not mistake somewhat related content as content in the same domainE.g. Anti-virus is more in the security domain than the IT domainProviding insufficient monolingual data to support grammar structuresThe more variety in bilingual data, the more monolingual data will be required. Example of translated output influenced by anti-virus text (security domain) mixed into IT domainProtect your documentsENPLChronikomputerprzeddokumentów(Protects your computer from documents)

27. With “Clean Data” Correction is PossibleTypically about 10-20 examples for each clean word of phrase.Each correction has statistical relevance and impact can be clearly seen.Corrections usually involve adding data to fill gaps.Far less correction of actual errors.Clean data means cause of errors can be understood and corrected.Concordance used to create unbiased examples/phrases and ensure scope covered. Large volumes of dirty data prohibits manual correction.Individual corrections would not be statistically relevant.Manual corrections would compete against 1,000’s of bad examples. Impractical to create enough examples manually.Understanding the cause of errors is difficult.Slows training and overall processing time. Requires more resources to process excess data.Only solution is to acquire more dirty data and hope problem is fixed. But may get worse or cause new errors.

28. Understanding “Clean Data” for SMTGood Translation Memories are not always good for SMTThe best DB isThe best database isThe best RDBMS isConsistent terminologyConsistent and minimal variablesFormatting removed Multiple examples of use of terms in training data and language modelHigh quality translations and language modelSplit at single sentences and phrases Terms and use of terms should be consistent

29. Industry standardization helps further%1%$VAR1$\{AGENT_SH\}%1 $1<1> Less variables or no variables is better

30. Tokenization has to be adapted to handle variablesXMLSGML\r\nHTMLRTF Many translation memories have multiple sentences or are partial phrasesData Cleaning Utilities to normalize and standardize data prior to consolidation to provide maximum leverageRecent study for TAUS proves conclusively that sharing clean data provides leverage Smaller amount of clean data can produce better results than datasets even 2X largerConsistent Terminology matters and provides real leverageData optimized for TM Tools can be “dirty data“ for SMT http://www.asiaonline.net/resources/reportID4523.aspx for full studyThe Importance of Clean Data605550454035302520CleanDirtyBLEU ScoreAsia OnlineGoogleSystran

31. Training Data: Volume vs. Quality*Data optimized for TM tools may often not be suitable for SMT

32. Training Data Analysis

33. Relative BLEU score comparisons The datasets that were cleaner at the outset produced better results and tend to benefit and improve consistently from Asia Online’s light cleaning effortsDataset A had less data but still produced better results than Dataset B that had twice the data volume“Dirty” and noisy data has unpredictable results and is much harder to correct and improve

34. Key ObservationsConsolidating clean data results in better quality SMT systemsSome TM Tool optimized data may be considered dirty for SMTData cleaning is a critical and necessary step for high quality SMT enginesConsistent Terminology produces significant benefits in SMTNormalization of formatting and terminology will boost SMT engine qualityIntroducing known dirty data can reduce SMT engine qualitySmaller amounts of clean data can outperform systems built with as much as 2X dirty dataSystems built with clean data and consistent terminology tend to perform better and improve faster

Smt & data quality

More Related Content

Similar to Smt & data quality (20)

More from LUSPIO LanguageCamp (6)

Smt & data quality