SlideShare a Scribd company logo
Build Your Own
 Statistical Machine
Translation Engines
            Ruben de la Fuente
About Me

• 4-year degree in translation
• Worked as translator for 10+ years
• Working full time in MT for the past
  year
Agenda


•   Quick comparison with RbMT
•   Fundamentals of SMT
•   Requirements and preparation
•   Using DoMY
Disclaimer



• I’m not saying SMT is better
• I’m not saying SMT is right for you
Statistical Machine Translation

Computer learns to translate through
statistical analysis of alignment in
bilingual corpora
Rule-based Machine Translation

User Dictionaries + Grammar and
translation rules
SMT: Pros and Cons
Pros              Cons

Quick to build    Unpredictable
Cheap             Quick
Fluent            improvements not
                  easy
Features of an SMT system

• Translation Model: table containing
  source and target phrases, together
  with a probability score (accuracy)
• Language Model: list of sequences of
  n-words in target language together
  with a probability score (fluency)
Language and Translation Models
• LM (fluency)     • TM (accuracy)
Tokenization and recasing
Breaking up text in        Lowercase all words
meaningul units (tokens)
                           File > file
                           file? > file ?
                           file. > file .
                           File! > file !
Requirements: Computing


•4 GB RAM PC needed
•Ubuntu 10.04 64-bit OS
•Virtual Machine OK
Requirements: size

MS Translator Hub recommends at least
10k segments
I have gotten good results with 100-200k
segments
Roughly over 1 million words corpus
Publicly Available Corpora

• Opus (ECB, EMA, OpenOffice)
• Acquis Communautaire
• Europarl
• Hansard
• Multilingual websites: Bitextor
Bitextor is Cunning

www.mywebsite.com/en/overview.html
www.mywebsite.com/es/overview.html
<title>My source text</title>
<title>My target text</title>
Requirements: relevance


Data needs to be in-domain
Requirements: quality

Garbage in, garbage out
Diagnose your TMs with automated QA
checks (e.g. glossary adherence, length)
CheckMate: General
CheckMate: Length
CheckMate: Terminology
Remove Repetitions
Remove Markup

Markup brings noise to the learning
process
Click <strong>Send</strong>
Haga clic en <strong>Enviar</strong>
Do-Moses-Yourself (DoMY)

Moses: state-of-the-art extensively used
open source SMT toolkit
DoMY: extension of Moses making
installation and configuration easier
Online SMT Portals
                  Cons
letsmt.eu
                  NDA-compliance
smartmate.co      Availability
                  Speed
DoMY (Basics)

Graphs: import-tmx, clean-LM/TM, build
LM/TM, train, translate.
Ini files: configuration (language pairs,
paths for input and output).
Folder structure: always include
superdomain, domain and subdomain
Folder structure
corpus           graphs
Run from terminal
Edit ini            Command line
Running from GUI
Graphs
Graph        Function             Input       Output
Import-tmx   Extract data from    Raw         Corpora/sa
             tmx files
Clean-tm     Clean data           Corpora/sa Corpora/re
                                             ady
Build-lm     Prepares training    Corpora/re builds
             set for LM           ady
Build-tm     Prepares training    Corpora/re builds
             set for TM           ady

Train        Trains MT engine     Builds      engines
Translate    Translates input     Translation Translation
             files and produces   s/in        s/out
             tmx output
Tips for settings

LM: 7-gram
TM: 9-gram
Aligner: Berkeley for distant languages
Troubleshoot

Error message in terminal
Log file in graph folder
DoMT QA
Is Your Engine Good?

A set is excluded from training to be used
for evaluation (598 segments)
From 0.5 BLEU points, engine is likely to
perform well
Keep Improving

Retrain the engine periodically as more
translation corpus become available
Gather feedback on what needs to be
improved
Statistical PE

• Keep a corpus of raw vs. PE
• Treat them as separate language pairs
• Run them thru DoMY
• Create raw vs. PE engine
• 2 engines: source > target, raw > PE
Questions?
Speak now…
Or reach me at:
www.facebook.com/xlation
www.wordbonds.es
@rubendelafuente
http://www.linkedin.com/in/rubendelafuente

More Related Content

PDF
Introduction to Machine translation - AEM
PDF
Trigger maxl from fdmee
PPTX
When to Update FME?
PDF
FDM to FDMEE migration utility
PPT
PPT
What is machine translation
PDF
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
PPTX
Presentation at CEF-EU-Luxembourg
Introduction to Machine translation - AEM
Trigger maxl from fdmee
When to Update FME?
FDM to FDMEE migration utility
What is machine translation
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
Presentation at CEF-EU-Luxembourg

Similar to Build your own statistical engines (20)

PDF
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
PPTX
Past, Present, and Future: Machine Translation & Natural Language Processing ...
PPTX
Past, Present, and Future: Machine Translation & Natural Language Processing ...
PDF
Session on machine translation batu 19 march2016
PDF
The Latest Advances in Patent Machine Translation
PPTX
Panacea presentation - Pangeanic - Budapest
PPTX
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
PDF
Do-it-yourself Machine Translation
PDF
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...
PDF
Gestión proyectos traducción - Universitat Autònoma de Barcelona
PDF
Gestión proyectos traducción en la Universitat Autònoma de Barcelona
PPTX
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
PDF
Machine_translation_for_low_resource_Indian_Languages_thesis_report
PDF
Muegge_Do-it-yourself MT_Taking statistical machine translation to the next l...
PDF
Machine Translation Introduction
PDF
iMT Language Solutions
 
PPT
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Kerstin Bier, Sybase, 4...
PDF
State of the Domain-Adaptive Machine Translation by Intento (November 2018)
PDF
No more SMT black boxes with MTradumàtica: a step-by-step web-based SMT appli...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Session on machine translation batu 19 march2016
The Latest Advances in Patent Machine Translation
Panacea presentation - Pangeanic - Budapest
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
Do-it-yourself Machine Translation
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...
Gestión proyectos traducción - Universitat Autònoma de Barcelona
Gestión proyectos traducción en la Universitat Autònoma de Barcelona
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Muegge_Do-it-yourself MT_Taking statistical machine translation to the next l...
Machine Translation Introduction
iMT Language Solutions
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Kerstin Bier, Sybase, 4...
State of the Domain-Adaptive Machine Translation by Intento (November 2018)
No more SMT black boxes with MTradumàtica: a step-by-step web-based SMT appli...
Ad

More from Rubén Rodríguez de la Fuente (13)

PDF
¿Me entiende el ordenador cuando hablo?
PPT
Tips and tricks for PE
PPTX
Trados studio 09 gestores
PPTX
Trados studio 09 traductores
PPTX
Presencia internet
PPTX
Resources for translators
PPTX
PPTX
PPTX
El traductor en plantilla
PPT
Presencia internet
PPTX
Translators on the go
PPT
Taller de traducción automática
¿Me entiende el ordenador cuando hablo?
Tips and tricks for PE
Trados studio 09 gestores
Trados studio 09 traductores
Presencia internet
Resources for translators
El traductor en plantilla
Presencia internet
Translators on the go
Taller de traducción automática
Ad

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation theory and applications.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation theory and applications.pdf
Empathic Computing: Creating Shared Understanding
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Build your own statistical engines

Editor's Notes

  • #11: Why? SMT is based in probability, calculated as # of a given token / total amount of tokens. Case and punctuation can disrupt the calculation.
  • #14: To get good results with SMT, you need around 10.000 segments at least
  • #21: Using Olifant from Okapi Framework
  • #29: Clean data: remove too long/short, empty sentences