Statistical Machine Translation for Language Localisation

Statistical Machine Translation for Language Localisation
By Y. Achchuthan 2010/SP/007
Supervised by Mr. K. Sarveswaran
Department of Computer Science, University of Jaffna.

Outline
• Motivation
• Introduction
• Problem Definition
• Methodology
• Architecture Overview & Experimental Setup
• Result
• Discussions
• Conclusion
• Deliverable
• References
• Demo

Motivation
Statistical
Machine
Translation
(SMT)
Localisation

Introduction
• Localisation of software has become an inevitable part of software
development.
• Machine Translation systems : Rule-based Machine Translation and
Statistical Machine Translation (SMT)
• Several frameworks have been implemented to carry out Machine
Translations
• SMT has a set of defined phases: Corpus preparation, Language
Modelling, Training, Testing and Evaluation

Problem Definition
Study whether Statistical Machine Translation can be used for
Language localisation of software.

Existing Efforts
• Morphological Processing for English-Tamil Statistical Machine
Translation
• Suffix-separation rules for both of the languages and evaluate the impact of
this pre-processing on translation quality of the phrase-based as well as
hierarchical model in terms of BLEU score and a small manual evaluation

Overview
Corpus
Preparation
Language
Modelling
Word
Alignment
Decoding
Evaluation

Step 1: Corpus Preparation [1/4]
• Data Collection
• Data are collected from language
resource files of different open source
projects.
• Online Tamil corpus that is published by
LoganathanRamasamy, OndrejBojar
Source Sentences
(No. of phrases)
Mozilla Firefox 4,568
Mozilla OS 3,465
Drupal 4,544
Moodle 4,355
Squirrel Mail 1,116
Tamil Glossary 2,567
Joomla 4,358
EnTam v2.0
(non technical)
169,871
Table 1 : Collected parallel data from the Internet

• Tokenization:
This means that spaces have to be inserted between words and punctuation.
Example:
smart search: manage search filters
smart search: search filters - new/edit
joomla update
private messages: inbox
private messages: read
private messages: write
smart search : manage search filters
smart search : search filters - new / edit
joomla update
private messages : inbox
private messages : read
private messages : write

• True-casing:
Words in each sentence are converted to their most probable casing.
Example:
எந்த (40/40)
இதத (34/34)
சரியான (26/26)
அதைவடிவம் (1/1)
தட்டச்சியது (2/2)
பியூகெ-பூட்டியில் (1/1)
ந ாக்கும் (1/1)
ெட்டதைக்ெ (1/1)
தனித்த (4/4)
இதைப்பில் (1/1)
ொரைங்ெளால் (2/2)
கசாடுக்ெில் (2/2)
அறிக்தெதய (9/9)
அதைக்ெப்பட்ட (13/13)
preceding (2/2)
system (125/125)
project (20/20)
submit (2/3) / Submit (1/3)
electronic (1/1)
sector (2/2)
earlier (7/7)
threaded (2/2)
super (3/4)
Super (1)
registering (2/2)
wait (15/15)
p3p (8/8)

• Cleaning:
Long sentences and empty sentences are removed as they can cause
problems with the training pipeline, and obviously misaligned sentences are
removed.

Step 2: Language Modeling
• Language Model (LM) is used to improve the
translation result
• Built with the target language
• Language Model toolkit estimates n-gram
probabilities using given text corpus
• IRSTLM and KenLM are used to build LM
Example:
ngram 1= 13346
ngram 2= 35419
ngram 3= 11607
ngram 4= 6390
1-grams:
-4.575466 ஏதுவான -0.10647591
-3.7375624 கபாத்தாதனக் -0.369015
-3.2596145 ொட்டுெிறது -1.0157927
-3.8978152 ெட்டுதரதயத் -0.27033526
-4.154526 நதர்ந்கதடுக்ெ -0.10647591
-3.8978152 தங்ெதள -0.12376224
-3.7375624 அனுைதிக்கும் -0.42978552
-4.154526 நைல்நதான்று -0.10647591
-5.135497 சாளரத்ததக் -0.10647591
-5.135497 படங்ெதளச் -0.10647591
2-grams:
-0.97480524 உருக்கள் எண்ணிக்கக -0.0629627
-1.1356568 ககோப்பகங்கள் எண்ணிக்கக -0.10245394
-1.6087823 பதிப்புகள் எண்ணிக்கக -0.10245394
-0.96094394 வகைபட எண்ணிக்கக -0.10245394
-1.2593822 வகைபடங்கள் எண்ணிக்கக -0.10245394
-0.96094394 நிைல்கள் எண்ணிக்கக -0.10245394

Step 3: Word Alignment
• Phrase extraction and scoring
• Most of the current Phrase-Based SMT systems rely on IBM Models (Specifically
model 4) for word alignment. Most popular implementation is GIZA++
• Running the algorithm in both directions, source to target and target to source
Example: Word Alignment Example: Phrase table
# Sentence pair (364) source length 2 target length 3 alignment score : 0.00613603
central control unit
NULL ({ }) தையக் ({ 1 }) ெட்டுப்பாட்டெம் ({ 2 3 })
data declaration
NULL ({ }) தரவுப் ({ 1 }) பிரெடனம் ({ 2 })
data import
NULL ({ }) தரவு ({ 1 }) இறக்குைதி ({ 2 })
cache controller ||| விதரநவெ ெட்டுப்பாட்டெம் ||| 1 0.1875 1 0.0582878 |||
0-0 1-0 1-1 ||| 1 1 1 |||
center ||| தையம் ||| 0.625 0.625 0.769231 0.555556 ||| 0-0 ||| 16 13 10 |||
|||central control unit ||| தையக் ெட்டுப்பாட்டெம் ||| 1 0.0390625 1 0.0136171 |||
0-0 0-1 1-1 2-1 ||| 1 1 1 |||
central control ||| தையக் ெட்டுப்பாட்டு ||| 1 0.75 1 0.0375 ||| 0-0 1-1 |||
1 1 1 |||

Step 4: Decoding
• Find the translation of a sentence that has the maximum probability
• Probabilistic model for phrase-based translation:
𝑒 𝑏𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑒
𝑖=1
𝐼
𝜙 𝑓𝑖 𝑒𝑖 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1 𝑝 𝐿𝑀 𝑒
• Components
• Phrase translation Picking phrase 𝑓𝑖 to be translated as a phrase 𝑒𝑖
• look up score 𝜙 𝑓𝑖 𝑒𝑖 from phrase translation table
• Reordering Previous phrase ended in 𝑒𝑛𝑑𝑖−1, current phrase starts at 𝑠𝑡𝑎𝑟𝑡𝑖
• compute 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1
• Language model For n-gram model, need to keep track of last 𝑛 − 1 words
• compute score 𝑝 𝐿𝑀 𝑤𝑖 𝑤𝑖−(𝑛−1), … , 𝑤𝑖−1 for added words 𝑤𝑖
• Moses Toolkit used to do the decoding process

Step 5: Evaluation
• Automatic evaluation
BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating the quality of text
which has been machine-translated from one natural language to another.
𝐵𝐿𝐸𝑈 = min 1,
𝑜𝑢𝑡𝑝𝑢𝑡𝑙𝑒𝑛𝑔𝑡ℎ
𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑙𝑒𝑛𝑔𝑡ℎ
𝑖=1
4
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖
1
4
• Human evaluation

Architecture Overview
Parallel
Corpus
Language Modeling
Phrase Extraction
Phrase Table
Language Model
Decoder
Web Service
.po File
Translated .po
Web Server
SMT Server
Word Alignment using GIZA ++
Language Modeling using
IRSTLM & KenLM
Using Moses toolkit

Result
70 60 50 40 30 20 10 0
14.43
25.87
55.06
12.74
3.04
5.56
13.67
27.72
56.44
13.42
2.61
6.08
13.57
28.28
56.73
13.64
2.63
6.11
2-gram3-gram4-gram
IRSTLM KenLM

Discussion
• Unavailability of parallel data
• Variations in collected parallel data
• BLEU scoring is optimized for generic domain

Conclusion
• Localisation can be done using SMT. However, it can be improved if
we can collect more parallel data.
• Output of SMT result is better for a specific domain than the generic
domain.
• Compare to IRSTLM, KenLM performs better.

Deliverable
• Dissertation
• An online interface for Tamil language localization using SMT
• A web service for Tamil language localization
• A research article

Future Work
• Test Factored Translation Models
• Study the Evaluation method and word alignment algorithm
• Improve the SMT performance

Selected References
• ZdenekŽabokrtský, LoganathanRamasamy OndrejBojar. "Morphological Processing for English-Tamil Statistical Machine
Translation." 24th International Conference on Computational Linguistics.
• Sripirakas, S.; Weerasinghe, A.R.; Herath, D.L., "Statistical machine translation of systems for Sinhala - Tamil," Advances in ICT
for Emerging Regions (ICTer), 2010 International Conference on , vol., no., pp.62,68, Sept. 29 2010-Oct. 1 2010
• Germann, Ulrich. "Building a statistical machine translation system from scratch: how much bang for the buck can we expect?."
Proceedings of the workshop on Data-driven methods in machine translation-Volume 14. Association for Computational
Linguistics, 2001.
• Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade
Shen, Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open Source Toolkit
for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration
session, Prague, Czech Republic, June 2007.

DEMO
URL: 10.20.10.211/smt
: 10.20.10.125/smt

Statistical Machine Translation for Language Localisation

More Related Content

Similar to Statistical Machine Translation for Language Localisation (20)

More from Achchuthan Yogarajah (10)

Recently uploaded (20)

Statistical Machine Translation for Language Localisation

Editor's Notes