Statistical Machine Translation for Language Localisation
By Y. Achchuthan 2010/SP/007
Supervised by Mr. K. Sarveswaran
Department of Computer Science, University of Jaffna.
Outline
• Motivation
• Introduction
• Problem Definition
• Methodology
• Architecture Overview & Experimental Setup
• Result
• Discussions
• Conclusion
• Deliverable
• References
• Demo
Motivation
Motivation
Statistical
Machine
Translation
(SMT)
Localisation
Introduction
Introduction
• Localisation of software has become an inevitable part of software
development.
• Machine Translation systems : Rule-based Machine Translation and
Statistical Machine Translation (SMT)
• Several frameworks have been implemented to carry out Machine
Translations
• SMT has a set of defined phases: Corpus preparation, Language
Modelling, Training, Testing and Evaluation
Problem Definition
Problem Definition
Study whether Statistical Machine Translation can be used for
Language localisation of software.
Existing Efforts
Existing Efforts
• Morphological Processing for English-Tamil Statistical Machine
Translation
• Suffix-separation rules for both of the languages and evaluate the impact of
this pre-processing on translation quality of the phrase-based as well as
hierarchical model in terms of BLEU score and a small manual evaluation
Methodology
Overview
Corpus
Preparation
Language
Modelling
Word
Alignment
Decoding
Evaluation
Step 1: Corpus Preparation [1/4]
• Data Collection
• Data are collected from language
resource files of different open source
projects.
• Online Tamil corpus that is published by
LoganathanRamasamy, OndrejBojar
Source Sentences
(No. of phrases)
Mozilla Firefox 4,568
Mozilla OS 3,465
Drupal 4,544
Moodle 4,355
Squirrel Mail 1,116
Tamil Glossary 2,567
Joomla 4,358
EnTam v2.0
(non technical)
169,871
Table 1 : Collected parallel data from the Internet
Step 1: Corpus Preparation [2/4]
• Tokenization:
This means that spaces have to be inserted between words and punctuation.
Example:
smart search: manage search filters
smart search: search filters - new/edit
joomla update
private messages: inbox
private messages: read
private messages: write
smart search : manage search filters
smart search : search filters - new / edit
joomla update
private messages : inbox
private messages : read
private messages : write
Step 1: Corpus Preparation [3/4]
• True-casing:
Words in each sentence are converted to their most probable casing.
Example:
எந்த (40/40)
இதத (34/34)
சரியான (26/26)
அதைவடிவம் (1/1)
தட்டச்சியது (2/2)
பியூகெ-பூட்டியில் (1/1)
ந ாக்கும் (1/1)
ெட்டதைக்ெ (1/1)
தனித்த (4/4)
இதைப்பில் (1/1)
ொரைங்ெளால் (2/2)
கசாடுக்ெில் (2/2)
அறிக்தெதய (9/9)
அதைக்ெப்பட்ட (13/13)
preceding (2/2)
system (125/125)
project (20/20)
submit (2/3) / Submit (1/3)
electronic (1/1)
sector (2/2)
earlier (7/7)
threaded (2/2)
super (3/4)
Super (1)
registering (2/2)
wait (15/15)
p3p (8/8)
Step 1: Corpus Preparation [4/4]
• Cleaning:
Long sentences and empty sentences are removed as they can cause
problems with the training pipeline, and obviously misaligned sentences are
removed.
Step 2: Language Modeling
• Language Model (LM) is used to improve the
translation result
• Built with the target language
• Language Model toolkit estimates n-gram
probabilities using given text corpus
• IRSTLM and KenLM are used to build LM
Example:
ngram 1= 13346
ngram 2= 35419
ngram 3= 11607
ngram 4= 6390
1-grams:
-4.575466 ஏதுவான -0.10647591
-3.7375624 கபாத்தாதனக் -0.369015
-3.2596145 ொட்டுெிறது -1.0157927
-3.8978152 ெட்டுதரதயத் -0.27033526
-4.154526 நதர்ந்கதடுக்ெ -0.10647591
-3.8978152 தங்ெதள -0.12376224
-3.7375624 அனுைதிக்கும் -0.42978552
-4.154526 நைல்நதான்று -0.10647591
-5.135497 சாளரத்ததக் -0.10647591
-5.135497 படங்ெதளச் -0.10647591
2-grams:
-0.97480524 உருக்கள் எண்ணிக்கக -0.0629627
-1.1356568 ககோப்பகங்கள் எண்ணிக்கக -0.10245394
-1.6087823 பதிப்புகள் எண்ணிக்கக -0.10245394
-0.96094394 வகைபட எண்ணிக்கக -0.10245394
-1.2593822 வகைபடங்கள் எண்ணிக்கக -0.10245394
-0.96094394 நிைல்கள் எண்ணிக்கக -0.10245394
Step 3: Word Alignment
• Phrase extraction and scoring
• Most of the current Phrase-Based SMT systems rely on IBM Models (Specifically
model 4) for word alignment. Most popular implementation is GIZA++
• Running the algorithm in both directions, source to target and target to source
Example: Word Alignment Example: Phrase table
# Sentence pair (364) source length 2 target length 3 alignment score : 0.00613603
central control unit
NULL ({ }) தையக் ({ 1 }) ெட்டுப்பாட்டெம் ({ 2 3 })
# Sentence pair (445) source length 2 target length 2 alignment score : 0.295143
data declaration
NULL ({ }) தரவுப் ({ 1 }) பிரெடனம் ({ 2 })
# Sentence pair (474) source length 2 target length 2 alignment score : 0.151245
data import
NULL ({ }) தரவு ({ 1 }) இறக்குைதி ({ 2 })
cache controller ||| விதரநவெ ெட்டுப்பாட்டெம் ||| 1 0.1875 1 0.0582878 |||
0-0 1-0 1-1 ||| 1 1 1 |||
center ||| தையம் ||| 0.625 0.625 0.769231 0.555556 ||| 0-0 ||| 16 13 10 |||
|||central control unit ||| தையக் ெட்டுப்பாட்டெம் ||| 1 0.0390625 1 0.0136171 |||
0-0 0-1 1-1 2-1 ||| 1 1 1 |||
central control ||| தையக் ெட்டுப்பாட்டு ||| 1 0.75 1 0.0375 ||| 0-0 1-1 |||
1 1 1 |||
Step 4: Decoding
• Find the translation of a sentence that has the maximum probability
• Probabilistic model for phrase-based translation:
𝑒 𝑏𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑒
𝑖=1
𝐼
𝜙 𝑓𝑖 𝑒𝑖 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1 𝑝 𝐿𝑀 𝑒
• Components
• Phrase translation Picking phrase 𝑓𝑖 to be translated as a phrase 𝑒𝑖
• look up score 𝜙 𝑓𝑖 𝑒𝑖 from phrase translation table
• Reordering Previous phrase ended in 𝑒𝑛𝑑𝑖−1, current phrase starts at 𝑠𝑡𝑎𝑟𝑡𝑖
• compute 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1
• Language model For n-gram model, need to keep track of last 𝑛 − 1 words
• compute score 𝑝 𝐿𝑀 𝑤𝑖 𝑤𝑖−(𝑛−1), … , 𝑤𝑖−1 for added words 𝑤𝑖
• Moses Toolkit used to do the decoding process
Step 5: Evaluation
• Automatic evaluation
BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating the quality of text
which has been machine-translated from one natural language to another.
𝐵𝐿𝐸𝑈 = min 1,
𝑜𝑢𝑡𝑝𝑢𝑡𝑙𝑒𝑛𝑔𝑡ℎ
𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑙𝑒𝑛𝑔𝑡ℎ
𝑖=1
4
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖
1
4
• Human evaluation
Architecture Overview
Architecture Overview
Parallel
Corpus
Language Modeling
Phrase Extraction
Phrase Table
Language Model
Decoder
Web Service
.po File
Translated .po
Web Server
SMT Server
Word Alignment using GIZA ++
Language Modeling using
IRSTLM & KenLM
Using Moses toolkit
Result
Result
70 60 50 40 30 20 10 0
14.43
25.87
55.06
12.74
3.04
5.56
13.67
27.72
56.44
13.42
2.61
6.08
13.57
28.28
56.73
13.64
2.63
6.11
2-gram3-gram4-gram
IRSTLM KenLM
Discussion
Discussion
• Unavailability of parallel data
• Variations in collected parallel data
• BLEU scoring is optimized for generic domain
Conclusion
Conclusion
• Localisation can be done using SMT. However, it can be improved if
we can collect more parallel data.
• Output of SMT result is better for a specific domain than the generic
domain.
• Compare to IRSTLM, KenLM performs better.
Deliverable
Deliverable
• Dissertation
• An online interface for Tamil language localization using SMT
• A web service for Tamil language localization
• A research article
Future Work
Future Work
• Test Factored Translation Models
• Study the Evaluation method and word alignment algorithm
• Improve the SMT performance
Selected References
Selected References
• ZdenekŽabokrtský, LoganathanRamasamy OndrejBojar. "Morphological Processing for English-Tamil Statistical Machine
Translation." 24th International Conference on Computational Linguistics.
• Sripirakas, S.; Weerasinghe, A.R.; Herath, D.L., "Statistical machine translation of systems for Sinhala - Tamil," Advances in ICT
for Emerging Regions (ICTer), 2010 International Conference on , vol., no., pp.62,68, Sept. 29 2010-Oct. 1 2010
• Germann, Ulrich. "Building a statistical machine translation system from scratch: how much bang for the buck can we expect?."
Proceedings of the workshop on Data-driven methods in machine translation-Volume 14. Association for Computational
Linguistics, 2001.
• Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade
Shen, Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open Source Toolkit
for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration
session, Prague, Czech Republic, June 2007.
DEMO
URL: 10.20.10.211/smt
: 10.20.10.125/smt

More Related Content

PPTX
Language Localisation of Tamil using Statistical Machine Translation - ICTer2015
PDF
Static Neural Compiler Optimization via Deep Reinforcement Learning
PDF
Data Structures and Algorithm - Week 11 - Algorithm Analysis
PDF
Data Structures and Algorithm - Week 8 - Minimum Spanning Trees
PDF
Data Structures and Algorithm - Week 9 - Search Algorithms
PPTX
hands on machine learning Chapter 4 model training
PDF
Data Structures and Algorithm - Week 4 - Trees, Binary Trees
PDF
Introduction to cyclical learning rates for training neural nets
Language Localisation of Tamil using Statistical Machine Translation - ICTer2015
Static Neural Compiler Optimization via Deep Reinforcement Learning
Data Structures and Algorithm - Week 11 - Algorithm Analysis
Data Structures and Algorithm - Week 8 - Minimum Spanning Trees
Data Structures and Algorithm - Week 9 - Search Algorithms
hands on machine learning Chapter 4 model training
Data Structures and Algorithm - Week 4 - Trees, Binary Trees
Introduction to cyclical learning rates for training neural nets

Similar to Statistical Machine Translation for Language Localisation (20)

PDF
Triantafyllia Voulibasi
PDF
Rui Meng - 2017 - Deep Keyphrase Generation
PDF
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
PDF
Analysis of speech signal mlbp features
PDF
Analysis of speech signal mlbp features
PDF
Analysis of speech signal mlbp features
PDF
Analysis of speech signal mlbp features
PDF
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
PPTX
Text independent speaker recognition system
PPTX
Query Execution Time and Query Optimization.
PPTX
A Novel Specification and Composition Language for Services
PDF
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
PDF
An improved approach to minimize context switching in round robin scheduling ...
PDF
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
DOC
Sangram Nayak_22Jan15
PDF
STV-20151019-ServiceFunctionaTestAutomation (2)
DOCX
DSP Spring 2023 Lab Manual for spring 2023
PPT
Cs 568 Spring 10 Lecture 5 Estimation
PPTX
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
PPTX
Ajila (1)
Triantafyllia Voulibasi
Rui Meng - 2017 - Deep Keyphrase Generation
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
Analysis of speech signal mlbp features
Analysis of speech signal mlbp features
Analysis of speech signal mlbp features
Analysis of speech signal mlbp features
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Text independent speaker recognition system
Query Execution Time and Query Optimization.
A Novel Specification and Composition Language for Services
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
An improved approach to minimize context switching in round robin scheduling ...
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Sangram Nayak_22Jan15
STV-20151019-ServiceFunctionaTestAutomation (2)
DSP Spring 2023 Lab Manual for spring 2023
Cs 568 Spring 10 Lecture 5 Estimation
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Ajila (1)
Ad

More from Achchuthan Yogarajah (10)

PPTX
Managing the design process
PPTX
intoduction to network devices
PPTX
basic network concepts
PPTX
4 php-advanced
PPTX
3 php-connect-to-my sql
PPTX
PHP Form Handling
PPTX
PHP-introduction
PPTX
Introduction to Web Programming
PPTX
PADDY CULTIVATION MANAGEMENT SYSTEM
PDF
Greedy Knapsack Problem - by Y Achchuthan
Managing the design process
intoduction to network devices
basic network concepts
4 php-advanced
3 php-connect-to-my sql
PHP Form Handling
PHP-introduction
Introduction to Web Programming
PADDY CULTIVATION MANAGEMENT SYSTEM
Greedy Knapsack Problem - by Y Achchuthan
Ad

Recently uploaded (20)

PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
advance database management system book.pdf
PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PDF
Mucosal Drug Delivery system_NDDS_BPHARMACY__SEM VII_PCI.pdf
PDF
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf
PPTX
Education and Perspectives of Education.pptx
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PDF
What if we spent less time fighting change, and more time building what’s rig...
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
My India Quiz Book_20210205121199924.pdf
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
Hazard Identification & Risk Assessment .pdf
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
Race Reva University – Shaping Future Leaders in Artificial Intelligence
AI-driven educational solutions for real-life interventions in the Philippine...
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
advance database management system book.pdf
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
Mucosal Drug Delivery system_NDDS_BPHARMACY__SEM VII_PCI.pdf
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf
Education and Perspectives of Education.pptx
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
What if we spent less time fighting change, and more time building what’s rig...
Virtual and Augmented Reality in Current Scenario
My India Quiz Book_20210205121199924.pdf
Uderstanding digital marketing and marketing stratergie for engaging the digi...
Hazard Identification & Risk Assessment .pdf
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Race Reva University – Shaping Future Leaders in Artificial Intelligence

Statistical Machine Translation for Language Localisation

  • 1. Statistical Machine Translation for Language Localisation By Y. Achchuthan 2010/SP/007 Supervised by Mr. K. Sarveswaran Department of Computer Science, University of Jaffna.
  • 2. Outline • Motivation • Introduction • Problem Definition • Methodology • Architecture Overview & Experimental Setup • Result • Discussions • Conclusion • Deliverable • References • Demo
  • 6. Introduction • Localisation of software has become an inevitable part of software development. • Machine Translation systems : Rule-based Machine Translation and Statistical Machine Translation (SMT) • Several frameworks have been implemented to carry out Machine Translations • SMT has a set of defined phases: Corpus preparation, Language Modelling, Training, Testing and Evaluation
  • 8. Problem Definition Study whether Statistical Machine Translation can be used for Language localisation of software.
  • 10. Existing Efforts • Morphological Processing for English-Tamil Statistical Machine Translation • Suffix-separation rules for both of the languages and evaluate the impact of this pre-processing on translation quality of the phrase-based as well as hierarchical model in terms of BLEU score and a small manual evaluation
  • 13. Step 1: Corpus Preparation [1/4] • Data Collection • Data are collected from language resource files of different open source projects. • Online Tamil corpus that is published by LoganathanRamasamy, OndrejBojar Source Sentences (No. of phrases) Mozilla Firefox 4,568 Mozilla OS 3,465 Drupal 4,544 Moodle 4,355 Squirrel Mail 1,116 Tamil Glossary 2,567 Joomla 4,358 EnTam v2.0 (non technical) 169,871 Table 1 : Collected parallel data from the Internet
  • 14. Step 1: Corpus Preparation [2/4] • Tokenization: This means that spaces have to be inserted between words and punctuation. Example: smart search: manage search filters smart search: search filters - new/edit joomla update private messages: inbox private messages: read private messages: write smart search : manage search filters smart search : search filters - new / edit joomla update private messages : inbox private messages : read private messages : write
  • 15. Step 1: Corpus Preparation [3/4] • True-casing: Words in each sentence are converted to their most probable casing. Example: எந்த (40/40) இதத (34/34) சரியான (26/26) அதைவடிவம் (1/1) தட்டச்சியது (2/2) பியூகெ-பூட்டியில் (1/1) ந ாக்கும் (1/1) ெட்டதைக்ெ (1/1) தனித்த (4/4) இதைப்பில் (1/1) ொரைங்ெளால் (2/2) கசாடுக்ெில் (2/2) அறிக்தெதய (9/9) அதைக்ெப்பட்ட (13/13) preceding (2/2) system (125/125) project (20/20) submit (2/3) / Submit (1/3) electronic (1/1) sector (2/2) earlier (7/7) threaded (2/2) super (3/4) Super (1) registering (2/2) wait (15/15) p3p (8/8)
  • 16. Step 1: Corpus Preparation [4/4] • Cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously misaligned sentences are removed.
  • 17. Step 2: Language Modeling • Language Model (LM) is used to improve the translation result • Built with the target language • Language Model toolkit estimates n-gram probabilities using given text corpus • IRSTLM and KenLM are used to build LM Example: ngram 1= 13346 ngram 2= 35419 ngram 3= 11607 ngram 4= 6390 1-grams: -4.575466 ஏதுவான -0.10647591 -3.7375624 கபாத்தாதனக் -0.369015 -3.2596145 ொட்டுெிறது -1.0157927 -3.8978152 ெட்டுதரதயத் -0.27033526 -4.154526 நதர்ந்கதடுக்ெ -0.10647591 -3.8978152 தங்ெதள -0.12376224 -3.7375624 அனுைதிக்கும் -0.42978552 -4.154526 நைல்நதான்று -0.10647591 -5.135497 சாளரத்ததக் -0.10647591 -5.135497 படங்ெதளச் -0.10647591 2-grams: -0.97480524 உருக்கள் எண்ணிக்கக -0.0629627 -1.1356568 ககோப்பகங்கள் எண்ணிக்கக -0.10245394 -1.6087823 பதிப்புகள் எண்ணிக்கக -0.10245394 -0.96094394 வகைபட எண்ணிக்கக -0.10245394 -1.2593822 வகைபடங்கள் எண்ணிக்கக -0.10245394 -0.96094394 நிைல்கள் எண்ணிக்கக -0.10245394
  • 18. Step 3: Word Alignment • Phrase extraction and scoring • Most of the current Phrase-Based SMT systems rely on IBM Models (Specifically model 4) for word alignment. Most popular implementation is GIZA++ • Running the algorithm in both directions, source to target and target to source Example: Word Alignment Example: Phrase table # Sentence pair (364) source length 2 target length 3 alignment score : 0.00613603 central control unit NULL ({ }) தையக் ({ 1 }) ெட்டுப்பாட்டெம் ({ 2 3 }) # Sentence pair (445) source length 2 target length 2 alignment score : 0.295143 data declaration NULL ({ }) தரவுப் ({ 1 }) பிரெடனம் ({ 2 }) # Sentence pair (474) source length 2 target length 2 alignment score : 0.151245 data import NULL ({ }) தரவு ({ 1 }) இறக்குைதி ({ 2 }) cache controller ||| விதரநவெ ெட்டுப்பாட்டெம் ||| 1 0.1875 1 0.0582878 ||| 0-0 1-0 1-1 ||| 1 1 1 ||| center ||| தையம் ||| 0.625 0.625 0.769231 0.555556 ||| 0-0 ||| 16 13 10 ||| |||central control unit ||| தையக் ெட்டுப்பாட்டெம் ||| 1 0.0390625 1 0.0136171 ||| 0-0 0-1 1-1 2-1 ||| 1 1 1 ||| central control ||| தையக் ெட்டுப்பாட்டு ||| 1 0.75 1 0.0375 ||| 0-0 1-1 ||| 1 1 1 |||
  • 19. Step 4: Decoding • Find the translation of a sentence that has the maximum probability • Probabilistic model for phrase-based translation: 𝑒 𝑏𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑒 𝑖=1 𝐼 𝜙 𝑓𝑖 𝑒𝑖 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1 𝑝 𝐿𝑀 𝑒 • Components • Phrase translation Picking phrase 𝑓𝑖 to be translated as a phrase 𝑒𝑖 • look up score 𝜙 𝑓𝑖 𝑒𝑖 from phrase translation table • Reordering Previous phrase ended in 𝑒𝑛𝑑𝑖−1, current phrase starts at 𝑠𝑡𝑎𝑟𝑡𝑖 • compute 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1 • Language model For n-gram model, need to keep track of last 𝑛 − 1 words • compute score 𝑝 𝐿𝑀 𝑤𝑖 𝑤𝑖−(𝑛−1), … , 𝑤𝑖−1 for added words 𝑤𝑖 • Moses Toolkit used to do the decoding process
  • 20. Step 5: Evaluation • Automatic evaluation BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. 𝐵𝐿𝐸𝑈 = min 1, 𝑜𝑢𝑡𝑝𝑢𝑡𝑙𝑒𝑛𝑔𝑡ℎ 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑙𝑒𝑛𝑔𝑡ℎ 𝑖=1 4 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 1 4 • Human evaluation
  • 22. Architecture Overview Parallel Corpus Language Modeling Phrase Extraction Phrase Table Language Model Decoder Web Service .po File Translated .po Web Server SMT Server Word Alignment using GIZA ++ Language Modeling using IRSTLM & KenLM Using Moses toolkit
  • 24. Result 70 60 50 40 30 20 10 0 14.43 25.87 55.06 12.74 3.04 5.56 13.67 27.72 56.44 13.42 2.61 6.08 13.57 28.28 56.73 13.64 2.63 6.11 2-gram3-gram4-gram IRSTLM KenLM
  • 26. Discussion • Unavailability of parallel data • Variations in collected parallel data • BLEU scoring is optimized for generic domain
  • 28. Conclusion • Localisation can be done using SMT. However, it can be improved if we can collect more parallel data. • Output of SMT result is better for a specific domain than the generic domain. • Compare to IRSTLM, KenLM performs better.
  • 30. Deliverable • Dissertation • An online interface for Tamil language localization using SMT • A web service for Tamil language localization • A research article
  • 32. Future Work • Test Factored Translation Models • Study the Evaluation method and word alignment algorithm • Improve the SMT performance
  • 34. Selected References • ZdenekŽabokrtský, LoganathanRamasamy OndrejBojar. "Morphological Processing for English-Tamil Statistical Machine Translation." 24th International Conference on Computational Linguistics. • Sripirakas, S.; Weerasinghe, A.R.; Herath, D.L., "Statistical machine translation of systems for Sinhala - Tamil," Advances in ICT for Emerging Regions (ICTer), 2010 International Conference on , vol., no., pp.62,68, Sept. 29 2010-Oct. 1 2010 • Germann, Ulrich. "Building a statistical machine translation system from scratch: how much bang for the buck can we expect?." Proceedings of the workshop on Data-driven methods in machine translation-Volume 14. Association for Computational Linguistics, 2001. • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.

Editor's Notes

  • #20: நீங்கள் இந்த பதிவை தொகுக்க அனுமதிக்கப்படவில்லை . இந்த பதிவை வௌியிடும் உரிமை உங்களிடம் இல்லை .