SlideShare a Scribd company logo
Latent Class Transliteration
based on Source Language Origin
Masato Hagiwara & Satoshi Sekine
Rakuten Institute of Technology, New York
ACL-HLT 2011, June 21
2
Objective
• Transliteration
– Phonetic translation between languages with different
writing systems
e.g., flextime / フレックスタイム furekkusutaimu
– Useful for machine translation, spelling variation etc.
• Transliteration models
– Phonetic-based re-writing models
(Knight and Jonathan 1998)
– Spelling-based supervised models
(Brill and Moore 2000)
3
Spelling-based model (Brill and Moore 2000)
Edit distance
substitution, insertion, deletion
= cost 1
Alpha-Beta Model
flextime
furekkusutaimu
Generalization of edit distance
string-to-string substitution α→β
flextime
furekkusutaimu
P(flextime→furekkusutaimu)
= P(^f→fu)*P(le→re)*P(x→kkusu)*P(ti→tai)*P(me$→mu)
Transliteration Probability
Maximum re-writing probability over all possible partitions
Substitution Prob.
α
β
4
Multiple Language Origins
piaget / ピアジェ piaje
target / ターゲット tāgetto
Single models cannot deal with multiple origins
• Class transliteration Model (Li et al. 2007)
– Language detection + switching multiple models
P(get→ジェ je) ?
P(get→ゲット getto) ?
piaget / ピアジェ piaje
target / ターゲット tāgetto
French origin
English origin
French model
English model
5
Issues on Class Transliteration Model
• Requires training sets tagged with language origins
– Rare especially for proper nouns
• Language origins ≠ transliteration models
– e.g., spaghetti / スパゲティ supageti
Italian origins but can be found in English dictionaries
– e.g., Carl Laemmle / カール・レムリ kāru remuri
German immigrant but listed as an “American” film producer
→ An English transliteration model doesn’t work
Model source language origins as latent classes
6
Latent Class Transliteration Model
• Proposing “latent class transliteration model”
– Models the “source language origins” as latent classes
– “latent classes” correspond to sets of words with similar
transliteration characteristics
– Trained via the EM algorithm from transliteration pairs
Class transliteration model
Latent class transliteration model (proposed)
Explicit language detection
Latent class distribution
language gender
latent class
s: source
t: target
7
Model Training via the EM Algorithm
E step
Log likelihood
M step
8
Iterative Learning via EM Algorithm
piaget → piaje
target → taagetto
…
p/i/a/get→p/i/a/je
t/ar/get→t/aa/getto
…
Lx Ly Lz
Update
M step
Σγ*f(get$→je)
Training Pairs
P(^p→p)
P(ar→aa)
P(get$→je)
P(get$→getto)
…
Transliteration
Model
Lx Ly Lz
P(^p→p)
P(ar→aa)
P(get$→je)
P(get$→getto)
…
Transliteration
Model
Lx Ly Lz
p/i/a/get→p/i/a/je
t/ar/get→t/aa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on αβ model
9
Experiments
• Estimate correct transliteration for
foreign proper nouns
– Rank the candidates based on probability
Top-10 Mean Reciprocal Rank (MRR)
• Datasets
– Dataset 1: Western person name list
(6,718; de+en+fr)
– Dataset 2: Western proper noun from
Wikipedia
(11,323; +it+es)
10
Compared Models
• Alpha-beta method (AB)
• Class transliteration method (SOFT)
• Class transliteration method (HARD)
• Latent class transliteration method
(LATENT – PROPOSED)
11Performance measure: Top-10 mean reciprocal rank (MRR)
Results
Model Dataset 1 Dataset 2
Alpha-beta method
AB 94.8 90.9
Class transliteration
method
HARD
90.3 89.8
Class transliteration
method
SOFT
95.7 92.4
Latent class
transliteration method
LATENT (proposed)
95.8 92.4
Higher performance of
LATENT vs SOFT/HARD
Performance can be higher depending
on the number of latent classes
Low class detection
precision (77.4%)
12
Error Analysis
Example SOFT/HARD LATENT (Proposed)
Felix/フェリックス
ferikkusu [en]
☓ フィリス firisu ✓
Read/リード
riido [en]
☓ レアード reādo ✓
Caen/カーン
kān [fr]
☓ シャーン shān ✓
Laemmle/レムリ
remuri [en]
☓リアム riamu ✓
Xavier/ザビア
zabia [en]
☓ガブリエル
gaburieru
✓
Hilda/イルダ
iruda [en]
☓ ハルラ
harura
☓ ハルラ
harura
13
Conclusion
• Proposed the “latent class
transliteration model”
– Models source language origins as latent classes
– Model estimation from transliterated pairs via the
EM algorithm
– Comparable results v.s. models with explicit
language origins
• Future works
– Sources other than Western languages
– Targets other than Japanese

More Related Content

PDF
Latent Semantic Transliteration using Dirichlet Mixture
PDF
Introduction to r bddsil meetup
PDF
A simple way for polymorphism and structured programming - Go interfaces
PDF
Modality-Preserving Phrase-based Statistical Machine Translation
PPTX
Presentation on python
PPTX
Introduction to Structure Programming with C++
KEY
Mypy pycon-fi-2012
PDF
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Latent Semantic Transliteration using Dirichlet Mixture
Introduction to r bddsil meetup
A simple way for polymorphism and structured programming - Go interfaces
Modality-Preserving Phrase-based Statistical Machine Translation
Presentation on python
Introduction to Structure Programming with C++
Mypy pycon-fi-2012
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...

What's hot (20)

PDF
Python45s - Session 01
PDF
PPTX
Regular expression to NFA (Nondeterministic Finite Automata)
PPTX
Type hints in python & mypy
DOC
Model toc
PDF
Python2 unicode-pt1
PPTX
Learning Python - Week 2
PDF
Java Polymorphism
PPT
Regular expressions and languages pdf
PPTX
Introduction to python
PPT
4 1 Exponential Functions
PPT
Introduction to Python
PPTX
Copy propagation
PDF
Python Workshop
PDF
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
PPT
Introduction to phython programming
DOCX
Sp imp gtu
PPTX
History of F#, and the ML family of languages.
PPTX
Introduction to F#
ODP
Python Presentation
Python45s - Session 01
Regular expression to NFA (Nondeterministic Finite Automata)
Type hints in python & mypy
Model toc
Python2 unicode-pt1
Learning Python - Week 2
Java Polymorphism
Regular expressions and languages pdf
Introduction to python
4 1 Exponential Functions
Introduction to Python
Copy propagation
Python Workshop
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Introduction to phython programming
Sp imp gtu
History of F#, and the ML family of languages.
Introduction to F#
Python Presentation
Ad

More from Rakuten Group, Inc. (20)

PDF
EPSS (Exploit Prediction Scoring System)モニタリングツールの開発
PPTX
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
PDF
楽天における安全な秘匿情報管理への道のり
PDF
What Makes Software Green?
PDF
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
PDF
DataSkillCultureを浸透させる楽天の取り組み
PDF
大規模なリアルタイム監視の導入と展開
PDF
楽天における大規模データベースの運用
PDF
楽天サービスを支えるネットワークインフラストラクチャー
PDF
楽天の規模とクラウドプラットフォーム統括部の役割
PDF
Rakuten Services and Infrastructure Team.pdf
PDF
The Data Platform Administration Handling the 100 PB.pdf
PDF
Supporting Internal Customers as Technical Account Managers.pdf
PDF
Making Cloud Native CI_CD Services.pdf
PDF
How We Defined Our Own Cloud.pdf
PDF
Travel & Leisure Platform Department's tech info
PDF
Travel & Leisure Platform Department's tech info
PDF
OWASPTop10_Introduction
PDF
Introduction of GORA API Group technology
PDF
100PBを越えるデータプラットフォームの実情
EPSS (Exploit Prediction Scoring System)モニタリングツールの開発
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
楽天における安全な秘匿情報管理への道のり
What Makes Software Green?
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
DataSkillCultureを浸透させる楽天の取り組み
大規模なリアルタイム監視の導入と展開
楽天における大規模データベースの運用
楽天サービスを支えるネットワークインフラストラクチャー
楽天の規模とクラウドプラットフォーム統括部の役割
Rakuten Services and Infrastructure Team.pdf
The Data Platform Administration Handling the 100 PB.pdf
Supporting Internal Customers as Technical Account Managers.pdf
Making Cloud Native CI_CD Services.pdf
How We Defined Our Own Cloud.pdf
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
OWASPTop10_Introduction
Introduction of GORA API Group technology
100PBを越えるデータプラットフォームの実情
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Spectroscopy.pptx food analysis technology
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
sap open course for s4hana steps from ECC to s4
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Reach Out and Touch Someone: Haptics and Empathic Computing
Programs and apps: productivity, graphics, security and other tools
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Digital-Transformation-Roadmap-for-Companies.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Spectroscopy.pptx food analysis technology

Latent Class Transliteration based on Source Language Origin

  • 1. Latent Class Transliteration based on Source Language Origin Masato Hagiwara & Satoshi Sekine Rakuten Institute of Technology, New York ACL-HLT 2011, June 21
  • 2. 2 Objective • Transliteration – Phonetic translation between languages with different writing systems e.g., flextime / フレックスタイム furekkusutaimu – Useful for machine translation, spelling variation etc. • Transliteration models – Phonetic-based re-writing models (Knight and Jonathan 1998) – Spelling-based supervised models (Brill and Moore 2000)
  • 3. 3 Spelling-based model (Brill and Moore 2000) Edit distance substitution, insertion, deletion = cost 1 Alpha-Beta Model flextime furekkusutaimu Generalization of edit distance string-to-string substitution α→β flextime furekkusutaimu P(flextime→furekkusutaimu) = P(^f→fu)*P(le→re)*P(x→kkusu)*P(ti→tai)*P(me$→mu) Transliteration Probability Maximum re-writing probability over all possible partitions Substitution Prob. α β
  • 4. 4 Multiple Language Origins piaget / ピアジェ piaje target / ターゲット tāgetto Single models cannot deal with multiple origins • Class transliteration Model (Li et al. 2007) – Language detection + switching multiple models P(get→ジェ je) ? P(get→ゲット getto) ? piaget / ピアジェ piaje target / ターゲット tāgetto French origin English origin French model English model
  • 5. 5 Issues on Class Transliteration Model • Requires training sets tagged with language origins – Rare especially for proper nouns • Language origins ≠ transliteration models – e.g., spaghetti / スパゲティ supageti Italian origins but can be found in English dictionaries – e.g., Carl Laemmle / カール・レムリ kāru remuri German immigrant but listed as an “American” film producer → An English transliteration model doesn’t work Model source language origins as latent classes
  • 6. 6 Latent Class Transliteration Model • Proposing “latent class transliteration model” – Models the “source language origins” as latent classes – “latent classes” correspond to sets of words with similar transliteration characteristics – Trained via the EM algorithm from transliteration pairs Class transliteration model Latent class transliteration model (proposed) Explicit language detection Latent class distribution language gender latent class s: source t: target
  • 7. 7 Model Training via the EM Algorithm E step Log likelihood M step
  • 8. 8 Iterative Learning via EM Algorithm piaget → piaje target → taagetto … p/i/a/get→p/i/a/je t/ar/get→t/aa/getto … Lx Ly Lz Update M step Σγ*f(get$→je) Training Pairs P(^p→p) P(ar→aa) P(get$→je) P(get$→getto) … Transliteration Model Lx Ly Lz P(^p→p) P(ar→aa) P(get$→je) P(get$→getto) … Transliteration Model Lx Ly Lz p/i/a/get→p/i/a/je t/ar/get→t/aa/getto … Lx Ly Lz E step Transliteration probability Based on αβ model
  • 9. 9 Experiments • Estimate correct transliteration for foreign proper nouns – Rank the candidates based on probability Top-10 Mean Reciprocal Rank (MRR) • Datasets – Dataset 1: Western person name list (6,718; de+en+fr) – Dataset 2: Western proper noun from Wikipedia (11,323; +it+es)
  • 10. 10 Compared Models • Alpha-beta method (AB) • Class transliteration method (SOFT) • Class transliteration method (HARD) • Latent class transliteration method (LATENT – PROPOSED)
  • 11. 11Performance measure: Top-10 mean reciprocal rank (MRR) Results Model Dataset 1 Dataset 2 Alpha-beta method AB 94.8 90.9 Class transliteration method HARD 90.3 89.8 Class transliteration method SOFT 95.7 92.4 Latent class transliteration method LATENT (proposed) 95.8 92.4 Higher performance of LATENT vs SOFT/HARD Performance can be higher depending on the number of latent classes Low class detection precision (77.4%)
  • 12. 12 Error Analysis Example SOFT/HARD LATENT (Proposed) Felix/フェリックス ferikkusu [en] ☓ フィリス firisu ✓ Read/リード riido [en] ☓ レアード reādo ✓ Caen/カーン kān [fr] ☓ シャーン shān ✓ Laemmle/レムリ remuri [en] ☓リアム riamu ✓ Xavier/ザビア zabia [en] ☓ガブリエル gaburieru ✓ Hilda/イルダ iruda [en] ☓ ハルラ harura ☓ ハルラ harura
  • 13. 13 Conclusion • Proposed the “latent class transliteration model” – Models source language origins as latent classes – Model estimation from transliterated pairs via the EM algorithm – Comparable results v.s. models with explicit language origins • Future works – Sources other than Western languages – Targets other than Japanese