SlideShare a Scribd company logo
Embracing Diversity: Searching
over multiple languages
Tommaso Teofili
Suneel Marthi
June 12, 2017
Berlin Buzzwords, Berlin, Germany
1
$WhoAreWe
Tommaso Teofili
 @tteofili
Software Engineer, Adobe Systems
Member of Apache Software Foundation,
PMC Chair, Apache Lucene
Committer and PMC on Apache Joshua, Apache OpenNLP, Apache JackRabbit
Suneel Marthi
 @suneelmarthi
Principal Software Engineer, Office of Technology, Red Hat
Member of Apache Software Foundation
Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams
2
Agenda
What is Multi-Lingual Search ?
Why Multi-Lingual Search ?
What is Statistical Machine Translation ?
Overview of Apache Joshua
Dataflow Pipeline
Demo
3
What is Multi-Lingual Search ?
4
Searching
over content written in different languages
with users speaking different languages
both
Parallel corpora
Translating queries
Translating documents
5
Why Multi-Lingual Search ?
6
Embracing diversity
Most online tech content is in English
Wikipedia dumps:
en: 62GB
de: 17GB
it:   10GB
Good number of non-English speaking users
A lot of search queries are composed in English
Preferable to retrieve search results in native
language
… or even to consolidate all results in one language
7
UC1 — tech domain, native first
8
UC2 — native only ?
9
What is Machine Translation ?
10
Generate Translations from Statistical Models trained
on Bilingual Corpora.
Translation happens per a probability distribution
p(e/f)
E = string in the target language (English)
F = string in the source language (Spanish)
e~ = argmax p(e/f) = argmax p(f/e) * p(e)
e~ = best translation, the one with highest probability
11
Word-based Translation
12
How to translate a word → lookup in dictionary
Gebäude — building, house, tower.
Multiple translations
some more frequent than others
for instance: house and building most common
13
Look at a parallel corpus
(German text along with English translation)
Translation of Gebäude Count Probability
house 5.28 billion 0.51
building 4.16 billion 0.402
tower 9.28 million 0.09
14
Alignment
In a parallel text (or when we translate), we align
words in one language with the word in the other
Das Gebäude ist hoch
↓ ↓ ↓ ↓
the building is high
Word positions are numbered 1—4
15
Alignment Function
Define the Alignment with an Alignment Function
Mapping an English target word at position i to a
German source word at position j with a function a :
i → j
Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}
16
One-to-Many Translation
A source word could translate into multiple target words
Das ist ein Hochhaus   
↓ ↓ ↓ ↙ ↓ ↘
This is a high    rise building
17
Phrase-based Translation
18
Alignment Function
Word-Based Models translate words as atomic units
Phrase-Based Models translate phrases as atomic
units
Advantages:
many-to-many translation can handle non-
compositional phrases
use of local context in translation
the more data, the longer phrases can be learned
“Standard Model”, used by Google Translate and
others
19
Phrase-Based Model
Berlin ist ein herausragendes Kunst- und Kulturzentrum .
↓ ↓ ↓ ↓ ↓ ↓
Berlin is an outstanding Art and cultural center .
Foreign input is segmented in phrases
Each phrase is translated into English
Phrases are reordered
20
Decoding
21
We have a mathematical model for translation
p(e|f)
Task of decoding: find the translation ebest with
highest probability
Two types of error
the most probable translation is bad → fix the
model
search does not find the most probable translation
→ fix the search
ebest = argmax p(e|f)
22
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er        
↓        
he        
Pick and input phrase, translate
23
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er     ja noch nichts
↓      
he   does not yet  
Pick and input phrase, translate
24
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er   trinkt   ja noch nichts
↓      
he   does not yet   drink
Pick and input phrase, translate
25
Apache Joshua
26
Statistical Machine Translation Decoder for phrase-based and hierarch
machine translation
Written in Java
Provide 64 language packs for machine translation
https://cwiki.apache.org/confluence/display/JOSHUA/Language+P
Project initiated by Johns Hopkins Univ. and University of Pennsylvania
Presently incubating at Apache Software Foundation
Used extensively by Amazon.com, NASA JPL
https://cwiki.apache.org/confluence/display/JOSHUA
 @ApacheJoshua
27
Flows
28
References
Apache Joshua — https://cwiki.apache.org/confluence/display/JOSHUA
Apache OpenNLP — https://opennlp.apache.org
GitHub — https://github.com/smarthi/BBuzz-multilang-search
Slides — https://smarthi.github.io/bbuzz17-embracing-diversity-searching-
over-multiple-languages/#/
29
Credits
Joern Kottmann — PMC Chair, Apache OpenNLP
Matt Post — PMC Chair, Apache Joshua
Bruno P. Kinoshita — Committer on Apache OpenNLP,
committer and PMC on Apache Commons and Apache
Jena
30
Questions ???
31

More Related Content

PDF
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
PDF
Large Scale Text Processing
PDF
Practical NLP with Lisp
PDF
Aspects of NLP Practice
PDF
The State of #NLProc
PPTX
Why Python?
PPTX
Python Programming Language
PPTX
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Large Scale Text Processing
Practical NLP with Lisp
Aspects of NLP Practice
The State of #NLProc
Why Python?
Python Programming Language

What's hot (20)

PPTX
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
PDF
Natural Language Processing in Practice
PPT
Programming languages vienna
PDF
NLP Project Full Cycle
PDF
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
PDF
Lets learn Python !
ODP
ANTLR4 and its testing
PPT
OpenNLP demo
PDF
Introduction to python programming
PPTX
Introduction to data science
PPTX
Python and its Applications
PPTX
Python Programming
PPTX
PPTX
Lambda The Extreme: Test-Driving a Functional Language
PPTX
Introduction to python
PDF
Python - the basics
PPTX
You too can nlp - PyBay 2018 lightning talk
PPTX
Natural Language Processing using Text Mining
PPTX
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
PDF
An Extensible Multilingual Open Source Lemmatizer
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
Natural Language Processing in Practice
Programming languages vienna
NLP Project Full Cycle
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Lets learn Python !
ANTLR4 and its testing
OpenNLP demo
Introduction to python programming
Introduction to data science
Python and its Applications
Python Programming
Lambda The Extreme: Test-Driving a Functional Language
Introduction to python
Python - the basics
You too can nlp - PyBay 2018 lightning talk
Natural Language Processing using Text Mining
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
An Extensible Multilingual Open Source Lemmatizer
Ad

Similar to Embracing diversity searching over multiple languages (20)

PPTX
Machine translator Introduction
PPT
Topic 09 MachineTranslation.ppt
PDF
Machine Transalation.pdf
PPTX
Past, Present, and Future: Machine Translation & Natural Language Processing ...
PPTX
Past, Present, and Future: Machine Translation & Natural Language Processing ...
PPTX
How to expand your nlp solution to new languages using transfer learning
PDF
Deep Learning for Machine Translation - A dramatic turn of paradigm
PDF
Machine Translation of Indic Languages using apertium
PPTX
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
PDF
The Effect of Translationese on Statistical Machine Translation
PPTX
Translate word press to your language
PPTX
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
PDF
PDF
Breaking the language barrier: how do we quickly add multilanguage support in...
PDF
"Machine Translation 101" and the Challenge of Patents
PDF
Technical_Trends_Role_Machine_Translation_march15
PDF
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
PPTX
Trans coder
Machine translator Introduction
Topic 09 MachineTranslation.ppt
Machine Transalation.pdf
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
How to expand your nlp solution to new languages using transfer learning
Deep Learning for Machine Translation - A dramatic turn of paradigm
Machine Translation of Indic Languages using apertium
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
The Effect of Translationese on Statistical Machine Translation
Translate word press to your language
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Breaking the language barrier: how do we quickly add multilanguage support in...
"Machine Translation 101" and the Challenge of Patents
Technical_Trends_Role_Machine_Translation_march15
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Trans coder
Ad

More from Suneel Marthi (8)

PDF
Measuring vegetation health to predict natural hazards
PDF
Large scale landuse classification of satellite imagery
PDF
Streaming topic model training and inference
PDF
Large scale landuse classification of satellite imagery
PDF
Building streaming pipelines for neural machine translation
PDF
Moving beyond moving bytes
PDF
Distributed Machine Learning with Apache Mahout
PDF
Apache Flink Stream Processing
Measuring vegetation health to predict natural hazards
Large scale landuse classification of satellite imagery
Streaming topic model training and inference
Large scale landuse classification of satellite imagery
Building streaming pipelines for neural machine translation
Moving beyond moving bytes
Distributed Machine Learning with Apache Mahout
Apache Flink Stream Processing

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Business Analytics and business intelligence.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
.pdf is not working space design for the following data for the following dat...
Data_Analytics_and_PowerBI_Presentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
annual-report-2024-2025 original latest.
Introduction to Knowledge Engineering Part 1
Supervised vs unsupervised machine learning algorithms
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction-to-Cloud-ComputingFinal.pptx
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Analytics and business intelligence.pdf
climate analysis of Dhaka ,Banglades.pptx
Business Acumen Training GuidePresentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Business Ppt On Nestle.pptx huunnnhhgfvu
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
.pdf is not working space design for the following data for the following dat...

Embracing diversity searching over multiple languages

  • 1. Embracing Diversity: Searching over multiple languages Tommaso Teofili Suneel Marthi June 12, 2017 Berlin Buzzwords, Berlin, Germany 1
  • 2. $WhoAreWe Tommaso Teofili  @tteofili Software Engineer, Adobe Systems Member of Apache Software Foundation, PMC Chair, Apache Lucene Committer and PMC on Apache Joshua, Apache OpenNLP, Apache JackRabbit Suneel Marthi  @suneelmarthi Principal Software Engineer, Office of Technology, Red Hat Member of Apache Software Foundation Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams 2
  • 3. Agenda What is Multi-Lingual Search ? Why Multi-Lingual Search ? What is Statistical Machine Translation ? Overview of Apache Joshua Dataflow Pipeline Demo 3
  • 5. Searching over content written in different languages with users speaking different languages both Parallel corpora Translating queries Translating documents 5
  • 7. Embracing diversity Most online tech content is in English Wikipedia dumps: en: 62GB de: 17GB it:   10GB Good number of non-English speaking users A lot of search queries are composed in English Preferable to retrieve search results in native language … or even to consolidate all results in one language 7
  • 8. UC1 — tech domain, native first 8
  • 9. UC2 — native only ? 9
  • 10. What is Machine Translation ? 10
  • 11. Generate Translations from Statistical Models trained on Bilingual Corpora. Translation happens per a probability distribution p(e/f) E = string in the target language (English) F = string in the source language (Spanish) e~ = argmax p(e/f) = argmax p(f/e) * p(e) e~ = best translation, the one with highest probability 11
  • 13. How to translate a word → lookup in dictionary Gebäude — building, house, tower. Multiple translations some more frequent than others for instance: house and building most common 13
  • 14. Look at a parallel corpus (German text along with English translation) Translation of Gebäude Count Probability house 5.28 billion 0.51 building 4.16 billion 0.402 tower 9.28 million 0.09 14
  • 15. Alignment In a parallel text (or when we translate), we align words in one language with the word in the other Das Gebäude ist hoch ↓ ↓ ↓ ↓ the building is high Word positions are numbered 1—4 15
  • 16. Alignment Function Define the Alignment with an Alignment Function Mapping an English target word at position i to a German source word at position j with a function a : i → j Example a : {1 → 1, 2 → 2, 3 → 3, 4 → 4} 16
  • 17. One-to-Many Translation A source word could translate into multiple target words Das ist ein Hochhaus    ↓ ↓ ↓ ↙ ↓ ↘ This is a high    rise building 17
  • 19. Alignment Function Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many translation can handle non- compositional phrases use of local context in translation the more data, the longer phrases can be learned “Standard Model”, used by Google Translate and others 19
  • 20. Phrase-Based Model Berlin ist ein herausragendes Kunst- und Kulturzentrum . ↓ ↓ ↓ ↓ ↓ ↓ Berlin is an outstanding Art and cultural center . Foreign input is segmented in phrases Each phrase is translated into English Phrases are reordered 20
  • 22. We have a mathematical model for translation p(e|f) Task of decoding: find the translation ebest with highest probability Two types of error the most probable translation is bad → fix the model search does not find the most probable translation → fix the search ebest = argmax p(e|f) 22
  • 23. Translation Process Translate this query from German into English er trinkt ja noch nichts er         ↓         he         Pick and input phrase, translate 23
  • 24. Translation Process Translate this query from German into English er trinkt ja noch nichts er     ja noch nichts ↓       he   does not yet   Pick and input phrase, translate 24
  • 25. Translation Process Translate this query from German into English er trinkt ja noch nichts er   trinkt   ja noch nichts ↓       he   does not yet   drink Pick and input phrase, translate 25
  • 27. Statistical Machine Translation Decoder for phrase-based and hierarch machine translation Written in Java Provide 64 language packs for machine translation https://cwiki.apache.org/confluence/display/JOSHUA/Language+P Project initiated by Johns Hopkins Univ. and University of Pennsylvania Presently incubating at Apache Software Foundation Used extensively by Amazon.com, NASA JPL https://cwiki.apache.org/confluence/display/JOSHUA  @ApacheJoshua 27
  • 29. References Apache Joshua — https://cwiki.apache.org/confluence/display/JOSHUA Apache OpenNLP — https://opennlp.apache.org GitHub — https://github.com/smarthi/BBuzz-multilang-search Slides — https://smarthi.github.io/bbuzz17-embracing-diversity-searching- over-multiple-languages/#/ 29
  • 30. Credits Joern Kottmann — PMC Chair, Apache OpenNLP Matt Post — PMC Chair, Apache Joshua Bruno P. Kinoshita — Committer on Apache OpenNLP, committer and PMC on Apache Commons and Apache Jena 30