SlideShare a Scribd company logo
How to expand your NLP solution to new
languages using transfer learning
Lena Shakurova
shakurova@textkernel.nl
Beata Nyari, Chao Li, Mihai Rotaru
2019-05-12
What this talk is about
You have an NLP solution for several languages
You want to support more languages
No training data, a lot of raw data
How to expand your solution to new languages using
transfer learning?
CV parsing
Textkernel and CV Parsing
1. Matching people and jobs
2. CV parsing - core of
Textkernel software
3. We solve CV parsing in
three stages
CV Parsing
Section segmentation
Personal section
Experience section
Education section
CV Parsing
Experience 1
Experience 2
Item segmentation
CV Parsing
Organisation
University
Job title
Degree
Job title
Organisation
Location
Location
Phrase extraction
Today’s
presentation
CV Parsing
• Formulated as a sequence labeling
(Similar to NER)
• BiLSTM + CRF with pre-trained word
embeddings
Bidirectional LSTM sequence labeling
Huang et al. (2015) Bidirectional LSTM-CRF Models for Sequence Tagging
Pydata London 2018: RNN sequence labeling for document parsing in
Tensorflow
Bi-LSTM
layers
Embeddings
O B-Name O B-LocationI-Name O B-PhoneOutput
CV Fiona City ParisLee Phone 0965...Words
CRF A0,1A0 A1,2 A... Ai-1,i Ai,i+1 A... An-
1,n
Our starting point
Issue Proposed solution
Multilingual model
• Implement models for new
languages as fast as
possible
• Improve performance on
low-resource languages
(using transfer learning and
cross-lingual embeddings)
• 20 languages
• Separate models for
separate languages
• New languages (100+)
lack labeled data
Transfer learning with cross-lingual embeddings
Cross-lingual embeddings
• semantically similar
words in the same
language nearby
• translationally
equivalent words in
different languages
nearby
Pre-trained
embeddings
MUSE
• 30 languages in a shared space
• already give good results
Open source
alignment code
Bilingual:
- Vecmap
Multilingual:
- Multilingual Fasttext
- UMWE
- CCA:
github.com/gallantlab/pyrcca
In our research we used
Canonical correlation analysis (CCA)
• Train monolingual word embeddings
• Learn the transformation matrices
using bilingual dictionary
• Map the monolingual spaces into one
shared semantic space in such a way
that translation pairs are maximally
correlated
Faruqui, M., & Dyer, C. (2014)
Σ
English
space
------------
Bilingual
dictionary
Ω
German
space
------------
Bilingual
dictionary
V
Transformation
matrix
(English)
Σ*
Transformed
English
space
W
Transformation
matrix
(German)
Ω*
Transformed
German
space
Shared space
Canonical correlation analysis (CCA)
• Ω* and Σ* lie in the same space
• Ω* can be projected into the English
embedding space Σ using the inverse
of V:
Ω** = V−1 * Ω*
Σ
English
space
------------
Bilingual
dictionary
Ω
German
space
------------
Bilingual
dictionary
V
Transformati
on matrix
(English)
Σ*
Transformed
English
space
W
Transformation
matrix
(German)
Ω*
Transformed
German
space
Shared space
Cross-lingual embeddings
developer
manager
engineer
English
entwickler
leiter
ingenieur
German
Bilingual
dictionary
Zero-shot parsing
Joint training
Parsed
German CV
Projected German
embeddings
English train
data
English
embeddings
Trained model
English train
data
German
train data
Projected German
embeddings
Trained model
Parsed
German CV
Projected German
embeddings
Testing
English
embeddings
Training
Experiments
Tweaking bilingual dictionary and experiments with no German data
Experimental setup
Task:
• Parse German
• Extract job title and
organisation
Embeddings:
• Trained on domain data
• Word2vec
• CCA
Does transfer learning work for us?
How bilingual dictionary influences downstream
performance?
Experimental setup
3700
English
200
German
500
German
1300
German
3700
English
3700
English
3700
English
Zero shot
Joint training
75.8
+4.1
+0.2
Does transfer learning work?
Monolingual
• More German data -> better
performance
Cross lingual
• Zero-shot parsing works
• Gain from transfer learning
• The more data we have the
smaller is cross-lingual gain
Zero-shot parsing
2. Construct your own:
• Use domain data
How to construct bilingual dictionary?
1. Use ready bilingual
dictionaries:
○ Internet Dictionary Project (IDP)
○ MUSE
■ 110 bilingual dictionaries
■ Created for development and the
evaluation of cross-lingual word
embeddings
Choose
English words
Translate into
German
Frequency
Size
Filtering
Google
translate
Yandex
translate
Source of data: IDP vs. muse vs. CV
Using bilingual dictionary with domain data boosts performance
Zero shot parsing
Joint training
CV vocabulary
CV vocabulary
61.5
72.1
75.879.5
80.4 81.1
Frequency: top vs. less frequent words
Using bilingual dictionary with top frequency words boosts
performance
Zero shot parsing
Joint training
Top frequent
Top frequent
65.6
75.780.1
80.7
Size of bilingual dictionary: 1k vs. 5k vs. 10k
Bilingual dictionary of bigger size boosts performance
Zero shot parsing
Joint training
5k / 10k
5k / 10k
70.4
76.3 76.680.0
81.1 81.4
Bilingual dictionary: what did we learn?
Best practices for constructing bilingual dictionary:
1. Domain words
2. Frequent words
3. Of size 5k or 10k
The less training data you have available, the more attention you
need to pay to bilingual dictionary.
Examples
Words from bilingual dictionary
Top 20 closest words
Words outside of bilingual dictionary
Job titles
Top 20 closest words
Locations
Header words
English names
Persoenliche in heldout set
Persoenliche in bilingual dictionary
Other languages?
Dutch to English
Slavic languages
Dutch to English
• Zero-shot parsing works
• Gain from transfer
learning
• The more data we have
the smaller is cross-
lingual gain
79.1 +6.3 +1.3
DutchZero-shot parsing
Slavic languages: on Russian
84.3
+2.3 +0.1
81.9
+1.4 +0.5
Czech Polish
Zero-shot parsing
Zero-shot parsing
What did we learn?
Summary
• Transfer learning works
• Pretty good results on zero shot
• Cross-lingual gain reduces as we add more data from target
language
• The quality of bilingual dictionary affects the end task
performance
• The less training data you have available, the more attention you
need to pay to bilingual dictionary
• Use top 5k most frequent words in your domain corpora

More Related Content

PPTX
Sketch engine presentation
PDF
Fusing Modeling and Programming into Language-Oriented Programming
PDF
Linguistic markup and processing of transclusion in XML documents (Notes)
PDF
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
PPTX
Final quantitative analysis of egyptian aphorisms by using r
PPTX
Experiments with Different Models of Statistcial Machine Translation
PDF
Tag Extraction Final Presentation - CS185CSpring2014
PPTX
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Sketch engine presentation
Fusing Modeling and Programming into Language-Oriented Programming
Linguistic markup and processing of transclusion in XML documents (Notes)
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
Final quantitative analysis of egyptian aphorisms by using r
Experiments with Different Models of Statistcial Machine Translation
Tag Extraction Final Presentation - CS185CSpring2014
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...

What's hot (20)

PPT
Putting DITA Localization into Practice
PPTX
Natural language processing: feature extraction
PDF
HOW TO MATCH BILINGUAL TWEETS?
PPT
DITA and Translation Best Praticices
PDF
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
PDF
Applications of Word Vectors in Text Retrieval and Classification
PPTX
Real-time DirectTranslation System for Sinhala and Tamil Languages.
PDF
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
PDF
Linguistic markup and transclusion processing in XML documents
PDF
An investigation of diachronic change in hypotaxis and parataxis in German th...
PDF
Ijetcas14 444
PDF
Challenges in transfer learning in nlp
PDF
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
PDF
co:op-READ-Convention Marburg - Roger Labahn
PDF
Ebay News 2000 10 19 Earnings
PDF
P4 P Update January 2009
PDF
Ebay News 2001 4 19 Earnings
PDF
Embracing diversity searching over multiple languages
DOC
Pré Descobrimento Do Brasil
PDF
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Putting DITA Localization into Practice
Natural language processing: feature extraction
HOW TO MATCH BILINGUAL TWEETS?
DITA and Translation Best Praticices
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Applications of Word Vectors in Text Retrieval and Classification
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Linguistic markup and transclusion processing in XML documents
An investigation of diachronic change in hypotaxis and parataxis in German th...
Ijetcas14 444
Challenges in transfer learning in nlp
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
co:op-READ-Convention Marburg - Roger Labahn
Ebay News 2000 10 19 Earnings
P4 P Update January 2009
Ebay News 2001 4 19 Earnings
Embracing diversity searching over multiple languages
Pré Descobrimento Do Brasil
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Ad

Similar to How to expand your nlp solution to new languages using transfer learning (20)

PDF
Parafraseo-Chenggang.pdf
PPTX
A Panorama of Natural Language Processing
PPTX
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
PDF
co:op-READ-Convention Marburg - Enrique Vidal
PDF
Pptphrase tagset mapping for french and english treebanks and its application...
PDF
Enriching Word Vectors with Subword Information
PDF
The Effect of Translationese on Statistical Machine Translation
PDF
MACHINE-DRIVEN TEXT ANALYSIS
PDF
2021-0509_JAECS2021_Spring
PDF
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
PDF
Single-Sourcing and Localization
PDF
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
PDF
CICLing_2016_paper_52
PPT
Publish perish as an instruction-end learning opportunity
PPTX
Introduction to natural language processing (NLP)
PDF
AINL 2016: Eyecioglu
PPTX
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
PPTX
Pptphrase tagset mapping for french and english treebanks and its application...
PPTX
Enriching the semantic web tutorial session 1
Parafraseo-Chenggang.pdf
A Panorama of Natural Language Processing
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
co:op-READ-Convention Marburg - Enrique Vidal
Pptphrase tagset mapping for french and english treebanks and its application...
Enriching Word Vectors with Subword Information
The Effect of Translationese on Statistical Machine Translation
MACHINE-DRIVEN TEXT ANALYSIS
2021-0509_JAECS2021_Spring
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Single-Sourcing and Localization
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
CICLing_2016_paper_52
Publish perish as an instruction-end learning opportunity
Introduction to natural language processing (NLP)
AINL 2016: Eyecioglu
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Pptphrase tagset mapping for french and english treebanks and its application...
Enriching the semantic web tutorial session 1
Ad

Recently uploaded (20)

PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Cell Structure & Organelles in detailed.
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Lesson notes of climatology university.
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Insiders guide to clinical Medicine.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
Classroom Observation Tools for Teachers
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
RMMM.pdf make it easy to upload and study
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
TR - Agricultural Crops Production NC III.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Cell Structure & Organelles in detailed.
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Lesson notes of climatology university.
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Insiders guide to clinical Medicine.pdf
GDM (1) (1).pptx small presentation for students
Renaissance Architecture: A Journey from Faith to Humanism
Sports Quiz easy sports quiz sports quiz
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Cell Types and Its function , kingdom of life
Classroom Observation Tools for Teachers
Supply Chain Operations Speaking Notes -ICLT Program
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
RMMM.pdf make it easy to upload and study
O5-L3 Freight Transport Ops (International) V1.pdf
01-Introduction-to-Information-Management.pdf
Anesthesia in Laparoscopic Surgery in India
TR - Agricultural Crops Production NC III.pdf

How to expand your nlp solution to new languages using transfer learning

  • 1. How to expand your NLP solution to new languages using transfer learning Lena Shakurova shakurova@textkernel.nl Beata Nyari, Chao Li, Mihai Rotaru 2019-05-12
  • 2. What this talk is about You have an NLP solution for several languages You want to support more languages No training data, a lot of raw data How to expand your solution to new languages using transfer learning?
  • 4. Textkernel and CV Parsing 1. Matching people and jobs 2. CV parsing - core of Textkernel software 3. We solve CV parsing in three stages
  • 5. CV Parsing Section segmentation Personal section Experience section Education section
  • 7. CV Parsing Organisation University Job title Degree Job title Organisation Location Location Phrase extraction Today’s presentation
  • 8. CV Parsing • Formulated as a sequence labeling (Similar to NER) • BiLSTM + CRF with pre-trained word embeddings
  • 9. Bidirectional LSTM sequence labeling Huang et al. (2015) Bidirectional LSTM-CRF Models for Sequence Tagging Pydata London 2018: RNN sequence labeling for document parsing in Tensorflow
  • 10. Bi-LSTM layers Embeddings O B-Name O B-LocationI-Name O B-PhoneOutput CV Fiona City ParisLee Phone 0965...Words CRF A0,1A0 A1,2 A... Ai-1,i Ai,i+1 A... An- 1,n
  • 12. Issue Proposed solution Multilingual model • Implement models for new languages as fast as possible • Improve performance on low-resource languages (using transfer learning and cross-lingual embeddings) • 20 languages • Separate models for separate languages • New languages (100+) lack labeled data
  • 13. Transfer learning with cross-lingual embeddings
  • 14. Cross-lingual embeddings • semantically similar words in the same language nearby • translationally equivalent words in different languages nearby
  • 15. Pre-trained embeddings MUSE • 30 languages in a shared space • already give good results Open source alignment code Bilingual: - Vecmap Multilingual: - Multilingual Fasttext - UMWE - CCA: github.com/gallantlab/pyrcca In our research we used
  • 16. Canonical correlation analysis (CCA) • Train monolingual word embeddings • Learn the transformation matrices using bilingual dictionary • Map the monolingual spaces into one shared semantic space in such a way that translation pairs are maximally correlated Faruqui, M., & Dyer, C. (2014) Σ English space ------------ Bilingual dictionary Ω German space ------------ Bilingual dictionary V Transformation matrix (English) Σ* Transformed English space W Transformation matrix (German) Ω* Transformed German space Shared space
  • 17. Canonical correlation analysis (CCA) • Ω* and Σ* lie in the same space • Ω* can be projected into the English embedding space Σ using the inverse of V: Ω** = V−1 * Ω* Σ English space ------------ Bilingual dictionary Ω German space ------------ Bilingual dictionary V Transformati on matrix (English) Σ* Transformed English space W Transformation matrix (German) Ω* Transformed German space Shared space
  • 19. Zero-shot parsing Joint training Parsed German CV Projected German embeddings English train data English embeddings Trained model English train data German train data Projected German embeddings Trained model Parsed German CV Projected German embeddings Testing English embeddings Training
  • 20. Experiments Tweaking bilingual dictionary and experiments with no German data
  • 21. Experimental setup Task: • Parse German • Extract job title and organisation Embeddings: • Trained on domain data • Word2vec • CCA Does transfer learning work for us? How bilingual dictionary influences downstream performance?
  • 23. 75.8 +4.1 +0.2 Does transfer learning work? Monolingual • More German data -> better performance Cross lingual • Zero-shot parsing works • Gain from transfer learning • The more data we have the smaller is cross-lingual gain Zero-shot parsing
  • 24. 2. Construct your own: • Use domain data How to construct bilingual dictionary? 1. Use ready bilingual dictionaries: ○ Internet Dictionary Project (IDP) ○ MUSE ■ 110 bilingual dictionaries ■ Created for development and the evaluation of cross-lingual word embeddings Choose English words Translate into German Frequency Size Filtering Google translate Yandex translate
  • 25. Source of data: IDP vs. muse vs. CV Using bilingual dictionary with domain data boosts performance Zero shot parsing Joint training CV vocabulary CV vocabulary 61.5 72.1 75.879.5 80.4 81.1
  • 26. Frequency: top vs. less frequent words Using bilingual dictionary with top frequency words boosts performance Zero shot parsing Joint training Top frequent Top frequent 65.6 75.780.1 80.7
  • 27. Size of bilingual dictionary: 1k vs. 5k vs. 10k Bilingual dictionary of bigger size boosts performance Zero shot parsing Joint training 5k / 10k 5k / 10k 70.4 76.3 76.680.0 81.1 81.4
  • 28. Bilingual dictionary: what did we learn? Best practices for constructing bilingual dictionary: 1. Domain words 2. Frequent words 3. Of size 5k or 10k The less training data you have available, the more attention you need to pay to bilingual dictionary.
  • 30. Words from bilingual dictionary Top 20 closest words
  • 31. Words outside of bilingual dictionary Job titles Top 20 closest words Locations
  • 32. Header words English names Persoenliche in heldout set Persoenliche in bilingual dictionary
  • 33. Other languages? Dutch to English Slavic languages
  • 34. Dutch to English • Zero-shot parsing works • Gain from transfer learning • The more data we have the smaller is cross- lingual gain 79.1 +6.3 +1.3 DutchZero-shot parsing
  • 35. Slavic languages: on Russian 84.3 +2.3 +0.1 81.9 +1.4 +0.5 Czech Polish Zero-shot parsing Zero-shot parsing
  • 36. What did we learn?
  • 37. Summary • Transfer learning works • Pretty good results on zero shot • Cross-lingual gain reduces as we add more data from target language • The quality of bilingual dictionary affects the end task performance • The less training data you have available, the more attention you need to pay to bilingual dictionary • Use top 5k most frequent words in your domain corpora