How to expand your nlp solution to new languages using transfer learning

How to expand your NLP solution to new
languages using transfer learning
Lena Shakurova
shakurova@textkernel.nl
Beata Nyari, Chao Li, Mihai Rotaru
2019-05-12

What this talk is about
You have an NLP solution for several languages
You want to support more languages
No training data, a lot of raw data
How to expand your solution to new languages using
transfer learning?

Textkernel and CV Parsing
1. Matching people and jobs
2. CV parsing - core of
Textkernel software
3. We solve CV parsing in
three stages

CV Parsing
Section segmentation
Personal section
Experience section
Education section

CV Parsing
Experience 1
Experience 2
Item segmentation

CV Parsing
Organisation
University
Job title
Degree
Job title
Organisation
Location
Location
Phrase extraction
Today’s
presentation

CV Parsing
• Formulated as a sequence labeling
(Similar to NER)
• BiLSTM + CRF with pre-trained word
embeddings

Bidirectional LSTM sequence labeling
Huang et al. (2015) Bidirectional LSTM-CRF Models for Sequence Tagging
Pydata London 2018: RNN sequence labeling for document parsing in
Tensorflow

Bi-LSTM
layers
Embeddings
O B-Name O B-LocationI-Name O B-PhoneOutput
CV Fiona City ParisLee Phone 0965...Words
CRF A0,1A0 A1,2 A... Ai-1,i Ai,i+1 A... An-
1,n

Issue Proposed solution
Multilingual model
• Implement models for new
languages as fast as
possible
• Improve performance on
low-resource languages
(using transfer learning and
cross-lingual embeddings)
• 20 languages
• Separate models for
separate languages
• New languages (100+)
lack labeled data

Transfer learning with cross-lingual embeddings

Cross-lingual embeddings
• semantically similar
words in the same
language nearby
• translationally
equivalent words in
different languages
nearby

Pre-trained
embeddings
MUSE
• 30 languages in a shared space
• already give good results
Open source
alignment code
Bilingual:
- Vecmap
Multilingual:
- Multilingual Fasttext
- UMWE
- CCA:
github.com/gallantlab/pyrcca
In our research we used

Canonical correlation analysis (CCA)
• Train monolingual word embeddings
• Learn the transformation matrices
using bilingual dictionary
• Map the monolingual spaces into one
shared semantic space in such a way
that translation pairs are maximally
correlated
Faruqui, M., & Dyer, C. (2014)
Σ
English
space
------------
Bilingual
dictionary
Ω
German
space
------------
Bilingual
dictionary
V
Transformation
matrix
(English)
Σ*
Transformed
English
space
W
Transformation
matrix
(German)
Ω*
Transformed
German
space
Shared space

Canonical correlation analysis (CCA)
• Ω* and Σ* lie in the same space
• Ω* can be projected into the English
embedding space Σ using the inverse
of V:
Ω** = V−1 * Ω*
Σ
English
space
------------
Bilingual
dictionary
Ω
German
space
------------
Bilingual
dictionary
V
Transformati
on matrix
(English)
Σ*
Transformed
English
space
W
Transformation
matrix
(German)
Ω*
Transformed
German
space
Shared space

Cross-lingual embeddings
developer
manager
engineer
English
entwickler
leiter
ingenieur
German
Bilingual
dictionary

Zero-shot parsing
Joint training
Parsed
German CV
Projected German
embeddings
English train
data
English
embeddings
Trained model
English train
data
German
train data
Projected German
embeddings
Trained model
Parsed
German CV
Projected German
embeddings
Testing
English
embeddings
Training

Experiments
Tweaking bilingual dictionary and experiments with no German data

Experimental setup
Task:
• Parse German
• Extract job title and
organisation
Embeddings:
• Trained on domain data
• Word2vec
• CCA
Does transfer learning work for us?
How bilingual dictionary influences downstream
performance?

Experimental setup
3700
English
200
German
500
German
1300
German
3700
English
3700
English
3700
English
Zero shot
Joint training

75.8
+4.1
+0.2
Does transfer learning work?
Monolingual
• More German data -> better
performance
Cross lingual
• Zero-shot parsing works
• Gain from transfer learning
• The more data we have the
smaller is cross-lingual gain
Zero-shot parsing

2. Construct your own:
• Use domain data
How to construct bilingual dictionary?
1. Use ready bilingual
dictionaries:
○ Internet Dictionary Project (IDP)
○ MUSE
■ 110 bilingual dictionaries
■ Created for development and the
evaluation of cross-lingual word
embeddings
Choose
English words
Translate into
German
Frequency
Size
Filtering
Google
translate
Yandex
translate

Source of data: IDP vs. muse vs. CV
Using bilingual dictionary with domain data boosts performance
Zero shot parsing
Joint training
CV vocabulary
CV vocabulary
61.5
72.1
75.879.5
80.4 81.1

Frequency: top vs. less frequent words
Using bilingual dictionary with top frequency words boosts
performance
Zero shot parsing
Joint training
Top frequent
Top frequent
65.6
75.780.1
80.7

Size of bilingual dictionary: 1k vs. 5k vs. 10k
Bilingual dictionary of bigger size boosts performance
Zero shot parsing
Joint training
5k / 10k
5k / 10k
70.4
76.3 76.680.0
81.1 81.4

Bilingual dictionary: what did we learn?
Best practices for constructing bilingual dictionary:
1. Domain words
2. Frequent words
3. Of size 5k or 10k
The less training data you have available, the more attention you
need to pay to bilingual dictionary.

Words from bilingual dictionary
Top 20 closest words

Words outside of bilingual dictionary
Job titles
Top 20 closest words
Locations

Header words
English names
Persoenliche in heldout set
Persoenliche in bilingual dictionary

Other languages?
Dutch to English
Slavic languages

Dutch to English
• Zero-shot parsing works
• Gain from transfer
learning
• The more data we have
the smaller is cross-
lingual gain
79.1 +6.3 +1.3
DutchZero-shot parsing

Slavic languages: on Russian
84.3
+2.3 +0.1
81.9
+1.4 +0.5
Czech Polish
Zero-shot parsing
Zero-shot parsing

Summary
• Transfer learning works
• Pretty good results on zero shot
• Cross-lingual gain reduces as we add more data from target
language
• The quality of bilingual dictionary affects the end task
performance
• The less training data you have available, the more attention you
need to pay to bilingual dictionary
• Use top 5k most frequent words in your domain corpora

How to expand your nlp solution to new languages using transfer learning

More Related Content

What's hot (20)

Similar to How to expand your nlp solution to new languages using transfer learning (20)

Recently uploaded (20)

How to expand your nlp solution to new languages using transfer learning