SlideShare a Scribd company logo
Challenges in Transfer
Learning in NLP
MADRID NLP MEETUP
29th May 2019
Lara Olmos Camarena
‘Except as otherwise noted, this material
Olmos Camarena, Lara (2019). “Challenges in Transfer Learning in NLP”
is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0
International.
Contents
Introduction
• Motivation
• Evolution
• First definitions, challenges, frontiers
• Formal definition
Word embeddings
• Word embedding: typical definition
• Pre-trained word embedding
• Conceptual architecture
• Training word embeddings
• Evaluation of word embeddings
Next steps!
• ¿Why? Word embedding limitations and new directions
• Universal Language Model Fine-tuning
• Transformer and BERT
Appendix
• Frameworks related
• NLP&DL: Education
• More references and papers
Introduction
Motivation
Natural Language
Processing
challenges
word enrinchment and
training meaningfull
representation requires
so much effort
enterprise tasks face with
specific and domain
data of different sizes
(dificulties with small!)
text annotation is time-
consuming
Machine Learning
models
suffer with vectorial spaces
models (VSM)
built with tokens (bag of
words) because of it is high
dimensional and sparse.
loose of sequential and
hierarquical information
Loose of relations between
words and semantics
Deep Learning
and Transfer
Learning!
Transferring information from
one machine learning task to
another. Not necessary to
train from zero, save effort
and time.
Transfer learning might involve
transferring knowledge from the
solution of a simpler task to a more
complex one, or involve transferring
knowledge from a task where there
is more data to one where there is
less data.
Most machine learning systems
solve a single task. Transfer
learning is a baby step towards
artificial intelligence in which a
single program can
solve multiple tasks.
Transfer learning definition from: https://developers.google.com/machine-learning/glossary/
More on Transfer Learning! https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
Evolution
 Language modelling, sometimes with neural network methods, to solve this problem: low
dimensional vectors (embeddings) for textual units selected: words, phrases or documents.
 Words embeddings obtained as output of different models and algorithms:
 Random: uniform, Xavier
 Predictive models: word2vec (skip-gram, CBOW)
 Count-based or coocurrences: GloVe
 Deep neural network models with different blocks trained: CNN, RNN, LSTM, biLSTM
 Recent and more complex ones: subword information FastText, biLSTM based ELMo
 New perspectives in transfer learning in NLP:
 Use pre-trained word embedding to initialize word vectors in embedding layers or other tasks
 Use pre-trained language models to directly represent text. Prodigy (spaCy), Zero-shot learning for classifiers
(Parallel Dots), Universal Sentence Encoder (Google), ULMFit, BERT and OpenAI Transformer
to use later in more complex or task specific models/systems.
NEW!
First definitions, challenges, frontiers
Initial context: Neural network models in NLP, catastrophic forgetting
and frontiers
• A review of the recent history of natural language processing - Sebastian Ruder
• Neural language models (2001), multi-task learning (2008), Word embeddings (2013), seq2seq (2014),
• 2018-2019: pretrained language models
• Fighting with catastrophic forgetting in NLP - Matthew Honnibal (2017)
• Frontiers of natural language processing
Transfer learning in NLP divulgation
• Rodríguez, Horacio. Embeddings, Deep Learning, Distributional Semantics en el Procesamiento de la
lengua, ... ¿más que marketing?
• NLP’s ImageNet moment
• ULMFiT, ELMo, and the OpenAI transformer is one key paradigm shift: going from just initializing the
first layer of our models to pretraining the entire model with hierarchical representations. If learning
word vectors is like only learning edges, these approaches are like learning the full hierarchy of
features, from edges to shapes to high-level semantic concepts.
• Efective transfer learning for NLP – Madison May – Indico
• Transfer learning with language models – Sebastian Ruder
• T3chFest – Transfer learning for NLP and Computer Vision - Pablo Vargas Ibarra, Manuel López Sheriff –
Kernel Analytics
Formal definition
 Neural Transfer Learning for Natural Language Processing – Sebastian Ruder
 Previous work found: “Transfer learning for Speech and Language Processing”
https://arxiv.org/pdf/1511.06066.pdf
Word embeddings
Word embedding: typical definitions
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language
processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a
mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.
Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence
matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which
words appear.
Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in
NLP tasks such as syntactic parsing and sentiment analysis.
(Ref.: Wikipedia)
A word embedding is a learned representation from text where words that have the same meaning have a similar
representation.
(Ref.: Machine Learning Mastering)
The Distributional Hypothesis is that words that occur in the same contexts tend to have similar meanings (Harris, 1954). The
underlying idea that "a word is characterized by the company it keeps" was popularized by Firth (1957), and it is implicit in
Weaver's (1955) discussion of word sense disambiguation (originally written as a memorandum, in 1949). The Distributional
Hypothesis is the basis for Statistical Semantics. Although the Distributional Hypothesis originated in Linguistics, it is now
receiving attention in Cognitive Science (McDonald and Ramscar, 2001). The origin and theoretical basis of the Distributional
Hypothesis is discussed by Sahlgren (2008).
(Ref: ACL Wiki)
Pre-
trained word
embedding
 Word embedding helps to capture vocabulary
from a corpus (f.e.: public or web corpora) not
seen in the task specific language or limited time
training and used in general common language.
 The representation allows to establish lineal
relations and capture similarities between
words from sintax and contexts seen in the
language and domain where the word embedding was
trained. Be careful with bias!
 If pre-trained with public and external corpus,
some words will be not recognized
(unknown word problem), when fighting with jargon
and argots! Also ortographic errors detected!
Conceptual Architecture
(common offline/training and online/predict/execution time)
Representation
obtained with
Word embedding
client 0.29, 0.53, -1.1, 0.23, …, 0.43
branch 0.39, 0.49, -1.2, 0.42, …, 0.98
Text entry
processed
(sentence,
Short text,
Long text)
Neural network
model
Unsupervised
model
Classifiers
(…)
[ [ 0.29, 0.53, -1.1, 0.23, …, 0.43], …, [0.39, 0.49, -1.2, 0.42, …, 0.98] ]
Mapping / look-up table
Vocabulary Word vectors
Response
for
specific
task
Evaluation
EvaluationEvaluation
Concatenation as input (for example)
Training word embeddings
 Different Word embedding trained for benchmarking – perspective of pre-
trained initialization of embedding layer for our neural network model
 Initialization with random distributions: uniform, Xavier
 Trained with word2vec
 Trained with GloVe
 Trained with Tensorflow with NCE estimation
 Using different corpora for training:
 public corpora (example projects like Billion Words)
 force-context phrases created for supervised synonyms
 domain specific document corpus
 real text users/industry dataset
GloVe example
Trained with force-contexts phrases built with vocabulary
Vocabulary size = 856, window size with = 3, vector size = 50
word2vec example
Corpus: force context phrases, Vocabulary size: 856
words
Algorithm: word2vec, Word vector size: 50
Evaluation of word embeddings
 Intrinsic proofs: discovering internal word synonyms. Synonyms
trained, are still synonyms after embedding? Which embedding
improves similarity for our task?
 Cualitative evaluation: word kneighbours clusters visualization (t-SNE and PCA),
Brown clustering
 Cuantitative evaluation: similarity test with cosine function and thresold using
 Extrinsic proofs: testing Word embedding improvement when
changing initialization for fix in specific task model.
 Dataset, domain and task dependent
 NLP processing pipeline must match the input sentence and word embedding
construction (stopword removal, normalization, stemming...)
Enso Benchmark - https://github.com/IndicoDataSolutions/EnsoMORE?
Next steps!
¿Why? Word embedding limitations
 Handling out of vocabulary words or unknown words
 Solved with subword level embeddings (character and n-gran based based): FastText embedding
 Unique representation for word, problem with polysemy and word senses!
 Solved with Elmo: Deep contextualized word representations
 Word embedding domain adaptation
 Lack of studies to solve bias and vectorial space studies, also in evaluation
 Research transition from feature-based to fine-tuning complex pre-trained Deep Learning architectures for
NLP principal tasks: classification, inference, question answering, natural language generation…
 Started in language modelling: sentence embedding like SpaCy, Universal Sentence Encoder (DAN and Transformer)
 State of the art: ULMFit, Transformer, BERT, Open AI GPT…
 Challenges with Deep Learning dificulties: explainability of neural network models, its representation
knowledge learned and to use and adapt in real business problems ethically!
New directions and challenges
Universal Language Model Fine-tuning
Neural Transfer Learning for Natural Language Processing – Sebastian Ruder
Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." arXiv preprint
arXiv:1801.06146 (2018).
Transformer
https://ai.google/research/pubs/pub46201
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems(pp. 5998-6008).
https://arxiv.org/pdf/1706.03762.pdf
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint
arXiv:1810.04805.
https://arxiv.org/pdf/1810.04805.pdf
TO START:
http://jalammar.github.
io/illustrated-bert/
BERT
Conclusions
Thank you for your
attention!
Any questions?
Appendix
Frameworks related
 Gensim: https://radimrehurek.com/gensim/models/word2vec.html
 Glovepy: https://pypi.org/project/glovepy/
 SpaCy: https://spacy.io/
 FastText – Facebook Research: https://github.com/facebookresearch/fastText
 Universal Sentence Encoder - Tensorflow:
https://tfhub.dev/google/universal-sentence-encoder/2
 AllenNLP - ELMo - http://aclweb.org/anthology/N18-1202
 https://allennlp.org/elmo
 Enso - https://github.com/IndicoDataSolutions/Enso
 BERT - https://github.com/google-research/bert
NLP&DL: Education
 Stanford University - Christopher Manning – Natural Language Processing with
Deep Learning
 https://www.youtube.com/watch?v=OQQ-
W_63UgQ&list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6
 http://cs224d.stanford.edu/
 http://web.stanford.edu/class/cs224n/
 University of Oxford - Deep Natural Language Processing
 https://github.com/oxford-cs-deepnlp-2017/lectures
 Bar Ilan University’s - Yoav Goldberg - Senior Lecturer Computer Science
Department NLP Lab - http://u.cs.biu.ac.il/~nlp/
 Book: Neural Network Methods for Natural Language Processing
 Universitat Politècnica de Catalunya – Horacio Rodríguez - Embeddings working
notes - http://www.cs.upc.edu/~horacio/docencia.html
http://www.lsi.upc.edu/~ageno/anlp/embeddings.pdf
More references and papers
 Academic material:
 Ruder, Sebastian. On word embeddings. http://ruder.io/word-embeddings-2017/, http://ruder.io/word-embeddings-1/
 Jurafsky, Daniel, & Martin, James H (2017). Speech and Language Processing. Draft 3º Edition. Chapter 15-16. Vector Semantics and Semantics with Dense Vectors
 Goldberg, Yoav (2017). Neural Network Methods for Natural Language Processing (Book)
 Word embedding papers:
 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
 Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
 Le, Quoc, and Mikolov, Tomas. "Distributed representations of sentences and documents." International conference on machine learning. 2014.
 Pennington, Jeffrey, Richard Socher, and Christopher Manning. "GloVe: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP). 2014.
 Mnih, Andriy, and Koray Kavukcuoglu. "Learning word embeddings efficiently with noise-contrastive estimation." Advances in neural information processing systems. 2013.
 Schnabel, Tobias and Labutov, Igor (2015). Evaluation methods for unsupervised word embeddings.
 Wang, Yanshan and Lui, Sijia (2018). A Comparison of Word Embeddings for the Biomedical Natural Language Processing. arXiv:1802.00400v3
 Next steps!
 http://jalammar.github.io/illustrated-bert/
 Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." arXiv preprint arXiv:1801.06146 (2018).
 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

More Related Content

PDF
Deep Learning for Natural Language Processing: Word Embeddings
PDF
An introduction to the Transformers architecture and BERT
PDF
Introduction to natural language processing
PDF
Natural language processing
PPTX
Natural Language Processing (NLP) - Introduction
PPTX
Introduction to natural language processing, history and origin
PDF
Introduction to Natural Language Processing (NLP)
PDF
Deep Feed Forward Neural Networks and Regularization
Deep Learning for Natural Language Processing: Word Embeddings
An introduction to the Transformers architecture and BERT
Introduction to natural language processing
Natural language processing
Natural Language Processing (NLP) - Introduction
Introduction to natural language processing, history and origin
Introduction to Natural Language Processing (NLP)
Deep Feed Forward Neural Networks and Regularization

What's hot (20)

PDF
Deep Generative Models
PPTX
NLP.pptx
PPTX
Understanding GloVe
PDF
Natural Language Processing (NLP)
PDF
Natural language processing (NLP) introduction
PDF
Machine learning
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
natural language processing help at myassignmenthelp.net
PPTX
A Simple Introduction to Word Embeddings
PPTX
Natural language processing
PDF
Module 8: Natural language processing Pt 1
PPTX
Gpt1 and 2 model review
PPTX
Language models
PPTX
NLP State of the Art | BERT
PPTX
NAMED ENTITY RECOGNITION
PPT
Natural Language Processing
PPTX
Natural Language Processing
PPTX
Deep Generative Models
NLP.pptx
Understanding GloVe
Natural Language Processing (NLP)
Natural language processing (NLP) introduction
Machine learning
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
natural language processing help at myassignmenthelp.net
A Simple Introduction to Word Embeddings
Natural language processing
Module 8: Natural language processing Pt 1
Gpt1 and 2 model review
Language models
NLP State of the Art | BERT
NAMED ENTITY RECOGNITION
Natural Language Processing
Natural Language Processing
Ad

Similar to Challenges in transfer learning in nlp (20)

PPTX
Neural word embedding and language modelling
PDF
Generative Artificial Intelligence and Large Language Model
DOCX
A neural probabilistic language model
PDF
French machine reading for question answering
DOCX
Natural Language Processing an introduction
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PPTX
NLP Introduction and basics of natural language processing
PDF
Nlp research presentation
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
Rasa NLU and ML Interpretability
PDF
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
PDF
Continuous bag of words cbow word2vec word embedding work .pdf
PDF
poiuytrewqasdfghjkloiuytrescvbjkl,mnbvcxzsdfghjklkjhgfdcvbnmnbvcxcvbn
PDF
DOMAIN BASED CHUNKING
PDF
DOMAIN BASED CHUNKING
PDF
DOMAIN BASED CHUNKING
PDF
Turkish language modeling using BERT
PDF
MACHINE-DRIVEN TEXT ANALYSIS
DOC
Doc format.
Neural word embedding and language modelling
Generative Artificial Intelligence and Large Language Model
A neural probabilistic language model
French machine reading for question answering
Natural Language Processing an introduction
Tomáš Mikolov - Distributed Representations for NLP
NLP Introduction and basics of natural language processing
Nlp research presentation
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
Rasa NLU and ML Interpretability
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
Continuous bag of words cbow word2vec word embedding work .pdf
poiuytrewqasdfghjkloiuytrescvbjkl,mnbvcxzsdfghjklkjhgfdcvbnmnbvcxcvbn
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKING
Turkish language modeling using BERT
MACHINE-DRIVEN TEXT ANALYSIS
Doc format.
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
sap open course for s4hana steps from ECC to s4
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectroscopy.pptx food analysis technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Challenges in transfer learning in nlp

  • 1. Challenges in Transfer Learning in NLP MADRID NLP MEETUP 29th May 2019 Lara Olmos Camarena ‘Except as otherwise noted, this material Olmos Camarena, Lara (2019). “Challenges in Transfer Learning in NLP” is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
  • 2. Contents Introduction • Motivation • Evolution • First definitions, challenges, frontiers • Formal definition Word embeddings • Word embedding: typical definition • Pre-trained word embedding • Conceptual architecture • Training word embeddings • Evaluation of word embeddings Next steps! • ¿Why? Word embedding limitations and new directions • Universal Language Model Fine-tuning • Transformer and BERT Appendix • Frameworks related • NLP&DL: Education • More references and papers
  • 4. Motivation Natural Language Processing challenges word enrinchment and training meaningfull representation requires so much effort enterprise tasks face with specific and domain data of different sizes (dificulties with small!) text annotation is time- consuming Machine Learning models suffer with vectorial spaces models (VSM) built with tokens (bag of words) because of it is high dimensional and sparse. loose of sequential and hierarquical information Loose of relations between words and semantics Deep Learning and Transfer Learning! Transferring information from one machine learning task to another. Not necessary to train from zero, save effort and time. Transfer learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or involve transferring knowledge from a task where there is more data to one where there is less data. Most machine learning systems solve a single task. Transfer learning is a baby step towards artificial intelligence in which a single program can solve multiple tasks. Transfer learning definition from: https://developers.google.com/machine-learning/glossary/ More on Transfer Learning! https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
  • 5. Evolution  Language modelling, sometimes with neural network methods, to solve this problem: low dimensional vectors (embeddings) for textual units selected: words, phrases or documents.  Words embeddings obtained as output of different models and algorithms:  Random: uniform, Xavier  Predictive models: word2vec (skip-gram, CBOW)  Count-based or coocurrences: GloVe  Deep neural network models with different blocks trained: CNN, RNN, LSTM, biLSTM  Recent and more complex ones: subword information FastText, biLSTM based ELMo  New perspectives in transfer learning in NLP:  Use pre-trained word embedding to initialize word vectors in embedding layers or other tasks  Use pre-trained language models to directly represent text. Prodigy (spaCy), Zero-shot learning for classifiers (Parallel Dots), Universal Sentence Encoder (Google), ULMFit, BERT and OpenAI Transformer to use later in more complex or task specific models/systems. NEW!
  • 6. First definitions, challenges, frontiers Initial context: Neural network models in NLP, catastrophic forgetting and frontiers • A review of the recent history of natural language processing - Sebastian Ruder • Neural language models (2001), multi-task learning (2008), Word embeddings (2013), seq2seq (2014), • 2018-2019: pretrained language models • Fighting with catastrophic forgetting in NLP - Matthew Honnibal (2017) • Frontiers of natural language processing Transfer learning in NLP divulgation • Rodríguez, Horacio. Embeddings, Deep Learning, Distributional Semantics en el Procesamiento de la lengua, ... ¿más que marketing? • NLP’s ImageNet moment • ULMFiT, ELMo, and the OpenAI transformer is one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts. • Efective transfer learning for NLP – Madison May – Indico • Transfer learning with language models – Sebastian Ruder • T3chFest – Transfer learning for NLP and Computer Vision - Pablo Vargas Ibarra, Manuel López Sheriff – Kernel Analytics
  • 7. Formal definition  Neural Transfer Learning for Natural Language Processing – Sebastian Ruder  Previous work found: “Transfer learning for Speech and Language Processing” https://arxiv.org/pdf/1511.06066.pdf
  • 9. Word embedding: typical definitions Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension. Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear. Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing and sentiment analysis. (Ref.: Wikipedia) A word embedding is a learned representation from text where words that have the same meaning have a similar representation. (Ref.: Machine Learning Mastering) The Distributional Hypothesis is that words that occur in the same contexts tend to have similar meanings (Harris, 1954). The underlying idea that "a word is characterized by the company it keeps" was popularized by Firth (1957), and it is implicit in Weaver's (1955) discussion of word sense disambiguation (originally written as a memorandum, in 1949). The Distributional Hypothesis is the basis for Statistical Semantics. Although the Distributional Hypothesis originated in Linguistics, it is now receiving attention in Cognitive Science (McDonald and Ramscar, 2001). The origin and theoretical basis of the Distributional Hypothesis is discussed by Sahlgren (2008). (Ref: ACL Wiki)
  • 10. Pre- trained word embedding  Word embedding helps to capture vocabulary from a corpus (f.e.: public or web corpora) not seen in the task specific language or limited time training and used in general common language.  The representation allows to establish lineal relations and capture similarities between words from sintax and contexts seen in the language and domain where the word embedding was trained. Be careful with bias!  If pre-trained with public and external corpus, some words will be not recognized (unknown word problem), when fighting with jargon and argots! Also ortographic errors detected!
  • 11. Conceptual Architecture (common offline/training and online/predict/execution time) Representation obtained with Word embedding client 0.29, 0.53, -1.1, 0.23, …, 0.43 branch 0.39, 0.49, -1.2, 0.42, …, 0.98 Text entry processed (sentence, Short text, Long text) Neural network model Unsupervised model Classifiers (…) [ [ 0.29, 0.53, -1.1, 0.23, …, 0.43], …, [0.39, 0.49, -1.2, 0.42, …, 0.98] ] Mapping / look-up table Vocabulary Word vectors Response for specific task Evaluation EvaluationEvaluation Concatenation as input (for example)
  • 12. Training word embeddings  Different Word embedding trained for benchmarking – perspective of pre- trained initialization of embedding layer for our neural network model  Initialization with random distributions: uniform, Xavier  Trained with word2vec  Trained with GloVe  Trained with Tensorflow with NCE estimation  Using different corpora for training:  public corpora (example projects like Billion Words)  force-context phrases created for supervised synonyms  domain specific document corpus  real text users/industry dataset
  • 13. GloVe example Trained with force-contexts phrases built with vocabulary Vocabulary size = 856, window size with = 3, vector size = 50 word2vec example Corpus: force context phrases, Vocabulary size: 856 words Algorithm: word2vec, Word vector size: 50
  • 14. Evaluation of word embeddings  Intrinsic proofs: discovering internal word synonyms. Synonyms trained, are still synonyms after embedding? Which embedding improves similarity for our task?  Cualitative evaluation: word kneighbours clusters visualization (t-SNE and PCA), Brown clustering  Cuantitative evaluation: similarity test with cosine function and thresold using  Extrinsic proofs: testing Word embedding improvement when changing initialization for fix in specific task model.  Dataset, domain and task dependent  NLP processing pipeline must match the input sentence and word embedding construction (stopword removal, normalization, stemming...) Enso Benchmark - https://github.com/IndicoDataSolutions/EnsoMORE?
  • 16. ¿Why? Word embedding limitations  Handling out of vocabulary words or unknown words  Solved with subword level embeddings (character and n-gran based based): FastText embedding  Unique representation for word, problem with polysemy and word senses!  Solved with Elmo: Deep contextualized word representations  Word embedding domain adaptation  Lack of studies to solve bias and vectorial space studies, also in evaluation  Research transition from feature-based to fine-tuning complex pre-trained Deep Learning architectures for NLP principal tasks: classification, inference, question answering, natural language generation…  Started in language modelling: sentence embedding like SpaCy, Universal Sentence Encoder (DAN and Transformer)  State of the art: ULMFit, Transformer, BERT, Open AI GPT…  Challenges with Deep Learning dificulties: explainability of neural network models, its representation knowledge learned and to use and adapt in real business problems ethically! New directions and challenges
  • 17. Universal Language Model Fine-tuning Neural Transfer Learning for Natural Language Processing – Sebastian Ruder Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." arXiv preprint arXiv:1801.06146 (2018).
  • 18. Transformer https://ai.google/research/pubs/pub46201 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems(pp. 5998-6008). https://arxiv.org/pdf/1706.03762.pdf Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/pdf/1810.04805.pdf TO START: http://jalammar.github. io/illustrated-bert/ BERT
  • 19. Conclusions Thank you for your attention! Any questions?
  • 21. Frameworks related  Gensim: https://radimrehurek.com/gensim/models/word2vec.html  Glovepy: https://pypi.org/project/glovepy/  SpaCy: https://spacy.io/  FastText – Facebook Research: https://github.com/facebookresearch/fastText  Universal Sentence Encoder - Tensorflow: https://tfhub.dev/google/universal-sentence-encoder/2  AllenNLP - ELMo - http://aclweb.org/anthology/N18-1202  https://allennlp.org/elmo  Enso - https://github.com/IndicoDataSolutions/Enso  BERT - https://github.com/google-research/bert
  • 22. NLP&DL: Education  Stanford University - Christopher Manning – Natural Language Processing with Deep Learning  https://www.youtube.com/watch?v=OQQ- W_63UgQ&list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6  http://cs224d.stanford.edu/  http://web.stanford.edu/class/cs224n/  University of Oxford - Deep Natural Language Processing  https://github.com/oxford-cs-deepnlp-2017/lectures  Bar Ilan University’s - Yoav Goldberg - Senior Lecturer Computer Science Department NLP Lab - http://u.cs.biu.ac.il/~nlp/  Book: Neural Network Methods for Natural Language Processing  Universitat Politècnica de Catalunya – Horacio Rodríguez - Embeddings working notes - http://www.cs.upc.edu/~horacio/docencia.html http://www.lsi.upc.edu/~ageno/anlp/embeddings.pdf
  • 23. More references and papers  Academic material:  Ruder, Sebastian. On word embeddings. http://ruder.io/word-embeddings-2017/, http://ruder.io/word-embeddings-1/  Jurafsky, Daniel, & Martin, James H (2017). Speech and Language Processing. Draft 3º Edition. Chapter 15-16. Vector Semantics and Semantics with Dense Vectors  Goldberg, Yoav (2017). Neural Network Methods for Natural Language Processing (Book)  Word embedding papers:  Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.  Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.  Le, Quoc, and Mikolov, Tomas. "Distributed representations of sentences and documents." International conference on machine learning. 2014.  Pennington, Jeffrey, Richard Socher, and Christopher Manning. "GloVe: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.  Mnih, Andriy, and Koray Kavukcuoglu. "Learning word embeddings efficiently with noise-contrastive estimation." Advances in neural information processing systems. 2013.  Schnabel, Tobias and Labutov, Igor (2015). Evaluation methods for unsupervised word embeddings.  Wang, Yanshan and Lui, Sijia (2018). A Comparison of Word Embeddings for the Biomedical Natural Language Processing. arXiv:1802.00400v3  Next steps!  http://jalammar.github.io/illustrated-bert/  Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." arXiv preprint arXiv:1801.06146 (2018).  Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.