SlideShare a Scribd company logo
Introduction to
word embeddings
Pavel Kalaidin
@facultyofwonder
Moscow Data Fest, September, 12th, 2015
Introduction to word embeddings with Python
Introduction to word embeddings with Python
distributional hypothesis
лойс
годно, лойс
лойс за песню
из принципа не поставлю лойс
взаимные лойсы
лойс, если согласен
What is the meaning of лойс?
годно, лойс
лойс за песню
из принципа не поставлю лойс
взаимные лойсы
лойс, если согласен
What is the meaning of лойс?
кек
кек, что ли?
кек)))))))
ну ты кек
What is the meaning of кек?
кек, что ли?
кек)))))))
ну ты кек
What is the meaning of кек?
vectorial representations
of words
simple and flexible
platform for
understanding text and
probably not messing up
one-hot encoding?
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
co-occurrence matrix
recall: word-document co-occurrence
matrix for LSA
credits: [x]
from entire document to
window (length 5-10)
still seems suboptimal ->
big, sparse, etc.
lower dimensions, we
want dense vectors
(say, 25-1000)
How?
matrix factorization?
SVD of co-occurrence
matrix
lots of memory?
idea: directly learn low-
dimensional vectors
here comes word2vec
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al: [paper]
idea: instead of capturing co-
occurrence counts
predict surrounding words
Two models:
C-BOW
predicting the word given its context
skip-gram
predicting the context given a word
Explained in great detail here, so we’ll skip it for now Also see: word2vec Parameter
Learning Explained, Rong, paper
Introduction to word embeddings with Python
CBOW: several times faster than skip-gram,
slightly better accuracy for the frequent words
Skip-Gram: works well with small amount of
data, represents well rare words or phrases
Examples?
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Wwoman
- Wman
= Wqueen
-
Wking
classic example
<censored example>
word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling
Word-Embedding Method, Goldberg et al, 2014 [arxiv]
all done with gensim:
github.com/piskvorky/gensim/
...failing to take advantage of
the vast amount of repetition
in the data
so back to co-occurrences
GloVe for Global Vectors
Pennington et al, 2014: nlp.stanford.
edu/pubs/glove.pdf
Ratios seem to cancel noise
The gist: model ratios with
vectors
The model
Preserving
linearity
Preventing mixing
dimensions
Restoring
symmetry, part 1
recall:
Introduction to word embeddings with Python
Restoring symmetry, part 2
Least squares problem it is now
SGD->AdaGrad
ok, Python code
glove-python:
github.com/maciejkula/glove-python
two sets of vectors
input and context + bias
average/sum/drop
complexity |V|2
complexity |C|0.8
Evaluation: it works
#spb
#gatchina
#msk
#kyiv
#minsk
#helsinki
Compared to word2vec
#spb
#gatchina
#msk
#kyiv
#minsk
#helsinki
Introduction to word embeddings with Python
t-SNE:
github.com/oreillymedia/t-SNE-tutorial
seaborn:
stanford.edu/~mwaskom/software/seaborn/
Abusing models
music playlists:
github.com/mattdennewitz/playlist-to-vec
deep walk:
DeepWalk: Online Learning of Social
Representations [link]
user interests
Paragraph vectors: cs.stanford.
edu/~quocle/paragraph_vector.pdf
predicting hashtags
interesting read: #TAGSPACE: Semantic
Embeddings from Hashtags [link]
RusVectōrēs: distributional semantic
models for Russian: ling.go.mail.
ru/dsm/en/
Introduction to word embeddings with Python
corpus matters
building block for
bigger models
╰(*´︶`*)╯
</slides>

More Related Content

PDF
Word Embeddings - Introduction
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
PDF
word2vec - From theory to practice
PPTX
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
PPTX
Word Embedding to Document distances
PPTX
A Simple Introduction to Word Embeddings
PDF
Word Embeddings, why the hype ?
PDF
Word2Vec
Word Embeddings - Introduction
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec - From theory to practice
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
Word Embedding to Document distances
A Simple Introduction to Word Embeddings
Word Embeddings, why the hype ?
Word2Vec

What's hot (20)

PDF
(Kpi summer school 2015) word embeddings and neural language modeling
PDF
Yoav Goldberg: Word Embeddings What, How and Whither
PDF
word embeddings and applications to machine translation and sentiment analysis
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
PPTX
Using Text Embeddings for Information Retrieval
PPTX
PDF
Word2vec algorithm
PPTX
Word2vec slide(lab seminar)
PDF
Word2vec ultimate beginner
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PDF
Representation Learning of Vectors of Words and Phrases
PDF
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
PDF
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
PDF
Word2vec: From intuition to practice using gensim
PDF
(Deep) Neural Networks在 NLP 和 Text Mining 总结
PDF
Word2vec and Friends
PPTX
Word representations in vector space
PPTX
Tutorial on word2vec
PDF
Skip gram and cbow
(Kpi summer school 2015) word embeddings and neural language modeling
Yoav Goldberg: Word Embeddings What, How and Whither
word embeddings and applications to machine translation and sentiment analysis
Tomáš Mikolov - Distributed Representations for NLP
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Using Text Embeddings for Information Retrieval
Word2vec algorithm
Word2vec slide(lab seminar)
Word2vec ultimate beginner
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Representation Learning of Vectors of Words and Phrases
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
Word2vec: From intuition to practice using gensim
(Deep) Neural Networks在 NLP 和 Text Mining 总结
Word2vec and Friends
Word representations in vector space
Tutorial on word2vec
Skip gram and cbow
Ad

Similar to Introduction to word embeddings with Python (20)

PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PPTX
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
PPTX
What is word2vec?
PDF
Paper dissected glove_ global vectors for word representation_ explained _ ...
PPTX
Designing, Visualizing and Understanding Deep Neural Networks
PDF
[Emnlp] what is glo ve part ii - towards data science
PPTX
Word embedding
PPTX
Dependency-Based Word Embeddings
PDF
[Emnlp] what is glo ve part i - towards data science
PPTX
Word embeddings
PDF
Continuous bag of words cbow word2vec word embedding work .pdf
PPT
Word2vector
PPTX
Natural language processing unit - 2 ppt
PPT
Word 2 vector
PPTX
Efficient estimation of word representations in vector space (2013)
PPTX
wordembedding.pptx
PDF
AI&BigData Lab. Mostapha Benhenda. "Word vector representation and applications"
PDF
deep learning slides on word embeddings.
PPTX
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
PDF
New word analogy corpus
Embedding for fun fumarola Meetup Milano DLI luglio
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
What is word2vec?
Paper dissected glove_ global vectors for word representation_ explained _ ...
Designing, Visualizing and Understanding Deep Neural Networks
[Emnlp] what is glo ve part ii - towards data science
Word embedding
Dependency-Based Word Embeddings
[Emnlp] what is glo ve part i - towards data science
Word embeddings
Continuous bag of words cbow word2vec word embedding work .pdf
Word2vector
Natural language processing unit - 2 ppt
Word 2 vector
Efficient estimation of word representations in vector space (2013)
wordembedding.pptx
AI&BigData Lab. Mostapha Benhenda. "Word vector representation and applications"
deep learning slides on word embeddings.
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
New word analogy corpus
Ad

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to Knowledge Engineering Part 1
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Lecture1 pattern recognition............
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Mega Projects Data Mega Projects Data
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Reliability_Chapter_ presentation 1221.5784
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to machine learning and Linear Models
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Foundation of Data Science unit number two notes
Introduction to Knowledge Engineering Part 1
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Lecture1 pattern recognition............
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Mega Projects Data Mega Projects Data
Business Ppt On Nestle.pptx huunnnhhgfvu
Clinical guidelines as a resource for EBP(1).pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
oil_refinery_comprehensive_20250804084928 (1).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx

Introduction to word embeddings with Python