SlideShare a Scribd company logo
CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
Lecture 8:
Vector Semantics and
Word Embeddings
CS447: Natural Language Processing (J. Hockenmaier)
Today’s lecture
The Distributional Hypothesis
From words to sparse vectors
that capture distributional similarities
From words to dense vectors
via word embeddings
2
CS447: Natural Language Processing (J. Hockenmaier)
How do we represent words?
As atomic symbols?
[e.g. as in a traditional n-gram language model, or
when we use them as explicit features in a classifier]
As very high-dimensional one-hot vectors?
[e.g. as in a naive neural language model]
As very high-dimensional sparse vectors?
[to capture so-called distributional similarities]
As lower-dimensional dense vectors?
[“word embeddings” — important prerequisite for neural NLP]
3
CS447: Natural Language Processing (J. Hockenmaier)
What should word representations
capture?
Vector representations of words were originally
motivated by attempts to capture lexical semantics
(the meaning of words) so that words that have
similar meanings have similar representations
These representations may also capture some
morphological or syntactic properties of words
(parts of speech, inflections, stems etc.).
4
CS447: Natural Language Processing (J. Hockenmaier)
Why do we care about word similarity?
Question answering:
Q: “How tall is Mt. Everest?”
Candidate A: “The official height of Mount Everest is
29029 feet”
“tall” is similar to “height”
5
CS447: Natural Language Processing (J. Hockenmaier)
Why do we care about word similarity?
Plagiarism detection
6
CS447: Natural Language Processing (J. Hockenmaier)
Different approaches to lexical semantics
Lexicographic tradition:
-Use lexicons, thesauri, ontologies
-Assume words have discrete word senses:
bank1 = financial institution; bank2 = river bank, etc.
-May capture explicit relations between word (senses):
“dog” is a “mammal”, etc.
Distributional tradition:
-Map words to (sparse) vectors that capture corpus statistics
-Contemporary variant: use neural nets to learn dense vector
“embeddings” from very large corpora
(this is a prerequisite for most neural approaches to NLP)
-If each word type is mapped to a single vector, this ignores
the fact that words have multiple senses or parts-of-speech
7
CS447: Natural Language Processing (J. Hockenmaier)
The distributional
hypothesis
8
CS447: Natural Language Processing (J. Hockenmaier)
The Distributional Hypothesis
Zellig Harris (1954):
“oculist and eye-doctor … occur in almost the same
environments”
“If A and B have almost identical environments we say that
they are synonyms.”
John R. Firth 1957:
You shall know a word by the company it keeps.
The contexts in which a word appears
tells us a lot about what it means.
Words that appear in similar contexts have similar meanings
9
CS447: Natural Language Processing (J. Hockenmaier)
Why do we care about word contexts?
What is tezgüino?
A bottle of tezgüino is on the table.
Everybody likes tezgüino.
Tezgüino makes you drunk.
We make tezgüino out of corn.
(Lin, 1998; Nida, 1975)
The contexts in which a word appears
tells us a lot about what it means.
10
CS447: Natural Language Processing (J. Hockenmaier)
Two ways NLP uses context for semantics
Distributional similarities (vector-space semantics):
Use the set of contexts in which words (= word types)
appear to measure their similarity
Assumption: Words that appear in similar contexts (tea, coffee)
have similar meanings.
Word sense disambiguation (future lecture)
Use the context of a particular occurrence of a word
(token) to identify which sense it has.
Assumption: If a word has multiple distinct senses
(e.g. plant: factory or green plant), each sense will appear in
different contexts.
11
CS447: Natural Language Processing (J. Hockenmaier)
Word similarities as
vector distances
12
CS447: Natural Language Processing (J. Hockenmaier)
Distributional Similarities
Measure the semantic similarity of words
in terms of the similarity of the contexts
in which the words appear
Represent words as vectors such that
— each vector element (dimension)
corresponds to a different context
— the vector for any particular word captures
how strongly it is associated with each context
Compute the semantic similarity of words
as the similarity of their vectors.
13
CS447: Natural Language Processing (J. Hockenmaier)
Distributional similarities
Distributional similarities use the set of contexts
in which words appear to measure their similarity.
They represent each word w as a vector w
w = (w1, …, wN) ∈ RN
in an N-dimensional vector space.
-Each dimension corresponds to a particular context cn
-Each element wn of w captures the degree to which
the word w is associated with the context cn.
- wn depends on the co-occurrence counts of w and cn
The similarity of words w and u is given by the
similarity of their vectors w and u
14
CS447: Natural Language Processing (J. Hockenmaier)
Documents as contexts
Let’s assume our corpus consists of a (large) number
of documents (articles, plays, novels, etc.)
In that case, we can define the contexts of a word as
the sets of documents in which it appears.
Conversely, we can represent each document as the
(multi)set of words which appear in it.
-Intuition: Documents are similar to each other if they contain
the same words.
-This is useful for information retrieval, e.g. to compute the
similarity between a query (also a document) and any
document in the collection to be searched.
15
CS447: Natural Language Processing (J. Hockenmaier)
Term-Document Matrix
A Term-Document Matrix is a 2D table:
-Each cell contains the frequency (count) of the term (word) t
in document d: tft,d
-Each column is a vector of counts over words, representing a
document
-Each row is a vector of counts over documents, representing
a word
16
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
CS447: Natural Language Processing (J. Hockenmaier)
Term-Document Matrix
Two documents are similar if their vectors are similar
Two words are similar if their vectors are similar
17
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
CS447: Natural Language Processing (J. Hockenmaier)
What is a ‘context’?
There are many different definitions of context
that yield different kinds of similarities:
Contexts defined by nearby words:
How often does w appear near the word drink?
Near = “drink appears within a window of ±k words of w”,
or “drink appears in the same document/sentence as w”
This yields fairly broad thematic similarities.
Contexts defined by grammatical relations:
How often is (the noun) w used as the subject (object)
of the verb drink? (Requires a parser).
This gives more fine-grained similarities.
18
CS447: Natural Language Processing (J. Hockenmaier)
Using nearby words as contexts
-Decide on a fixed vocabulary of N context words c1..cN
Context words should occur frequently enough in your corpus that you get
reliable co-occurrence counts, but you should ignore words that are too
common (‘stop words’: a, the, on, in, and, or, is, have, etc.)
-Define what ‘nearby’ means
For example: w appears near c if c appears within ±5 words of w
-Get co-occurrence counts of words w and contexts c
-Define how to transform co-occurrence counts
of words w and contexts c into vector elements wn
For example: compute (positive) PMI of words and contexts
-Define how to compute the similarity of word vectors
For example: use the cosine of their angles.
19
CS447: Natural Language Processing (J. Hockenmaier)
Defining and counting co-occurrence
Defining co-occurrences:
-Within a fixed window: vi occurs within ±n words of w
-Within the same sentence: requires sentence boundaries
-By grammatical relations:
vi occurs as a subject/object/modifier/… of verb w
(requires parsing - and separate features for each relation)
Counting co-occurrences:
-fi as binary features (1,0): w does/does not occur with vi
-fi as frequencies: w occurs n times with vi
-fi as probabilities:
e.g. fi is the probability that vi is the subject of w.
20
CS447: Natural Language Processing (J. Hockenmaier)
Getting co-occurrence counts
Co-occurrence as a binary feature:
Does word w ever appear in the context c? (1 = yes/0 = no)
Co-occurrence as a frequency count:
How often does word w appear in the context c? (0…n times)
Typically: 10K-100K dimensions (contexts), very sparse vectors
21
arts boil data function large sugar water
apricot 0 1 0 0 1 1 1
pineapple 0 1 0 0 1 1 1
digital 0 0 1 1 1 0 0
information 0 0 1 1 1 0 0
arts boil data function large sugar water
apricot 0 1 0 0 5 2 7
pineapple 0 2 0 0 10 8 5
digital 0 0 31 8 20 0 0
information 0 0 35 23 5 0 0
CS447: Natural Language Processing (J. Hockenmaier)
Counts vs PMI
Sometimes, low co-occurrences counts are very
informative, and high co-occurrence counts are not:
-Any word is going to have relatively high co-occurrence
counts with very common contexts (e.g. “it”, “anything”, “is”,
etc.), but this won’t tell us much about what that word means.
-We need to identify when co-occurrence counts are more
likely than we would expect by chance.
We can use pointwise mutual information (PMI)
values instead of raw frequency counts:
But this requires us to define p(w, c), p(w) and p(c)
22
PMI(w, c) = log
p(w, c)
p(w)p(c)
CS447: Natural Language Processing (J. Hockenmaier)
Word-Word Matrix
Context: ± 7 words
Resulting word-word matrix:
f(w, c) = how often does word w appear in context c:
“information” appeared six times in the context of “data”
23
aardvark computer data pinch result sugar …
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
CS447: Natural Language Processing (J. Hockenmaier) 24
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
pij =
fij
fij
j=1
C
∑
i=1
W
∑
p(wi ) =
fij
j=1
C
∑
N
p(cj ) =
fij
i=1
W
∑
N
p(w=information, c=data) = 6/19 = .32
p(w=information) = 11/19 = .58
p(c=data) = 7/19 = .37
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
CS447: Natural Language Processing (J. Hockenmaier)
Computing PMI of w and c:
Using a fixed window of ± k words
N: How many tokens does the corpus contain?
f(w) ≤ N: How often does w occur?
f(w, c) ≤ f(w,): How often does w occur with c in its window?
f(c) = ∑wf(w, c) ≤ N: How many tokens have c in their window?
p(w) = f(w)/N
p(c) = f(c)/N
p(w, c) = f(w, c)/N
25
PMI(w, c) = log
p(w, c)
p(w)p(c)
CS447: Natural Language Processing (J. Hockenmaier)
Computing PMI of w and c:
w and c in the same sentence
N: How many sentences does the corpus contain?
f(w) ≤ N: How many sentences contain w?
f(w, c) ≤ f(w): How many sentences contain w and c?
f(c) ≤ N: How many sentences contain c?
p(w) = f(w)/N
p(c) = f(c)/N
p(w, c) = f(w, c)/N
26
PMI(w, c) = log
p(w, c)
p(w)p(c)
CS447: Natural Language Processing (J. Hockenmaier)
Positive Pointwise Mutual Information
PMI is negative when words co-occur less than
expected by chance.
This is unreliable without huge corpora:
With P(w1) ≈ P(w2) ≈ 10-6, we can’t estimate whether P(w1,w2)
is significantly different from 10-12
We often just use positive PMI values,
and replace all PMI values < 0 with 0:
Positive Pointwise Mutual Information (PPMI):
PPMI(w,c) = PMI(w,c) if PMI(w,c) > 0
= 0 if PMI(w,c) ≤ 0
27
CS447: Natural Language Processing (J. Hockenmaier)
PMI and smoothing
PMI is biased towards infrequent events:
If P(w, c) = P(w) = P(c), then PMI(w,c) = log(1/P(w))
So PMI(w, c) is larger for rare words w with low P(w).
Simple remedy: Add-k smoothing of P(w, c), P(w), P(c)
pushes all PMI values towards zero.
Add-k smoothing affects low-probability events more,
and will therefore reduce the bias of PMI towards
infrequent events.
(Pantel & Turney 2010)
28
CS447: Natural Language Processing (J. Hockenmaier)
Vector similarity as vector distances
In distributional models, every word is a point in n-dimensional
space.
How do we measure the similarity between two points/vectors?
In general:
•Manhattan distance (Levenshtein distance, L1 norm)
•Euclidian distance (L2 norm)
29
distL1(⌅
x, ⌅
y) =
N
i=1
|xi yi|
distL2(⌅
x, ⌅
y) =
⌅
⇤
⇤
⇥
N
i=1
(xi yi)2 X
Y
L1
L2
CS447: Natural Language Processing (J. Hockenmaier)
Dot product as similarity
If the vectors consist of simple binary features (0,1),
we can use the dot product as similarity metric:
The dot product is a bad metric if the vector elements
are arbitrary features: it prefers long vectors
-If one xi is very large (and yi nonzero), sim(x,y) gets very large
If the number of nonzero xi and yi s is very large, sim(x,y) gets very large.
-Both can happen with frequent words.
30
simdot prod(⌅
x, ⌅
y) =
N
i=1
xi yi
length of ⇥
x : |⇥
x| =
⌅
⇤
⇤
⇥
N
i=1
x2
i
CS447: Natural Language Processing (J. Hockenmaier)
Vector similarity: Cosine
One way to define the similarity of two vectors
is to use the cosine of their angle.
The cosine of two vectors is their dot product,
divided by the product of their lengths:
sim(w, u) = 1: w and u point in the same direction
sim(w, u) = 0: w and u are orthogonal
sim(w, u) = −1: w and u point in the opposite direction
31
simcos(⌅
x, ⌅
y) =
N
i=1 xi ⇥ yi
⇥
N
i=1 x2
i
⇥
N
i=1 y2
i
=
⌅
x · ⌅
y
|⌅
x||⌅
y|
CS447: Natural Language Processing (J. Hockenmaier)
Word Embeddings
32
CS447: Natural Language Processing (J. Hockenmaier)
Words as input to neural models
We typically think of words as atomic symbols,
but neural nets require input in vector form.
Naive solution: one-hot encoding (dim(x) = |V| )
“a” = (1,0,0,…0), “aardvark” = (0,1,0,…,0), ….
Very high-dimensional, very sparse vectors (most elements 0)
No ability to generalize across similar words
Still requires a lot of parameters.
How do we obtain low-dimensional, dense vectors?
Low-dimensional => our models need far fewer parameters
Dense => lots of elements are non-zero
We also want words that are similar to have similar vectors
33
CS447: Natural Language Processing (J. Hockenmaier)
Vector representations of words
“Traditional” distributional similarity approaches
represent words as sparse vectors
-Each dimension represents one specific context
-Vector entries are based on word-context co-occurrence
statistics (counts or PMI values)
Alternative, dense vector representations:
-We can use Singular Value Decomposition to turn these
sparse vectors into dense vectors (Latent Semantic Analysis)
-We can also use neural models to explicitly learn a dense
vector representation (embedding) (word2vec, Glove, etc.)
Sparse vectors = most entries are zero
Dense vectors = most entries are non-zero
34
CS447: Natural Language Processing (J. Hockenmaier)
(Static) Word Embeddings
A (static) word embedding is a function that maps
each word type to a single vector
— these vectors are typically dense and have much
lower dimensionality than the size of the vocabulary
— this mapping function typically ignores that the
same string of letters may have different senses
(dining table vs. a table of contents) or parts of
speech (to table a motion vs. a table)
— this mapping function typically assumes a fixed
size vocabulary (so an UNK token is still required)
35
CS447: Natural Language Processing (J. Hockenmaier)
Word2Vec Embeddings
Main idea:
Use a binary classifier to predict which words appear in
the context of (i.e. near) a target word.
The parameters of that classifier provide a dense vector
representation of the target word (embedding)
Words that appear in similar contexts (that have high
distributional similarity) will have very similar vector
representations.
These models can be trained on large amounts of raw
text (and pre-trained embeddings can be downloaded)
36
CS447: Natural Language Processing (J. Hockenmaier)
Word2Vec (Mikolov et al. 2013)
The first really influential dense word embeddings
Two ways to think about Word2Vec:
— a simplification of neural language models
— a binary logistic regression classifier
Variants of Word2Vec
— Two different context representations: CBOW or Skip-Gram
— Two different optimization objectives:
Negative sampling (NS) or hierarchical softmax
37
CS447: Natural Language Processing (J. Hockenmaier)
Skip-Gram with negative sampling
Train a binary classifier that decides whether a target
word t appears in the context of other words c1..k
— Context: the set of k words near (surrounding) t
— Treat the target word t and any word that actually appears
in its context in a real corpus as positive examples
— Treat the target word t and randomly sampled words
that don’t appear in its context as negative examples
— Train a binary logistic regression classifier to distinguish
these cases
— The weights of this classifier depend on the similarity of t
and the words in c1..k
Use the weights of this classifier as embeddings for t
38
CS447: Natural Language Processing (J. Hockenmaier)
Skip-Gram Goal
Given a tuple (t,c) = target, context
(apricot, jam)
(apricot, aardvark)
Return the probability that c is a real
context word:
P( D = + | t, c)
P( D = − | t, c) = 1 − P(D = + | t, c)
11/27/18
39
CS447: Natural Language Processing (J. Hockenmaier)
Skip-Gram Training data
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 t c3 c4
Training data: input/output pairs centering on apricot
Assume a +/- 2 word window
Positive examples:
(apricot, tablespoon), (apricot, of), (apricot, jam), (apricot, a)
For each positive example, create k negative examples,
using noise words:
(apricot, aardvark), (apricot, puddle)…
40
CS447: Natural Language Processing (J. Hockenmaier)
Word2Vec: Negative Sampling
Training data: D+ ∪ D-
D+ = actual examples from training data
Where do we get D- from?
Lots of options.
Word2Vec: for each good pair (w,c), sample k words and add
each wi as a negative example (wi,c) to D’
(D’ is k times as large as D)
Words can be sampled according to corpus frequency
or according to smoothed variant where freq’(w) = freq(w)0.75
(This gives more weight to rare words)
41
CS447: Natural Language Processing (J. Hockenmaier)
Word2Vec: Negative Sampling
Training objective:
Maximize log-likelihood of training data D+ ∪ D-:
42
L (Q,D,D0
) = Â
(w,c)2D
logP(D = 1|w,c)
+ Â
(w,c)2D0
logP(D = 0|w,c)
CS447: Natural Language Processing (J. Hockenmaier)
The Skip-Gram classifier
Use logistic regression to predict whether the pair (t, c) (target
word t and a context word c), is a positive or negative example:
Assume that t and c are represented as vectors,
so that their dot product tc captures their similarity
To capture the entire context window c1..k, assume the words in
c1:k are independent (multiply) and take the log:
bability, we’ll use the logistic or sigmoid function s(x), the
ogistic regression:
s(x) =
1
1+e x
(6.26)
ord c is a real context word for target word t is thus computed
P(+|t,c) =
1
1+e t·c
(6.27)
just returns a number between 0 and 1, so to make it a proba-
ake sure that the total probability of the two possible events (c
and c not being a context word) sum to 1.
at word c is not a real context word for t is thus:
P( |t,c) = 1 P(+|t,c)
=
e t·c
1+e t·c
(6.28)
the probability for one word, but we need to take account of
words in the window. Skip-gram makes the strong but very
umption that all context words are independent, allowing us to
The probability that word c is a real context word for target word t is thus computed
as:
P(+|t,c) =
1
1+e t·c
(6.27)
The sigmoid function just returns a number between 0 and 1, so to make it a proba-
bility we’ll need to make sure that the total probability of the two possible events (c
being a context word, and c not being a context word) sum to 1.
The probability that word c is not a real context word for t is thus:
P( |t,c) = 1 P(+|t,c)
=
e t·c
1+e t·c
(6.28)
Equation 6.27 give us the probability for one word, but we need to take account of
the multiple context words in the window. Skip-gram makes the strong but very
useful simplifying assumption that all context words are independent, allowing us to
just multiply their probabilities:
P(+|t,c1:k) =
k
Y
i=1
1
1+e t·ci
(6.29)
logP(+|t,c1:k) =
k
X
i=1
log
1
1+e t·ci
(6.30)
In summary, skip-gram trains a probabilistic classifier that, given a test target word
t and its context window of k words c1:k, assigns a probability based on how similar
this context window is to the target word. The probability is based on applying the
logistic (sigmoid) function to the dot product of the embeddings of the target word
with each context word. We could thus compute this probability if only we had
ility that word c is a real context word for target word t is thus computed
P(+|t,c) =
1
1+e t·c
(6.27)
d function just returns a number between 0 and 1, so to make it a proba-
need to make sure that the total probability of the two possible events (c
text word, and c not being a context word) sum to 1.
bability that word c is not a real context word for t is thus:
P( |t,c) = 1 P(+|t,c)
=
e t·c
1+e t·c
(6.28)
27 give us the probability for one word, but we need to take account of
e context words in the window. Skip-gram makes the strong but very
lifying assumption that all context words are independent, allowing us to
y their probabilities:
P(+|t,c1:k) =
k
Y
i=1
1
1+e t·ci
(6.29)
logP(+|t,c1:k) =
k
X
i=1
log
1
1+e t·ci
(6.30)
y, skip-gram trains a probabilistic classifier that, given a test target word
43
CS447: Natural Language Processing (J. Hockenmaier)
Where do we get vectors t, c from?
Iterative approach:
Assume an initial set of vectors, and then adjust them
during training to maximize the probability of the
training examples.
44
CS447: Natural Language Processing (J. Hockenmaier)
How to compute p(+ | t, c)?
Intuition:
Words are likely to appear near similar words
Model similarity with dot-product!
Similarity(t,c) ∝ t · c
Problem:
Dot product is not a probability!
(Neither is cosine)
45
CS447: Natural Language Processing (J. Hockenmaier)
Turning the dot product into a
probability
The sigmoid lies between 0 and 1:
46
σ(x) =
1
1 + exp(−x)
P( + |t, c) =
1
1 + exp(−t ⋅ c)
P( − |t, c) = 1 −
1
1 + exp(−t ⋅ c)
=
exp(−t ⋅ c)
1 + exp(−t ⋅ c)
CS447: Natural Language Processing (J. Hockenmaier)
Word2Vec: Negative Sampling
Distinguish “good” (correct) word-context pairs (D=1),
from “bad” ones (D=0)
Probabilistic objective:
P( D = 1 | t, c ) defined by sigmoid:
P( D = 0 | t, c ) = 1 — P( D = 0 | t, c )
P( D = 1 | t, c ) should be high when (t, c) ∈ D+, and low when
(t,c) ∈ D-
47
P(D = 1|w,c) =
1
1+exp( s(w,c))
CS447: Natural Language Processing (J. Hockenmaier)
Summary: How to learn word2vec (skip-gram)
embeddings
For a vocabulary of size V: Start with V random 300-
dimensional vectors as initial embeddings
Train a logistic regression classifier to distinguish words
that co-occur in corpus from those that don’t
Pairs of words that co-occur are positive examples
Pairs of words that don't co-occur are negative examples
Train the classifier to distinguish these by slowly adjusting
all the embeddings to improve the classifier performance
Throw away the classifier code and keep the embeddings.
48
CS447: Natural Language Processing (J. Hockenmaier)
Evaluating embeddings
Compare to human scores on word
similarity-type tasks:
WordSim-353 (Finkelstein et al., 2002)
SimLex-999 (Hill et al., 2015)
Stanford Contextual Word Similarity (SCWS) dataset
(Huang et al., 2012)
TOEFL dataset: Levied is closest in meaning to: imposed,
believed, requested, correlated
49
CS447: Natural Language Processing (J. Hockenmaier)
Properties of embeddings
Similarity depends on window size C
C = ±2 The nearest words to Hogwarts:
Sunnydale
Evernight
C = ±5 The nearest words to Hogwarts:
Dumbledore
Malfoy
hal@lood
50
CS447: Natural Language Processing (J. Hockenmaier)
Analogy: Embeddings capture
relational meaning!
vector(‘king’) - vector(‘man’) + vector(‘woman’) =
vector(‘queen’)
vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) =
vector(‘Rome’)
51
CS447: Natural Language Processing (J. Hockenmaier)
Using Word
Embeddings
52
CS447: Natural Language Processing (J. Hockenmaier)
Using pre-trained embeddings
Assume you have pre-trained embeddings E.
How do you use them in your model?
-Option 1: Adapt E during training
Disadvantage: only words in training data will be affected.
-Option 2: Keep E fixed, but add another hidden layer that is
learned for your task
-Option 3: Learn matrix T ∈ dim(emb)×dim(emb) and use rows
of E’ = ET (adapts all embeddings, not specific words)
-Option 4: Keep E fixed, but learn matrix Δ ∈ R|V|×dim(emb) and
use E’ = E + Δ or E’ = ET + Δ (this learns to adapt specific
words)
53
CS447: Natural Language Processing (J. Hockenmaier)
More on embeddings
Embeddings aren’t just for words!
You can take any discrete input feature (with a fixed number of
K outcomes, e.g. POS tags, etc.) and learn an embedding
matrix for that feature.
Where do we get the input embeddings from?
We can learn the embedding matrix during training.
Initialization matters: use random weights, but in special range
(e.g. [-1/(2d), +(1/2d)] for d-dimensional embeddings), or use
Xavier initialization
We can also use pre-trained embeddings
LM-based embeddings are useful for many NLP task
54
CS447: Natural Language Processing (J. Hockenmaier)
Dense embeddings you can
download!
Word2vec (Mikolov et al.)
https://code.google.com/archive/p/word2vec/
Fasttext http://www.fasttext.cc/
Glove (Pennington, Socher, Manning)
http://nlp.stanford.edu/projects/glove/
55

More Related Content

PDF
Natural Language Processing
PPTX
Document similarity
PPTX
Cork AI Meetup Number 3
PPTX
Dialog system understanding
PPTX
A Panorama of Natural Language Processing
PDF
Big Data Palooza Talk: Aspects of Semantic Processing
PPTX
A Simple Introduction to Word Embeddings
PPTX
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
Natural Language Processing
Document similarity
Cork AI Meetup Number 3
Dialog system understanding
A Panorama of Natural Language Processing
Big Data Palooza Talk: Aspects of Semantic Processing
A Simple Introduction to Word Embeddings
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn

Similar to Word embeddings and glove and word2vec nlp (20)

PDF
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
PDF
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embedd...
PPTX
Pycon ke word vectors
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PPTX
Unit - III Vector Space Model in Natura Languge Processing .pptx
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PDF
Topics Modeling
PDF
Visual-Semantic Embeddings: some thoughts on Language
PDF
Hacking Human Language (PyCon Sweden 2015)
PPTX
Interpreting Embeddings with Comparison
PPTX
A Simple Introduction to Neural Information Retrieval
PPTX
NLP Introduction and basics of natural language processing
PPTX
DH Tools Workshop #1: Text Analysis
PPTX
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
PPTX
Project Proposal Topics Modeling (Ir)
PDF
Natural Language Processing
PPTX
Vector Space Word Representations - Rani Nelken PhD
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embedd...
Pycon ke word vectors
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Unit - III Vector Space Model in Natura Languge Processing .pptx
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Topics Modeling
Visual-Semantic Embeddings: some thoughts on Language
Hacking Human Language (PyCon Sweden 2015)
Interpreting Embeddings with Comparison
A Simple Introduction to Neural Information Retrieval
NLP Introduction and basics of natural language processing
DH Tools Workshop #1: Text Analysis
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
Embedding for fun fumarola Meetup Milano DLI luglio
Deep Learning for Natural Language Processing: Word Embeddings
Frontiers of Computational Journalism week 2 - Text Analysis
Project Proposal Topics Modeling (Ir)
Natural Language Processing
Vector Space Word Representations - Rani Nelken PhD
Ad

Recently uploaded (20)

PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Artificial Intelligence
PPTX
additive manufacturing of ss316l using mig welding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
web development for engineering and engineering
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Sustainable Sites - Green Building Construction
PPT
introduction to datamining and warehousing
PDF
PPT on Performance Review to get promotions
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT 4 Total Quality Management .pptx
Artificial Intelligence
additive manufacturing of ss316l using mig welding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
bas. eng. economics group 4 presentation 1.pptx
web development for engineering and engineering
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Foundation to blockchain - A guide to Blockchain Tech
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Sustainable Sites - Green Building Construction
introduction to datamining and warehousing
PPT on Performance Review to get promotions
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
UNIT-1 - COAL BASED THERMAL POWER PLANTS
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Ad

Word embeddings and glove and word2vec nlp

  • 1. CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Lecture 8: Vector Semantics and Word Embeddings
  • 2. CS447: Natural Language Processing (J. Hockenmaier) Today’s lecture The Distributional Hypothesis From words to sparse vectors that capture distributional similarities From words to dense vectors via word embeddings 2
  • 3. CS447: Natural Language Processing (J. Hockenmaier) How do we represent words? As atomic symbols? [e.g. as in a traditional n-gram language model, or when we use them as explicit features in a classifier] As very high-dimensional one-hot vectors? [e.g. as in a naive neural language model] As very high-dimensional sparse vectors? [to capture so-called distributional similarities] As lower-dimensional dense vectors? [“word embeddings” — important prerequisite for neural NLP] 3
  • 4. CS447: Natural Language Processing (J. Hockenmaier) What should word representations capture? Vector representations of words were originally motivated by attempts to capture lexical semantics (the meaning of words) so that words that have similar meanings have similar representations These representations may also capture some morphological or syntactic properties of words (parts of speech, inflections, stems etc.). 4
  • 5. CS447: Natural Language Processing (J. Hockenmaier) Why do we care about word similarity? Question answering: Q: “How tall is Mt. Everest?” Candidate A: “The official height of Mount Everest is 29029 feet” “tall” is similar to “height” 5
  • 6. CS447: Natural Language Processing (J. Hockenmaier) Why do we care about word similarity? Plagiarism detection 6
  • 7. CS447: Natural Language Processing (J. Hockenmaier) Different approaches to lexical semantics Lexicographic tradition: -Use lexicons, thesauri, ontologies -Assume words have discrete word senses: bank1 = financial institution; bank2 = river bank, etc. -May capture explicit relations between word (senses): “dog” is a “mammal”, etc. Distributional tradition: -Map words to (sparse) vectors that capture corpus statistics -Contemporary variant: use neural nets to learn dense vector “embeddings” from very large corpora (this is a prerequisite for most neural approaches to NLP) -If each word type is mapped to a single vector, this ignores the fact that words have multiple senses or parts-of-speech 7
  • 8. CS447: Natural Language Processing (J. Hockenmaier) The distributional hypothesis 8
  • 9. CS447: Natural Language Processing (J. Hockenmaier) The Distributional Hypothesis Zellig Harris (1954): “oculist and eye-doctor … occur in almost the same environments” “If A and B have almost identical environments we say that they are synonyms.” John R. Firth 1957: You shall know a word by the company it keeps. The contexts in which a word appears tells us a lot about what it means. Words that appear in similar contexts have similar meanings 9
  • 10. CS447: Natural Language Processing (J. Hockenmaier) Why do we care about word contexts? What is tezgüino? A bottle of tezgüino is on the table. Everybody likes tezgüino. Tezgüino makes you drunk. We make tezgüino out of corn. (Lin, 1998; Nida, 1975) The contexts in which a word appears tells us a lot about what it means. 10
  • 11. CS447: Natural Language Processing (J. Hockenmaier) Two ways NLP uses context for semantics Distributional similarities (vector-space semantics): Use the set of contexts in which words (= word types) appear to measure their similarity Assumption: Words that appear in similar contexts (tea, coffee) have similar meanings. Word sense disambiguation (future lecture) Use the context of a particular occurrence of a word (token) to identify which sense it has. Assumption: If a word has multiple distinct senses (e.g. plant: factory or green plant), each sense will appear in different contexts. 11
  • 12. CS447: Natural Language Processing (J. Hockenmaier) Word similarities as vector distances 12
  • 13. CS447: Natural Language Processing (J. Hockenmaier) Distributional Similarities Measure the semantic similarity of words in terms of the similarity of the contexts in which the words appear Represent words as vectors such that — each vector element (dimension) corresponds to a different context — the vector for any particular word captures how strongly it is associated with each context Compute the semantic similarity of words as the similarity of their vectors. 13
  • 14. CS447: Natural Language Processing (J. Hockenmaier) Distributional similarities Distributional similarities use the set of contexts in which words appear to measure their similarity. They represent each word w as a vector w w = (w1, …, wN) ∈ RN in an N-dimensional vector space. -Each dimension corresponds to a particular context cn -Each element wn of w captures the degree to which the word w is associated with the context cn. - wn depends on the co-occurrence counts of w and cn The similarity of words w and u is given by the similarity of their vectors w and u 14
  • 15. CS447: Natural Language Processing (J. Hockenmaier) Documents as contexts Let’s assume our corpus consists of a (large) number of documents (articles, plays, novels, etc.) In that case, we can define the contexts of a word as the sets of documents in which it appears. Conversely, we can represent each document as the (multi)set of words which appear in it. -Intuition: Documents are similar to each other if they contain the same words. -This is useful for information retrieval, e.g. to compute the similarity between a query (also a document) and any document in the collection to be searched. 15
  • 16. CS447: Natural Language Processing (J. Hockenmaier) Term-Document Matrix A Term-Document Matrix is a 2D table: -Each cell contains the frequency (count) of the term (word) t in document d: tft,d -Each column is a vector of counts over words, representing a document -Each row is a vector of counts over documents, representing a word 16 As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  • 17. CS447: Natural Language Processing (J. Hockenmaier) Term-Document Matrix Two documents are similar if their vectors are similar Two words are similar if their vectors are similar 17 As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  • 18. CS447: Natural Language Processing (J. Hockenmaier) What is a ‘context’? There are many different definitions of context that yield different kinds of similarities: Contexts defined by nearby words: How often does w appear near the word drink? Near = “drink appears within a window of ±k words of w”, or “drink appears in the same document/sentence as w” This yields fairly broad thematic similarities. Contexts defined by grammatical relations: How often is (the noun) w used as the subject (object) of the verb drink? (Requires a parser). This gives more fine-grained similarities. 18
  • 19. CS447: Natural Language Processing (J. Hockenmaier) Using nearby words as contexts -Decide on a fixed vocabulary of N context words c1..cN Context words should occur frequently enough in your corpus that you get reliable co-occurrence counts, but you should ignore words that are too common (‘stop words’: a, the, on, in, and, or, is, have, etc.) -Define what ‘nearby’ means For example: w appears near c if c appears within ±5 words of w -Get co-occurrence counts of words w and contexts c -Define how to transform co-occurrence counts of words w and contexts c into vector elements wn For example: compute (positive) PMI of words and contexts -Define how to compute the similarity of word vectors For example: use the cosine of their angles. 19
  • 20. CS447: Natural Language Processing (J. Hockenmaier) Defining and counting co-occurrence Defining co-occurrences: -Within a fixed window: vi occurs within ±n words of w -Within the same sentence: requires sentence boundaries -By grammatical relations: vi occurs as a subject/object/modifier/… of verb w (requires parsing - and separate features for each relation) Counting co-occurrences: -fi as binary features (1,0): w does/does not occur with vi -fi as frequencies: w occurs n times with vi -fi as probabilities: e.g. fi is the probability that vi is the subject of w. 20
  • 21. CS447: Natural Language Processing (J. Hockenmaier) Getting co-occurrence counts Co-occurrence as a binary feature: Does word w ever appear in the context c? (1 = yes/0 = no) Co-occurrence as a frequency count: How often does word w appear in the context c? (0…n times) Typically: 10K-100K dimensions (contexts), very sparse vectors 21 arts boil data function large sugar water apricot 0 1 0 0 1 1 1 pineapple 0 1 0 0 1 1 1 digital 0 0 1 1 1 0 0 information 0 0 1 1 1 0 0 arts boil data function large sugar water apricot 0 1 0 0 5 2 7 pineapple 0 2 0 0 10 8 5 digital 0 0 31 8 20 0 0 information 0 0 35 23 5 0 0
  • 22. CS447: Natural Language Processing (J. Hockenmaier) Counts vs PMI Sometimes, low co-occurrences counts are very informative, and high co-occurrence counts are not: -Any word is going to have relatively high co-occurrence counts with very common contexts (e.g. “it”, “anything”, “is”, etc.), but this won’t tell us much about what that word means. -We need to identify when co-occurrence counts are more likely than we would expect by chance. We can use pointwise mutual information (PMI) values instead of raw frequency counts: But this requires us to define p(w, c), p(w) and p(c) 22 PMI(w, c) = log p(w, c) p(w)p(c)
  • 23. CS447: Natural Language Processing (J. Hockenmaier) Word-Word Matrix Context: ± 7 words Resulting word-word matrix: f(w, c) = how often does word w appear in context c: “information” appeared six times in the context of “data” 23 aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0
  • 24. CS447: Natural Language Processing (J. Hockenmaier) 24 p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 pij = fij fij j=1 C ∑ i=1 W ∑ p(wi ) = fij j=1 C ∑ N p(cj ) = fij i=1 W ∑ N p(w=information, c=data) = 6/19 = .32 p(w=information) = 11/19 = .58 p(c=data) = 7/19 = .37 p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11
  • 25. CS447: Natural Language Processing (J. Hockenmaier) Computing PMI of w and c: Using a fixed window of ± k words N: How many tokens does the corpus contain? f(w) ≤ N: How often does w occur? f(w, c) ≤ f(w,): How often does w occur with c in its window? f(c) = ∑wf(w, c) ≤ N: How many tokens have c in their window? p(w) = f(w)/N p(c) = f(c)/N p(w, c) = f(w, c)/N 25 PMI(w, c) = log p(w, c) p(w)p(c)
  • 26. CS447: Natural Language Processing (J. Hockenmaier) Computing PMI of w and c: w and c in the same sentence N: How many sentences does the corpus contain? f(w) ≤ N: How many sentences contain w? f(w, c) ≤ f(w): How many sentences contain w and c? f(c) ≤ N: How many sentences contain c? p(w) = f(w)/N p(c) = f(c)/N p(w, c) = f(w, c)/N 26 PMI(w, c) = log p(w, c) p(w)p(c)
  • 27. CS447: Natural Language Processing (J. Hockenmaier) Positive Pointwise Mutual Information PMI is negative when words co-occur less than expected by chance. This is unreliable without huge corpora: With P(w1) ≈ P(w2) ≈ 10-6, we can’t estimate whether P(w1,w2) is significantly different from 10-12 We often just use positive PMI values, and replace all PMI values < 0 with 0: Positive Pointwise Mutual Information (PPMI): PPMI(w,c) = PMI(w,c) if PMI(w,c) > 0 = 0 if PMI(w,c) ≤ 0 27
  • 28. CS447: Natural Language Processing (J. Hockenmaier) PMI and smoothing PMI is biased towards infrequent events: If P(w, c) = P(w) = P(c), then PMI(w,c) = log(1/P(w)) So PMI(w, c) is larger for rare words w with low P(w). Simple remedy: Add-k smoothing of P(w, c), P(w), P(c) pushes all PMI values towards zero. Add-k smoothing affects low-probability events more, and will therefore reduce the bias of PMI towards infrequent events. (Pantel & Turney 2010) 28
  • 29. CS447: Natural Language Processing (J. Hockenmaier) Vector similarity as vector distances In distributional models, every word is a point in n-dimensional space. How do we measure the similarity between two points/vectors? In general: •Manhattan distance (Levenshtein distance, L1 norm) •Euclidian distance (L2 norm) 29 distL1(⌅ x, ⌅ y) = N i=1 |xi yi| distL2(⌅ x, ⌅ y) = ⌅ ⇤ ⇤ ⇥ N i=1 (xi yi)2 X Y L1 L2
  • 30. CS447: Natural Language Processing (J. Hockenmaier) Dot product as similarity If the vectors consist of simple binary features (0,1), we can use the dot product as similarity metric: The dot product is a bad metric if the vector elements are arbitrary features: it prefers long vectors -If one xi is very large (and yi nonzero), sim(x,y) gets very large If the number of nonzero xi and yi s is very large, sim(x,y) gets very large. -Both can happen with frequent words. 30 simdot prod(⌅ x, ⌅ y) = N i=1 xi yi length of ⇥ x : |⇥ x| = ⌅ ⇤ ⇤ ⇥ N i=1 x2 i
  • 31. CS447: Natural Language Processing (J. Hockenmaier) Vector similarity: Cosine One way to define the similarity of two vectors is to use the cosine of their angle. The cosine of two vectors is their dot product, divided by the product of their lengths: sim(w, u) = 1: w and u point in the same direction sim(w, u) = 0: w and u are orthogonal sim(w, u) = −1: w and u point in the opposite direction 31 simcos(⌅ x, ⌅ y) = N i=1 xi ⇥ yi ⇥ N i=1 x2 i ⇥ N i=1 y2 i = ⌅ x · ⌅ y |⌅ x||⌅ y|
  • 32. CS447: Natural Language Processing (J. Hockenmaier) Word Embeddings 32
  • 33. CS447: Natural Language Processing (J. Hockenmaier) Words as input to neural models We typically think of words as atomic symbols, but neural nets require input in vector form. Naive solution: one-hot encoding (dim(x) = |V| ) “a” = (1,0,0,…0), “aardvark” = (0,1,0,…,0), …. Very high-dimensional, very sparse vectors (most elements 0) No ability to generalize across similar words Still requires a lot of parameters. How do we obtain low-dimensional, dense vectors? Low-dimensional => our models need far fewer parameters Dense => lots of elements are non-zero We also want words that are similar to have similar vectors 33
  • 34. CS447: Natural Language Processing (J. Hockenmaier) Vector representations of words “Traditional” distributional similarity approaches represent words as sparse vectors -Each dimension represents one specific context -Vector entries are based on word-context co-occurrence statistics (counts or PMI values) Alternative, dense vector representations: -We can use Singular Value Decomposition to turn these sparse vectors into dense vectors (Latent Semantic Analysis) -We can also use neural models to explicitly learn a dense vector representation (embedding) (word2vec, Glove, etc.) Sparse vectors = most entries are zero Dense vectors = most entries are non-zero 34
  • 35. CS447: Natural Language Processing (J. Hockenmaier) (Static) Word Embeddings A (static) word embedding is a function that maps each word type to a single vector — these vectors are typically dense and have much lower dimensionality than the size of the vocabulary — this mapping function typically ignores that the same string of letters may have different senses (dining table vs. a table of contents) or parts of speech (to table a motion vs. a table) — this mapping function typically assumes a fixed size vocabulary (so an UNK token is still required) 35
  • 36. CS447: Natural Language Processing (J. Hockenmaier) Word2Vec Embeddings Main idea: Use a binary classifier to predict which words appear in the context of (i.e. near) a target word. The parameters of that classifier provide a dense vector representation of the target word (embedding) Words that appear in similar contexts (that have high distributional similarity) will have very similar vector representations. These models can be trained on large amounts of raw text (and pre-trained embeddings can be downloaded) 36
  • 37. CS447: Natural Language Processing (J. Hockenmaier) Word2Vec (Mikolov et al. 2013) The first really influential dense word embeddings Two ways to think about Word2Vec: — a simplification of neural language models — a binary logistic regression classifier Variants of Word2Vec — Two different context representations: CBOW or Skip-Gram — Two different optimization objectives: Negative sampling (NS) or hierarchical softmax 37
  • 38. CS447: Natural Language Processing (J. Hockenmaier) Skip-Gram with negative sampling Train a binary classifier that decides whether a target word t appears in the context of other words c1..k — Context: the set of k words near (surrounding) t — Treat the target word t and any word that actually appears in its context in a real corpus as positive examples — Treat the target word t and randomly sampled words that don’t appear in its context as negative examples — Train a binary logistic regression classifier to distinguish these cases — The weights of this classifier depend on the similarity of t and the words in c1..k Use the weights of this classifier as embeddings for t 38
  • 39. CS447: Natural Language Processing (J. Hockenmaier) Skip-Gram Goal Given a tuple (t,c) = target, context (apricot, jam) (apricot, aardvark) Return the probability that c is a real context word: P( D = + | t, c) P( D = − | t, c) = 1 − P(D = + | t, c) 11/27/18 39
  • 40. CS447: Natural Language Processing (J. Hockenmaier) Skip-Gram Training data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Training data: input/output pairs centering on apricot Assume a +/- 2 word window Positive examples: (apricot, tablespoon), (apricot, of), (apricot, jam), (apricot, a) For each positive example, create k negative examples, using noise words: (apricot, aardvark), (apricot, puddle)… 40
  • 41. CS447: Natural Language Processing (J. Hockenmaier) Word2Vec: Negative Sampling Training data: D+ ∪ D- D+ = actual examples from training data Where do we get D- from? Lots of options. Word2Vec: for each good pair (w,c), sample k words and add each wi as a negative example (wi,c) to D’ (D’ is k times as large as D) Words can be sampled according to corpus frequency or according to smoothed variant where freq’(w) = freq(w)0.75 (This gives more weight to rare words) 41
  • 42. CS447: Natural Language Processing (J. Hockenmaier) Word2Vec: Negative Sampling Training objective: Maximize log-likelihood of training data D+ ∪ D-: 42 L (Q,D,D0 ) = Â (w,c)2D logP(D = 1|w,c) + Â (w,c)2D0 logP(D = 0|w,c)
  • 43. CS447: Natural Language Processing (J. Hockenmaier) The Skip-Gram classifier Use logistic regression to predict whether the pair (t, c) (target word t and a context word c), is a positive or negative example: Assume that t and c are represented as vectors, so that their dot product tc captures their similarity To capture the entire context window c1..k, assume the words in c1:k are independent (multiply) and take the log: bability, we’ll use the logistic or sigmoid function s(x), the ogistic regression: s(x) = 1 1+e x (6.26) ord c is a real context word for target word t is thus computed P(+|t,c) = 1 1+e t·c (6.27) just returns a number between 0 and 1, so to make it a proba- ake sure that the total probability of the two possible events (c and c not being a context word) sum to 1. at word c is not a real context word for t is thus: P( |t,c) = 1 P(+|t,c) = e t·c 1+e t·c (6.28) the probability for one word, but we need to take account of words in the window. Skip-gram makes the strong but very umption that all context words are independent, allowing us to The probability that word c is a real context word for target word t is thus computed as: P(+|t,c) = 1 1+e t·c (6.27) The sigmoid function just returns a number between 0 and 1, so to make it a proba- bility we’ll need to make sure that the total probability of the two possible events (c being a context word, and c not being a context word) sum to 1. The probability that word c is not a real context word for t is thus: P( |t,c) = 1 P(+|t,c) = e t·c 1+e t·c (6.28) Equation 6.27 give us the probability for one word, but we need to take account of the multiple context words in the window. Skip-gram makes the strong but very useful simplifying assumption that all context words are independent, allowing us to just multiply their probabilities: P(+|t,c1:k) = k Y i=1 1 1+e t·ci (6.29) logP(+|t,c1:k) = k X i=1 log 1 1+e t·ci (6.30) In summary, skip-gram trains a probabilistic classifier that, given a test target word t and its context window of k words c1:k, assigns a probability based on how similar this context window is to the target word. The probability is based on applying the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had ility that word c is a real context word for target word t is thus computed P(+|t,c) = 1 1+e t·c (6.27) d function just returns a number between 0 and 1, so to make it a proba- need to make sure that the total probability of the two possible events (c text word, and c not being a context word) sum to 1. bability that word c is not a real context word for t is thus: P( |t,c) = 1 P(+|t,c) = e t·c 1+e t·c (6.28) 27 give us the probability for one word, but we need to take account of e context words in the window. Skip-gram makes the strong but very lifying assumption that all context words are independent, allowing us to y their probabilities: P(+|t,c1:k) = k Y i=1 1 1+e t·ci (6.29) logP(+|t,c1:k) = k X i=1 log 1 1+e t·ci (6.30) y, skip-gram trains a probabilistic classifier that, given a test target word 43
  • 44. CS447: Natural Language Processing (J. Hockenmaier) Where do we get vectors t, c from? Iterative approach: Assume an initial set of vectors, and then adjust them during training to maximize the probability of the training examples. 44
  • 45. CS447: Natural Language Processing (J. Hockenmaier) How to compute p(+ | t, c)? Intuition: Words are likely to appear near similar words Model similarity with dot-product! Similarity(t,c) ∝ t · c Problem: Dot product is not a probability! (Neither is cosine) 45
  • 46. CS447: Natural Language Processing (J. Hockenmaier) Turning the dot product into a probability The sigmoid lies between 0 and 1: 46 σ(x) = 1 1 + exp(−x) P( + |t, c) = 1 1 + exp(−t ⋅ c) P( − |t, c) = 1 − 1 1 + exp(−t ⋅ c) = exp(−t ⋅ c) 1 + exp(−t ⋅ c)
  • 47. CS447: Natural Language Processing (J. Hockenmaier) Word2Vec: Negative Sampling Distinguish “good” (correct) word-context pairs (D=1), from “bad” ones (D=0) Probabilistic objective: P( D = 1 | t, c ) defined by sigmoid: P( D = 0 | t, c ) = 1 — P( D = 0 | t, c ) P( D = 1 | t, c ) should be high when (t, c) ∈ D+, and low when (t,c) ∈ D- 47 P(D = 1|w,c) = 1 1+exp( s(w,c))
  • 48. CS447: Natural Language Processing (J. Hockenmaier) Summary: How to learn word2vec (skip-gram) embeddings For a vocabulary of size V: Start with V random 300- dimensional vectors as initial embeddings Train a logistic regression classifier to distinguish words that co-occur in corpus from those that don’t Pairs of words that co-occur are positive examples Pairs of words that don't co-occur are negative examples Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance Throw away the classifier code and keep the embeddings. 48
  • 49. CS447: Natural Language Processing (J. Hockenmaier) Evaluating embeddings Compare to human scores on word similarity-type tasks: WordSim-353 (Finkelstein et al., 2002) SimLex-999 (Hill et al., 2015) Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) TOEFL dataset: Levied is closest in meaning to: imposed, believed, requested, correlated 49
  • 50. CS447: Natural Language Processing (J. Hockenmaier) Properties of embeddings Similarity depends on window size C C = ±2 The nearest words to Hogwarts: Sunnydale Evernight C = ±5 The nearest words to Hogwarts: Dumbledore Malfoy hal@lood 50
  • 51. CS447: Natural Language Processing (J. Hockenmaier) Analogy: Embeddings capture relational meaning! vector(‘king’) - vector(‘man’) + vector(‘woman’) = vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) = vector(‘Rome’) 51
  • 52. CS447: Natural Language Processing (J. Hockenmaier) Using Word Embeddings 52
  • 53. CS447: Natural Language Processing (J. Hockenmaier) Using pre-trained embeddings Assume you have pre-trained embeddings E. How do you use them in your model? -Option 1: Adapt E during training Disadvantage: only words in training data will be affected. -Option 2: Keep E fixed, but add another hidden layer that is learned for your task -Option 3: Learn matrix T ∈ dim(emb)×dim(emb) and use rows of E’ = ET (adapts all embeddings, not specific words) -Option 4: Keep E fixed, but learn matrix Δ ∈ R|V|×dim(emb) and use E’ = E + Δ or E’ = ET + Δ (this learns to adapt specific words) 53
  • 54. CS447: Natural Language Processing (J. Hockenmaier) More on embeddings Embeddings aren’t just for words! You can take any discrete input feature (with a fixed number of K outcomes, e.g. POS tags, etc.) and learn an embedding matrix for that feature. Where do we get the input embeddings from? We can learn the embedding matrix during training. Initialization matters: use random weights, but in special range (e.g. [-1/(2d), +(1/2d)] for d-dimensional embeddings), or use Xavier initialization We can also use pre-trained embeddings LM-based embeddings are useful for many NLP task 54
  • 55. CS447: Natural Language Processing (J. Hockenmaier) Dense embeddings you can download! Word2vec (Mikolov et al.) https://code.google.com/archive/p/word2vec/ Fasttext http://www.fasttext.cc/ Glove (Pennington, Socher, Manning) http://nlp.stanford.edu/projects/glove/ 55