[Emnlp] what is glo ve part ii - towards data science

[EMNLP] What is GloVe? Part II
An introduction to unsupervised learning of word embeddings from
co-occurrence matrices.
Brendan Whitaker
May 25, 2018 · 5 min read
In this article, we’ll discuss one of the newer methods of creating vector space models
of word semantics, more commonly known as word embeddings. The original paper by
J. Pennington, R. Socher, and C. Manning is available here:
http://www.aclweb.org/anthology/D14-1162.This method combines elements from
the two main word embedding models which existed when GloVe, short for “Global
Vectors [for word representation]” was proposed: global matrix factorization and local
context window methods. In Part I, we compared these two different approaches. Now
we’ll give an explanation of the GloVe embedding generation algorithm and how it
improves on these previous methods.

. . .
Co-occurrence probabilities.
Recall from the Part I of this series that term-term frequency matrices encode how
often terms appear in the context of one another by enumerating each unique token in
the corpus along both of the axes of a large 2-dimensional matrix. Performing matrix
factorization gives us a low rank approximation of the whole of the data contained in
the original matrix. However, as we’ll explain in a moment, the authors of GloVe
discovered via empirical methods that instead of learning the raw co-occurrence
probabilities, it may make more sense to learn ratios of these co-occurrence
probabilities, which seem to better discriminate subtleties in term-term relevance.
To illustrate this, we borrow an example from their paper: suppose we wish to study
the relationship between two words, i = ice and j = steam. We’ll do this by examining
the co-occurrence probabilities of these words with various “probe” words. We define
the co-occurrence probability of an arbitrary word i with an arbitrary word j to be the
probability that word j appears in the context of word i. This is represented by the
equation and definitions below.
Note X_i is defined as the number of times any word appears in the context of word i,
so it’s defined as the sum over all words k of the number of times word k occurs in the
context of word i. So if we choose a probe word k = solid which is closely related to i =
ice but not to j = steam, we expect the ratio P_{ik}/P_{jk} of co-occurrence
probabilities to be large, since solid should, in theory, appear in the context of ice more
often than it would appear in the context of steam, since ice is a solid and steam is
not.Conversely, for a choice of k = gas, we would expect the same ratio to be small,

since steam is more closely related to gas than ice is. Then we also have words like
water, which are closely related to both ice and steam, but not more to one than the
other. And we also have words like fashion which are not closely related to either of the
words in question. For both water and fashion, we expect our ratio to be close to 1
since there shouldn’t be any bias to one of ice or steam.
Now it is important to note that since we are trying to determine information about the
relationship between the words ice and steam, water doesn’t give us a lot of useful
information. For discriminative purposes, it doesn’t give us a good idea of how “far
apart” steam is from ice, and the information that steam, ice, and water are all related
is already captured in the discriminative information between ice and water, and
steam and water. Words that don’t help us distinguish between i and j are referred to
as noise points, and it is the use of the ratio between co-occurrence probabilities helps
filter out these noise points. This is well-illustrated by the real data for these example
points, which we print here from Table 1 in the GloVe paper.
source
So let’s take a step back for a moment and recall the overall structure of the problem.
We want to take data from the corpus in the form of global statistics and learn a
function that gives us information about the relationship between any two words in
said corpus, given only the words themselves. Now the authors have discovered that
ratios of co-occurrence probabilities are a good source of this information, so it would
be nice if our function mapped from the space of two words to compare as well as a
context word to the space of co-occurrence probability ratios. So let the function our
model is learning be given by F. A naive interpretation of the desired model is given by
the authors as:

Note that the w’s are real-valued word vectors. Now since we want to encode
information about the ratios between two words, the authors suggest using vector
differences as inputs to our function. Then we have the following:
Now we’re getting closer to something that could work, but the final function for the
GloVe model will be considerably more complex to accurately reflect certain desirable
symmetries, since distinction between the words i and j should be invariant under
commutation of the inputs. The authors also design a weighting scheme for co-
occurrences to reflect the relationship between frequency and semantic relevance. But
since I’m trying to keep these summary articles to around 800 words, we’ll cover all
that in Part III!
[EMNLP] What is GloVe? Part III
An introduction to unsupervised learning of word
embeddings from co-occurrence matrices.
towardsdatascience.com
Please check out the source paper!
Page 1 of 12

Machine Learning Arti cial Intelligence Language Tech Linguistics
About Help Legal
GloVe: Global Vectors for Word Representation
Jeffrey Pennington, Richard Socher, Christopher D. Manning
Computer Science Department, Stanford University, Stanford, CA
94305jpennin@stanford.edu, richard@socher.org, manning@stanford.edu
Abstract
Recent methods for learning vector
spacerepresentations of words have
succeededin capturing fine-grained semantic
andsyntactic regularities using vector
arith- metic, but the origin of these
regularitieshas remained opaque. We analyze
andmake explicit the model properties
neededfor such regularities to emerge in
wordvectors. The result is a new global
log- bilinear regression model that
combinesthe advantages of the two major
modelfamilies in the literature: global matrix
factorization and local context
windowmethods. Our model efficiently
leveragesstatistical information by training only
onthe nonzero elements in a word-word
co- occurrence matrix, rather than on
the en- tire sparse matrix or on
individual contextwindows in a large corpus. The
model pro- duces a vector space with
meaningful sub- structure, as
evidenced by its performanceof 75% on a recent word analogy
task. Italso outperforms related models on
simi- larity tasks and named entity
recognition.
1 Introduction
Semantic vector space models of language
repre- sent each word with a real-valued
vector. Thesevectors can be used as features in a variety
of ap- plications, such as information
retrieval (Manninget al., 2008), document classification
(Sebastiani,2002), question answering (Tellex et al.,
2003),named entity recognition (Turian et al.,
2010), andparsing (Socher et al.,
2013).Most word vector methods rely on the
distanceor angle between pairs of word vectors as
the pri- mary method for evaluating the
intrinsic qualityof such a set of word representations.
Recently,Mikolov et al. (2013c) introduced a new
evaluation scheme based on word
analogies that probes
the finer structure of the word vector space
by examining not the scalar distance
between word vectors, but rather their
various dimensions of dif- ference. For
example, the analogy “king is toqueen as man is to woman” should be
encodedin the vector space by the vector equation
king −queen = man − woman. This evaluation
schemefavors models that produce dimensions of
mean- ing, thereby capturing the multi-
clustering idea ofdistributed representations (Bengio,
2009).The two main model families for learning
wordvectors are: 1) global matrix factorization
methods, such as latent semantic analysis
(LSA) (Deer- wester et al., 1990) and 2)
local context windowmethods, such as the skip-gram model of
Mikolovet al. (2013c). Currently, both families suffer
sig- nificant drawbacks. While methods like
LSA efficiently leverage statistical
information, they dorelatively poorly on the word analogy task,
indi- cating a sub-optimal vector space
structure. Meth- ods like skip-gram may do
better on the analogytask, but they poorly utilize the statistics of
the corpus since they train on separate
local context win- dows instead of on global
co-occurrence counts.
In this work, we analyze the model
propertiesnecessary to produce linear directions of
meaningand argue that global log-bilinear regression
models are appropriate for doing so. We
propose a spe- cific weighted least squares
model that trains onglobal word-word co-occurrence counts and
thusmakes efficient use of statistics. The model
pro- duces a word vector space with
meaningful sub- structure, as evidenced by
its state-of-the-art per- formance of 75%
accuracy on the word analogydataset. We also demonstrate that our
methodsoutperform other current methods on
several wordsimilarity tasks, and also on a common
named entity recognition (NER)
benchmark.We provide the source code for the model
aswell as trained word vectors at http://nlp.
stanford.edu/projects/glove/.
Page 2 of 12
Page 1 / 12

[Emnlp] what is glo ve part ii - towards data science

More Related Content

Similar to [Emnlp] what is glo ve part ii - towards data science (20)

Recently uploaded (20)

[Emnlp] what is glo ve part ii - towards data science