SlideShare a Scribd company logo
[EMNLP] What is GloVe? Part II
An introduction to unsupervised learning of word embeddings from
co-occurrence matrices.
Brendan Whitaker
May 25, 2018 · 5 min read
In this article, we’ll discuss one of the newer methods of creating vector space models
of word semantics, more commonly known as word embeddings. The original paper by
J. Pennington, R. Socher, and C. Manning is available here:
http://www.aclweb.org/anthology/D14-1162.This method combines elements from
the two main word embedding models which existed when GloVe, short for “Global
Vectors [for word representation]” was proposed: global matrix factorization and local
context window methods. In Part I, we compared these two different approaches. Now
we’ll give an explanation of the GloVe embedding generation algorithm and how it
improves on these previous methods.
. . .
Co-occurrence probabilities.
Recall from the Part I of this series that term-term frequency matrices encode how
often terms appear in the context of one another by enumerating each unique token in
the corpus along both of the axes of a large 2-dimensional matrix. Performing matrix
factorization gives us a low rank approximation of the whole of the data contained in
the original matrix. However, as we’ll explain in a moment, the authors of GloVe
discovered via empirical methods that instead of learning the raw co-occurrence
probabilities, it may make more sense to learn ratios of these co-occurrence
probabilities, which seem to better discriminate subtleties in term-term relevance.
To illustrate this, we borrow an example from their paper: suppose we wish to study
the relationship between two words, i = ice and j = steam. We’ll do this by examining
the co-occurrence probabilities of these words with various “probe” words. We define
the co-occurrence probability of an arbitrary word i with an arbitrary word j to be the
probability that word j appears in the context of word i. This is represented by the
equation and definitions below.
Note X_i is defined as the number of times any word appears in the context of word i,
so it’s defined as the sum over all words k of the number of times word k occurs in the
context of word i. So if we choose a probe word k = solid which is closely related to i =
ice but not to j = steam, we expect the ratio P_{ik}/P_{jk} of co-occurrence
probabilities to be large, since solid should, in theory, appear in the context of ice more
often than it would appear in the context of steam, since ice is a solid and steam is
not.Conversely, for a choice of k = gas, we would expect the same ratio to be small,
since steam is more closely related to gas than ice is. Then we also have words like
water, which are closely related to both ice and steam, but not more to one than the
other. And we also have words like fashion which are not closely related to either of the
words in question. For both water and fashion, we expect our ratio to be close to 1
since there shouldn’t be any bias to one of ice or steam.
Now it is important to note that since we are trying to determine information about the
relationship between the words ice and steam, water doesn’t give us a lot of useful
information. For discriminative purposes, it doesn’t give us a good idea of how “far
apart” steam is from ice, and the information that steam, ice, and water are all related
is already captured in the discriminative information between ice and water, and
steam and water. Words that don’t help us distinguish between i and j are referred to
as noise points, and it is the use of the ratio between co-occurrence probabilities helps
filter out these noise points. This is well-illustrated by the real data for these example
points, which we print here from Table 1 in the GloVe paper.
source
So let’s take a step back for a moment and recall the overall structure of the problem.
We want to take data from the corpus in the form of global statistics and learn a
function that gives us information about the relationship between any two words in
said corpus, given only the words themselves. Now the authors have discovered that
ratios of co-occurrence probabilities are a good source of this information, so it would
be nice if our function mapped from the space of two words to compare as well as a
context word to the space of co-occurrence probability ratios. So let the function our
model is learning be given by F. A naive interpretation of the desired model is given by
the authors as:
Note that the w’s are real-valued word vectors. Now since we want to encode
information about the ratios between two words, the authors suggest using vector
differences as inputs to our function. Then we have the following:
Now we’re getting closer to something that could work, but the final function for the
GloVe model will be considerably more complex to accurately reflect certain desirable
symmetries, since distinction between the words i and j should be invariant under
commutation of the inputs. The authors also design a weighting scheme for co-
occurrences to reflect the relationship between frequency and semantic relevance. But
since I’m trying to keep these summary articles to around 800 words, we’ll cover all
that in Part III!
[EMNLP] What is GloVe? Part III
An introduction to unsupervised learning of word
embeddings from co-occurrence matrices.
towardsdatascience.com
Please check out the source paper!
Page 1 of 12
Machine Learning Arti cial Intelligence Language Tech Linguistics
About Help Legal
GloVe: Global Vectors for Word Representation
Jeffrey Pennington, Richard Socher, Christopher D. Manning
Computer Science Department, Stanford University, Stanford, CA
94305jpennin@stanford.edu, richard@socher.org, manning@stanford.edu
Abstract
Recent methods for learning vector
spacerepresentations of words have
succeededin capturing fine-grained semantic
andsyntactic regularities using vector
arith- metic, but the origin of these
regularitieshas remained opaque. We analyze
andmake explicit the model properties
neededfor such regularities to emerge in
wordvectors. The result is a new global
log- bilinear regression model that
combinesthe advantages of the two major
modelfamilies in the literature: global matrix
factorization and local context
windowmethods. Our model efficiently
leveragesstatistical information by training only
onthe nonzero elements in a word-word
co- occurrence matrix, rather than on
the en- tire sparse matrix or on
individual contextwindows in a large corpus. The
model pro- duces a vector space with
meaningful sub- structure, as
evidenced by its performanceof 75% on a recent word analogy
task. Italso outperforms related models on
simi- larity tasks and named entity
recognition.
1 Introduction
Semantic vector space models of language
repre- sent each word with a real-valued
vector. Thesevectors can be used as features in a variety
of ap- plications, such as information
retrieval (Manninget al., 2008), document classification
(Sebastiani,2002), question answering (Tellex et al.,
2003),named entity recognition (Turian et al.,
2010), andparsing (Socher et al.,
2013).Most word vector methods rely on the
distanceor angle between pairs of word vectors as
the pri- mary method for evaluating the
intrinsic qualityof such a set of word representations.
Recently,Mikolov et al. (2013c) introduced a new
evalua- tion scheme based on word
analogies that probes
the finer structure of the word vector space
by ex- amining not the scalar distance
between word vec- tors, but rather their
various dimensions of dif- ference. For
example, the analogy “king is toqueen as man is to woman” should be
encodedin the vector space by the vector equation
king −queen = man − woman. This evaluation
schemefavors models that produce dimensions of
mean- ing, thereby capturing the multi-
clustering idea ofdistributed representations (Bengio,
2009).The two main model families for learning
wordvectors are: 1) global matrix factorization
meth- ods, such as latent semantic analysis
(LSA) (Deer- wester et al., 1990) and 2)
local context windowmethods, such as the skip-gram model of
Mikolovet al. (2013c). Currently, both families suffer
sig- nificant drawbacks. While methods like
LSA ef- ficiently leverage statistical
information, they dorelatively poorly on the word analogy task,
indi- cating a sub-optimal vector space
structure. Meth- ods like skip-gram may do
better on the analogytask, but they poorly utilize the statistics of
the cor- pus since they train on separate
local context win- dows instead of on global
co-occurrence counts.
In this work, we analyze the model
propertiesnecessary to produce linear directions of
meaningand argue that global log-bilinear regression
mod- els are appropriate for doing so. We
propose a spe- cific weighted least squares
model that trains onglobal word-word co-occurrence counts and
thusmakes efficient use of statistics. The model
pro- duces a word vector space with
meaningful sub- structure, as evidenced by
its state-of-the-art per- formance of 75%
accuracy on the word analogydataset. We also demonstrate that our
methodsoutperform other current methods on
several wordsimilarity tasks, and also on a common
named en- tity recognition (NER)
benchmark.We provide the source code for the model
aswell as trained word vectors at http://nlp.
stanford.edu/projects/glove/.
Page 2 of 12
Page 1 / 12

More Related Content

PDF
[Emnlp] what is glo ve part iii - towards data science
PDF
P13 corley
PDF
[Emnlp] what is glo ve part i - towards data science
PDF
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
PDF
2016 m7 w2
PDF
Text smilarity02 corpus_based
PDF
Paper dissected glove_ global vectors for word representation_ explained _ ...
PDF
Word2vec on the italian language: first experiments
[Emnlp] what is glo ve part iii - towards data science
P13 corley
[Emnlp] what is glo ve part i - towards data science
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
2016 m7 w2
Text smilarity02 corpus_based
Paper dissected glove_ global vectors for word representation_ explained _ ...
Word2vec on the italian language: first experiments

Similar to [Emnlp] what is glo ve part ii - towards data science (20)

PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
New word analogy corpus
PDF
Turkish language modeling using BERT
PDF
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
PDF
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
PDF
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
PDF
Identifying the semantic relations on
PDF
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
PDF
An efficient metric of automatic weight generation for properties in instance...
PDF
An efficient metric of automatic weight generation for properties in instance...
PDF
semeval2016
PDF
Enhancing SPARQL Query Rewriting for Complex Ontology Alignments
PDF
Enhancing SPARQL Query Rewriting for Complex Ontology Alignments
PDF
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
PPTX
Chat bot using text similarity approach
PDF
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
PDF
IJNLC 2013 - Ambiguity-Aware Document Similarity
PDF
Effect of word embedding vector dimensionality on sentiment analysis through ...
PDF
AMBIGUITY-AWARE DOCUMENT SIMILARITY
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
New word analogy corpus
Turkish language modeling using BERT
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
Identifying the semantic relations on
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...
semeval2016
Enhancing SPARQL Query Rewriting for Complex Ontology Alignments
Enhancing SPARQL Query Rewriting for Complex Ontology Alignments
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
Chat bot using text similarity approach
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
IJNLC 2013 - Ambiguity-Aware Document Similarity
Effect of word embedding vector dimensionality on sentiment analysis through ...
AMBIGUITY-AWARE DOCUMENT SIMILARITY
Ad

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Quality review (1)_presentation of this 21
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
1_Introduction to advance data techniques.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Business Ppt On Nestle.pptx huunnnhhgfvu
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Mega Projects Data Mega Projects Data
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Quality review (1)_presentation of this 21
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Ad

[Emnlp] what is glo ve part ii - towards data science

  • 1. [EMNLP] What is GloVe? Part II An introduction to unsupervised learning of word embeddings from co-occurrence matrices. Brendan Whitaker May 25, 2018 · 5 min read In this article, we’ll discuss one of the newer methods of creating vector space models of word semantics, more commonly known as word embeddings. The original paper by J. Pennington, R. Socher, and C. Manning is available here: http://www.aclweb.org/anthology/D14-1162.This method combines elements from the two main word embedding models which existed when GloVe, short for “Global Vectors [for word representation]” was proposed: global matrix factorization and local context window methods. In Part I, we compared these two different approaches. Now we’ll give an explanation of the GloVe embedding generation algorithm and how it improves on these previous methods.
  • 2. . . . Co-occurrence probabilities. Recall from the Part I of this series that term-term frequency matrices encode how often terms appear in the context of one another by enumerating each unique token in the corpus along both of the axes of a large 2-dimensional matrix. Performing matrix factorization gives us a low rank approximation of the whole of the data contained in the original matrix. However, as we’ll explain in a moment, the authors of GloVe discovered via empirical methods that instead of learning the raw co-occurrence probabilities, it may make more sense to learn ratios of these co-occurrence probabilities, which seem to better discriminate subtleties in term-term relevance. To illustrate this, we borrow an example from their paper: suppose we wish to study the relationship between two words, i = ice and j = steam. We’ll do this by examining the co-occurrence probabilities of these words with various “probe” words. We define the co-occurrence probability of an arbitrary word i with an arbitrary word j to be the probability that word j appears in the context of word i. This is represented by the equation and definitions below. Note X_i is defined as the number of times any word appears in the context of word i, so it’s defined as the sum over all words k of the number of times word k occurs in the context of word i. So if we choose a probe word k = solid which is closely related to i = ice but not to j = steam, we expect the ratio P_{ik}/P_{jk} of co-occurrence probabilities to be large, since solid should, in theory, appear in the context of ice more often than it would appear in the context of steam, since ice is a solid and steam is not.Conversely, for a choice of k = gas, we would expect the same ratio to be small,
  • 3. since steam is more closely related to gas than ice is. Then we also have words like water, which are closely related to both ice and steam, but not more to one than the other. And we also have words like fashion which are not closely related to either of the words in question. For both water and fashion, we expect our ratio to be close to 1 since there shouldn’t be any bias to one of ice or steam. Now it is important to note that since we are trying to determine information about the relationship between the words ice and steam, water doesn’t give us a lot of useful information. For discriminative purposes, it doesn’t give us a good idea of how “far apart” steam is from ice, and the information that steam, ice, and water are all related is already captured in the discriminative information between ice and water, and steam and water. Words that don’t help us distinguish between i and j are referred to as noise points, and it is the use of the ratio between co-occurrence probabilities helps filter out these noise points. This is well-illustrated by the real data for these example points, which we print here from Table 1 in the GloVe paper. source So let’s take a step back for a moment and recall the overall structure of the problem. We want to take data from the corpus in the form of global statistics and learn a function that gives us information about the relationship between any two words in said corpus, given only the words themselves. Now the authors have discovered that ratios of co-occurrence probabilities are a good source of this information, so it would be nice if our function mapped from the space of two words to compare as well as a context word to the space of co-occurrence probability ratios. So let the function our model is learning be given by F. A naive interpretation of the desired model is given by the authors as:
  • 4. Note that the w’s are real-valued word vectors. Now since we want to encode information about the ratios between two words, the authors suggest using vector differences as inputs to our function. Then we have the following: Now we’re getting closer to something that could work, but the final function for the GloVe model will be considerably more complex to accurately reflect certain desirable symmetries, since distinction between the words i and j should be invariant under commutation of the inputs. The authors also design a weighting scheme for co- occurrences to reflect the relationship between frequency and semantic relevance. But since I’m trying to keep these summary articles to around 800 words, we’ll cover all that in Part III! [EMNLP] What is GloVe? Part III An introduction to unsupervised learning of word embeddings from co-occurrence matrices. towardsdatascience.com Please check out the source paper! Page 1 of 12
  • 5. Machine Learning Arti cial Intelligence Language Tech Linguistics About Help Legal GloVe: Global Vectors for Word Representation Jeffrey Pennington, Richard Socher, Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305jpennin@stanford.edu, richard@socher.org, manning@stanford.edu Abstract Recent methods for learning vector spacerepresentations of words have succeededin capturing fine-grained semantic andsyntactic regularities using vector arith- metic, but the origin of these regularitieshas remained opaque. We analyze andmake explicit the model properties neededfor such regularities to emerge in wordvectors. The result is a new global log- bilinear regression model that combinesthe advantages of the two major modelfamilies in the literature: global matrix factorization and local context windowmethods. Our model efficiently leveragesstatistical information by training only onthe nonzero elements in a word-word co- occurrence matrix, rather than on the en- tire sparse matrix or on individual contextwindows in a large corpus. The model pro- duces a vector space with meaningful sub- structure, as evidenced by its performanceof 75% on a recent word analogy task. Italso outperforms related models on simi- larity tasks and named entity recognition. 1 Introduction Semantic vector space models of language repre- sent each word with a real-valued vector. Thesevectors can be used as features in a variety of ap- plications, such as information retrieval (Manninget al., 2008), document classification (Sebastiani,2002), question answering (Tellex et al., 2003),named entity recognition (Turian et al., 2010), andparsing (Socher et al., 2013).Most word vector methods rely on the distanceor angle between pairs of word vectors as the pri- mary method for evaluating the intrinsic qualityof such a set of word representations. Recently,Mikolov et al. (2013c) introduced a new evalua- tion scheme based on word analogies that probes the finer structure of the word vector space by ex- amining not the scalar distance between word vec- tors, but rather their various dimensions of dif- ference. For example, the analogy “king is toqueen as man is to woman” should be encodedin the vector space by the vector equation king −queen = man − woman. This evaluation schemefavors models that produce dimensions of mean- ing, thereby capturing the multi- clustering idea ofdistributed representations (Bengio, 2009).The two main model families for learning wordvectors are: 1) global matrix factorization meth- ods, such as latent semantic analysis (LSA) (Deer- wester et al., 1990) and 2) local context windowmethods, such as the skip-gram model of Mikolovet al. (2013c). Currently, both families suffer sig- nificant drawbacks. While methods like LSA ef- ficiently leverage statistical information, they dorelatively poorly on the word analogy task, indi- cating a sub-optimal vector space structure. Meth- ods like skip-gram may do better on the analogytask, but they poorly utilize the statistics of the cor- pus since they train on separate local context win- dows instead of on global co-occurrence counts. In this work, we analyze the model propertiesnecessary to produce linear directions of meaningand argue that global log-bilinear regression mod- els are appropriate for doing so. We propose a spe- cific weighted least squares model that trains onglobal word-word co-occurrence counts and thusmakes efficient use of statistics. The model pro- duces a word vector space with meaningful sub- structure, as evidenced by its state-of-the-art per- formance of 75% accuracy on the word analogydataset. We also demonstrate that our methodsoutperform other current methods on several wordsimilarity tasks, and also on a common named en- tity recognition (NER) benchmark.We provide the source code for the model aswell as trained word vectors at http://nlp. stanford.edu/projects/glove/. Page 2 of 12 Page 1 / 12