SlideShare a Scribd company logo
word2vec in theory and practice with tensorflow


www.bgoncalves.com
https://github.com/bmtgoncalves/word2vec-and-friends/
www.bgoncalves.com@bgoncalves
• Computers are really good at crunching numbers but not so much when it
comes to words.
• Perhaps can we represent words numerically?



• Can we do it in a way that preserves semantic information?





• Words that have similar meanings are used in similar contexts and the context
in which a word is used helps us understand it’s meaning.
Teaching machines to read!
The red house is beautiful.

The blue house is old.

The red car is beautiful.

The blue car is old.
“You shall know a word by the company it keeps”

(J. R. Firth)
a 1
about 2
above 3
after 4
again 5
against 6
all 7
am 8
an 9
and 10
any 11
are 12
aren't 13
as 14
… …
vafter = (0, 0, 0, 1, 0, 0, · · · )
T
vabove = (0, 0, 1, 0, 0, 0, · · · )
T
One-hot
encoding
www.bgoncalves.com@bgoncalves
Teaching machines to read!
➡Words with similar meanings should have similar representations.
➡From a word we can get some idea about the context where it might appear 



➡And from the context we have some idea about possible words
“You shall know a word by the company it keeps”

(J. R. Firth)
max p (C|w)
max p (w|C)
The red _____ is beautiful.

The blue _____ is old.
___ ___ house __ ____.

___ ___ car __ _______.

www.bgoncalves.com@bgoncalves
word2vec
max p (C|w) max p (w|C)
Skipgram Continuous Bag of Words
⇥1
wj1wj+1
wj
⇥2
⇥2
wj+1
⇥2
⇥2
⇥1
wj
wj1
Word Context Context Word
⇥1
⇥2
wj
word embeddings
context embeddings
one hot vector
activation function
Mikolov 2013
www.bgoncalves.com@bgoncalves
• Let us take a better look at a simplified case with a single context word
• Words are one-hot encoded vectors of length V
• is an matrix so that when we take the product:
• We are effectively selecting the j’th column of :
• The linear activation function simply passes this value along

which is then multiplied by , a matrix.
• Each element k of the output layer its then given by:
• We convert these values to a normalized probability distribution by using the softmax
Skipgram
⇥1
softmax
wj
⇥2
wj+1
wj = (0, 0, 1, 0, 0, 0, · · · )
T
⇥1 (M ⇥ V )
⇥1 · wj
⇥1
vj = ⇥1 · wj
⇥2 (V ⇥ M)
uT
k · vj
www.bgoncalves.com@bgoncalves
• A standard way of converting a set of number to a normalized probability distribution:



• With this final ingredient we obtain:



• Our goal is then to learn:
• so that we can predict what the next word is likely to be using:

• But how can we quantify how far we are from the correct answer? Our error measure
shouldn’t be just binary (right or wrong)…
Softmax
⇥1
softmax
wj
⇥2
wj+1
softmax (x) =
exp (xj)
P
l exp (xl)
p (wk|wj) ⌘ softmax uT
k · vj =
exp uT
k · vj
P
l exp uT
l · vj
p (wj+1|wj)
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Cross-Entropy
• First we have to recall that what we are, in effect, comparing two probability distributions:
• and the one-hot encoding of the context:
• The Cross Entropy measures the distance, in number of bits, between two probability
distributions p and q: 

• In our case, this becomes:
• So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 

• This is our Error function. But how can we use this to update the values of and ?
p (wk|wj)
H (p, q) =
X
k
pk log qk
H = log p (wj+1|wj)
wj+1
wj+1 = (0, 0, 0, 1, 0, 0, · · · )
T
H [wj+1, p (wk|wj)] =
X
k
wk
j+1 log p (wk|wj)
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Gradient Descent
• Find the gradient for each training batch
• Take a step downhill along the
direction of the gradient 



• where is the step size.
• Repeat until “convergence”.
H
✓mn ✓mn ↵
@H
@✓mn
@H
@✓mn
↵
www.bgoncalves.com@bgoncalves
Chain-rule
• How can we calculate

• we rewrite:

• and expand:
• Then we can rewrite:

• and apply the chain rule:
@H
@✓mn
=
@
@✓mn
log p (wj+1|wj)
@H
@✓mn
=
@
@✓mn
log
exp uT
k · vj
P
l exp uT
l · vj
uT
k · vj =
X
q
✓
(2)
kq ✓
(1)
qj
@f (g (x))
@x
=
@f (g (x))
@g (x)
@g (x)
@x
@H
@✓mn
=
@
@✓mn
"
uT
k · vj log
X
l
exp uT
l · vj
#
✓mn =
n
✓(1)
mn, ✓(2)
mn
o
www.bgoncalves.com@bgoncalves
SkipGram with Larger Contexts
• Use the same for all context words.
• Use the average of cross entropy.
• word order is not important (the average does not change).
⇥1
softmax
wj
⇥2
wj+1 ⇥1
wj1wj+1
wj
⇥2
⇥2
H = log p (wj+1|wj) H =
1
T
X
t
log p (wj+t|wj)
⇥2
www.bgoncalves.com@bgoncalves
Continuous Bag of Words
• The process is essentially the same
wj+1
⇥2
⇥2
⇥1
wj
wj1
www.bgoncalves.com@bgoncalves
Variations
• Hierarchical Softmax:
• Approximate the softmax using a binary tree
• Reduce the number of calculations per training example from to and
increase performance by orders of magnitude.
• Negative Sampling:
• Under sample the most frequent words by removing them from the text before
generating the contexts
• Similar idea to removing stop-words — very frequent words are less informative.
• Effectively makes the window larger, increasing the amount of information available for
context
V log2 V
www.bgoncalves.com@bgoncalves
Comments
• word2vec, even in its original formulation is actually a family
of algorithms using various combinations of:
• Skip-gram, CBOW
• Hierarchical Softmax, Negative Sampling
• The output of this neural network is deterministic:
• If two words appear in the same context (“blue” vs “red”,
for e.g.), they will have similar internal representations in 

and
• and are vector embeddings of the input words
and the context words respectively
• Words that are too rare are also removed.
• The original implementation had a dynamic window size:
• for each word in the corpus a window size k’ is
sampled uniformly between 1 and k
⇥1
wj1wj+1
wj
⇥2
⇥2
⇥1 ⇥2
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Online resources
• C - https://code.google.com/archive/p/word2vec/ (the original one)
• Python/tensorflow - https://www.tensorflow.org/tutorials/word2vec
• Both a minimalist and an efficient versions are available in the tutorial
• Python/gensim - https://radimrehurek.com/gensim/models/word2vec.html
• Pretrained embeddings:
• 90 languages, trained using wikipedia: https://github.com/facebookresearch/fastText/blob/
master/pretrained-vectors.md
www.bgoncalves.com@bgoncalves
Analogies
• The embedding of each word is a function of the context it appears in:

• words that appear in similar contexts will have similar embeddings:
• “Distributional hypotesis” in linguistics
(red) = f (context (red))
context (red) ⇡ context (blue) =) (red) ⇡ (blue)
“You shall know a word by the company it keeps”

(J. R. Firth)
Geometrical relations
between contexts imply
semantic relations
between words!
Paris
Rome
Washington DC
Lisbon
France
PortugalItaly
USA
Capital context
Country context
(France) (Paris) + (Rome) = (Italy)
www.bgoncalves.com@bgoncalves
Visualization
https://github.com/bmtgoncalves/word2vec-and-friends/
www.bgoncalves.com@bgoncalves
Tensorflow
www.bgoncalves.com@bgoncalves
• Let’s imagine I want to perform these calculations:







• for some given .
• To calculate we must follow a certain sequence of operations.
• Which can be shortened if we are interested in just the value of
• In Tensorflow, this is called a Computational Graph and it’s the
most fundamental concept to understand
• Data, in the form of tensors, flows through the graph from inputs
to outputs
• Tensorflow, is, essentially, a way of defining arbitrary computational
graphs in a way that can be automatically distributed and
optimized.
A diversion… https://www.tensorflow.org/
y = f (x)
z = g (y)
x
z
Apply
Assign
Apply
Assign
y
www.bgoncalves.com@bgoncalves
• If we use base functions, tensorflow knows how to automatically calculate the respective
gradients
• Automatic BackProp
• Graphs can have multiple outputs
• Predictions
• Cost functions
• etc…
Computational Graphs https://www.tensorflow.org/
www.bgoncalves.com@bgoncalves
Sessions
• After we have defined the computational graph, we can start using it
to make calculations
• All computations must take place within a “session” that defines the
values of all required input values
https://www.tensorflow.org/
• Which values are required for a specific computation depend on
what part of the graph is actually being executed.
• When you request the value of a specific output, tensorflow
determines what is the specific subgraph that must be executed
and what are the required input values.
• For optimization purposes, it can also execute independent parts of
the graph in different devices (CPUs, GPUs, TPUs, etc) at the same
time.
@bgoncalves
A basic Tensorflow program https://www.tensorflow.org/
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.constant(3.)
m = tf.add(x, y)
z = tf.multiply(m, c)
with tf.Session() as sess:
output = sess.run(z, feed_dict={x: 1., y: 2.})
print("Output value is:", output)
basic.py
z = c ⇤ (x + y)
Placeholders
Constant
add
multiply
assign
assign
@bgoncalves
A basic Tensorflow program https://www.tensorflow.org/
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.constant(3.)
m = tf.add(x, y)
z = tf.multiply(m, c)
with tf.Session() as sess:
output = sess.run(z, feed_dict={x: 1., y: 2.})
print("Output value is:", output)
z = c ⇤ (x + y)
Placeholders
Constant
add
multiply
assign
1 2
assign
9
Inputs
@bgoncalves
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
learning_rate = 0.01
N = 100
N_steps = 300
# Training Data
train_X = np.linspace(-10, 10, N)
train_Y = 2*train_X + 3 + 5*np.random.random(N)
# Computational Graph
X = tf.placeholder("float")
Y = tf.placeholder("float")
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")
y = tf.add(tf.multiply(X, W), b)
cost = tf.reduce_mean(tf.pow(y-Y, 2))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for step in range(N_steps):
sess.run(optimizer, feed_dict={X: train_X, Y: train_Y})
cost_val, W_val, b_val = sess.run([cost, W, b], feed_dict={X: train_X, Y:train_Y})
print("step", step, "cost", cost_val, "w", W_val, "b", b_val)
Linear Regression
linear.py
y = W ⇤ x + b
cost =
1
N
X
i
(yi Yi)
2
@bgoncalves
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
learning_rate = 0.01
N = 100
N_steps = 300
# Training Data
train_X = np.linspace(-10, 10, N)
train_Y = 2*train_X + 3 + 5*np.random.random(N)
# Computational Graph
X = tf.placeholder("float")
Y = tf.placeholder("float")
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")
y = tf.add(tf.multiply(X, W), b)
cost = tf.reduce_mean(tf.pow(y-Y, 2))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for step in range(N_steps):
sess.run(optimizer, feed_dict={X: train_X, Y: train_Y})
cost_val, W_val, b_val = sess.run([cost, W, b], feed_dict={X: train_X, Y:train_Y})
print("step", step, "cost", cost_val, "w", W_val, "b", b_val)
Linear Regression
linear.py
y = W ⇤ x + b
cost =
1
N
X
i
(yi Yi)
2
Jupyter Notebook
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow

More Related Content

PDF
Machine(s) Learning with Neural Networks
PDF
A practical Introduction to Machine(s) Learning
PDF
Word2vec and Friends
PDF
Word2vec ultimate beginner
PPTX
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
PDF
Real Time Big Data Management
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
PDF
Internet of Things Data Science
Machine(s) Learning with Neural Networks
A practical Introduction to Machine(s) Learning
Word2vec and Friends
Word2vec ultimate beginner
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
Real Time Big Data Management
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Internet of Things Data Science

What's hot (20)

PDF
Mastering the game of Go with deep neural networks and tree search (article o...
PDF
Introduction to Big Data Science
PDF
The Logical Burrito - pattern matching, term rewriting and unification
PDF
core.logic introduction
DOCX
Study material ip class 12th
PDF
Scaling Deep Learning with MXNet
PDF
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
PDF
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
PPTX
Class 26: Objectifying Objects
PPTX
Gan seminar
PDF
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
PDF
Deep Convolutional GANs - meaning of latent space
PPTX
From Trill to Quill: Pushing the Envelope of Functionality and Scale
PDF
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
PPTX
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
PDF
Intro to io
PDF
Word2vec algorithm
KEY
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
PDF
Hathor@FGV: Introductory notes by Dr. A. Linhares
PDF
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Mastering the game of Go with deep neural networks and tree search (article o...
Introduction to Big Data Science
The Logical Burrito - pattern matching, term rewriting and unification
core.logic introduction
Study material ip class 12th
Scaling Deep Learning with MXNet
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
Class 26: Objectifying Objects
Gan seminar
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
Deep Convolutional GANs - meaning of latent space
From Trill to Quill: Pushing the Envelope of Functionality and Scale
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Intro to io
Word2vec algorithm
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Hathor@FGV: Introductory notes by Dr. A. Linhares
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Ad

Viewers also liked (6)

PDF
Twitterology - The Science of Twitter
PDF
Human Mobility (with Mobile Devices)
PDF
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
PDF
Making Sense of Data Big and Small
PDF
Complenet 2017
PDF
Mining Georeferenced Data
Twitterology - The Science of Twitter
Human Mobility (with Mobile Devices)
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Making Sense of Data Big and Small
Complenet 2017
Mining Georeferenced Data
Ad

Similar to Word2vec in Theory Practice with TensorFlow (20)

PDF
Word2vec and Friends
PPTX
Lecture1.pptx
PPTX
Word_Embeddings.pptx
PPTX
What Deep Learning Means for Artificial Intelligence
PPTX
word vector embeddings in natural languag processing
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
PDF
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
PDF
Deep learning Malaysia presentation 12/4/2017
PPTX
wordembedding.pptx
PPTX
A note on word embedding
PDF
Atlanta MLconf Machine Learning Conference 09-23-2016
PDF
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
PDF
lecture3-Generative AI Lecture 3 – Learning LLMs and Decoding.pdf
PPTX
Deep Learning Bangalore meet up
PPTX
DLBLR talk
PPTX
Word2vec slide(lab seminar)
PPTX
Word embedding
PDF
Magpie
Word2vec and Friends
Lecture1.pptx
Word_Embeddings.pptx
What Deep Learning Means for Artificial Intelligence
word vector embeddings in natural languag processing
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
Deep learning Malaysia presentation 12/4/2017
wordembedding.pptx
A note on word embedding
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
lecture3-Generative AI Lecture 3 – Learning LLMs and Decoding.pdf
Deep Learning Bangalore meet up
DLBLR talk
Word2vec slide(lab seminar)
Word embedding
Magpie

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Computer network topology notes for revision
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Database Infoormation System (DBIS).pptx
Introduction to machine learning and Linear Models
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Supervised vs unsupervised machine learning algorithms
Computer network topology notes for revision
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Acumen Training GuidePresentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Data_Analytics_and_PowerBI_Presentation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Word2vec in Theory Practice with TensorFlow

  • 1. word2vec in theory and practice with tensorflow 
 www.bgoncalves.com https://github.com/bmtgoncalves/word2vec-and-friends/
  • 2. www.bgoncalves.com@bgoncalves • Computers are really good at crunching numbers but not so much when it comes to words. • Perhaps can we represent words numerically?
 
 • Can we do it in a way that preserves semantic information?
 
 
 • Words that have similar meanings are used in similar contexts and the context in which a word is used helps us understand it’s meaning. Teaching machines to read! The red house is beautiful.
 The blue house is old.
 The red car is beautiful.
 The blue car is old. “You shall know a word by the company it keeps”
 (J. R. Firth) a 1 about 2 above 3 after 4 again 5 against 6 all 7 am 8 an 9 and 10 any 11 are 12 aren't 13 as 14 … … vafter = (0, 0, 0, 1, 0, 0, · · · ) T vabove = (0, 0, 1, 0, 0, 0, · · · ) T One-hot encoding
  • 3. www.bgoncalves.com@bgoncalves Teaching machines to read! ➡Words with similar meanings should have similar representations. ➡From a word we can get some idea about the context where it might appear 
 
 ➡And from the context we have some idea about possible words “You shall know a word by the company it keeps”
 (J. R. Firth) max p (C|w) max p (w|C) The red _____ is beautiful.
 The blue _____ is old. ___ ___ house __ ____.
 ___ ___ car __ _______.

  • 4. www.bgoncalves.com@bgoncalves word2vec max p (C|w) max p (w|C) Skipgram Continuous Bag of Words ⇥1 wj1wj+1 wj ⇥2 ⇥2 wj+1 ⇥2 ⇥2 ⇥1 wj wj1 Word Context Context Word ⇥1 ⇥2 wj word embeddings context embeddings one hot vector activation function Mikolov 2013
  • 5. www.bgoncalves.com@bgoncalves • Let us take a better look at a simplified case with a single context word • Words are one-hot encoded vectors of length V • is an matrix so that when we take the product: • We are effectively selecting the j’th column of : • The linear activation function simply passes this value along
 which is then multiplied by , a matrix. • Each element k of the output layer its then given by: • We convert these values to a normalized probability distribution by using the softmax Skipgram ⇥1 softmax wj ⇥2 wj+1 wj = (0, 0, 1, 0, 0, 0, · · · ) T ⇥1 (M ⇥ V ) ⇥1 · wj ⇥1 vj = ⇥1 · wj ⇥2 (V ⇥ M) uT k · vj
  • 6. www.bgoncalves.com@bgoncalves • A standard way of converting a set of number to a normalized probability distribution:
 
 • With this final ingredient we obtain:
 
 • Our goal is then to learn: • so that we can predict what the next word is likely to be using:
 • But how can we quantify how far we are from the correct answer? Our error measure shouldn’t be just binary (right or wrong)… Softmax ⇥1 softmax wj ⇥2 wj+1 softmax (x) = exp (xj) P l exp (xl) p (wk|wj) ⌘ softmax uT k · vj = exp uT k · vj P l exp uT l · vj p (wj+1|wj) ⇥1 ⇥2
  • 7. www.bgoncalves.com@bgoncalves Cross-Entropy • First we have to recall that what we are, in effect, comparing two probability distributions: • and the one-hot encoding of the context: • The Cross Entropy measures the distance, in number of bits, between two probability distributions p and q: 
 • In our case, this becomes: • So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 
 • This is our Error function. But how can we use this to update the values of and ? p (wk|wj) H (p, q) = X k pk log qk H = log p (wj+1|wj) wj+1 wj+1 = (0, 0, 0, 1, 0, 0, · · · ) T H [wj+1, p (wk|wj)] = X k wk j+1 log p (wk|wj) ⇥1 ⇥2
  • 8. www.bgoncalves.com@bgoncalves Gradient Descent • Find the gradient for each training batch • Take a step downhill along the direction of the gradient 
 
 • where is the step size. • Repeat until “convergence”. H ✓mn ✓mn ↵ @H @✓mn @H @✓mn ↵
  • 9. www.bgoncalves.com@bgoncalves Chain-rule • How can we calculate
 • we rewrite:
 • and expand: • Then we can rewrite:
 • and apply the chain rule: @H @✓mn = @ @✓mn log p (wj+1|wj) @H @✓mn = @ @✓mn log exp uT k · vj P l exp uT l · vj uT k · vj = X q ✓ (2) kq ✓ (1) qj @f (g (x)) @x = @f (g (x)) @g (x) @g (x) @x @H @✓mn = @ @✓mn " uT k · vj log X l exp uT l · vj # ✓mn = n ✓(1) mn, ✓(2) mn o
  • 10. www.bgoncalves.com@bgoncalves SkipGram with Larger Contexts • Use the same for all context words. • Use the average of cross entropy. • word order is not important (the average does not change). ⇥1 softmax wj ⇥2 wj+1 ⇥1 wj1wj+1 wj ⇥2 ⇥2 H = log p (wj+1|wj) H = 1 T X t log p (wj+t|wj) ⇥2
  • 11. www.bgoncalves.com@bgoncalves Continuous Bag of Words • The process is essentially the same wj+1 ⇥2 ⇥2 ⇥1 wj wj1
  • 12. www.bgoncalves.com@bgoncalves Variations • Hierarchical Softmax: • Approximate the softmax using a binary tree • Reduce the number of calculations per training example from to and increase performance by orders of magnitude. • Negative Sampling: • Under sample the most frequent words by removing them from the text before generating the contexts • Similar idea to removing stop-words — very frequent words are less informative. • Effectively makes the window larger, increasing the amount of information available for context V log2 V
  • 13. www.bgoncalves.com@bgoncalves Comments • word2vec, even in its original formulation is actually a family of algorithms using various combinations of: • Skip-gram, CBOW • Hierarchical Softmax, Negative Sampling • The output of this neural network is deterministic: • If two words appear in the same context (“blue” vs “red”, for e.g.), they will have similar internal representations in 
 and • and are vector embeddings of the input words and the context words respectively • Words that are too rare are also removed. • The original implementation had a dynamic window size: • for each word in the corpus a window size k’ is sampled uniformly between 1 and k ⇥1 wj1wj+1 wj ⇥2 ⇥2 ⇥1 ⇥2 ⇥1 ⇥2
  • 14. www.bgoncalves.com@bgoncalves Online resources • C - https://code.google.com/archive/p/word2vec/ (the original one) • Python/tensorflow - https://www.tensorflow.org/tutorials/word2vec • Both a minimalist and an efficient versions are available in the tutorial • Python/gensim - https://radimrehurek.com/gensim/models/word2vec.html • Pretrained embeddings: • 90 languages, trained using wikipedia: https://github.com/facebookresearch/fastText/blob/ master/pretrained-vectors.md
  • 15. www.bgoncalves.com@bgoncalves Analogies • The embedding of each word is a function of the context it appears in:
 • words that appear in similar contexts will have similar embeddings: • “Distributional hypotesis” in linguistics (red) = f (context (red)) context (red) ⇡ context (blue) =) (red) ⇡ (blue) “You shall know a word by the company it keeps”
 (J. R. Firth) Geometrical relations between contexts imply semantic relations between words! Paris Rome Washington DC Lisbon France PortugalItaly USA Capital context Country context (France) (Paris) + (Rome) = (Italy)
  • 19. www.bgoncalves.com@bgoncalves • Let’s imagine I want to perform these calculations:
 
 
 
 • for some given . • To calculate we must follow a certain sequence of operations. • Which can be shortened if we are interested in just the value of • In Tensorflow, this is called a Computational Graph and it’s the most fundamental concept to understand • Data, in the form of tensors, flows through the graph from inputs to outputs • Tensorflow, is, essentially, a way of defining arbitrary computational graphs in a way that can be automatically distributed and optimized. A diversion… https://www.tensorflow.org/ y = f (x) z = g (y) x z Apply Assign Apply Assign y
  • 20. www.bgoncalves.com@bgoncalves • If we use base functions, tensorflow knows how to automatically calculate the respective gradients • Automatic BackProp • Graphs can have multiple outputs • Predictions • Cost functions • etc… Computational Graphs https://www.tensorflow.org/
  • 21. www.bgoncalves.com@bgoncalves Sessions • After we have defined the computational graph, we can start using it to make calculations • All computations must take place within a “session” that defines the values of all required input values https://www.tensorflow.org/ • Which values are required for a specific computation depend on what part of the graph is actually being executed. • When you request the value of a specific output, tensorflow determines what is the specific subgraph that must be executed and what are the required input values. • For optimization purposes, it can also execute independent parts of the graph in different devices (CPUs, GPUs, TPUs, etc) at the same time.
  • 22. @bgoncalves A basic Tensorflow program https://www.tensorflow.org/ import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) c = tf.constant(3.) m = tf.add(x, y) z = tf.multiply(m, c) with tf.Session() as sess: output = sess.run(z, feed_dict={x: 1., y: 2.}) print("Output value is:", output) basic.py z = c ⇤ (x + y) Placeholders Constant add multiply assign assign
  • 23. @bgoncalves A basic Tensorflow program https://www.tensorflow.org/ import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) c = tf.constant(3.) m = tf.add(x, y) z = tf.multiply(m, c) with tf.Session() as sess: output = sess.run(z, feed_dict={x: 1., y: 2.}) print("Output value is:", output) z = c ⇤ (x + y) Placeholders Constant add multiply assign 1 2 assign 9 Inputs
  • 24. @bgoncalves import numpy as np import matplotlib.pyplot as plt import tensorflow as tf learning_rate = 0.01 N = 100 N_steps = 300 # Training Data train_X = np.linspace(-10, 10, N) train_Y = 2*train_X + 3 + 5*np.random.random(N) # Computational Graph X = tf.placeholder("float") Y = tf.placeholder("float") W = tf.Variable(np.random.randn(), name="weight") b = tf.Variable(np.random.randn(), name="bias") y = tf.add(tf.multiply(X, W), b) cost = tf.reduce_mean(tf.pow(y-Y, 2)) optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) for step in range(N_steps): sess.run(optimizer, feed_dict={X: train_X, Y: train_Y}) cost_val, W_val, b_val = sess.run([cost, W, b], feed_dict={X: train_X, Y:train_Y}) print("step", step, "cost", cost_val, "w", W_val, "b", b_val) Linear Regression linear.py y = W ⇤ x + b cost = 1 N X i (yi Yi) 2
  • 25. @bgoncalves import numpy as np import matplotlib.pyplot as plt import tensorflow as tf learning_rate = 0.01 N = 100 N_steps = 300 # Training Data train_X = np.linspace(-10, 10, N) train_Y = 2*train_X + 3 + 5*np.random.random(N) # Computational Graph X = tf.placeholder("float") Y = tf.placeholder("float") W = tf.Variable(np.random.randn(), name="weight") b = tf.Variable(np.random.randn(), name="bias") y = tf.add(tf.multiply(X, W), b) cost = tf.reduce_mean(tf.pow(y-Y, 2)) optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) for step in range(N_steps): sess.run(optimizer, feed_dict={X: train_X, Y: train_Y}) cost_val, W_val, b_val = sess.run([cost, W, b], feed_dict={X: train_X, Y:train_Y}) print("step", step, "cost", cost_val, "w", W_val, "b", b_val) Linear Regression linear.py y = W ⇤ x + b cost = 1 N X i (yi Yi) 2