2. Administrative
Lecture 11 -2 May 9,
Fei-Fei Li & Justin Johnson & Serena
● A3 is out. Due May 22.
● Milestone is due next Wednesday.
○ Read Piazza post for milestone requirements.
○ Need to Finish data preprocessing and initial results by
then.
● Don't discuss exam yet since people are still taking it.
3. Overview
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 3
● Unsupervised Learning
● Generative Models
○ PixelRNN and PixelCNN
○ Variational Autoencoders (VAE)
○ Generative Adversarial Networks
(GAN)
4. Supervised vs Unsupervised Learning
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 4
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to map x ->
y
Examples: Classification,
regression, object detection,
semantic segmentation,
image captioning, etc.
5. Supervised vs Unsupervised Learning
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to map x ->
y
Examples: Classification,
regression, object detection,
semantic segmentation,
image captioning, etc.
Cat
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 5
Classificatio
n
This image is CC0 public
domain
6. Supervised vs Unsupervised Learning
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to map x ->
y
Examples: Classification,
regression, object detection,
semantic segmentation,
image captioning, etc.
DOG, DOG, CAT
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 6
This image is CC0 public
domain
Object
Detection
7. Supervised vs Unsupervised Learning
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to map x ->
y
Examples: Classification,
regression, object detection,
semantic segmentation,
image captioning, etc.
Semantic
Segmentation
GRASS, CAT,
TREE, SKY
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 7
8. Supervised vs Unsupervised Learning
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to map x ->
y
Examples: Classification,
regression, object detection,
semantic segmentation,
image captioning, etc.
Image
captioning
A cat sitting on a suitcase on the
floor
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 8
Caption generated using
neuraltalk2 Image is CC0 Public
domain.
9. Unsupervised Learning
Data: x
Just data, no labels!
Goal: Learn some
underlying hidden structure
of the data
Examples: Clustering,
dimensionality reduction,
feature learning, density
estimation, etc. Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 9
Supervised vs Unsupervised Learning
10. Unsupervised Learning
Data: x
Just data, no labels!
Goal: Learn some
underlying hidden structure
of the data
Examples: Clustering,
dimensionality reduction,
feature learning, density
estimation, etc.
Supervised vs Unsupervised Learning
K-means
clustering
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 10
This image is CC0 public
domain
11. Unsupervised Learning
Data: x
Just data, no labels!
Goal: Learn some
underlying hidden structure
of the data
Examples: Clustering,
dimensionality reduction,
feature learning, density
estimation, etc.
Supervised vs Unsupervised Learning
3-d 2-d
Principal Component
Analysis (Dimensionality
reduction) This image from Matthias
Scholz is CC0 public domain
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 11
12. Unsupervised Learning
Data: x
Just data, no labels!
Goal: Learn some
underlying hidden structure
of the data
Examples: Clustering,
dimensionality reduction,
feature learning, density
estimation, etc.
Supervised vs Unsupervised Learning
Autoencoders
(Feature
learning)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 12
13. Unsupervised Learning
Data: x
Just data, no
labels!
Goal: Learn some
underlying hidden structure
of the data
Examples: Clustering,
dimensionality reduction,
feature learning, density
estimation, etc.
Supervised vs Unsupervised Learning
2-d density
estimation
2-d density images left and
right are CC0 public domain
Figure copyright Ian Goodfellow, 2016. Reproduced with permission.
1-d density
estimation
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 13
14. Unsupervised Learning
Data: x
Just data, no labels!
Goal: Learn some
underlying hidden structure
of the data
Examples: Clustering,
dimensionality reduction,
feature learning, density
estimation, etc.
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 14
Supervised vs Unsupervised Learning
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to map x ->
y
Examples: Classification,
regression, object detection,
semantic segmentation,
image captioning, etc.
15. Data: x
Just data, no
labels!
Unsupervised Learning
Training data is cheap
Goal: Learn some
underlying hidden structure
of the data
Examples: Clustering,
dimensionality reduction,
feature learning, density
estimation, etc.
Holy grail: Solve
unsupervised
learning
=> understand
structure of visual
world
Supervised vs Unsupervised Learning
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to map x ->
y
Examples: Classification,
regression, object detection,
semantic segmentation,
image captioning, etc.
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 15
16. Generative Models
Given training data, generate new samples from same
distribution
Training data ~ pdata
(x) Generated samples ~
pmodel
(x)
Want to learn pmodel
(x) similar to pdata
(x)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 16
17. Generative Models
Given training data, generate new samples from same
distribution
Training data ~ pdata
(x) Generated samples ~ pmodel
(x)
Want to learn pmodel
(x) similar to pdata
(x)
Addresses density estimation, a core problem in unsupervised learning
Several flavors:
- Explicit density estimation: explicitly define and solve for pmodel
(x)
- Implicit density estimation: learn model that can sample from pmodel
(x) w/o explicitly
defining it
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 17
18. Why Generative Models?
- Realistic samples for artwork, super-resolution, colorization,
etc.
- Generative models of time-series data can be used for simulation
and planning (reinforcement learning applications!)
- Training generative models can also enable inference of
latent representations that can be useful as general
features
FIgures from L-R are copyright: (1) Alec Radford et al. 2016; (2) Phillip Isola et al. 2017. Reproduced with authors permission (3) BAIR
Blog.
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 18
19. Taxonomy of Generative
Models Generative models
Explicit density Implicit density
Direct
Tractable density Approximate density
Markov Chain
Variational Markov Chain
Variational Autoencoder Boltzmann Machine
Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks,
2017.
GSN
GAN
Fully Visible Belief
Nets
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 19
- NADE
- MADE
- PixelRNN/CNN
- NICE / RealNVP
- Glow
- Ffjord
20. Taxonomy of Generative
Models Generative models
Explicit density Implicit density
Direct
Tractable density Approximate density
Markov Chain
Variational Markov Chain
Variational
Autoencoder
Boltzmann
Machine
GSN
GAN
Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks,
2017.
Today: discuss 3 most
popular types of
generative models today
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 20
Fully Visible Belief
Nets
PixelRNN/CNN
- NADE
- MADE
-
- NICE / RealNVP
- Glow
- Ffjord
21. PixelRNN and
PixelCNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 9, 2019
Lecture 11 - 21 May 9,
Fei-Fei Li & Justin Johnson & Serena
22. Fully visible belief network
Explicit density model
Use chain rule to decompose likelihood of an image x into product
of 1-d distributions:
Likelihood
of image
x
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 22
Probability of i’th pixel
value given all previous
pixels
Then maximize likelihood of training
data
23. Then maximize likelihood of training
data
Fully visible belief network
Explicit density model
Use chain rule to decompose likelihood of an image x into product
of 1-d distributions:
Likelihood
of image
x
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 23
Probability of i’th pixel
value given all previous
pixels
Complex
distribution
over pixel
values => Express using a
neural network!
24. Fully visible belief network
Explicit density model
Use chain rule to decompose likelihood of an image x into product
of 1-d distributions:
Likelihood
of image
x
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 24
Probability of i’th pixel
value given all previous
pixels
Will need to define
ordering of
“previous pixels”
Complex distribution over pixel
values => Express using a
neural network!
Then maximize likelihood of training
data
25. PixelRNN
Generate image pixels starting from
corner
Dependency on previous pixels
modeled using an RNN (LSTM)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 25
[van der Oord et al. 2016]
26. PixelRNN
Generate image pixels starting from
corner
Dependency on previous pixels
modeled using an RNN (LSTM)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 26
[van der Oord et al. 2016]
27. PixelRNN
Generate image pixels starting from
corner
Dependency on previous pixels
modeled using an RNN (LSTM)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 27
[van der Oord et al. 2016]
28. PixelRNN
Generate image pixels starting from
corner
Dependency on previous pixels
modeled using an RNN (LSTM)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 28
[van der Oord et al. 2016]
Drawback: sequential generation is
slow!
29. PixelCNN [van der Oord et al. 2016]
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 29
Still generate image pixels starting
from corner
Dependency on previous pixels now
modeled using a CNN over context
region
Figure copyright van der Oord et al., 2016. Reproduced with permission.
30. PixelCNN [van der Oord et al. 2016]
Still generate image pixels starting
from corner
Dependency on previous pixels now
modeled using a CNN over context
region
Training: maximize likelihood of
training images
Figure copyright van der Oord et al., 2016. Reproduced with permission.
Softmax loss at each
pixel
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 30
31. PixelCNN [van der Oord et al. 2016]
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 31
Still generate image pixels starting
from corner
Dependency on previous pixels now
modeled using a CNN over context
region
Training is faster than PixelRNN
(can parallelize convolutions since context
region values known from training images)
Generation must still proceed sequentially
=> still slow
Figure copyright van der Oord et al., 2016. Reproduced with permission.
32. Generation Samples
Figures copyright Aaron van der Oord et al., 2016. Reproduced with permission.
32x32 CIFAR-10
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 32
32x32
ImageNet
33. PixelRNN and PixelCNN
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 33
See
Improving PixelCNN
performance
- Gated convolutional layers
- Short-cut connections
- Discretized logistic loss
- Multi-scale
- Training tricks
- Etc…
- Van der Oord et al. NIPS
2016
- Salimans et al.
2017 (PixelCNN++)
Pros:
- Can explicitly compute
likelihood p(x)
- Explicit likelihood of
training data gives good
evaluation metric
- Good samples
Con:
- Sequential generation =>
slow
35. PixelCNNs define tractable density function, optimize likelihood of training
data:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 35
So far...
36. So far...
PixelCNNs define tractable density function, optimize likelihood of training
data:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 36
VAEs define intractable density function with latent z:
Cannot optimize directly, derive and optimize lower bound on likelihood
instead
37. Some background first: Autoencoders
Features
Encoder
Input data
Unsupervised approach for learning a lower-dimensional feature
representation from unlabeled training data
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 37
38. Some background first: Autoencoders
Unsupervised approach for learning a lower-dimensional feature
representation from unlabeled training data
Originally: Linear +
nonlinearity
(sigmoid)
Later: Deep, fully-
connected
Later: ReLU CNN
Features
Encoder
Input data
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 38
39. Some background first: Autoencoders
Features
Encoder
Input data
Unsupervised approach for learning a lower-dimensional feature
representation from unlabeled training data
Originally: Linear +
nonlinearity
(sigmoid)
Later: Deep, fully-
connected
Later: ReLU CNN
z usually smaller than x
(dimensionality
reduction)
Q: Why
dimensionality
reduction?
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 39
40. Some background first: Autoencoders
Encode
r
Input
data
Feature
s
Unsupervised approach for learning a lower-dimensional feature
representation from unlabeled training data
Originally: Linear +
nonlinearity
(sigmoid)
Later: Deep, fully-
connected
Later: ReLU CNN
z usually smaller than x
(dimensionality
reduction)
Q: Why
dimensionality
reduction?
A: Want features to
capture meaningful
factors of variation
in data
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 40
41. Some background first: Autoencoders
Features
Encoder
Input data
How to learn this feature
representation?
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 41
42. Some background first: Autoencoders
How to learn this feature representation?
Train such that features can be used to reconstruct original
data “Autoencoding” - encoding itself
Reconstructe
d input
data
D
e
c
o
d
e
r
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 42
43. Some background first: Autoencoders
Decode
r
Features
Encoder
Input data
How to learn this feature representation?
Train such that features can be used to reconstruct original
data “Autoencoding” - encoding itself
Originally: Linear +
Reconstructe
d input
data
nonlinearity (sigmoid)
Later: Deep, fully-
connected
Later: ReLU CNN (upconv)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 43
44. Some background first:
Autoencoders
Features
Encoder
Input data
How to learn this feature representation?
Train such that features can be used to reconstruct original
data “Autoencoding” - encoding itself
Decode
r
Reconstructe
d input
data
Reconstructed
data
Input
data
Encoder: 4-layer conv
Decoder: 4-layer
upconv
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 44
45. Some background first:
Autoencoders
Features
Encoder
Input data
Decode
r
Reconstructe
d input
data
Reconstructed
data
Input
data
Encoder: 4-layer conv
Decoder: 4-layer
upconv
L2 Loss
function:
Train such that
features
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 45
can be used to
reconstruct original
data
47. Some background first: Autoencoders
Encode
r
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 47
Input
data
Feature
s
Decode
r
Reconstructe
d input
data
After training,
throw away
decoder
48. Some background first: Autoencoders
Encode
r
Input
data
Feature
s
Classifie
r
Predicted
Label Fine-tune
encoder
jointly
with
classifier
Loss
function
(Softmax,
etc)
Encoder can be
used to initialize
a supervised
model
plan
e
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 48
do
g
dee
r
bir
d truc
k
Train for final
task (sometimes
with small data)
49. Some background first: Autoencoders
Features
Encoder
Input data
Decode
r
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 49
Reconstructe
d input
data
Autoencoders can
reconstruct data, and can
learn features to initialize a
supervised model
Features capture factors of
variation in training data. Can
we generate new images from
an autoencoder?
50. Variational Autoencoders
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 50
Probabilistic spin on autoencoders - will let us sample from the model to generate
data!
51. Sample
from true
prior
Variational
Autoencoders
Probabilistic spin on autoencoders - will let us sample from the model to generate
data!
Assume training data is generated from underlying unobserved
(latent) representation z
Sample from
true
conditional
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 51
52. Sample
from true
prior
Variational
Autoencoders
Probabilistic spin on autoencoders - will let us sample from the model to generate
data!
Assume training data is generated from underlying unobserved
(latent) representation z
Sample from
true
conditional
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 52
Intuition (remember from
autoencoders!): x is an image, z is
latent factors used to generate x:
attributes, orientation, etc.
53. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 53
54. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
How should we represent this
model?
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 54
55. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
How should we represent this
model?
Choose prior p(z) to be simple, e.g.
Gaussian. Reasonable for latent
attributes,
e.g. pose, how much smile.
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 55
56. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
How should we represent this
model?
Choose prior p(z) to be simple,
e.g. Gaussian.
Conditional p(x|z) is complex
(generates image) => represent with
neural network
Decode
r
networ
k
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 56
57. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
How to train the
model?
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 57
Decode
r
networ
k
58. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
How to train the
model?
Remember strategy for training generative
models from FVBNs. Learn model
parameters to maximize likelihood of
training data
Decode
r
networ
k
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 58
59. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
How to train the
model?
Remember strategy for training generative
models from FVBNs. Learn model
parameters to maximize likelihood of
training data
Now with latent
z
Decode
r
networ
k
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 59
60. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
How to train the
model?
Remember strategy for training generative
models from FVBNs. Learn model
parameters to maximize likelihood of
training data
Q: What is the problem with
this?
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 60
Decode
r
networ
k
61. Sample
from true
prior
Variational Autoencoders
Sample from
true
conditional
We want to estimate the true
parameters of this generative model.
How to train the
model?
Remember strategy for training generative
models from FVBNs. Learn model
parameters to maximize likelihood of
training data
Q: What is the problem with
this? Intractable!
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 61
Decode
r
networ
k
66. Variational Autoencoders:
Intractability ✔ ✔
Data likelihood:
Posterior density also
intractable:
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 66
68. Variational Autoencoders:
Intractability
Data likelihood:
✔
✔
✔
✔
Posterior density also intractable:
Solution: In addition to decoder network modeling pθ
(x|z), define
additional encoder network qɸ
(z|x) that approximates pθ
(z|x)
Will see that this allows us to derive a lower bound on the data likelihood
that is tractable, which we can optimize
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 68
69. Variational Autoencoders
Since we’re modeling probabilistic generation of data, encoder and decoder networks are
probabilistic Mean and (diagonal) covariance of z | x Mean and (diagonal) covariance of x | z
Encoder
network
Decoder
network
(parameters
ɸ)
(parameters
θ)
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 69
70. Variational Autoencoders
Encoder
network
Since we’re modeling probabilistic generation of data, encoder and decoder networks are
probabilistic
Decoder
network
(parameters
ɸ)
(parameters
θ)
Sample z
from
Sample x|z
from
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 70
71. Variational Autoencoders
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Encoder
network
Since we’re modeling probabilistic generation of data, encoder and decoder networks are
probabilistic
Decoder
network
(parameters
ɸ)
(parameters
θ)
Sample z
from
Sample x|z
from
Encoder and decoder networks also called
“recognition”/“inference” and “generation”
networks
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 71
72. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 72
73. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 73
Taking expectation wrt. z
(using encoder network)
will come in handy later
74. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 74
75. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 75
76. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 76
77. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 77
78. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
The expectation wrt. z
(using encoder network)
let us write nice KL terms
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 78
79. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
This KL term (between
Gaussians for encoder and
z prior) has nice closed-
form solution!
pθ
(z|x) intractable (saw
earlier), can’t compute this
KL term :( But we know KL
divergence always >= 0.
Decoder network gives pθ
(x|z), can
compute estimate of this term
through sampling. (Sampling
differentiable through reparam.
trick, see paper.) Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 79
80. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
This KL term (between
Gaussians for encoder and
z
pθ
(z|x) intractable (saw
earlier), can’t compute this
KL
Decoder network gives pθ
(x|z), can
compute estimate of this term
through
We want to
maximize
the data
likelihood
term :( But we know KL
divergence always >=
0.
prior) has nice closed-
form solution!
sampling. (Sampling
differentiable through reparam.
trick, see paper.)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 80
81. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
Tractable lower bound which we can take
gradient of and optimize! (pθ
(x|z)
differentiable,
We want to
maximize
the data
likelihood
KL term differentiable)
Fei-Fei Li & Justin Johnson & Serena Lecture 11 May 9,
81
82. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
We want to
maximize
the data
likelihood
Training: Maximize lower
bound
Variational lower bound
(“ELBO”)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 82
83. Variational Autoencoders
Now equipped with our encoder and decoder networks, let’s work out the (log) data
likelihood:
Reconstruct
the input
data
Make approximate
posterior
distribution close
to prior
Training: Maximize lower
bound
Variational lower bound
(“ELBO”)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 83
84. Variational Autoencoders
Putting it all together: maximizing
the likelihood lower bound
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 84
85. Input Data
Variational Autoencoders
Putting it all together: maximizing
the likelihood lower bound
Let’s look at computing the bound
(forward pass) for a given minibatch
of input data
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 85
86. Encoder network
Input Data
Variational Autoencoders
Putting it all together: maximizing
the likelihood lower bound
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 86
87. Variational Autoencoders
Putting it all together: maximizing
the likelihood lower bound
Make approximate
posterior
distribution close
to prior
E
n
c
o
d
e
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 87
88. Sample z
from
Variational Autoencoders
Putting it all together: maximizing
the likelihood lower bound
Make approximate
posterior
distribution close
to prior
E
n
c
o
d
e
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 88
90. Decoder
network
Sample z
from
Sample x|z
from
Variational
Autoencoders
Putting it all together: maximizing
the likelihood lower bound
Make approximate
posterior
distribution close
to prior
E
n
c
o
d
e
Maximiz
e
likelihood of
original
input being
reconstructe
d
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 90
91. Decoder
network
Sample z
from
Sample x|z
from
Input Data
Variational
Autoencoders
Putting it all together: maximizing
the likelihood lower bound
Make approximate
posterior
distribution close
to prior
E
n
c
o
d
e
Maximiz
e
likelihood of
original
input being
reconstructe
d
For every minibatch of
input data: compute this
forward pass, and then
backprop!
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 91
92. Decoder
network
Sample x|z
from
Variational Autoencoders: Generating Data!
Use decoder network. Now sample z from
prior!
Sample z from
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 92
93. Decoder
network
Sample x|z
from
Variational Autoencoders: Generating Data!
Use decoder network. Now sample z from
prior!
Sample z from
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 93
94. Decoder
network
Sample x|z
from
Variational Autoencoders: Generating Data!
Sample z from
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
Use decoder network. Now sample z from prior! Data manifold for 2-d
z
Vary z1
Vary z2
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 94
95. Variational Autoencoders: Generating Data!
Vary z1
Vary z2
Degree of
smile
Head
pose
Diagonal prior on
z
=>
independent
latent
variables
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 95
Different
dimensions of z
encode
interpretable
factors of variation
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR
2014
96. Variational Autoencoders: Generating Data!
Vary z1
Vary z2
Degree of
smile
Head
pose
Diagonal prior on
z
=>
independent
latent
variables
Different
dimensions of z
encode
interpretable
factors of variation
Also good feature representation
that can be computed using qɸ
(z|
x)!
Kingma and Welling, “Auto-Encoding Variational
Bayes”, ICLR 2014 Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 96
97. Variational Autoencoders: Generating Data!
32x32 CIFAR-10
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 97
Labeled Faces in the
Wild
Figures copyright (L) Dirk Kingma et al. 2016; (R) Anders Larsen et al. 2017. Reproduced with permission.
98. Variational Autoencoders
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 98
Probabilistic spin to traditional autoencoders => allows generating data
Defines an intractable density => derive and optimize a (variational) lower bound
Pros:
- Principled approach to generative models
- Allows inference of q(z|x), can be useful feature representation for other
tasks
Cons:
- Maximizes lower bound of likelihood: okay, but not as good
evaluation as PixelRNN/PixelCNN
- Samples blurrier and lower quality compared to state-of-the-art
(GANs)
Active areas of research:
- More flexible approximations, e.g. richer approximate posterior instead of
diagonal Gaussian, e.g., Gaussian Mixture Models (GMMs)
- Incorporating structure in latent variables, e.g., Categorical Distributions
100. So far...
PixelCNNs define tractable density function, optimize likelihood of training
data:
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 100
VAEs define intractable density function with latent z:
Cannot optimize directly, derive and optimize lower bound on likelihood
instead
101. So far...
PixelCNNs define tractable density function, optimize likelihood of training
data:
VAEs define intractable density function with latent z:
Cannot optimize directly, derive and optimize lower bound on likelihood
instead What if we give up on explicitly modeling density, and just want
ability to sample?
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 101
102. So far...
PixelCNNs define tractable density function, optimize likelihood of training
data:
VAEs define intractable density function with latent z:
Cannot optimize directly, derive and optimize lower bound on likelihood
instead What if we give up on explicitly modeling density, and just want
ability to sample?
GANs: don’t work with any explicit density function!
Instead, take game-theoretic approach: learn to generate from training
distribution through 2-player game Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 102
103. Generative Adversarial
Networks
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 103
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Problem: Want to sample from complex, high-dimensional training distribution. No
direct way to do this!
Solution: Sample from a simple distribution, e.g. random noise. Learn
transformation to training distribution.
Q: What can we use
to represent this
complex
transformation?
104. Problem: Want to sample from complex, high-dimensional training distribution. No
direct way to do this!
Solution: Sample from a simple distribution, e.g. random noise. Learn
transformation to training distribution.
Generative Adversarial
Networks
z
Input: Random
noise
Generato
r
Network
Output: Sample
from training
distribution
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 104
Q: What can we use
to represent this
complex
transformation?
A: A neural
network!
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
105. Training GANs: Two-player
game
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 105
Generator network: try to fool the discriminator by generating real-looking
images
Discriminator network: try to distinguish between real and fake images
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
106. Training GANs: Two-player
game
Generator network: try to fool the discriminator by generating real-looking
images
Discriminator network: try to distinguish between real and fake images
Real or Fake
z
Random
noise
Generator Network
Discriminator
Network
Fake Images
(from
generator)
Real Images
(from training
set)
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 106
Fake and real images copyright Emily Denton et al. 2015. Reproduced with
permission.
107. Training GANs: Two-player
game
Generator network: try to fool the discriminator by generating real-looking
images
Discriminator network: try to distinguish between real and fake images
Train jointly in minimax game
Minimax objective function:
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 107
108. Training GANs: Two-player
game
Generator network: try to fool the discriminator by generating real-looking
images
Discriminator network: try to distinguish between real and fake images
Train jointly in minimax game
Discriminator outputs likelihood in (0,1) of real
image
Minimax objective function:
Discriminator
output for real
data x
Discriminator output
for generated fake data
G(z)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 108
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
109. Training GANs: Two-player
game
Generator network: try to fool the discriminator by generating real-looking
images
Discriminator network: try to distinguish between real and fake images
Train jointly in minimax game
Discriminator outputs likelihood in (0,1) of real
image
Minimax objective function:
Discriminator
output for real
data x
Discriminator output
for generated fake data
G(z)
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 109
- Discriminator (θd
) wants to maximize objective such that D(x) is close to 1 (real)
and D(G(z)) is close to 0 (fake)
- Generator (θg
) wants to minimize objective such that D(G(z)) is close to
1 (discriminator is fooled into thinking generated G(z) is real)
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
110. Training GANs: Two-player
game
Minimax objective
function:
Alternate between:
1. Gradient ascent on
discriminator
2. Gradient descent on
generator
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 110
111. Training GANs: Two-player
game
Minimax objective
function:
Alternate between:
1. Gradient ascent on
discriminator
2. Gradient descent on
generator
In practice, optimizing this generator
objective does not work well!
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
When sample is
likely fake, want to
learn from it to
improve generator.
But gradient in this
region is relatively
flat!
Gradient signal
dominated by
region where
sample is already
good
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 111
112. Training GANs: Two-player
game
Minimax objective
function:
Alternate between:
1. Gradient ascent on
discriminator
2. Instead: Gradient ascent on generator, different
objective
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Instead of minimizing likelihood of discriminator being
correct, now maximize likelihood of discriminator being
wrong.
Same objective of fooling discriminator, but now higher
gradient signal for bad samples => works much better!
Standard in practice.
High gradient
signal
Low gradient
signal
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 112
113. Training GANs: Two-player
game
Minimax objective
function:
Alternate between:
1. Gradient ascent on
discriminator
2. Instead: Gradient ascent on generator, different
objective
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Aside: Jointly training two
networks is challenging,
can be unstable.
Choosing objectives with
better loss landscapes
helps training, is an active
area of research.
Instead of minimizing likelihood of discriminator being
correct, now maximize likelihood of discriminator being
wrong.
Same objective of fooling discriminator, but now higher
gradient signal for bad samples => works much better!
Standard in practice.
High gradient
signal
Low gradient
signal
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 113
114. Training GANs: Two-player game
Putting it together: GAN training
algorithm
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 114
115. Training GANs: Two-player
game
Putting it together: GAN training
algorithm
Some find k=1
more stable,
others use k >
1, no best rule.
Recent work (e.g.
Wasserstein GAN)
alleviates this
problem, better
stability!
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 115
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
116. Training GANs: Two-player
game
Generator network: try to fool the discriminator by generating real-looking
images
Discriminator network: try to distinguish between real and fake images
Real or Fake
z
Generator Network
Discriminator
Network
Fake Images
(from
generator)
Random
noise
Real Images
(from training
set)
After training, use generator
network to generate new images
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 116
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Fake and real images copyright Emily Denton et al. 2015. Reproduced with
permission.
117. Generative Adversarial
Nets Generated
samples
Nearest neighbor from training
set
Figures copyright Ian Goodfellow et al., 2014. Reproduced with
permission.
Lecture 11 May 9,
117
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
118. Generative Adversarial
Nets Generated samples (CIFAR-10)
Ian Goodfellow et al.,
“Generative Adversarial Nets”,
NIPS 2014
Nearest neighbor from training
set
Figures copyright Ian Goodfellow et al., 2014. Reproduced with
permission.
Lecture 11 May 9,
118
119. Generative Adversarial Nets: Convolutional
Architectures
Generator is an upsampling network with fractionally-strided
convolutions Discriminator is a convolutional network
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”,
ICLR 2016
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 119
120. Generato
r
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”,
ICLR 2016
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 120
Generative Adversarial Nets: Convolutional
Architectures
122. n
t
Interpolatin
g between
random
points in
late space
Generative Adversarial Nets: Convolutional
Architectures
Radford et
al, ICLR
2016
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 122
123. Generative Adversarial Nets: Interpretable Vector
Math
Smiling woman Neutral woman Neutral
man
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 123
Sample
s from
the
model
Radford et al, ICLR 2016
124. Smiling woman Neutral woman Neutral
man
Average Z
vectors,
do
arithmetic
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 124
Sample
s from
the
model
Radford et al, ICLR 2016
Generative Adversarial Nets: Interpretable Vector
Math
125. Smiling woman Neutral woman Neutral
man
Smiling
Man
Sample
s from
the
model
Average Z
vectors,
do
arithmetic
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 125
Radford et al, ICLR 2016
Generative Adversarial Nets: Interpretable Vector
Math
126. Radford et
al, ICLR 2016
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 126
Glasses man No glasses man No glasses
woman
Generative Adversarial Nets: Interpretable Vector
Math
127. Glasses man No glasses man No glasses
woman
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 127
Woman with
glasses
Radford et
al, ICLR 2016
Generative Adversarial Nets: Interpretable Vector
Math
128. “The GAN
Zoo”
2017: Explosion of GANs
“The GAN Zoo”
https://github.com/hindupuravinash/the-
gan-zoo
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 128
129. and tricks for trainings
GANs
https://github.com/hindupuravinash/the-
gan-zoo
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 129
2017: Explosion of GANs See also: https://github.com/soumith/ganhacks for tips
“The GAN Zoo”
130. Better training and
generation
LSGAN, Zhu 2017.
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 130
Wasserstein GAN,
Arjovsky 2017.
Improved
Wasserstein GAN,
Gulrajani 2017. Progressive GAN, Karras 2018.
2017: Explosion of
GANs
131. 2017: Explosion of
GANs
Source->Target domain transfer
CycleGAN. Zhu et al. 2017.
Pix2pix. Isola 2017. Many examples
at https://phillipi.github.io/pix2pix/
Reed et al. 2017.
Many GAN
applications
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 131
Text -> Image
Synthesis
132. 2019: BigGAN
Brock et al.,
2019
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 132
133. GANs
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 133
Don’t work with an explicit density function
Take game-theoretic approach: learn to generate from training distribution through
2-player game
Pros:
- Beautiful, state-of-the-art samples!
Cons:
- Trickier / more unstable to train
- Can’t solve inference queries such as p(x), p(z|x)
Active areas of research:
- Better loss functions, more stable training (Wasserstein GAN, LSGAN, many
others)
- Conditional GANs, GANs for all kinds of applications
134. Taxonomy of Generative
Models Generative models
Explicit density Implicit density
Direct
Tractable density Approximate density
Markov Chain
Variational Markov Chain
Variational
Autoencoder
Boltzmann
Machine
GSN
GAN
Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks,
2017.
Fully Visible Belief
Nets
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 134
PixelRNN/CNN
- NADE
- MADE
-
- NICE / RealNVP
- Glow
- Ffjord
135. Useful Resources on Generative Models
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 135
CS 236: Deep Generative Models (Stanford)
CS 294-158 Deep Unsupervised Learning
(Berkeley)
136. Recap
Lecture 11 May 9,
Fei-Fei Li & Justin Johnson & Serena 136
Generative Models
- PixelRNN and
PixelCNN
- Variational Autoencoders
(VAE)
- Generative Adversarial Networks
(GANs)
Explicit density model, optimizes exact likelihood,
good samples. But inefficient sequential
generation.
Optimize variational lower bound on likelihood.
Useful latent representation, inference queries. But
current sample quality not the best.
Game-theoretic approach, best
samples! But can be tricky and
unstable to train, no inference
queries.