SlideShare a Scribd company logo
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
1
Lecture 14:
Self-Supervised Learning
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
2
Administrative
- Assignment 3 due in two weeks 5/25
- Midterm grade is out
- Regrade request:
- Gradescope regrade only for mistakes
according to the current rubric
- Teaching team will discuss concerns in MC & T/F
next Monday
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
3
Last Lecture: Generative Modeling
Training data ~ pdata
(x)
Objectives:
1. Learn pmodel
(x) that approximates pdata
(x)
2. Sampling new x from pmodel
(x)
Given training data, generate new samples from same distribution
learning
pmodel
(x
)
sampling
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
4
Last Lecture: Generative Modeling
Generative models
Explicit density Implicit density
Direct
Tractable density Approximate density
Markov Chain
Variational Markov Chain
Variational Autoencoder Boltzmann Machine
GSN
GAN
Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017.
Fully Visible Belief Nets
- NADE
- MADE
- PixelRNN/CNN
- NICE / RealNVP
- Glow
- Ffjord
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
5
Generative vs. Self-supervised Learning
● Both aim to learn from data without manual label annotation.
● Generative learning aims to model data distribution pdata
(x),
e.g., generating realistic images.
● Self-supervised learning methods solve “pretext” tasks that
produce good features for downstream tasks.
○ Learn with supervised learning objectives, e.g., classification,
regression.
○ Labels of these pretext tasks are generated automatically
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
6
Self-supervised pretext tasks
?
Example: learn to predict image transformations / complete corrupted images
image completion
θ=?
rotation prediction “jigsaw puzzle” colorization
1. Solving the pretext tasks allow the model to learn good features.
2. We can automatically generate labels for the pretext tasks.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
7
Generative vs. Self-supervised Learning
Left: Drawing of a dollar bill from memory. Right: Drawing subsequently made
with a dollar bill present. Image source: Epstein, 2016
Learning to generate pixel-level details is often unnecessary; learn
high-level semantic features with pretext tasks instead
Source: Anand, 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
8
How to evaluate a self-supervised learning method?
We usually don’t care about the performance of the self-supervised
learning task, e.g., we don’t care if the model learns to predict image
rotation perfectly.
Evaluate the learned feature encoders on downstream target tasks
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
9
How to evaluate a self-supervised learning method?
lots of
unlabeled
data
self-supervised
learning
feature
extractor
(e.g., a
convnet)
90°
conv fc
1. Learn good feature extractors from
self-supervised pretext tasks, e.g.,
predicting image rotations
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
10
How to evaluate a self-supervised learning method?
lots of
unlabeled
data
self-supervised
learning
feature
extractor
(e.g., a
convnet)
small amount of
labeled data on
the target task
supervised
learning
evaluate on the
target task
e.g. classification, detection
90°
conv fc
bird
conv linear
classifier
1. Learn good feature extractors from
self-supervised pretext tasks, e.g.,
predicting image rotations
2. Attach a shallow network on the
feature extractor; train the shallow
network on the target task with small
amount of labeled data
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
11
Broader picture
language modeling
GPT3 (Brown, Mann,
Ryder, Subbiah et al., 2020)
speech synthesis
Wavenet (van den Oord et
al., 2016)
computer vision
robot / reinforcement learning
Dense Object Net (Florence
and Manuelli et al., 2018)
Doersch et al., 2015
...
Today’s lecture
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
12
Today’s Agenda
Pretext tasks from image transformations
- Rotation, inpainting, rearrangement, coloring
Contrastive representation learning
- Intuition and formulation
- Instance contrastive learning: SimCLR and MOCO
- Sequence contrastive learning: CPC
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
13
Today’s Agenda
Pretext tasks from image transformations
- Rotation, inpainting, rearrangement, coloring
Contrastive representation learning
- Intuition and formulation
- Instance contrastive learning: SimCLR and MOCO
- Sequence contrastive learning: CPC
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
14
Pretext task: predict rotations
Hypothesis: a model could recognize the correct rotation of an object
only if it has the “visual commonsense” of what the object should look
like unperturbed.
(Image source: Gidaris et al. 2018)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
15
Pretext task: predict rotations
Self-supervised
learning by rotating
the entire input
images.
The model learns to
predict which rotation
is applied (4-way
classification)
(Image source: Gidaris et al. 2018)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
16
Pretext task: predict rotations
Self-supervised
learning by rotating
the entire input
images.
The model learns to
predict which rotation
is applied (4-way
classification)
(Image source: Gidaris et al. 2018)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
17
Evaluation on semi-supervised learning
(Image source: Gidaris et al. 2018)
Self-supervised learning on
CIFAR10 (entire training set).
Freeze conv1 + conv2
Learn conv3 + linear layers
with subset of labeled
CIFAR10 data (classification).
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
18
Transfer learned features to supervised learning
source: Gidaris et al. 2018
Self-supervised learning on
ImageNet (entire training
set) with AlexNet.
Finetune on labeled data
from Pascal VOC 2007.
Pretrained with full
ImageNet supervision
No pretraining
Self-supervised learning with rotation prediction
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
19
Visualize learned visual attentions
(Image source: Gidaris et al. 2018)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
20
Pretext task: predict relative patch locations
(Image source: Doersch et al., 2015)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
21
Pretext task: solving “jigsaw puzzles”
(Image source: Noroozi & Favaro, 2016)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
22
Transfer learned features to supervised learning
(source: Noroozi & Favaro, 2016)
“Ours” is feature learned from solving image Jigsaw puzzles (Noroozi &
Favaro, 2016). Doersch et al. is the method with relative patch location
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
23
Pretext task: predict missing pixels (inpainting)
Source: Pathak et al., 2016
Context Encoders: Feature Learning by Inpainting (Pathak et al., 2016)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
24
Learning to inpaint by reconstruction
Learning to reconstruct the missing pixels
Source: Pathak et al., 2016
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
25
Inpainting evaluation
Source: Pathak et al., 2016
Input (context) reconstruction
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
26
Learning to inpaint by reconstruction
Source: Pathak et al., 2016
Loss = reconstruction + adversarial learning
Adversarial loss between “real” images and inpainted images
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
27
Inpainting evaluation
Source: Pathak et al., 2016
Input (context) reconstruction adversarial recon + adv
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
28
Source: Pathak et al., 2016
Transfer learned features to supervised learning
Self-supervised learning on ImageNet training set, transfer to
classification (Pascal VOC 2007), detection (Pascal VOC 2007), and
semantic segmentation (Pascal VOC 2012)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
29
Pretext task: image coloring
Source: Richard Zhang / Phillip Isola
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
30
Pretext task: image coloring
Source: Richard Zhang / Phillip Isola
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
31
Learning features from colorization:
Split-brain Autoencoder
Source: Richard Zhang / Phillip Isola
Idea: cross-channel predictions
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
32
Learning features from colorization:
Split-brain Autoencoder
Source: Richard Zhang / Phillip Isola
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
33
Learning features from colorization:
Split-brain Autoencoder
Source: Richard Zhang / Phillip Isola
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
34
Source: Zhang et al., 2017
Transfer learned features to supervised learning
Self-supervised learning on
ImageNet (entire training
set).
Use concatenated features
from F1
and F2
Labeled data is from the
Places (Zhou 2016).
supervised
this paper
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
35
Pretext task: image coloring
Source: Richard Zhang / Phillip Isola
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
36
Pretext task: image coloring
Source: Richard Zhang / Phillip Isola
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
37
Pretext task: video coloring
Source: Vondrick et al., 2018
t = 1 t = 2 t = 3
...
reference frame
t = 0
how should I color these frames?
Idea: model the temporal coherence of colors in videos
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
38
Pretext task: video coloring
Source: Vondrick et al., 2018
t = 1 t = 2 t = 3
...
reference frame
t = 0
how should I color these frames?
Idea: model the temporal coherence of colors in videos
Should be the same color!
Hypothesis: learning to color video frames should allow model to
learn to track regions or objects without labels!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
39
Learning to color videos
Source: Vondrick et al., 2018
Learning objective:
Establish mappings
between reference and
target frames in a
learned feature space.
Use the mapping as
“pointers” to copy the
correct color (LAB).
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
40
Learning to color videos
Source: Vondrick et al., 2018
attention map on the
reference frame
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
41
Learning to color videos
Source: Vondrick et al., 2018
attention map on the
reference frame
predicted color = weighted
sum of the reference color
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
42
Learning to color videos
Source: Vondrick et al., 2018
attention map on the
reference frame
predicted color = weighted
sum of the reference color
loss between predicted color
and ground truth color
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
43
Colorizing videos (qualitative)
reference frame
Source: Google AI blog post
target frames (gray) predicted color
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
44
Colorizing videos (qualitative)
reference frame target frames (gray) predicted color
Source: Google AI blog post
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
45
Tracking emerges from colorization
Propagate segmentation masks using learned attention
Source: Google AI blog post
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
46
Tracking emerges from colorization
Propagate pose keypoints using learned attention
Source: Google AI blog post
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
47
Summary: pretext tasks from image
transformations
● Pretext tasks focus on “visual common sense”, e.g., predict rotations,
inpainting, rearrangement, and colorization.
● The models are forced learn good features about natural images, e.g.,
semantic representation of an object category, in order to solve the
pretext tasks.
● We don’t care about the performance of these pretext tasks, but rather
how useful the learned features are for downstream tasks (classification,
detection, segmentation).
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
48
Summary: pretext tasks from image
transformations
● Pretext tasks focus on “visual common sense”, e.g., predict rotations,
inpainting, rearrangement, and colorization.
● The models are forced learn good features about natural images, e.g.,
semantic representation of an object category, in order to solve the
pretext tasks.
● We don’t care about the performance of these pretext tasks, but rather
how useful the learned features are for downstream tasks (classification,
detection, segmentation).
● Problems: 1) coming up with individual pretext tasks is tedious, and 2)
the learned representations may not be general.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
49
Pretext tasks from image transformations
?
image completion
θ=?
rotation prediction “jigsaw puzzle” colorization
Learned representations may be tied to a specific pretext task!
Can we come up with a more general pretext task?
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
50
A more general pretext task?
?
θ=?
same object
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
51
A more general pretext task?
?
θ=?
same object
different object
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
52
Contrastive Representation Learning
?
θ=?
attract
repel
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
53
Today’s Agenda
Pretext tasks from image transformations
- Rotation, inpainting, rearrangement, coloring
Contrastive representation learning
- Intuition and formulation
- Instance contrastive learning: SimCLR and MOCO
- Sequence contrastive learning: CPC
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
54
Contrastive Representation Learning
?
θ=?
attract
repel
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
55
Contrastive Representation Learning
?
θ=?
reference
positive
negative
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
56
A formulation of contrastive learning
What we want:
x: reference sample; x+
positive sample; x-
negative sample
Given a chosen score function, we aim to learn an encoder
function f that yields high score for positive pairs (x, x+
) and
low scores for negative pairs (x, x-
).
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
57
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
58
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
59
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
score for the
positive pair
score for the N-1
negative pairs
This seems familiar …
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
60
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
score for the
positive pair
score for the N-1
negative pairs
This seems familiar …
Cross entropy loss for a N-way softmax classifier!
I.e., learn to find the positive sample from the N samples
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
61
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
Commonly known as the InfoNCE loss (van den Oord et al., 2018)
A lower bound on the mutual information between f(x) and f(x+
)
The larger the negative sample size (N), the tighter the bound
Detailed derivation: Poole et al., 2019
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
62
SimCLR: A Simple Framework for Contrastive Learning
Source: Chen et al., 2020
Use a projection network g(·) to project
features to a space where contrastive
learning is applied
Generate positive samples through data
augmentation:
● random cropping, random color
distortion, and random blur.
Cosine similarity as the score function:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
63
SimCLR: generating positive samples from
data augmentation
Source: Chen et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
64
SimCLR
Source: Chen et al., 2020
Generate a positive pair
by sampling data
augmentation functions
*We use a slightly different
formulation in the assignment.
You should follow the
assignment instructions.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
65
SimCLR
Source: Chen et al., 2020
InfoNCE loss:
Use all non-positive
samples in the
batch as x -
Generate a positive pair
by sampling data
augmentation functions
*We use a slightly different
formulation in the assignment.
You should follow the
assignment instructions.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
66
SimCLR
Source: Chen et al., 2020
InfoNCE loss:
Use all non-positive
samples in the
batch as x -
Generate a positive pair
by sampling data
augmentation functions
*We use a slightly different
formulation in the assignment.
You should follow the
assignment instructions.
Iterate through and
use each of the 2N
sample as reference,
compute average loss
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
67
SimCLR: mini-batch training
list of positive pairs
Each 2k and 2k + 1
element is a positive pair
“Affinity matrix”
*We use a slightly different formulation in the assignment.
You should follow the assignment instructions.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
68
SimCLR: mini-batch training
list of positive pairs
= classification label for each row
Each 2k and 2k + 1
element is a positive pair
“Affinity matrix”
*We use a slightly different formulation in the assignment.
You should follow the assignment instructions.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
69
Training linear classifier on SimCLR features
Train feature encoder on
ImageNet (entire training set)
using SimCLR.
Freeze feature encoder, train a
linear classifier on top with
labeled data.
Source: Chen et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
70
Semi-supervised learning on SimCLR features
Train feature encoder on
ImageNet (entire training set)
using SimCLR.
Finetune the encoder with 1% /
10% of labeled data on ImageNet.
Source: Chen et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
71
SimCLR design choices: projection head
Linear / non-linear projection heads improve
representation learning.
A possible explanation:
● contrastive learning objective may discard
useful information for downstream tasks
● representation space z is trained to be
invariant to data transformation.
● by leveraging the projection head g(ᐧ),
more information can be preserved in the
h representation space
Source: Chen et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
72
SimCLR design choices: large batch size
Large training batch size is crucial for
SimCLR!
Large batch size causes large memory
footprint during backpropagation:
requires distributed training on TPUs
(ImageNet experiments)
Source: Chen et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
73
Momentum Contrastive Learning (MoCo)
Key differences to SimCLR:
● Keep a running queue of keys
(negative samples).
● Compute gradients and update the
encoder only through the queries.
● Decouple min-batch size with the
number of keys: can support a large
number of negative samples.
no_grad
Source: He et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
74
Momentum Contrastive Learning (MoCo)
Key differences to SimCLR:
● Keep a running queue of keys
(negative samples).
● Compute gradients and update the
encoder only through the queries.
● Decouple min-batch size with the
number of keys: can support a large
number of negative samples.
no_grad
Source: He et al., 2020
● The key encoder is slowly progressing
through the momentum update rules:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
75
MoCo
Generate a positive pair
by sampling data
augmentation functions
No gradient through
the positive sample
Use the running
queue of keys as the
negative samples
InfoNCE loss
Update f_k through
momentum
Update the FIFO
negative sample queue
Source: He et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
76
“MoCo V2”
A hybrid of ideas from SimCLR and MoCo:
● From SimCLR: non-linear projection head and strong data
augmentation.
● From MoCo: momentum-updated queues that allow training
on a large number of negative samples (no TPU required!).
Source: Chen et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
77
MoCo vs. SimCLR vs. MoCo V2
Key takeaways:
● Non-linear projection head and
strong data augmentation are crucial
for contrastive learning.
Source: Chen et al., 2020
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
78
MoCo vs. SimCLR vs. MoCo V2
Source: Chen et al., 2020
Key takeaways:
● Non-linear projection head and
strong data augmentation are crucial
for contrastive learning.
● Decoupling mini-batch size with
negative sample size allows
MoCo-V2 to outperform SimCLR with
smaller batch size (256 vs. 8192).
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
79
MoCo vs. SimCLR vs. MoCo V2
Source: Chen et al., 2020
Key takeaways:
● Non-linear projection head and
strong data augmentation are crucial
for contrastive learning.
● Decoupling mini-batch size with
negative sample size allows
MoCo-V2 to outperform SimCLR with
smaller batch size (256 vs. 8192).
● … all with much smaller memory
footprint! (“end-to-end” means
SimCLR here)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
80
Instance vs. Sequence Contrastive Learning
Instance-level contrastive learning:
contrastive learning based on
positive & negative instances.
Examples: SimCLR, MoCo
Sequence-level contrastive learning:
contrastive learning based on
sequential / temporal orders.
Example: Contrastive Predictive Coding (CPC)
Source: van den Oord et al., 2018
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
81
Contrastive Predictive Coding (CPC)
Source: van den Oord et al., 2018,
Figure source
Contrastive: contrast between
“right” and “wrong” sequences
using contrastive learning.
Predictive: the model has to
predict future patterns given the
current context.
Coding: the model learns useful
feature vectors, or “code”, for
downstream tasks, similar to other
self-supervised methods.
context
positive
negative
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
82
Contrastive Predictive Coding (CPC)
Source: van den Oord et al., 2018,
Figure source
context
positive
negative
1. Encode all samples in a sequence
into vectors zt
= genc
(xt
)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
83
Contrastive Predictive Coding (CPC)
Source: van den Oord et al., 2018,
Figure source
context
positive
negative
1. Encode all samples in a sequence
into vectors zt
= genc
(xt
)
2. Summarize context (e.g., half of a
sequence) into a context code ct
using
an auto-regressive model (gar
). The
original paper uses GRU-RNN here.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
84
Contrastive Predictive Coding (CPC)
Source: van den Oord et al., 2018,
Figure source
context
positive
negative
1. Encode all samples in a sequence
into vectors zt
= genc
(xt
)
2. Summarize context (e.g., half of a
sequence) into a context code ct
using
an auto-regressive model (gar
)
3. Compute InfoNCE loss between the
context ct
and future code zt+k
using
the following time-dependent score
function:
, where Wk
is a trainable matrix.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
85
CPC example: modeling audio sequences
Source: van den Oord et al., 2018,
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
86
CPC example: modeling audio sequences
Linear classification on trained
representations (LibriSpeech dataset)
Source: van den Oord et al., 2018,
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
87
CPC example: modeling visual context
Source: van den Oord et al., 2018,
Idea: split image into patches, model rows of patches from top to bottom
as a sequence. I.e., use top rows as context to predict bottom rows.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
88
CPC example: modeling visual context
Source: van den Oord et al., 2018,
● Compares favorably with other pretext
task-based self-supervised learning method.
● Doesn’t do as well compared to newer
instance-based contrastive learning
methods on image feature learning.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
89
Summary: Contrastive Representation Learning
A general formulation for contrastive learning:
InfoNCE loss: N-way classification among positive and negative samples
Commonly known as the InfoNCE loss (van den Oord et al., 2018)
A lower bound on the mutual information between f(x) and f(x+
)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
90
Summary: Contrastive Representation Learning
SimCLR: a simple framework for contrastive
representation learning
● Key ideas: non-linear projection head to
allow flexible representation learning
● Simple to implement, effective in learning
visual representation
● Requires large training batch size to be
effective; large memory footprint
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
91
Summary: Contrastive Representation Learning
MoCo (v1, v2): contrastive learning using
momentum sample encoder
● Decouples negative sample size from
minibatch size; allows large batch training
without TPU
● MoCo-v2 combines the key ideas from
SimCLR, i.e., nonlinear projection head,
strong data augmentation, with momentum
contrastive learning
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
92
Summary: Contrastive Representation Learning
CPC: sequence-level contrastive learning
● Contrast “right” sequence with “wrong”
sequence.
● InfoNCE loss with a time-dependent score
function.
● Can be applied to a variety of learning
problems, but not as effective in learning
image representations compared to
instance-level methods.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
93
Other examples
CLIP (Contrastive Language–Image Pre-training) Radford et al., 2021
Contrastive learning between image and natural language sentences
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
94
Other examples
Dense Object Net, Florence et al., 2018
Contrastive learning on pixel-wise feature descriptors
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
95
Other examples
Dense Object Net, Florence et al., 2018
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
96
Other examples
Dense Object Net, Florence et al., 2018
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
97
Next time: Low-Level Vision
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
98
Today’s Agenda
Pretext tasks from image transformations
- Rotation, inpainting, rearrangement, coloring
Contrastive representation learning
- Intuition and formulation
- Instance contrastive learning: SimCLR and MOCO
- Sequence contrastive learning: CPC
Frontier:
- Contrastive Language Image Pre-training (CLIP)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
99
Frontier: Contrastive Language–Image
Pre-training (CLIP)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
100
Self-Supervised Learning
General idea: pretend there is a part of the data you don’t know and train
the neural network to predict that.
Source: Lecun 2019 Keynote at ISSCC
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
101
“The Cake of Learning”
Source: Lecun 2019 Keynote at ISSCC
downstream
tasks
Learn good
features through
self-supervision
feature
extractor
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
102
Can we do better?
SimCLR Momentum Contrast
(MoCo)
Source: Chen et al., 2020b

More Related Content

PPTX
Lecture_16_Self-supervised_Learning.pptx
PDF
598_WI2022_lecture22.pdf data analysis and data prediction
PDF
PAISS (PRAIRIE AI Summer School) Digest July 2018
PDF
lecture_13_jiajun.pdf Generative models GAN
PDF
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
PDF
Learning Visual Representations from Uncurated Data
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
PDF
lec_11_self_supervised_learning.pdf
Lecture_16_Self-supervised_Learning.pptx
598_WI2022_lecture22.pdf data analysis and data prediction
PAISS (PRAIRIE AI Summer School) Digest July 2018
lecture_13_jiajun.pdf Generative models GAN
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Learning Visual Representations from Uncurated Data
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
lec_11_self_supervised_learning.pdf

Similar to lecture_14_jiajun.pdf Self supervised Learning (20)

PDF
Unsupervised visual representation learning overview: Toward Self-Supervision
PDF
NTU DBME5028 Week8 Transfer Learning
PPTX
cs231n_2019_lecture11_Tispptisneededforth.pptx
PDF
A Simple Framework for Contrastive Learning of Visual Representations
PDF
lecture_5_ruohan image classification with CNN
PDF
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
PPTX
CM20315_01_Intro_Machine_Learning_ap.pptx
PDF
Cs231n 2017 lecture13 Generative Model
PDF
Emerging Properties in Self-Supervised Vision Transformers
PPTX
Self-Supervised Learning recent trends 1 2
PDF
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
PDF
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
PPTX
Deep Learning: Towards General Artificial Intelligence
PDF
Taskonomy of Transfer Learning
PDF
深度学习639页PPT/////////////////////////////
PDF
GenerativeModelsMaskedSelf-Attention.pdf
PDF
Learning visual representation without human label
PDF
MILA DL & RL summer school highlights
PDF
Research Directions - Full Stack Deep Learning
PDF
Self-supervised Learning Lecture Note
Unsupervised visual representation learning overview: Toward Self-Supervision
NTU DBME5028 Week8 Transfer Learning
cs231n_2019_lecture11_Tispptisneededforth.pptx
A Simple Framework for Contrastive Learning of Visual Representations
lecture_5_ruohan image classification with CNN
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
CM20315_01_Intro_Machine_Learning_ap.pptx
Cs231n 2017 lecture13 Generative Model
Emerging Properties in Self-Supervised Vision Transformers
Self-Supervised Learning recent trends 1 2
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Deep Learning: Towards General Artificial Intelligence
Taskonomy of Transfer Learning
深度学习639页PPT/////////////////////////////
GenerativeModelsMaskedSelf-Attention.pdf
Learning visual representation without human label
MILA DL & RL summer school highlights
Research Directions - Full Stack Deep Learning
Self-supervised Learning Lecture Note
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Transcultural that can help you someday.
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Introduction to Data Science and Data Analysis
PDF
Lecture1 pattern recognition............
PPTX
Leprosy and NLEP programme community medicine
PDF
How to run a consulting project- client discovery
PDF
Mega Projects Data Mega Projects Data
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Introduction to the R Programming Language
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
[EN] Industrial Machine Downtime Prediction
Data_Analytics_and_PowerBI_Presentation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Transcultural that can help you someday.
climate analysis of Dhaka ,Banglades.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Data Science and Data Analysis
Lecture1 pattern recognition............
Leprosy and NLEP programme community medicine
How to run a consulting project- client discovery
Mega Projects Data Mega Projects Data
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Introduction to the R Programming Language
ISS -ESG Data flows What is ESG and HowHow
A Complete Guide to Streamlining Business Processes
SAP 2 completion done . PRESENTATION.pptx
[EN] Industrial Machine Downtime Prediction
Ad

lecture_14_jiajun.pdf Self supervised Learning

  • 1. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 1 Lecture 14: Self-Supervised Learning
  • 2. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 2 Administrative - Assignment 3 due in two weeks 5/25 - Midterm grade is out - Regrade request: - Gradescope regrade only for mistakes according to the current rubric - Teaching team will discuss concerns in MC & T/F next Monday
  • 3. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 3 Last Lecture: Generative Modeling Training data ~ pdata (x) Objectives: 1. Learn pmodel (x) that approximates pdata (x) 2. Sampling new x from pmodel (x) Given training data, generate new samples from same distribution learning pmodel (x ) sampling
  • 4. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 4 Last Lecture: Generative Modeling Generative models Explicit density Implicit density Direct Tractable density Approximate density Markov Chain Variational Markov Chain Variational Autoencoder Boltzmann Machine GSN GAN Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. Fully Visible Belief Nets - NADE - MADE - PixelRNN/CNN - NICE / RealNVP - Glow - Ffjord
  • 5. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 5 Generative vs. Self-supervised Learning ● Both aim to learn from data without manual label annotation. ● Generative learning aims to model data distribution pdata (x), e.g., generating realistic images. ● Self-supervised learning methods solve “pretext” tasks that produce good features for downstream tasks. ○ Learn with supervised learning objectives, e.g., classification, regression. ○ Labels of these pretext tasks are generated automatically
  • 6. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 6 Self-supervised pretext tasks ? Example: learn to predict image transformations / complete corrupted images image completion θ=? rotation prediction “jigsaw puzzle” colorization 1. Solving the pretext tasks allow the model to learn good features. 2. We can automatically generate labels for the pretext tasks.
  • 7. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 7 Generative vs. Self-supervised Learning Left: Drawing of a dollar bill from memory. Right: Drawing subsequently made with a dollar bill present. Image source: Epstein, 2016 Learning to generate pixel-level details is often unnecessary; learn high-level semantic features with pretext tasks instead Source: Anand, 2020
  • 8. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 8 How to evaluate a self-supervised learning method? We usually don’t care about the performance of the self-supervised learning task, e.g., we don’t care if the model learns to predict image rotation perfectly. Evaluate the learned feature encoders on downstream target tasks
  • 9. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 9 How to evaluate a self-supervised learning method? lots of unlabeled data self-supervised learning feature extractor (e.g., a convnet) 90° conv fc 1. Learn good feature extractors from self-supervised pretext tasks, e.g., predicting image rotations
  • 10. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 10 How to evaluate a self-supervised learning method? lots of unlabeled data self-supervised learning feature extractor (e.g., a convnet) small amount of labeled data on the target task supervised learning evaluate on the target task e.g. classification, detection 90° conv fc bird conv linear classifier 1. Learn good feature extractors from self-supervised pretext tasks, e.g., predicting image rotations 2. Attach a shallow network on the feature extractor; train the shallow network on the target task with small amount of labeled data
  • 11. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 11 Broader picture language modeling GPT3 (Brown, Mann, Ryder, Subbiah et al., 2020) speech synthesis Wavenet (van den Oord et al., 2016) computer vision robot / reinforcement learning Dense Object Net (Florence and Manuelli et al., 2018) Doersch et al., 2015 ... Today’s lecture
  • 12. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 12 Today’s Agenda Pretext tasks from image transformations - Rotation, inpainting, rearrangement, coloring Contrastive representation learning - Intuition and formulation - Instance contrastive learning: SimCLR and MOCO - Sequence contrastive learning: CPC
  • 13. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 13 Today’s Agenda Pretext tasks from image transformations - Rotation, inpainting, rearrangement, coloring Contrastive representation learning - Intuition and formulation - Instance contrastive learning: SimCLR and MOCO - Sequence contrastive learning: CPC
  • 14. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 14 Pretext task: predict rotations Hypothesis: a model could recognize the correct rotation of an object only if it has the “visual commonsense” of what the object should look like unperturbed. (Image source: Gidaris et al. 2018)
  • 15. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 15 Pretext task: predict rotations Self-supervised learning by rotating the entire input images. The model learns to predict which rotation is applied (4-way classification) (Image source: Gidaris et al. 2018)
  • 16. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 16 Pretext task: predict rotations Self-supervised learning by rotating the entire input images. The model learns to predict which rotation is applied (4-way classification) (Image source: Gidaris et al. 2018)
  • 17. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 17 Evaluation on semi-supervised learning (Image source: Gidaris et al. 2018) Self-supervised learning on CIFAR10 (entire training set). Freeze conv1 + conv2 Learn conv3 + linear layers with subset of labeled CIFAR10 data (classification).
  • 18. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 18 Transfer learned features to supervised learning source: Gidaris et al. 2018 Self-supervised learning on ImageNet (entire training set) with AlexNet. Finetune on labeled data from Pascal VOC 2007. Pretrained with full ImageNet supervision No pretraining Self-supervised learning with rotation prediction
  • 19. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 19 Visualize learned visual attentions (Image source: Gidaris et al. 2018)
  • 20. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 20 Pretext task: predict relative patch locations (Image source: Doersch et al., 2015)
  • 21. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 21 Pretext task: solving “jigsaw puzzles” (Image source: Noroozi & Favaro, 2016)
  • 22. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 22 Transfer learned features to supervised learning (source: Noroozi & Favaro, 2016) “Ours” is feature learned from solving image Jigsaw puzzles (Noroozi & Favaro, 2016). Doersch et al. is the method with relative patch location
  • 23. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 23 Pretext task: predict missing pixels (inpainting) Source: Pathak et al., 2016 Context Encoders: Feature Learning by Inpainting (Pathak et al., 2016)
  • 24. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 24 Learning to inpaint by reconstruction Learning to reconstruct the missing pixels Source: Pathak et al., 2016
  • 25. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 25 Inpainting evaluation Source: Pathak et al., 2016 Input (context) reconstruction
  • 26. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 26 Learning to inpaint by reconstruction Source: Pathak et al., 2016 Loss = reconstruction + adversarial learning Adversarial loss between “real” images and inpainted images
  • 27. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 27 Inpainting evaluation Source: Pathak et al., 2016 Input (context) reconstruction adversarial recon + adv
  • 28. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 28 Source: Pathak et al., 2016 Transfer learned features to supervised learning Self-supervised learning on ImageNet training set, transfer to classification (Pascal VOC 2007), detection (Pascal VOC 2007), and semantic segmentation (Pascal VOC 2012)
  • 29. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 29 Pretext task: image coloring Source: Richard Zhang / Phillip Isola
  • 30. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 30 Pretext task: image coloring Source: Richard Zhang / Phillip Isola
  • 31. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 31 Learning features from colorization: Split-brain Autoencoder Source: Richard Zhang / Phillip Isola Idea: cross-channel predictions
  • 32. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 32 Learning features from colorization: Split-brain Autoencoder Source: Richard Zhang / Phillip Isola
  • 33. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 33 Learning features from colorization: Split-brain Autoencoder Source: Richard Zhang / Phillip Isola
  • 34. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 34 Source: Zhang et al., 2017 Transfer learned features to supervised learning Self-supervised learning on ImageNet (entire training set). Use concatenated features from F1 and F2 Labeled data is from the Places (Zhou 2016). supervised this paper
  • 35. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 35 Pretext task: image coloring Source: Richard Zhang / Phillip Isola
  • 36. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 36 Pretext task: image coloring Source: Richard Zhang / Phillip Isola
  • 37. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 37 Pretext task: video coloring Source: Vondrick et al., 2018 t = 1 t = 2 t = 3 ... reference frame t = 0 how should I color these frames? Idea: model the temporal coherence of colors in videos
  • 38. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 38 Pretext task: video coloring Source: Vondrick et al., 2018 t = 1 t = 2 t = 3 ... reference frame t = 0 how should I color these frames? Idea: model the temporal coherence of colors in videos Should be the same color! Hypothesis: learning to color video frames should allow model to learn to track regions or objects without labels!
  • 39. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 39 Learning to color videos Source: Vondrick et al., 2018 Learning objective: Establish mappings between reference and target frames in a learned feature space. Use the mapping as “pointers” to copy the correct color (LAB).
  • 40. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 40 Learning to color videos Source: Vondrick et al., 2018 attention map on the reference frame
  • 41. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 41 Learning to color videos Source: Vondrick et al., 2018 attention map on the reference frame predicted color = weighted sum of the reference color
  • 42. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 42 Learning to color videos Source: Vondrick et al., 2018 attention map on the reference frame predicted color = weighted sum of the reference color loss between predicted color and ground truth color
  • 43. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 43 Colorizing videos (qualitative) reference frame Source: Google AI blog post target frames (gray) predicted color
  • 44. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 44 Colorizing videos (qualitative) reference frame target frames (gray) predicted color Source: Google AI blog post
  • 45. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 45 Tracking emerges from colorization Propagate segmentation masks using learned attention Source: Google AI blog post
  • 46. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 46 Tracking emerges from colorization Propagate pose keypoints using learned attention Source: Google AI blog post
  • 47. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 47 Summary: pretext tasks from image transformations ● Pretext tasks focus on “visual common sense”, e.g., predict rotations, inpainting, rearrangement, and colorization. ● The models are forced learn good features about natural images, e.g., semantic representation of an object category, in order to solve the pretext tasks. ● We don’t care about the performance of these pretext tasks, but rather how useful the learned features are for downstream tasks (classification, detection, segmentation).
  • 48. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 48 Summary: pretext tasks from image transformations ● Pretext tasks focus on “visual common sense”, e.g., predict rotations, inpainting, rearrangement, and colorization. ● The models are forced learn good features about natural images, e.g., semantic representation of an object category, in order to solve the pretext tasks. ● We don’t care about the performance of these pretext tasks, but rather how useful the learned features are for downstream tasks (classification, detection, segmentation). ● Problems: 1) coming up with individual pretext tasks is tedious, and 2) the learned representations may not be general.
  • 49. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 49 Pretext tasks from image transformations ? image completion θ=? rotation prediction “jigsaw puzzle” colorization Learned representations may be tied to a specific pretext task! Can we come up with a more general pretext task?
  • 50. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 50 A more general pretext task? ? θ=? same object
  • 51. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 51 A more general pretext task? ? θ=? same object different object
  • 52. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 52 Contrastive Representation Learning ? θ=? attract repel
  • 53. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 53 Today’s Agenda Pretext tasks from image transformations - Rotation, inpainting, rearrangement, coloring Contrastive representation learning - Intuition and formulation - Instance contrastive learning: SimCLR and MOCO - Sequence contrastive learning: CPC
  • 54. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 54 Contrastive Representation Learning ? θ=? attract repel
  • 55. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 55 Contrastive Representation Learning ? θ=? reference positive negative
  • 56. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 56 A formulation of contrastive learning What we want: x: reference sample; x+ positive sample; x- negative sample Given a chosen score function, we aim to learn an encoder function f that yields high score for positive pairs (x, x+ ) and low scores for negative pairs (x, x- ).
  • 57. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 57 A formulation of contrastive learning Loss function given 1 positive sample and N - 1 negative samples:
  • 58. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 58 A formulation of contrastive learning Loss function given 1 positive sample and N - 1 negative samples: ...
  • 59. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 59 A formulation of contrastive learning Loss function given 1 positive sample and N - 1 negative samples: score for the positive pair score for the N-1 negative pairs This seems familiar …
  • 60. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 60 A formulation of contrastive learning Loss function given 1 positive sample and N - 1 negative samples: score for the positive pair score for the N-1 negative pairs This seems familiar … Cross entropy loss for a N-way softmax classifier! I.e., learn to find the positive sample from the N samples
  • 61. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 61 A formulation of contrastive learning Loss function given 1 positive sample and N - 1 negative samples: Commonly known as the InfoNCE loss (van den Oord et al., 2018) A lower bound on the mutual information between f(x) and f(x+ ) The larger the negative sample size (N), the tighter the bound Detailed derivation: Poole et al., 2019
  • 62. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 62 SimCLR: A Simple Framework for Contrastive Learning Source: Chen et al., 2020 Use a projection network g(·) to project features to a space where contrastive learning is applied Generate positive samples through data augmentation: ● random cropping, random color distortion, and random blur. Cosine similarity as the score function:
  • 63. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 63 SimCLR: generating positive samples from data augmentation Source: Chen et al., 2020
  • 64. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 64 SimCLR Source: Chen et al., 2020 Generate a positive pair by sampling data augmentation functions *We use a slightly different formulation in the assignment. You should follow the assignment instructions.
  • 65. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 65 SimCLR Source: Chen et al., 2020 InfoNCE loss: Use all non-positive samples in the batch as x - Generate a positive pair by sampling data augmentation functions *We use a slightly different formulation in the assignment. You should follow the assignment instructions.
  • 66. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 66 SimCLR Source: Chen et al., 2020 InfoNCE loss: Use all non-positive samples in the batch as x - Generate a positive pair by sampling data augmentation functions *We use a slightly different formulation in the assignment. You should follow the assignment instructions. Iterate through and use each of the 2N sample as reference, compute average loss
  • 67. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 67 SimCLR: mini-batch training list of positive pairs Each 2k and 2k + 1 element is a positive pair “Affinity matrix” *We use a slightly different formulation in the assignment. You should follow the assignment instructions.
  • 68. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 68 SimCLR: mini-batch training list of positive pairs = classification label for each row Each 2k and 2k + 1 element is a positive pair “Affinity matrix” *We use a slightly different formulation in the assignment. You should follow the assignment instructions.
  • 69. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 69 Training linear classifier on SimCLR features Train feature encoder on ImageNet (entire training set) using SimCLR. Freeze feature encoder, train a linear classifier on top with labeled data. Source: Chen et al., 2020
  • 70. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 70 Semi-supervised learning on SimCLR features Train feature encoder on ImageNet (entire training set) using SimCLR. Finetune the encoder with 1% / 10% of labeled data on ImageNet. Source: Chen et al., 2020
  • 71. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 71 SimCLR design choices: projection head Linear / non-linear projection heads improve representation learning. A possible explanation: ● contrastive learning objective may discard useful information for downstream tasks ● representation space z is trained to be invariant to data transformation. ● by leveraging the projection head g(ᐧ), more information can be preserved in the h representation space Source: Chen et al., 2020
  • 72. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 72 SimCLR design choices: large batch size Large training batch size is crucial for SimCLR! Large batch size causes large memory footprint during backpropagation: requires distributed training on TPUs (ImageNet experiments) Source: Chen et al., 2020
  • 73. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 73 Momentum Contrastive Learning (MoCo) Key differences to SimCLR: ● Keep a running queue of keys (negative samples). ● Compute gradients and update the encoder only through the queries. ● Decouple min-batch size with the number of keys: can support a large number of negative samples. no_grad Source: He et al., 2020
  • 74. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 74 Momentum Contrastive Learning (MoCo) Key differences to SimCLR: ● Keep a running queue of keys (negative samples). ● Compute gradients and update the encoder only through the queries. ● Decouple min-batch size with the number of keys: can support a large number of negative samples. no_grad Source: He et al., 2020 ● The key encoder is slowly progressing through the momentum update rules:
  • 75. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 75 MoCo Generate a positive pair by sampling data augmentation functions No gradient through the positive sample Use the running queue of keys as the negative samples InfoNCE loss Update f_k through momentum Update the FIFO negative sample queue Source: He et al., 2020
  • 76. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 76 “MoCo V2” A hybrid of ideas from SimCLR and MoCo: ● From SimCLR: non-linear projection head and strong data augmentation. ● From MoCo: momentum-updated queues that allow training on a large number of negative samples (no TPU required!). Source: Chen et al., 2020
  • 77. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 77 MoCo vs. SimCLR vs. MoCo V2 Key takeaways: ● Non-linear projection head and strong data augmentation are crucial for contrastive learning. Source: Chen et al., 2020
  • 78. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 78 MoCo vs. SimCLR vs. MoCo V2 Source: Chen et al., 2020 Key takeaways: ● Non-linear projection head and strong data augmentation are crucial for contrastive learning. ● Decoupling mini-batch size with negative sample size allows MoCo-V2 to outperform SimCLR with smaller batch size (256 vs. 8192).
  • 79. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 79 MoCo vs. SimCLR vs. MoCo V2 Source: Chen et al., 2020 Key takeaways: ● Non-linear projection head and strong data augmentation are crucial for contrastive learning. ● Decoupling mini-batch size with negative sample size allows MoCo-V2 to outperform SimCLR with smaller batch size (256 vs. 8192). ● … all with much smaller memory footprint! (“end-to-end” means SimCLR here)
  • 80. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 80 Instance vs. Sequence Contrastive Learning Instance-level contrastive learning: contrastive learning based on positive & negative instances. Examples: SimCLR, MoCo Sequence-level contrastive learning: contrastive learning based on sequential / temporal orders. Example: Contrastive Predictive Coding (CPC) Source: van den Oord et al., 2018
  • 81. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 81 Contrastive Predictive Coding (CPC) Source: van den Oord et al., 2018, Figure source Contrastive: contrast between “right” and “wrong” sequences using contrastive learning. Predictive: the model has to predict future patterns given the current context. Coding: the model learns useful feature vectors, or “code”, for downstream tasks, similar to other self-supervised methods. context positive negative
  • 82. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 82 Contrastive Predictive Coding (CPC) Source: van den Oord et al., 2018, Figure source context positive negative 1. Encode all samples in a sequence into vectors zt = genc (xt )
  • 83. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 83 Contrastive Predictive Coding (CPC) Source: van den Oord et al., 2018, Figure source context positive negative 1. Encode all samples in a sequence into vectors zt = genc (xt ) 2. Summarize context (e.g., half of a sequence) into a context code ct using an auto-regressive model (gar ). The original paper uses GRU-RNN here.
  • 84. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 84 Contrastive Predictive Coding (CPC) Source: van den Oord et al., 2018, Figure source context positive negative 1. Encode all samples in a sequence into vectors zt = genc (xt ) 2. Summarize context (e.g., half of a sequence) into a context code ct using an auto-regressive model (gar ) 3. Compute InfoNCE loss between the context ct and future code zt+k using the following time-dependent score function: , where Wk is a trainable matrix.
  • 85. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 85 CPC example: modeling audio sequences Source: van den Oord et al., 2018,
  • 86. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 86 CPC example: modeling audio sequences Linear classification on trained representations (LibriSpeech dataset) Source: van den Oord et al., 2018,
  • 87. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 87 CPC example: modeling visual context Source: van den Oord et al., 2018, Idea: split image into patches, model rows of patches from top to bottom as a sequence. I.e., use top rows as context to predict bottom rows.
  • 88. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 88 CPC example: modeling visual context Source: van den Oord et al., 2018, ● Compares favorably with other pretext task-based self-supervised learning method. ● Doesn’t do as well compared to newer instance-based contrastive learning methods on image feature learning.
  • 89. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 89 Summary: Contrastive Representation Learning A general formulation for contrastive learning: InfoNCE loss: N-way classification among positive and negative samples Commonly known as the InfoNCE loss (van den Oord et al., 2018) A lower bound on the mutual information between f(x) and f(x+ )
  • 90. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 90 Summary: Contrastive Representation Learning SimCLR: a simple framework for contrastive representation learning ● Key ideas: non-linear projection head to allow flexible representation learning ● Simple to implement, effective in learning visual representation ● Requires large training batch size to be effective; large memory footprint
  • 91. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 91 Summary: Contrastive Representation Learning MoCo (v1, v2): contrastive learning using momentum sample encoder ● Decouples negative sample size from minibatch size; allows large batch training without TPU ● MoCo-v2 combines the key ideas from SimCLR, i.e., nonlinear projection head, strong data augmentation, with momentum contrastive learning
  • 92. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 92 Summary: Contrastive Representation Learning CPC: sequence-level contrastive learning ● Contrast “right” sequence with “wrong” sequence. ● InfoNCE loss with a time-dependent score function. ● Can be applied to a variety of learning problems, but not as effective in learning image representations compared to instance-level methods.
  • 93. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 93 Other examples CLIP (Contrastive Language–Image Pre-training) Radford et al., 2021 Contrastive learning between image and natural language sentences
  • 94. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 94 Other examples Dense Object Net, Florence et al., 2018 Contrastive learning on pixel-wise feature descriptors
  • 95. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 95 Other examples Dense Object Net, Florence et al., 2018
  • 96. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 96 Other examples Dense Object Net, Florence et al., 2018
  • 97. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 97 Next time: Low-Level Vision
  • 98. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 98 Today’s Agenda Pretext tasks from image transformations - Rotation, inpainting, rearrangement, coloring Contrastive representation learning - Intuition and formulation - Instance contrastive learning: SimCLR and MOCO - Sequence contrastive learning: CPC Frontier: - Contrastive Language Image Pre-training (CLIP)
  • 99. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 99 Frontier: Contrastive Language–Image Pre-training (CLIP)
  • 100. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 100 Self-Supervised Learning General idea: pretend there is a part of the data you don’t know and train the neural network to predict that. Source: Lecun 2019 Keynote at ISSCC
  • 101. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 101 “The Cake of Learning” Source: Lecun 2019 Keynote at ISSCC downstream tasks Learn good features through self-supervision feature extractor
  • 102. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022 102 Can we do better? SimCLR Momentum Contrast (MoCo) Source: Chen et al., 2020b