lecture_14_jiajun.pdf Self supervised Learning

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - May 17, 2022
1
Lecture 14:
Self-Supervised Learning

2
Administrative
- Assignment 3 due in two weeks 5/25
- Midterm grade is out
- Regrade request:
- Gradescope regrade only for mistakes
according to the current rubric
- Teaching team will discuss concerns in MC & T/F
next Monday

3
Last Lecture: Generative Modeling
Training data ~ pdata
(x)
Objectives:
1. Learn pmodel
(x) that approximates pdata
(x)
2. Sampling new x from pmodel
(x)
Given training data, generate new samples from same distribution
learning
pmodel
(x
)
sampling

4
Last Lecture: Generative Modeling
Generative models
Explicit density Implicit density
Direct
Tractable density Approximate density
Markov Chain
Variational Markov Chain
Variational Autoencoder Boltzmann Machine
GSN
GAN
Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017.
Fully Visible Belief Nets
- NADE
- MADE
- PixelRNN/CNN
- NICE / RealNVP
- Glow
- Ffjord

5
Generative vs. Self-supervised Learning
● Both aim to learn from data without manual label annotation.
● Generative learning aims to model data distribution pdata
(x),
e.g., generating realistic images.
● Self-supervised learning methods solve “pretext” tasks that
produce good features for downstream tasks.
○ Learn with supervised learning objectives, e.g., classification,
regression.
○ Labels of these pretext tasks are generated automatically

6
Self-supervised pretext tasks
?
Example: learn to predict image transformations / complete corrupted images
image completion
θ=?
rotation prediction “jigsaw puzzle” colorization
1. Solving the pretext tasks allow the model to learn good features.
2. We can automatically generate labels for the pretext tasks.

7
Generative vs. Self-supervised Learning
Left: Drawing of a dollar bill from memory. Right: Drawing subsequently made
with a dollar bill present. Image source: Epstein, 2016
Learning to generate pixel-level details is often unnecessary; learn
high-level semantic features with pretext tasks instead
Source: Anand, 2020

8
How to evaluate a self-supervised learning method?
We usually don’t care about the performance of the self-supervised
learning task, e.g., we don’t care if the model learns to predict image
rotation perfectly.
Evaluate the learned feature encoders on downstream target tasks

9
lots of
unlabeled
data
self-supervised
learning
feature
extractor
(e.g., a
convnet)
90°
conv fc
1. Learn good feature extractors from
self-supervised pretext tasks, e.g.,
predicting image rotations

10
lots of
unlabeled
data
self-supervised
learning
feature
extractor
(e.g., a
convnet)
small amount of
labeled data on
the target task
supervised
learning
evaluate on the
target task
e.g. classification, detection
90°
conv fc
bird
conv linear
classifier
1. Learn good feature extractors from
self-supervised pretext tasks, e.g.,
predicting image rotations
2. Attach a shallow network on the
feature extractor; train the shallow
network on the target task with small
amount of labeled data

11
Broader picture
language modeling
GPT3 (Brown, Mann,
Ryder, Subbiah et al., 2020)
speech synthesis
Wavenet (van den Oord et
al., 2016)
computer vision
robot / reinforcement learning
Dense Object Net (Florence
and Manuelli et al., 2018)
Doersch et al., 2015
...
Today’s lecture

12
Today’s Agenda
Pretext tasks from image transformations
- Rotation, inpainting, rearrangement, coloring
Contrastive representation learning
- Intuition and formulation
- Instance contrastive learning: SimCLR and MOCO
- Sequence contrastive learning: CPC

13
Today’s Agenda

14
Pretext task: predict rotations
Hypothesis: a model could recognize the correct rotation of an object
only if it has the “visual commonsense” of what the object should look
like unperturbed.
(Image source: Gidaris et al. 2018)

15
Self-supervised
learning by rotating
the entire input
images.
The model learns to
predict which rotation
is applied (4-way
classification)

16
Self-supervised
learning by rotating
the entire input
images.
The model learns to
predict which rotation
is applied (4-way
classification)

17
Evaluation on semi-supervised learning
Self-supervised learning on
CIFAR10 (entire training set).
Freeze conv1 + conv2
Learn conv3 + linear layers
with subset of labeled
CIFAR10 data (classification).

18
Transfer learned features to supervised learning
source: Gidaris et al. 2018
ImageNet (entire training
set) with AlexNet.
Finetune on labeled data
from Pascal VOC 2007.
Pretrained with full
ImageNet supervision
No pretraining
Self-supervised learning with rotation prediction

19
Visualize learned visual attentions

20
Pretext task: predict relative patch locations
(Image source: Doersch et al., 2015)

21
Pretext task: solving “jigsaw puzzles”
(Image source: Noroozi & Favaro, 2016)

22
(source: Noroozi & Favaro, 2016)
“Ours” is feature learned from solving image Jigsaw puzzles (Noroozi &
Favaro, 2016). Doersch et al. is the method with relative patch location

23
Pretext task: predict missing pixels (inpainting)
Source: Pathak et al., 2016
Context Encoders: Feature Learning by Inpainting (Pathak et al., 2016)

24
Learning to inpaint by reconstruction
Learning to reconstruct the missing pixels

25
Inpainting evaluation
Input (context) reconstruction

26
Learning to inpaint by reconstruction
Loss = reconstruction + adversarial learning
Adversarial loss between “real” images and inpainted images

27
Inpainting evaluation
Input (context) reconstruction adversarial recon + adv

28
Self-supervised learning on ImageNet training set, transfer to
classification (Pascal VOC 2007), detection (Pascal VOC 2007), and
semantic segmentation (Pascal VOC 2012)

29
Pretext task: image coloring
Source: Richard Zhang / Phillip Isola

30

31
Learning features from colorization:
Split-brain Autoencoder
Idea: cross-channel predictions

32

33

34
Source: Zhang et al., 2017
ImageNet (entire training
set).
Use concatenated features
from F1
and F2
Labeled data is from the
Places (Zhou 2016).
supervised
this paper

35

36

37
Pretext task: video coloring
Source: Vondrick et al., 2018
t = 1 t = 2 t = 3
...
reference frame
t = 0
how should I color these frames?
Idea: model the temporal coherence of colors in videos

38
Pretext task: video coloring
t = 1 t = 2 t = 3
...
reference frame
t = 0
how should I color these frames?
Idea: model the temporal coherence of colors in videos
Should be the same color!
Hypothesis: learning to color video frames should allow model to
learn to track regions or objects without labels!

39
Learning to color videos
Learning objective:
Establish mappings
between reference and
target frames in a
learned feature space.
Use the mapping as
“pointers” to copy the
correct color (LAB).

40
attention map on the
reference frame

41
reference frame
predicted color = weighted
sum of the reference color

42
reference frame
predicted color = weighted
sum of the reference color
loss between predicted color
and ground truth color

43
Colorizing videos (qualitative)
reference frame
Source: Google AI blog post
target frames (gray) predicted color

44
Colorizing videos (qualitative)
reference frame target frames (gray) predicted color

45
Tracking emerges from colorization
Propagate segmentation masks using learned attention

46
Tracking emerges from colorization
Propagate pose keypoints using learned attention

47
Summary: pretext tasks from image
transformations
● Pretext tasks focus on “visual common sense”, e.g., predict rotations,
inpainting, rearrangement, and colorization.
● The models are forced learn good features about natural images, e.g.,
semantic representation of an object category, in order to solve the
pretext tasks.
● We don’t care about the performance of these pretext tasks, but rather
how useful the learned features are for downstream tasks (classification,
detection, segmentation).

48
Summary: pretext tasks from image
transformations
● Pretext tasks focus on “visual common sense”, e.g., predict rotations,
inpainting, rearrangement, and colorization.
● The models are forced learn good features about natural images, e.g.,
semantic representation of an object category, in order to solve the
pretext tasks.
● We don’t care about the performance of these pretext tasks, but rather
how useful the learned features are for downstream tasks (classification,
detection, segmentation).
● Problems: 1) coming up with individual pretext tasks is tedious, and 2)
the learned representations may not be general.

49
?
image completion
θ=?
rotation prediction “jigsaw puzzle” colorization
Learned representations may be tied to a specific pretext task!
Can we come up with a more general pretext task?

50
A more general pretext task?
?
θ=?
same object

51
A more general pretext task?
?
θ=?
same object
different object

52
Contrastive Representation Learning
?
θ=?
attract
repel

53
Today’s Agenda

54
?
θ=?
attract
repel

55
?
θ=?
reference
positive
negative

56
A formulation of contrastive learning
What we want:
x: reference sample; x+
positive sample; x-
negative sample
Given a chosen score function, we aim to learn an encoder
function f that yields high score for positive pairs (x, x+
) and
low scores for negative pairs (x, x-
).

57
Loss function given 1 positive sample and N - 1 negative samples:

58
...

59
score for the
positive pair
score for the N-1
negative pairs
This seems familiar …

60
score for the
positive pair
score for the N-1
negative pairs
This seems familiar …
Cross entropy loss for a N-way softmax classifier!
I.e., learn to find the positive sample from the N samples

61
Commonly known as the InfoNCE loss (van den Oord et al., 2018)
A lower bound on the mutual information between f(x) and f(x+
)
The larger the negative sample size (N), the tighter the bound
Detailed derivation: Poole et al., 2019

62
SimCLR: A Simple Framework for Contrastive Learning
Source: Chen et al., 2020
Use a projection network g(·) to project
features to a space where contrastive
learning is applied
Generate positive samples through data
augmentation:
● random cropping, random color
distortion, and random blur.
Cosine similarity as the score function:

63
SimCLR: generating positive samples from
data augmentation

64
SimCLR
Generate a positive pair
by sampling data
augmentation functions
*We use a slightly different
formulation in the assignment.
You should follow the
assignment instructions.

65
SimCLR
InfoNCE loss:
Use all non-positive
samples in the
batch as x -
by sampling data

66
SimCLR
InfoNCE loss:
Use all non-positive
samples in the
batch as x -
by sampling data
Iterate through and
use each of the 2N
sample as reference,
compute average loss

67
SimCLR: mini-batch training
list of positive pairs
Each 2k and 2k + 1
element is a positive pair
“Affinity matrix”
*We use a slightly different formulation in the assignment.
You should follow the assignment instructions.

68
SimCLR: mini-batch training
list of positive pairs
= classification label for each row
Each 2k and 2k + 1
element is a positive pair
“Affinity matrix”
*We use a slightly different formulation in the assignment.
You should follow the assignment instructions.

69
Training linear classifier on SimCLR features
Train feature encoder on
ImageNet (entire training set)
using SimCLR.
Freeze feature encoder, train a
linear classifier on top with
labeled data.

70
Semi-supervised learning on SimCLR features
Train feature encoder on
ImageNet (entire training set)
using SimCLR.
Finetune the encoder with 1% /
10% of labeled data on ImageNet.

71
SimCLR design choices: projection head
Linear / non-linear projection heads improve
representation learning.
A possible explanation:
● contrastive learning objective may discard
useful information for downstream tasks
● representation space z is trained to be
invariant to data transformation.
● by leveraging the projection head g(ᐧ),
more information can be preserved in the
h representation space

72
SimCLR design choices: large batch size
Large training batch size is crucial for
SimCLR!
Large batch size causes large memory
footprint during backpropagation:
requires distributed training on TPUs
(ImageNet experiments)

73
Momentum Contrastive Learning (MoCo)
Key differences to SimCLR:
● Keep a running queue of keys
(negative samples).
● Compute gradients and update the
encoder only through the queries.
● Decouple min-batch size with the
number of keys: can support a large
number of negative samples.
no_grad
Source: He et al., 2020

74
Momentum Contrastive Learning (MoCo)
Key differences to SimCLR:
● Keep a running queue of keys
(negative samples).
● Compute gradients and update the
encoder only through the queries.
● Decouple min-batch size with the
number of keys: can support a large
number of negative samples.
no_grad
● The key encoder is slowly progressing
through the momentum update rules:

75
MoCo
by sampling data
No gradient through
the positive sample
Use the running
queue of keys as the
negative samples
InfoNCE loss
Update f_k through
momentum
Update the FIFO
negative sample queue

76
“MoCo V2”
A hybrid of ideas from SimCLR and MoCo:
● From SimCLR: non-linear projection head and strong data
augmentation.
● From MoCo: momentum-updated queues that allow training
on a large number of negative samples (no TPU required!).

77
MoCo vs. SimCLR vs. MoCo V2
Key takeaways:
● Non-linear projection head and
strong data augmentation are crucial
for contrastive learning.

78
Key takeaways:
● Decoupling mini-batch size with
negative sample size allows
MoCo-V2 to outperform SimCLR with
smaller batch size (256 vs. 8192).

79
Key takeaways:
● Decoupling mini-batch size with
negative sample size allows
MoCo-V2 to outperform SimCLR with
smaller batch size (256 vs. 8192).
● … all with much smaller memory
footprint! (“end-to-end” means
SimCLR here)

80
Instance vs. Sequence Contrastive Learning
Instance-level contrastive learning:
contrastive learning based on
positive & negative instances.
Examples: SimCLR, MoCo
Sequence-level contrastive learning:
contrastive learning based on
sequential / temporal orders.
Example: Contrastive Predictive Coding (CPC)
Source: van den Oord et al., 2018

81
Contrastive Predictive Coding (CPC)
Source: van den Oord et al., 2018,
Figure source
Contrastive: contrast between
“right” and “wrong” sequences
using contrastive learning.
Predictive: the model has to
predict future patterns given the
current context.
Coding: the model learns useful
feature vectors, or “code”, for
downstream tasks, similar to other
self-supervised methods.
context
positive
negative

82
Figure source
context
positive
negative
1. Encode all samples in a sequence
into vectors zt
= genc
(xt
)

83
Figure source
context
positive
negative
into vectors zt
= genc
(xt
)
2. Summarize context (e.g., half of a
sequence) into a context code ct
using
an auto-regressive model (gar
). The
original paper uses GRU-RNN here.

84
Figure source
context
positive
negative
into vectors zt
= genc
(xt
)
2. Summarize context (e.g., half of a
sequence) into a context code ct
using
an auto-regressive model (gar
)
3. Compute InfoNCE loss between the
context ct
and future code zt+k
using
the following time-dependent score
function:
, where Wk
is a trainable matrix.

85
CPC example: modeling audio sequences

86
CPC example: modeling audio sequences
Linear classification on trained
representations (LibriSpeech dataset)

87
CPC example: modeling visual context
Idea: split image into patches, model rows of patches from top to bottom
as a sequence. I.e., use top rows as context to predict bottom rows.

88
CPC example: modeling visual context
● Compares favorably with other pretext
task-based self-supervised learning method.
● Doesn’t do as well compared to newer
instance-based contrastive learning
methods on image feature learning.

89
Summary: Contrastive Representation Learning
A general formulation for contrastive learning:
InfoNCE loss: N-way classification among positive and negative samples
Commonly known as the InfoNCE loss (van den Oord et al., 2018)
A lower bound on the mutual information between f(x) and f(x+
)

90
SimCLR: a simple framework for contrastive
representation learning
● Key ideas: non-linear projection head to
allow flexible representation learning
● Simple to implement, effective in learning
visual representation
● Requires large training batch size to be
effective; large memory footprint

91
MoCo (v1, v2): contrastive learning using
momentum sample encoder
● Decouples negative sample size from
minibatch size; allows large batch training
without TPU
● MoCo-v2 combines the key ideas from
SimCLR, i.e., nonlinear projection head,
strong data augmentation, with momentum
contrastive learning

92
CPC: sequence-level contrastive learning
● Contrast “right” sequence with “wrong”
sequence.
● InfoNCE loss with a time-dependent score
function.
● Can be applied to a variety of learning
problems, but not as effective in learning
image representations compared to
instance-level methods.

93
Other examples
CLIP (Contrastive Language–Image Pre-training) Radford et al., 2021
Contrastive learning between image and natural language sentences

94
Other examples
Dense Object Net, Florence et al., 2018
Contrastive learning on pixel-wise feature descriptors

95
Other examples

96
Other examples

97
Next time: Low-Level Vision

98
Today’s Agenda
Frontier:
- Contrastive Language Image Pre-training (CLIP)

99
Frontier: Contrastive Language–Image
Pre-training (CLIP)

100
Self-Supervised Learning
General idea: pretend there is a part of the data you don’t know and train
the neural network to predict that.
Source: Lecun 2019 Keynote at ISSCC

101
“The Cake of Learning”
Source: Lecun 2019 Keynote at ISSCC
downstream
tasks
Learn good
features through
self-supervision
feature
extractor

102
Can we do better?
SimCLR Momentum Contrast
(MoCo)
Source: Chen et al., 2020b

lecture_14_jiajun.pdf Self supervised Learning

More Related Content

Similar to lecture_14_jiajun.pdf Self supervised Learning (20)

Recently uploaded (20)

lecture_14_jiajun.pdf Self supervised Learning