Self-Supervised Learning recent trends 1 2

Downside of Strong Supervision
- New Task -> New Dataset
- Hard to obtain annotation (e.g. medical domain)
- Availability of large pool of unlabeled data from YouTube, Facebook etc.
- 1B images are being uploaded everyday on Facebook
- 300 hours of videos on Youtube
- Weakly Supervised, Semi Supervised, Self-Supervised
2

Self-Supervision in Text Domain
Given - A large corpus of text
Task - Train a model which maps each word to a feature vector
Constraint - Words with similar context should reside closely in the feature
space
Example: Word2Vec
3
Image Credit: https://bit.ly/2FElezW

What is Self-Supervision?
- A form of unsupervised learning where the data provides the supervision
- Define a proxy task/loss, and force the network to learn what we really
want it to
Vondorick et al., “Tracking emerges by colorizing videos”, ECCV’18
4

Unsupervised Visual Representation Learning by
Context Prediction
5
Doersch et al., ICCV’15
- Given two patches from one of
these spatial configurations without
any context
- The model should predict the
position of one patch relative to
another patch
- To do well in this task, the network
has to learn a good representation
of the scenes and objects

Context Prediction
6
To generate training patches,
- sample the first patch uniformly
from the image without any
reference to the image content
- Given the position of the first patch,
we sample the second patch from
the 8 possible locations
- Siamese Style Network to classify
between one of the 8 classes

Context Prediction
7
To evaluate learned feature representation,
- Pre-train CNN using self-supervision
- Use this pre-trained CNN as backbone of RCNN

Context Prediction
8
Avoiding trivial shortcuts-
- Low-level cues such as boundary patterns or
continuing texture between patches can serve as a
“shortcut”
- Include gap between patches (not enough).
- Randomly jitter each patch locations

Context Prediction
9
- Use one of the color channels

Context Prediction
10

Context Encoders: Feature Learning by
Inpainting
11
Pathak et al., CVPR’16
Image Image Credit: https://bit.ly/2sB4gdS

Inpainting
12

Inpainting
13

Inpainting
14

Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
15
- Learning a visual representation of raw spatio-
temporal signal such as videos
- Formulate the task as unsupervised sequence
verification task
- Learn a powerful CNN without any semantic labels

16
- Temporal window with very little motion can cause
ambiguity

17

18

19

Tracking Emerges by Colorizing Videos
20
Vondorick et al., ECCV’18
- Learn models for visual tracking from large pool of
unlabeled data
- Exploit natural temporal coherency of colors

21
- Colorize by Pointing

22
- Learn models for visual tracking from large pool of
unlabeled data
- Exploit natural temporal coherency of colors

23

24

25

26

Summary
27
- Important to select informative data
- If the data is to too easy, the network will not learn anything
- Choose a proxy task which encourages the network to learn features
representation for the target task
- Consider how the network will solve the task without cheating

Self-Supervised Learning recent trends 1 2

More Related Content

Similar to Self-Supervised Learning recent trends 1 2 (20)

Recently uploaded (20)

Self-Supervised Learning recent trends 1 2