2. Downside of Strong Supervision
- New Task -> New Dataset
- Hard to obtain annotation (e.g. medical domain)
- Availability of large pool of unlabeled data from YouTube, Facebook etc.
- 1B images are being uploaded everyday on Facebook
- 300 hours of videos on Youtube
- Weakly Supervised, Semi Supervised, Self-Supervised
2
3. Self-Supervision in Text Domain
Given - A large corpus of text
Task - Train a model which maps each word to a feature vector
Constraint - Words with similar context should reside closely in the feature
space
Example: Word2Vec
3
Image Credit: https://bit.ly/2FElezW
4. What is Self-Supervision?
- A form of unsupervised learning where the data provides the supervision
- Define a proxy task/loss, and force the network to learn what we really
want it to
Vondorick et al., “Tracking emerges by colorizing videos”, ECCV’18
4
5. Unsupervised Visual Representation Learning by
Context Prediction
5
Doersch et al., ICCV’15
- Given two patches from one of
these spatial configurations without
any context
- The model should predict the
position of one patch relative to
another patch
- To do well in this task, the network
has to learn a good representation
of the scenes and objects
6. Unsupervised Visual Representation Learning by
Context Prediction
6
Doersch et al., ICCV’15
To generate training patches,
- sample the first patch uniformly
from the image without any
reference to the image content
- Given the position of the first patch,
we sample the second patch from
the 8 possible locations
- Siamese Style Network to classify
between one of the 8 classes
7. Unsupervised Visual Representation Learning by
Context Prediction
7
Doersch et al., ICCV’15
To evaluate learned feature representation,
- Pre-train CNN using self-supervision
- Use this pre-trained CNN as backbone of RCNN
8. Unsupervised Visual Representation Learning by
Context Prediction
8
Doersch et al., ICCV’15
Avoiding trivial shortcuts-
- Low-level cues such as boundary patterns or
continuing texture between patches can serve as a
“shortcut”
- Include gap between patches (not enough).
- Randomly jitter each patch locations
15. Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
15
Pathak et al., CVPR’16
- Learning a visual representation of raw spatio-
temporal signal such as videos
- Formulate the task as unsupervised sequence
verification task
- Learn a powerful CNN without any semantic labels
16. Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
16
Pathak et al., CVPR’16
- Temporal window with very little motion can cause
ambiguity
17. Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
17
Pathak et al., CVPR’16
18. Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
18
Pathak et al., CVPR’16
19. Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
19
Pathak et al., CVPR’16
20. Tracking Emerges by Colorizing Videos
20
Vondorick et al., ECCV’18
- Learn models for visual tracking from large pool of
unlabeled data
- Exploit natural temporal coherency of colors
21. Tracking Emerges by Colorizing Videos
21
Vondorick et al., ECCV’18
- Colorize by Pointing
22. Tracking Emerges by Colorizing Videos
22
Vondorick et al., ECCV’18
- Learn models for visual tracking from large pool of
unlabeled data
- Exploit natural temporal coherency of colors
23. Tracking Emerges by Colorizing Videos
23
Vondorick et al., ECCV’18
- Colorize by Pointing
24. Tracking Emerges by Colorizing Videos
24
Vondorick et al., ECCV’18
- Colorize by Pointing
27. Summary
27
Vondorick et al., ECCV’18
- Important to select informative data
- If the data is to too easy, the network will not learn anything
- Choose a proxy task which encourages the network to learn features
representation for the target task
- Consider how the network will solve the task without cheating