SlideShare a Scribd company logo
Self-Supervised Learning recent trends 1 2
Downside of Strong Supervision
- New Task -> New Dataset
- Hard to obtain annotation (e.g. medical domain)
- Availability of large pool of unlabeled data from YouTube, Facebook etc.
- 1B images are being uploaded everyday on Facebook
- 300 hours of videos on Youtube
- Weakly Supervised, Semi Supervised, Self-Supervised
2
Self-Supervision in Text Domain
Given - A large corpus of text
Task - Train a model which maps each word to a feature vector
Constraint - Words with similar context should reside closely in the feature
space
Example: Word2Vec
3
Image Credit: https://bit.ly/2FElezW
What is Self-Supervision?
- A form of unsupervised learning where the data provides the supervision
- Define a proxy task/loss, and force the network to learn what we really
want it to
Vondorick et al., “Tracking emerges by colorizing videos”, ECCV’18
4
Unsupervised Visual Representation Learning by
Context Prediction
5
Doersch et al., ICCV’15
- Given two patches from one of
these spatial configurations without
any context
- The model should predict the
position of one patch relative to
another patch
- To do well in this task, the network
has to learn a good representation
of the scenes and objects
Unsupervised Visual Representation Learning by
Context Prediction
6
Doersch et al., ICCV’15
To generate training patches,
- sample the first patch uniformly
from the image without any
reference to the image content
- Given the position of the first patch,
we sample the second patch from
the 8 possible locations
- Siamese Style Network to classify
between one of the 8 classes
Unsupervised Visual Representation Learning by
Context Prediction
7
Doersch et al., ICCV’15
To evaluate learned feature representation,
- Pre-train CNN using self-supervision
- Use this pre-trained CNN as backbone of RCNN
Unsupervised Visual Representation Learning by
Context Prediction
8
Doersch et al., ICCV’15
Avoiding trivial shortcuts-
- Low-level cues such as boundary patterns or
continuing texture between patches can serve as a
“shortcut”
- Include gap between patches (not enough).
- Randomly jitter each patch locations
Unsupervised Visual Representation Learning by
Context Prediction
9
Doersch et al., ICCV’15
- Use one of the color channels
Unsupervised Visual Representation Learning by
Context Prediction
10
Doersch et al., ICCV’15
Context Encoders: Feature Learning by
Inpainting
11
Pathak et al., CVPR’16
Image Image Credit: https://bit.ly/2sB4gdS
Context Encoders: Feature Learning by
Inpainting
12
Pathak et al., CVPR’16
Context Encoders: Feature Learning by
Inpainting
13
Pathak et al., CVPR’16
Context Encoders: Feature Learning by
Inpainting
14
Pathak et al., CVPR’16
Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
15
Pathak et al., CVPR’16
- Learning a visual representation of raw spatio-
temporal signal such as videos
- Formulate the task as unsupervised sequence
verification task
- Learn a powerful CNN without any semantic labels
Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
16
Pathak et al., CVPR’16
- Temporal window with very little motion can cause
ambiguity
Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
17
Pathak et al., CVPR’16
Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
18
Pathak et al., CVPR’16
Shuffle and Learn: Unsupervised Learning using
Temporal Order Verification
19
Pathak et al., CVPR’16
Tracking Emerges by Colorizing Videos
20
Vondorick et al., ECCV’18
- Learn models for visual tracking from large pool of
unlabeled data
- Exploit natural temporal coherency of colors
Tracking Emerges by Colorizing Videos
21
Vondorick et al., ECCV’18
- Colorize by Pointing
Tracking Emerges by Colorizing Videos
22
Vondorick et al., ECCV’18
- Learn models for visual tracking from large pool of
unlabeled data
- Exploit natural temporal coherency of colors
Tracking Emerges by Colorizing Videos
23
Vondorick et al., ECCV’18
- Colorize by Pointing
Tracking Emerges by Colorizing Videos
24
Vondorick et al., ECCV’18
- Colorize by Pointing
Tracking Emerges by Colorizing Videos
25
Vondorick et al., ECCV’18
Tracking Emerges by Colorizing Videos
26
Vondorick et al., ECCV’18
Summary
27
Vondorick et al., ECCV’18
- Important to select informative data
- If the data is to too easy, the network will not learn anything
- Choose a proxy task which encourages the network to learn features
representation for the target task
- Consider how the network will solve the task without cheating
Questions?
28

More Related Content

PDF
Unsupervised visual representation learning overview: Toward Self-Supervision
PDF
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
PDF
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
PDF
lec_11_self_supervised_learning.pdf
PDF
Deep Learning from Videos (UPC 2018)
PDF
Shuffle and learn: Unsupervised Learning using Temporal Order Verification (U...
PDF
lecture_14_jiajun.pdf Self supervised Learning
PDF
598_WI2022_lecture22.pdf data analysis and data prediction
Unsupervised visual representation learning overview: Toward Self-Supervision
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
lec_11_self_supervised_learning.pdf
Deep Learning from Videos (UPC 2018)
Shuffle and learn: Unsupervised Learning using Temporal Order Verification (U...
lecture_14_jiajun.pdf Self supervised Learning
598_WI2022_lecture22.pdf data analysis and data prediction

Similar to Self-Supervised Learning recent trends 1 2 (20)

PDF
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
PPTX
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
PPTX
Lecture_16_Self-supervised_Learning.pptx
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
PDF
What multimodal foundation models cannot perceive
PDF
161209 Unsupervised Learning of Video Representations using LSTMs
PDF
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transforma...
PDF
Learning object dynamics in video generation
PDF
Learning visual representation without human label
PDF
PAISS (PRAIRIE AI Summer School) Digest July 2018
PDF
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
PDF
MILA DL & RL summer school highlights
PPTX
Deep Learning in Computer Vision
PDF
Learning Visual Representations from Uncurated Data
PDF
“Vision-language Representations for Robotics,” a Presentation from the Unive...
PDF
Video+Language: From Classification to Description
PDF
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
PPTX
CM20315_01_Intro_Machine_Learning_ap.pptx
PDF
Video + Language 2019
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
Lecture_16_Self-supervised_Learning.pptx
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
What multimodal foundation models cannot perceive
161209 Unsupervised Learning of Video Representations using LSTMs
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transforma...
Learning object dynamics in video generation
Learning visual representation without human label
PAISS (PRAIRIE AI Summer School) Digest July 2018
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
MILA DL & RL summer school highlights
Deep Learning in Computer Vision
Learning Visual Representations from Uncurated Data
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Video+Language: From Classification to Description
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
CM20315_01_Intro_Machine_Learning_ap.pptx
Video + Language 2019
Ad

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Sustainable Sites - Green Building Construction
DOCX
573137875-Attendance-Management-System-original
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
web development for engineering and engineering
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Well-logging-methods_new................
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT
Project quality management in manufacturing
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Sustainable Sites - Green Building Construction
573137875-Attendance-Management-System-original
Embodied AI: Ushering in the Next Era of Intelligent Systems
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CYBER-CRIMES AND SECURITY A guide to understanding
CH1 Production IntroductoryConcepts.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
web development for engineering and engineering
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Well-logging-methods_new................
Model Code of Practice - Construction Work - 21102022 .pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Project quality management in manufacturing
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
UNIT 4 Total Quality Management .pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Ad

Self-Supervised Learning recent trends 1 2

  • 2. Downside of Strong Supervision - New Task -> New Dataset - Hard to obtain annotation (e.g. medical domain) - Availability of large pool of unlabeled data from YouTube, Facebook etc. - 1B images are being uploaded everyday on Facebook - 300 hours of videos on Youtube - Weakly Supervised, Semi Supervised, Self-Supervised 2
  • 3. Self-Supervision in Text Domain Given - A large corpus of text Task - Train a model which maps each word to a feature vector Constraint - Words with similar context should reside closely in the feature space Example: Word2Vec 3 Image Credit: https://bit.ly/2FElezW
  • 4. What is Self-Supervision? - A form of unsupervised learning where the data provides the supervision - Define a proxy task/loss, and force the network to learn what we really want it to Vondorick et al., “Tracking emerges by colorizing videos”, ECCV’18 4
  • 5. Unsupervised Visual Representation Learning by Context Prediction 5 Doersch et al., ICCV’15 - Given two patches from one of these spatial configurations without any context - The model should predict the position of one patch relative to another patch - To do well in this task, the network has to learn a good representation of the scenes and objects
  • 6. Unsupervised Visual Representation Learning by Context Prediction 6 Doersch et al., ICCV’15 To generate training patches, - sample the first patch uniformly from the image without any reference to the image content - Given the position of the first patch, we sample the second patch from the 8 possible locations - Siamese Style Network to classify between one of the 8 classes
  • 7. Unsupervised Visual Representation Learning by Context Prediction 7 Doersch et al., ICCV’15 To evaluate learned feature representation, - Pre-train CNN using self-supervision - Use this pre-trained CNN as backbone of RCNN
  • 8. Unsupervised Visual Representation Learning by Context Prediction 8 Doersch et al., ICCV’15 Avoiding trivial shortcuts- - Low-level cues such as boundary patterns or continuing texture between patches can serve as a “shortcut” - Include gap between patches (not enough). - Randomly jitter each patch locations
  • 9. Unsupervised Visual Representation Learning by Context Prediction 9 Doersch et al., ICCV’15 - Use one of the color channels
  • 10. Unsupervised Visual Representation Learning by Context Prediction 10 Doersch et al., ICCV’15
  • 11. Context Encoders: Feature Learning by Inpainting 11 Pathak et al., CVPR’16 Image Image Credit: https://bit.ly/2sB4gdS
  • 12. Context Encoders: Feature Learning by Inpainting 12 Pathak et al., CVPR’16
  • 13. Context Encoders: Feature Learning by Inpainting 13 Pathak et al., CVPR’16
  • 14. Context Encoders: Feature Learning by Inpainting 14 Pathak et al., CVPR’16
  • 15. Shuffle and Learn: Unsupervised Learning using Temporal Order Verification 15 Pathak et al., CVPR’16 - Learning a visual representation of raw spatio- temporal signal such as videos - Formulate the task as unsupervised sequence verification task - Learn a powerful CNN without any semantic labels
  • 16. Shuffle and Learn: Unsupervised Learning using Temporal Order Verification 16 Pathak et al., CVPR’16 - Temporal window with very little motion can cause ambiguity
  • 17. Shuffle and Learn: Unsupervised Learning using Temporal Order Verification 17 Pathak et al., CVPR’16
  • 18. Shuffle and Learn: Unsupervised Learning using Temporal Order Verification 18 Pathak et al., CVPR’16
  • 19. Shuffle and Learn: Unsupervised Learning using Temporal Order Verification 19 Pathak et al., CVPR’16
  • 20. Tracking Emerges by Colorizing Videos 20 Vondorick et al., ECCV’18 - Learn models for visual tracking from large pool of unlabeled data - Exploit natural temporal coherency of colors
  • 21. Tracking Emerges by Colorizing Videos 21 Vondorick et al., ECCV’18 - Colorize by Pointing
  • 22. Tracking Emerges by Colorizing Videos 22 Vondorick et al., ECCV’18 - Learn models for visual tracking from large pool of unlabeled data - Exploit natural temporal coherency of colors
  • 23. Tracking Emerges by Colorizing Videos 23 Vondorick et al., ECCV’18 - Colorize by Pointing
  • 24. Tracking Emerges by Colorizing Videos 24 Vondorick et al., ECCV’18 - Colorize by Pointing
  • 25. Tracking Emerges by Colorizing Videos 25 Vondorick et al., ECCV’18
  • 26. Tracking Emerges by Colorizing Videos 26 Vondorick et al., ECCV’18
  • 27. Summary 27 Vondorick et al., ECCV’18 - Important to select informative data - If the data is to too easy, the network will not learn anything - Choose a proxy task which encourages the network to learn features representation for the target task - Consider how the network will solve the task without cheating