15/05/2021
Multi-modal self-supervised
learning from videos
Adrià Recasens Continente
DeepMind
We learn from the world through multimodal experience
[...] towards the root and try to get as close to the root as possible, nice long strokes [...]
Success of supervised learning
Pose estimation
[Towards Accurate Multi-person Pose Estimation in
the Wild, Papandreou, Zhu, Kanazawa, Toshev,
Tompson, Bregler and Murphy, CVPR17]
Image Segmentation
[Mask R-CNN, He, Gkioxari, Dollár, and Girshisck,
ICCV17]
Supervised learning
Labels are expensive Agreement: definition? Granularity?
Supervised learning
Labels are expensive Even more problematic for videos
Self-supervised learning
Vision Vision+Language Vision+Audio
SimCLR: Chen et al, 2020
MOCO: He et al, 2020
XDC: Alwassel at al,
2020
L3: Arandjelovic and
Zisserman, 2017
GDT: Patrick at al,
2020
MIL-NCE: Miech, Alayrac et al, 2020
VideoBERT: Sun et al, 2019
DaveNet: Harwath et al, 2018
Sound of Pixels: Zhao
et al, 2018
Outline of the talk
01
Multimodal Versatile Networks
Motivation
MMV Model
Versatility checklist
Video Network Deflation
Potential applications
02
BraVe: Broaden your views for
self-supervised learning
Narrow and broad views
Main idea
Motivation
Research questions
Evaluation
1 Multi-modal
versatile
networks
Motivation
Research questions:
Are three modalities better than two for downstream tasks?
Are there natural requirements for such a multimodal network?
Self-supervised learning on modalities naturally present in videos:
Vision, Audio and Language
Positive pairs
This is an “old” idea: DeVise, Frome et al. NeurIPS13 and WSABIE, Weston et al. IJCAI 2011.
“Play the guitar” “Cut the onion”
Negative pairs
Main Idea
Video 1 Video 2
Which pretraining datasets?
MOCO: He et al, 2020
GDT: Patrick at al,
2020
MIL-NCE: Miech, Alayrac et al, 2020
HowTo100M: 1M videos, 100M clips, 20K tasks, text obtained from ASR.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac et al., ICCV19
MOCO: He et al, 2020
MIL-NCE: Miech, Alayrac et al, 2020
AudioSet: 2M videos (with audio tracks), we do not extract text for this dataset.
Audio Set: An ontology and human-labeled dataset for audio events, Gemmeke et al. ICASSP 2017
Versatility checklist
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Embedding graph design
Fine and Coarse
Intuition: audio is more fine grained (e.g., multiple sounds of guitar) whereas text is more coarse (a single
word for guitar) ⇒ The Fine and Coarse design:
✓ enables the different modalities to be easily compared
✓ has the best results in several downstream tasks
✓ respects the specificity of modalities
Fine Space
Coarse Space
Self-supervised Multi-Modal Versatile Networks, NeurIPS 2020
Do more modalities help?
State-of-the-art comparison
Versatility checklist
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Versatility checklist
Network Deflation
Motivation:
Most works consider learning first from images to apply models to video.
Goal:
We train our model in video and apply them efficiently to image inputs.
A standard solution: Inflated input Proposed solution: Deflated network
Video Network
Network Deflation
MULTIMODAL VERSATILE NETWORKS
Potential
Applications
Audio to video
Rank 1 Rank 2 Rank 3
Audio to video
Rank 1 Rank 2 Rank 3
Text to video
“add fresh chopped
tomatoes and stir”
Input text
Text to video
Rank 1 Rank 2 Rank 3
“add fresh chopped
tomatoes and stir”
Input text
Text to video
“pour some oil
into a hot pan”
Input text
Text to video
Rank 1 Rank 2 Rank 3
“pour some oil
into a hot pan”
Input text
Text to audio retrieval in the coarse space
Even though the link between audio and text
was never explicit during training, we can use
the FAC architecture to perform text to audio
retrieval.
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
Then the va→vat projection head is used to
project the audio embeddings into the joint
visual-audio-text space (coarse).
Text to audio retrieval in the coarse space
ResNet50
Then the va→vat projection head is used to
project the audio embeddings into the joint
visual-audio-text space (coarse).
Text to audio retrieval in the coarse space
ResNet50
Given a text input query, we simply embed it
into the joint space and retrieve the closest
audio embedding.
Input
query
Text to audio retrieval in the coarse space
“airplane”
Rank 1
Input text
Text to audio retrieval in the coarse space
Rank 1
Input text
“chirping bird”
Text to audio retrieval in the coarse space
Resources
Pretrained models available
TF-Hub: [S3D] [TSM-RN]
[TSM-RNx2]
Models in JAX with action
recognition downstream task!
However...
Most of available videos do not contain
narrations.
Using negatives for self-supervision is
expensive as it require training with large
batch sizes.
Our training misses larger context as the views
of the data cover at most 3 seconds.
Outline of the talk
01
Multimodal Versatile Networks
Motivation
MMV Model
Versatility checklist
Video Network Deflation
Potential applications
02
BraVe: Broaden your views for
self-supervised learning
Narrow and broad views
Main idea
Motivation
Research questions
Evaluation
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Main Idea
Main Idea
Main Idea
Motivation
Goal: learn good representation by regressing a broad representation of the video.
BraVe learns strong representation of video as the narrow view needs to predict
the representation of the whole video clip (broad view).
We use separate backbones to process both views, as they perform different
tasks. This enables using different augmentations/modalities in both views.
Flow or alternative representations of the video can provide a strong signal for
learning.
Research Questions
Importance of the
broad view
Modality in the
broad view
Weight sharing
across views
1 2 3
Syncing the narrow
and broad views
4
Broaden Your Views for Self-Supervised Video Learning
Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron
van den Oord, Andrew Zisserman. Arxiv 2021.
Comparison to SoTA: video-only models
Comparison to SoTA: audio-visual models
Conclusions
Videos are a rich source of self-supervision for
video, audio and image models.
Both MMV and BraVe archive SoTA results for
self-supervised learning in several downstream
tasks.
Using audio, text or larger video context are
useful self-supervisory signals.
Thank
you!

More Related Content

PDF
Graph Kernelpdf
PPTX
Introduction to Transformer Model
PDF
Gnn overview
PDF
Basic Generative Adversarial Networks
PDF
Apply MLOps at Scale by H&M
PPTX
Generative Adversarial Networks (GANs)
PPTX
Introduction to Graph neural networks @ Vienna Deep Learning meetup
PDF
Graph neural networks overview
Graph Kernelpdf
Introduction to Transformer Model
Gnn overview
Basic Generative Adversarial Networks
Apply MLOps at Scale by H&M
Generative Adversarial Networks (GANs)
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Graph neural networks overview

What's hot (20)

PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
PPTX
Basics of Deep learning
PPTX
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
PPTX
An Introduction to XAI! Towards Trusting Your ML Models!
PPTX
Intro to deep learning
PDF
Deep-Learning Based Stereo Super-Resolution
PPTX
A survey on graph kernels
PDF
AutoML - The Future of AI
PPTX
Introduction to Deep Learning
PDF
What is MLOps
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
PPTX
Graph Neural Network - Introduction
PDF
Introduction to Transformers for NLP - Olga Petrova
PPTX
Transformers In Vision From Zero to Hero (DLI).pptx
PDF
Deep learning - A Visual Introduction
PDF
An introduction to the Transformers architecture and BERT
PPTX
Transformers AI PPT.pptx
PPTX
Image classification using cnn
PDF
Automatic machine learning (AutoML) 101
PDF
Introduction to Generative Adversarial Networks (GANs)
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Basics of Deep learning
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
An Introduction to XAI! Towards Trusting Your ML Models!
Intro to deep learning
Deep-Learning Based Stereo Super-Resolution
A survey on graph kernels
AutoML - The Future of AI
Introduction to Deep Learning
What is MLOps
Transfer Learning and Fine-tuning Deep Neural Networks
Graph Neural Network - Introduction
Introduction to Transformers for NLP - Olga Petrova
Transformers In Vision From Zero to Hero (DLI).pptx
Deep learning - A Visual Introduction
An introduction to the Transformers architecture and BERT
Transformers AI PPT.pptx
Image classification using cnn
Automatic machine learning (AutoML) 101
Introduction to Generative Adversarial Networks (GANs)
Ad

Similar to Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos (20)

PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
PDF
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
PDF
A Survey on Cross-Modal Embedding
PPTX
Multimodal deep learning
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
PDF
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
PDF
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
PDF
Multi modal retrieval and generation with deep distributed models
PDF
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
PDF
vision in LMM, a close look at them in context
PDF
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
PDF
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
PDF
Multimodal Machine Learning_ Merging Text, Images, and Sound.pdf
PDF
Advanced Video Search - Leveraging Twelve Labs and Milvus for Semantic Retrieval
PDF
“Frontiers in Perceptual AI: First-person Video and Multimodal Perception,” a...
PDF
Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Corre...
PDF
VAEs for multimodal disentanglement
PDF
Video search by deep-learning
PDF
Interactive Video Search: Where is the User in the Age of Deep Learning?
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
A Survey on Cross-Modal Embedding
Multimodal deep learning
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Multi modal retrieval and generation with deep distributed models
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
vision in LMM, a close look at them in context
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Multimodal Machine Learning_ Merging Text, Images, and Sound.pdf
Advanced Video Search - Leveraging Twelve Labs and Milvus for Semantic Retrieval
“Frontiers in Perceptual AI: First-person Video and Multimodal Perception,” a...
Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Corre...
VAEs for multimodal disentanglement
Video search by deep-learning
Interactive Video Search: Where is the User in the Age of Deep Learning?
Ad

More from Codiax (20)

PDF
Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...
PDF
Costas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluation
PDF
Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...
PDF
Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...
PDF
Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...
PDF
Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...
PDF
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
PDF
Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...
PDF
Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...
PDF
Matthias Feys (ML6) – Bias in ML: A Technical Intro
PDF
Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...
PDF
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
PDF
Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...
PDF
Maciej Marek (Philip Morris International) - The Tools of The Trade
PDF
Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...
PDF
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
PDF
Jakub Bartoszek (Samsung Electronics) - Hardware Security in Connected World
PDF
Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...
PDF
Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?
PDF
Alexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network Server
Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...
Costas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluation
Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...
Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...
Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...
Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...
Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...
Matthias Feys (ML6) – Bias in ML: A Technical Intro
Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...
Maciej Marek (Philip Morris International) - The Tools of The Trade
Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
Jakub Bartoszek (Samsung Electronics) - Hardware Security in Connected World
Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...
Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?
Alexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network Server

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
STKI Israel Market Study 2025 version august
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
DOCX
search engine optimization ppt fir known well about this
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPT
Geologic Time for studying geology for geologist
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPT
What is a Computer? Input Devices /output devices
A comparative study of natural language inference in Swahili using monolingua...
Tartificialntelligence_presentation.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Module 1.ppt Iot fundamentals and Architecture
Getting started with AI Agents and Multi-Agent Systems
A contest of sentiment analysis: k-nearest neighbor versus neural network
STKI Israel Market Study 2025 version august
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
search engine optimization ppt fir known well about this
WOOl fibre morphology and structure.pdf for textiles
Assigned Numbers - 2025 - Bluetooth® Document
O2C Customer Invoices to Receipt V15A.pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Geologic Time for studying geology for geologist
CloudStack 4.21: First Look Webinar slides
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
Developing a website for English-speaking practice to English as a foreign la...
What is a Computer? Input Devices /output devices

Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos

  • 1. 15/05/2021 Multi-modal self-supervised learning from videos Adrià Recasens Continente DeepMind
  • 2. We learn from the world through multimodal experience [...] towards the root and try to get as close to the root as possible, nice long strokes [...]
  • 3. Success of supervised learning Pose estimation [Towards Accurate Multi-person Pose Estimation in the Wild, Papandreou, Zhu, Kanazawa, Toshev, Tompson, Bregler and Murphy, CVPR17] Image Segmentation [Mask R-CNN, He, Gkioxari, Dollár, and Girshisck, ICCV17]
  • 4. Supervised learning Labels are expensive Agreement: definition? Granularity?
  • 5. Supervised learning Labels are expensive Even more problematic for videos
  • 6. Self-supervised learning Vision Vision+Language Vision+Audio SimCLR: Chen et al, 2020 MOCO: He et al, 2020 XDC: Alwassel at al, 2020 L3: Arandjelovic and Zisserman, 2017 GDT: Patrick at al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 VideoBERT: Sun et al, 2019 DaveNet: Harwath et al, 2018 Sound of Pixels: Zhao et al, 2018
  • 7. Outline of the talk 01 Multimodal Versatile Networks Motivation MMV Model Versatility checklist Video Network Deflation Potential applications 02 BraVe: Broaden your views for self-supervised learning Narrow and broad views Main idea Motivation Research questions Evaluation
  • 9. Motivation Research questions: Are three modalities better than two for downstream tasks? Are there natural requirements for such a multimodal network? Self-supervised learning on modalities naturally present in videos: Vision, Audio and Language
  • 10. Positive pairs This is an “old” idea: DeVise, Frome et al. NeurIPS13 and WSABIE, Weston et al. IJCAI 2011. “Play the guitar” “Cut the onion” Negative pairs Main Idea Video 1 Video 2
  • 11. Which pretraining datasets? MOCO: He et al, 2020 GDT: Patrick at al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 HowTo100M: 1M videos, 100M clips, 20K tasks, text obtained from ASR. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac et al., ICCV19 MOCO: He et al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 AudioSet: 2M videos (with audio tracks), we do not extract text for this dataset. Audio Set: An ontology and human-labeled dataset for audio events, Gemmeke et al. ICASSP 2017
  • 12. Versatility checklist Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4
  • 13. Embedding graph design Fine and Coarse Intuition: audio is more fine grained (e.g., multiple sounds of guitar) whereas text is more coarse (a single word for guitar) ⇒ The Fine and Coarse design: ✓ enables the different modalities to be easily compared ✓ has the best results in several downstream tasks ✓ respects the specificity of modalities Fine Space Coarse Space Self-supervised Multi-Modal Versatile Networks, NeurIPS 2020
  • 16. Versatility checklist Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4
  • 17. Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4 Versatility checklist
  • 18. Network Deflation Motivation: Most works consider learning first from images to apply models to video. Goal: We train our model in video and apply them efficiently to image inputs. A standard solution: Inflated input Proposed solution: Deflated network Video Network
  • 21. Audio to video Rank 1 Rank 2 Rank 3
  • 22. Audio to video Rank 1 Rank 2 Rank 3
  • 23. Text to video “add fresh chopped tomatoes and stir” Input text
  • 24. Text to video Rank 1 Rank 2 Rank 3 “add fresh chopped tomatoes and stir” Input text
  • 25. Text to video “pour some oil into a hot pan” Input text
  • 26. Text to video Rank 1 Rank 2 Rank 3 “pour some oil into a hot pan” Input text
  • 27. Text to audio retrieval in the coarse space Even though the link between audio and text was never explicit during training, we can use the FAC architecture to perform text to audio retrieval.
  • 28. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  • 29. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  • 30. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  • 31. ResNet50 Then the va→vat projection head is used to project the audio embeddings into the joint visual-audio-text space (coarse). Text to audio retrieval in the coarse space
  • 32. ResNet50 Then the va→vat projection head is used to project the audio embeddings into the joint visual-audio-text space (coarse). Text to audio retrieval in the coarse space
  • 33. ResNet50 Given a text input query, we simply embed it into the joint space and retrieve the closest audio embedding. Input query Text to audio retrieval in the coarse space
  • 34. “airplane” Rank 1 Input text Text to audio retrieval in the coarse space
  • 35. Rank 1 Input text “chirping bird” Text to audio retrieval in the coarse space
  • 36. Resources Pretrained models available TF-Hub: [S3D] [TSM-RN] [TSM-RNx2] Models in JAX with action recognition downstream task!
  • 37. However... Most of available videos do not contain narrations. Using negatives for self-supervision is expensive as it require training with large batch sizes. Our training misses larger context as the views of the data cover at most 3 seconds.
  • 38. Outline of the talk 01 Multimodal Versatile Networks Motivation MMV Model Versatility checklist Video Network Deflation Potential applications 02 BraVe: Broaden your views for self-supervised learning Narrow and broad views Main idea Motivation Research questions Evaluation
  • 45. Motivation Goal: learn good representation by regressing a broad representation of the video. BraVe learns strong representation of video as the narrow view needs to predict the representation of the whole video clip (broad view). We use separate backbones to process both views, as they perform different tasks. This enables using different augmentations/modalities in both views. Flow or alternative representations of the video can provide a strong signal for learning.
  • 46. Research Questions Importance of the broad view Modality in the broad view Weight sharing across views 1 2 3 Syncing the narrow and broad views 4 Broaden Your Views for Self-Supervised Video Learning Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron van den Oord, Andrew Zisserman. Arxiv 2021.
  • 47. Comparison to SoTA: video-only models
  • 48. Comparison to SoTA: audio-visual models
  • 49. Conclusions Videos are a rich source of self-supervision for video, audio and image models. Both MMV and BraVe archive SoTA results for self-supervised learning in several downstream tasks. Using audio, text or larger video context are useful self-supervisory signals.