Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos

15/05/2021
Multi-modal self-supervised
learning from videos
Adrià Recasens Continente
DeepMind

We learn from the world through multimodal experience
[...] towards the root and try to get as close to the root as possible, nice long strokes [...]

Success of supervised learning
Pose estimation
[Towards Accurate Multi-person Pose Estimation in
the Wild, Papandreou, Zhu, Kanazawa, Toshev,
Tompson, Bregler and Murphy, CVPR17]
Image Segmentation
[Mask R-CNN, He, Gkioxari, Dollár, and Girshisck,
ICCV17]

Supervised learning
Labels are expensive Agreement: deﬁnition? Granularity?

Supervised learning
Labels are expensive Even more problematic for videos

Self-supervised learning
Vision Vision+Language Vision+Audio
SimCLR: Chen et al, 2020
MOCO: He et al, 2020
XDC: Alwassel at al,
2020
L3: Arandjelovic and
Zisserman, 2017
GDT: Patrick at al,
2020
MIL-NCE: Miech, Alayrac et al, 2020
VideoBERT: Sun et al, 2019
DaveNet: Harwath et al, 2018
Sound of Pixels: Zhao
et al, 2018

Outline of the talk
01
Multimodal Versatile Networks
Motivation
MMV Model
Versatility checklist
Video Network Deﬂation
Potential applications
02
BraVe: Broaden your views for
self-supervised learning
Narrow and broad views
Main idea
Motivation
Research questions
Evaluation

1 Multi-modal
versatile
networks

Motivation
Research questions:
Are three modalities better than two for downstream tasks?
Are there natural requirements for such a multimodal network?
Self-supervised learning on modalities naturally present in videos:
Vision, Audio and Language

Positive pairs
This is an “old” idea: DeVise, Frome et al. NeurIPS13 and WSABIE, Weston et al. IJCAI 2011.
“Play the guitar” “Cut the onion”
Negative pairs
Main Idea
Video 1 Video 2

Which pretraining datasets?
GDT: Patrick at al,
2020
HowTo100M: 1M videos, 100M clips, 20K tasks, text obtained from ASR.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac et al., ICCV19
AudioSet: 2M videos (with audio tracks), we do not extract text for this dataset.
Audio Set: An ontology and human-labeled dataset for audio events, Gemmeke et al. ICASSP 2017

Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4

Embedding graph design
Fine and Coarse
Intuition: audio is more ﬁne grained (e.g., multiple sounds of guitar) whereas text is more coarse (a single
word for guitar) ⇒ The Fine and Coarse design:
✓ enables the different modalities to be easily compared
✓ has the best results in several downstream tasks
✓ respects the speciﬁcity of modalities
Fine Space
Coarse Space
Self-supervised Multi-Modal Versatile Networks, NeurIPS 2020

Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4

Network Deflation
Motivation:
Most works consider learning first from images to apply models to video.
Goal:
We train our model in video and apply them efficiently to image inputs.
A standard solution: Inflated input Proposed solution: Deflated network
Video Network

MULTIMODAL VERSATILE NETWORKS
Potential
Applications

Audio to video
Rank 1 Rank 2 Rank 3

Text to video
“add fresh chopped
tomatoes and stir”
Input text

Text to video
“pour some oil
into a hot pan”
Input text

Text to audio retrieval in the coarse space
Even though the link between audio and text
was never explicit during training, we can use
the FAC architecture to perform text to audio
retrieval.

ResNet50
To do so, the audio samples are ﬁrst
embedded in the joint visual-audio (ﬁne)
space.

ResNet50
Then the va→vat projection head is used to
project the audio embeddings into the joint
visual-audio-text space (coarse).

ResNet50
Given a text input query, we simply embed it
into the joint space and retrieve the closest
audio embedding.
Input
query

“airplane”
Rank 1
Input text

Rank 1
Input text
“chirping bird”

Resources
Pretrained models available
TF-Hub: [S3D] [TSM-RN]
[TSM-RNx2]
Models in JAX with action
recognition downstream task!

However...
Most of available videos do not contain
narrations.
Using negatives for self-supervision is
expensive as it require training with large
batch sizes.
Our training misses larger context as the views
of the data cover at most 3 seconds.

Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos

Motivation
Goal: learn good representation by regressing a broad representation of the video.
BraVe learns strong representation of video as the narrow view needs to predict
the representation of the whole video clip (broad view).
We use separate backbones to process both views, as they perform different
tasks. This enables using different augmentations/modalities in both views.
Flow or alternative representations of the video can provide a strong signal for
learning.

Research Questions
Importance of the
broad view
Modality in the
broad view
Weight sharing
across views
1 2 3
Syncing the narrow
and broad views
4
Broaden Your Views for Self-Supervised Video Learning
Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron
van den Oord, Andrew Zisserman. Arxiv 2021.

Comparison to SoTA: video-only models

Comparison to SoTA: audio-visual models

Conclusions
Videos are a rich source of self-supervision for
video, audio and image models.
Both MMV and BraVe archive SoTA results for
self-supervised learning in several downstream
tasks.
Using audio, text or larger video context are
useful self-supervisory signals.

Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos

More Related Content

What's hot (20)

Similar to Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos (20)

More from Codiax (20)

Recently uploaded (20)

Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos