Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)

http://bit.ly/dlsl2018
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Audio and Vision
Day 4 Lecture 3
#DLUPC

2
Audio & Vision
Vision
Audio
Speech

3
Audio & Vision
Vision
Audio
Speech
Video
Synchronization among modalities
captured by video is exploited in a
self-supervised manner.

4
Audio & Vision
● Feature Learning
● Cross-modal Retrieval
● Cross-modal Translation

5
Audio & Vision
● Cross-modal Retrieval

6
Vision
Audio
Video
Visual Feature Learning

7
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.

8
Use videos to train a CNN that predicts the audio statistics of a frame.

9
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means over training set
Average statsCluster assignments at test time (one row=one cluster)

10
Although the CNN was not trained with class labels, local units with semantic meaning
emerge.

11
Vision
Audio
Video
Audio Feature Learning

12
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016.

13
Audio Feature Learning: SoundNet
video." NIPS 2016
Pretrained visual ConvNets supervise the training of a model for sound representation

14
Videos for training are unlabeled. Relies on Convnets trained on labeled images.
video." NIPS 2016

15
Hidden layers of Soundnet are used to train a standard SVM
classifier that outperforms state of the art.
video." NIPS 2016

16
Visualization of the 1D filters over raw audio in conv1.
video." NIPS 2016

17
Visualization of the 1D filters over raw audio in conv1.
video." NIPS 2016

18
Visualize samples that mostly activate a neuron in a late layer (conv7)
video." NIPS 2016

19
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
video." NIPS 2016

20
Hearing sounds that most activate a neuron in the sound network (conv7)
video." NIPS 2016

21
Hearing sounds that most activate a neuron in the sound network (conv5)
video." NIPS 2016

22
Vision
Audio
Audio & Visual Feature Learning
Video

2323Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Audio and visual features learned by assessing correspondence.
Audio & Visual Feature Learning

24
Audio & Vision
● Cross-modal retrieval

25
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.

26
Cross-modal Retrieval
Learn synthesized sounds from videos of people hitting objects with a drumstick.

27
Not end-to-end

28
The Greatest Hits Dataset

29
[Paper draft]
Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).

30
Best
match
Visual feature Audio feature
Video sonorization

31
Visual feature Audio feature
Best
match
Audio coloring

32
Audio & Vision

33
Audio & Vision
Vision Speech
Video

34
Audio & Vision
Vision Speech
Video

35
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.

36
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis

37
Speech Generation from Video
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.

38Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.

39
Audio & Vision
Vision Speech
Video

40Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
Speech to Video Synthesis (mouth)

41
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017

42
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
Speech to Video Synthesis (mouth)

43

44
Speech to Video Synthesis (pose & emotion)

45
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
Multimedia Thematic Workshops 2017.
Audio & Visual Generation

46
"Hello"
SLPA
Speech2Signs (under work)

47
Audio & Vision

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)

More Related Content

What's hot (20)

Similar to Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language) (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)