SlideShare a Scribd company logo
http://bit.ly/dlsl2018
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Audio and Vision
Day 4 Lecture 3
#DLUPC
2
Audio & Vision
Vision
Audio
Speech
3
Audio & Vision
Vision
Audio
Speech
Video
Synchronization among modalities
captured by video is exploited in a
self-supervised manner.
4
Audio & Vision
● Feature Learning
● Cross-modal Retrieval
● Cross-modal Translation
5
Audio & Vision
● Feature Learning
● Cross-modal Retrieval
● Cross-modal Translation
6
Vision
Audio
Video
Visual Feature Learning
7
Visual Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.
8
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Use videos to train a CNN that predicts the audio statistics of a frame.
Visual Feature Learning
9
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means over training set
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Average statsCluster assignments at test time (one row=one cluster)
Visual Feature Learning
10
Although the CNN was not trained with class labels, local units with semantic meaning
emerge.
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Visual Feature Learning
11
Vision
Audio
Video
Audio Feature Learning
12
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016.
13
Audio Feature Learning: SoundNet
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Pretrained visual ConvNets supervise the training of a model for sound representation
14
Videos for training are unlabeled. Relies on Convnets trained on labeled images.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
15
Hidden layers of Soundnet are used to train a standard SVM
classifier that outperforms state of the art.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
16
Visualization of the 1D filters over raw audio in conv1.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
17
Visualization of the 1D filters over raw audio in conv1.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
18
Visualize samples that mostly activate a neuron in a late layer (conv7)
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
19
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
20
Hearing sounds that most activate a neuron in the sound network (conv7)
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
21
Hearing sounds that most activate a neuron in the sound network (conv5)
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
22
Vision
Audio
Audio & Visual Feature Learning
Video
2323Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Audio and visual features learned by assessing correspondence.
Audio & Visual Feature Learning
24
Audio & Vision
● Feature Learning
● Cross-modal retrieval
● Cross-modal Translation
25
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
26
Cross-modal Retrieval
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Learn synthesized sounds from videos of people hitting objects with a drumstick.
27
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Not end-to-end
Cross-modal Retrieval
28
The Greatest Hits Dataset
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Cross-modal Retrieval
29
[Paper draft]
Cross-modal Retrieval
Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
30
Best
match
Visual feature Audio feature
Video sonorization
Cross-modal Retrieval
Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
31
Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
Visual feature Audio feature
Best
match
Audio coloring
Cross-modal Retrieval
32
Audio & Vision
● Feature Learning
● Cross-modal retrieval
● Cross-modal Translation
33
Audio & Vision
Vision Speech
Video
34
Audio & Vision
Vision Speech
Video
35
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
36
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
37
Speech Generation from Video
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
38Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
39
Audio & Vision
Vision Speech
Video
40Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
Speech to Video Synthesis (mouth)
41
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
42
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
Speech to Video Synthesis (mouth)
43
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
44
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
Speech to Video Synthesis (pose & emotion)
45
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
Multimedia Thematic Workshops 2017.
Audio & Visual Generation
46
"Hello"
SLPA
Speech2Signs (under work)
47
Audio & Vision
● Feature Learning
● Cross-modal retrieval
● Cross-modal Translation
48
Questions ?

More Related Content

PDF
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
PDF
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
PDF
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
PDF
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
PDF
Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
PDF
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019

What's hot (20)

PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
PDF
One Perceptron to Rule Them All: Language and Vision
PDF
Once Perceptron to Rule Them all: Deep Learning for Multimedia
PDF
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
PDF
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
PDF
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
PDF
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
PDF
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
PDF
Deep Learning from Videos (UPC 2018)
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
PDF
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
PDF
Learning with Unpaired Data
PDF
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
PDF
Deep Learning for Video: Language (UPC 2018)
PDF
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
PDF
Disentangle motion, Foreground and Background Features in Videos
PDF
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
PDF
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
One Perceptron to Rule Them All: Language and Vision
Once Perceptron to Rule Them all: Deep Learning for Multimedia
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Deep Learning from Videos (UPC 2018)
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Learning with Unpaired Data
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Deep Learning for Video: Language (UPC 2018)
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
Disentangle motion, Foreground and Background Features in Videos
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
Ad

Similar to Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language) (20)

PDF
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
PDF
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
PDF
A Survey on Cross-Modal Embedding
PDF
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
PPTX
Cross Model.pptx
PDF
Visual recognition of human communications
PDF
SAM2: Segment Anything in Images and Videos
PDF
Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning | ...
PDF
Video search by deep-learning
PPTX
[NS][Lab_Seminar_250609]Audio-Visual Semantic Graph Network for Audio-Visual ...
PDF
The Potensial of Visual Features
PDF
Vision and Language: Past, Present and Future
PPTX
Research_Wu.pptx
PDF
WATCH, LISTEN AND TELL: MULTI-MODAL WEAKLY SUPERVISED DENSE EVENT CAPTIONING
PPTX
Matt Feiszli at AI Frontiers : Video Understanding
PPTX
Multimodal deep learning
PPTX
Sound is not speech
PPTX
BAZAN GIL Would you please marry me
PDF
Audio insights
PDF
VITA-1.5 Towards GPT-4o Level Real-Time Vision and Speech Interaction
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
A Survey on Cross-Modal Embedding
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Cross Model.pptx
Visual recognition of human communications
SAM2: Segment Anything in Images and Videos
Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning | ...
Video search by deep-learning
[NS][Lab_Seminar_250609]Audio-Visual Semantic Graph Network for Audio-Visual ...
The Potensial of Visual Features
Vision and Language: Past, Present and Future
Research_Wu.pptx
WATCH, LISTEN AND TELL: MULTI-MODAL WEAKLY SUPERVISED DENSE EVENT CAPTIONING
Matt Feiszli at AI Frontiers : Video Understanding
Multimodal deep learning
Sound is not speech
BAZAN GIL Would you please marry me
Audio insights
VITA-1.5 Towards GPT-4o Level Real-Time Vision and Speech Interaction
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Open challenges in sign language translation and production
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
PDF
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
PDF
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Open challenges in sign language translation and production
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
modul_python (1).pptx for professional and student
PDF
Lecture1 pattern recognition............
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Computer network topology notes for revision
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Mega Projects Data Mega Projects Data
PPT
Predictive modeling basics in data cleaning process
PPTX
Managing Community Partner Relationships
PDF
[EN] Industrial Machine Downtime Prediction
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Knowledge Engineering Part 1
IB Computer Science - Internal Assessment.pptx
modul_python (1).pptx for professional and student
Lecture1 pattern recognition............
Qualitative Qantitative and Mixed Methods.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Computer network topology notes for revision
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction-to-Cloud-ComputingFinal.pptx
Mega Projects Data Mega Projects Data
Predictive modeling basics in data cleaning process
Managing Community Partner Relationships
[EN] Industrial Machine Downtime Prediction

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)

  • 1. http://bit.ly/dlsl2018 Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Audio and Vision Day 4 Lecture 3 #DLUPC
  • 3. 3 Audio & Vision Vision Audio Speech Video Synchronization among modalities captured by video is exploited in a self-supervised manner.
  • 4. 4 Audio & Vision ● Feature Learning ● Cross-modal Retrieval ● Cross-modal Translation
  • 5. 5 Audio & Vision ● Feature Learning ● Cross-modal Retrieval ● Cross-modal Translation
  • 7. 7 Visual Feature Learning Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Based on the assumption that ambient sound in video is related to the visual semantics.
  • 8. 8 Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Use videos to train a CNN that predicts the audio statistics of a frame. Visual Feature Learning
  • 9. 9 Task: Use the predicted audio stats to clusters images. Audio clusters built with K-means over training set Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Average statsCluster assignments at test time (one row=one cluster) Visual Feature Learning
  • 10. 10 Although the CNN was not trained with class labels, local units with semantic meaning emerge. Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Visual Feature Learning
  • 12. 12 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
  • 13. 13 Audio Feature Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Pretrained visual ConvNets supervise the training of a model for sound representation
  • 14. 14 Videos for training are unlabeled. Relies on Convnets trained on labeled images. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  • 15. 15 Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  • 16. 16 Visualization of the 1D filters over raw audio in conv1. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  • 17. 17 Visualization of the 1D filters over raw audio in conv1. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  • 18. 18 Visualize samples that mostly activate a neuron in a late layer (conv7) Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  • 19. 19 Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  • 20. 20 Hearing sounds that most activate a neuron in the sound network (conv7) Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  • 21. 21 Hearing sounds that most activate a neuron in the sound network (conv5) Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  • 22. 22 Vision Audio Audio & Visual Feature Learning Video
  • 23. 2323Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Audio and visual features learned by assessing correspondence. Audio & Visual Feature Learning
  • 24. 24 Audio & Vision ● Feature Learning ● Cross-modal retrieval ● Cross-modal Translation
  • 25. 25 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
  • 26. 26 Cross-modal Retrieval Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Learn synthesized sounds from videos of people hitting objects with a drumstick.
  • 27. 27 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Not end-to-end Cross-modal Retrieval
  • 28. 28 The Greatest Hits Dataset Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Cross-modal Retrieval
  • 29. 29 [Paper draft] Cross-modal Retrieval Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
  • 30. 30 Best match Visual feature Audio feature Video sonorization Cross-modal Retrieval Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
  • 31. 31 Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018). Visual feature Audio feature Best match Audio coloring Cross-modal Retrieval
  • 32. 32 Audio & Vision ● Feature Learning ● Cross-modal retrieval ● Cross-modal Translation
  • 33. 33 Audio & Vision Vision Speech Video
  • 34. 34 Audio & Vision Vision Speech Video
  • 35. 35 Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
  • 36. 36 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  • 37. 37 Speech Generation from Video Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
  • 38. 38Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
  • 39. 39 Audio & Vision Vision Speech Video
  • 40. 40Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. Speech to Video Synthesis (mouth)
  • 41. 41 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  • 42. 42 Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from audio." SIGGRAPH 2017. Speech to Video Synthesis (mouth)
  • 43. 43 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  • 44. 44 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017 Speech to Video Synthesis (pose & emotion)
  • 45. 45 L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM Multimedia Thematic Workshops 2017. Audio & Visual Generation
  • 47. 47 Audio & Vision ● Feature Learning ● Cross-modal retrieval ● Cross-modal Translation