SlideShare a Scribd company logo
@DocXavi
Module 4 - Lecture 6
Video Analysis with
CNNs
31 January 2017
Xavier Giró-i-Nieto
[http://pagines.uab.cat/mcv/]
Acknowledgments
2
Víctor Campos Alberto Montes
Linked slides
Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
5. Generative models
4
Recognition
Demo: Clarifai
MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)
5
Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with
convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.
6
Recognition
7
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
8
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Previous lectures
9
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video
classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014
IEEE Conference on (pp. 1725-1732). IEEE.
Slides extracted from ReadCV seminar by Victor Campos 10
Recognition: DeepVideo
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional
neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 11
Recognition: DeepVideo: Demo
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional
neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 12
Recognition: DeepVideo: Architectures
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional
neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 13
Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14]
Recognition: DeepVideo: Features
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional
neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 14
Recognition: DeepVideo: Multiscale
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional
neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 15
Recognition: DeepVideo: Results
16
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
17
Recognition: C3D
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International
Conference on Computer Vision, pp. 4489-4497. 2015
18
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Demo
19
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015.
Recognition: C3D: Spatial dimension
Spatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).
20
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Temporal dimension
3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets
Temporal depth
2D ConvNets
21
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best
performing architectures for 3D ConvNets
Recognition: C3D: Temporal dimension
22
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
No gain when varying the temporal depth across layers.
Recognition: C3D: Temporal dimension
23
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Architecture
Feature
vector
24
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Feature vector
Video sequence
16 frames-long clips
8 frames-long overlap
25
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Feature vector
16-frame clip
16-frame clip
16-frame clip
16-frame clip
...
Average
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm
26
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Visualization
Based on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.
27
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Compactness
28
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4
different benchmarks and are comparable with state of the art methods on other 2 benchmarks
Recognition: C3D: Performance
29
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Software
Implementation by Michael Gygli (GitHub)
30Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." 2014.
Recognition: Two stream
Two CNNs in paralel:
● One for RGB images
● One for Optical flow (hand-crafted features)
Fusion after the softmax layer
31Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code]
Recognition: Two stream
Two CNNs in paralel:
● One for RGB images
● One for Optical flow (hand-crafted features)
Fusion at a convolutional layer
32
Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.
(Slidecast and Slides by Alberto Montes)
Recognition: Localization
33
Recognition: Localization
Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.
(Slidecast and Slides by Alberto Montes)
34
Recognition: Localization
Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.
(Slidecast and Slides by Alberto Montes)
Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
5. Generative models
35
Optical Flow
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 36
Optical Flow: DeepFlow
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 37
Andrei Bursuc
Postoc INRIA
@abursuc
Optical Flow: DeepFlow
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 38
● Deep (hierarchy) ✔
● Convolution ✔
● Learning ❌
Optical Flow: Small vs Large
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 39
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 40
Optical Flow
Classic approach:
Rigid matching of HoG or
SIFT descriptors
Deep Matching:
Allow each subpatch to move:
● independently
● in a limited range
depending on its size
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 41
Optical Flow: Deep Matching
Source: Matlab R2015b documentation for normxcorr2 by Mathworks
42
Optical Flow: 2D correlation
Image
Sub-Image
Offset of the sub-image with respect to the image [0,0].
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 43
Instead of pre-trained filters, a
convolution is defined between
each:
● patch of the reference image
● target image
...as a results, a correlation map is
generated for each reference
patch.
Optical Flow: Deep Matching
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 44
Optical Flow: Deep Matching
The most
discriminative
response map
The less
discriminative
response map
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 45
Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.
Optical Flow: Deep Matching
4x4
patches
8x8 patches
16x16 patches
32x32 patches
Top-down
matching
(TD)Bottom-up
extraction
(BU)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 46
Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.
Optical Flow: Deep Matching
4x4
patches
8x8 patches
16x16 patches
32x32 patches
Bottom-up
extraction
(BU)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 47
Optical Flow: Deep Matching (BU)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 48
Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.
Optical Flow: Deep Matching (TD)
4x4
patches
8x8 patches
16x16 patches
32x32 patches
Top-down
matching
(TD)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 49
Optical Flow: Deep Matching (TD)
Each local maxima in the top layer corresponds to a shift of one of the biggest (32x32) patches.
If we focus on local maximum, we can retrieve the corresponding responses one scale below and focus on
shift of the sub-patches that generated it
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 50
Optical Flow: Deep Matching (TD)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 51
Optical Flow: Deep Matching
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 52
Ground truth
Dense HOG
[Brox & Malik 2011]
Deep Matching
Optical Flow: Deep Matching
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 53
Optical Flow: Deep Matching
Optical Flow
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 54
Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 55
Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 56
End to end supervised learning of optical flow.
Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 57
Option A: Stack both input images together and feed them through a generic network.
Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 58
Option B: Create two separate, yet identical processing streams for the two images and combine them at a
later stage.
Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 59
Option B: Create two separate, yet identical processing streams for the two images and combine them at a
later stage.
Correlation layer:
Convolution of data patches from the layers to combine.
Optical Flow: FlowNet (expanding)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 60
Upconvolutional layers: Unpooling features maps + convolution.
Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.
Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. ICCV 2015 61
Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset
is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise;
changes in brightness, contrast, gamma and color).
Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.
Data
augmentation
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 62
Optical Flow: FlowNet
Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
5. Generative models
63
Object tracking: MDNet
64
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
Object tracking: MDNet
65
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
Object tracking: MDNet: Architecture
66
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
Domain-specific layers are used during training for each sequence, but are replaced by a single one at test
time.
Object tracking: MDNet: Online update
67
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
MDNet is updated online at test
time with hard negative mining,
that is, selecting negative
samples with the highest positive
score.
Object tracking: FCNT
68
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]
Object tracking: FCNT
69
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.
conv4-3 conv5-3
Object tracking: FCNT: Specialization
70
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking
sequence.
Object tracking: FCNT: Localization
71
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Although trained for image classification, feature maps in conv5-3 enable object localization…
...but is not discriminative enough to different objects of the same category.
Object tracking: Localization
72
Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.
Other works have shown how features maps in convolutional layers allow object localization.
Object tracking: FCNT: Localization
73
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…
conv4-3 conv5-3
Object tracking: FCNT: Architecture
74
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
SNet=Specific Network (online update)
GNet=General Network (fixed)
Object tracking: FCNT: Results
75
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
5. Generative models
76
77
Audio and Video
Audio Vision
78
Audio and Video: Soundnet
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.
Object & Scenes recognition in videos by analysing the audio track (only).
79
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Videos for training are unlabeled. Relies on CNNs trained on labeled images.
Audio and Video: Soundnet
80
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Videos for training are unlabeled. Relies on CNNs trained on labeled images.
Audio and Video: Soundnet
81
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.
Audio and Video: Soundnet
82
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Hidden layers of Soundnet are used to train a standard SVM classifier that
outperforms state of the art.
Audio and Video: Soundnet
83
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio and Video: Soundnet
84
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio and Video: Soundnet
85
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio and Video: Soundnet
86
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
Audio and Video: Soundnet
87
Audio and Video: Sonorizaton
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Learn synthesized sounds from videos of people hitting objects with a drumstick.
88
Audio and Video: Visual Sounds
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
No
end-to-end
89
Audio and Video: Visual Sounds
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
90
Learn more
Ruus Salakhutdinov, “Multimodal Machine Learning” (NIPS 2015 Workshop)
Generative models for Video
91
Slides
D2L5 by Santi Pascual.
92
What are Generative Models?
We want our model with parameters θ = {weights, biases} and outputs
distributed like Pmodel to estimate the distribution of our training data Pdata.
Example) y = f(x), where y is scalar, make Pmodel similar to Pdata by training
the parameters θ to maximize their similarity.
Key Idea: our model cares about what distribution generated the input data
points, and we want to mimic it with our probabilistic model. Our learned
model should be able to make up new samples from the distribution, not
just copy and paste existing samples!
93
What are Generative Models?
Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)
94
Video Frame Prediction
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR
2016
95
Video Frame Prediction
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR
2016
96
Video Frame Prediction
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR
2016
Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
100
100
It’s not even
green
Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
100
100
There is no
watermark
Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
100
100
Watermark
should be
rounded
Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
?
After enough iterations, and if the counterfeiter is good enough (in terms of G
network it means “has enough parameters”), the police should be confused.
Adversarial Training (batch update)
● Pick a sample x from training set
● Show x to D and update weights to
output 1 (real)
Adversarial Training (batch update)
● G maps sample z to ẍ
● show ẍ and update weights to output 0 (fake)
Adversarial Training (batch update)
● Freeze D weights
● Update G weights to make D output 1 (just G weights!)
● Unfreeze D Weights and repeat
104
Generative Adversarial Networks (GANs)
Slide credit: Víctor Garcia
Discriminator
D(·)
Generator
G(·)
Real World
Random
seed (z)
Real/Synthetic
105Slide credit: Víctor Garcia
Conditional Adversarial Networks
Real World
Real/Synthetic
Condition
Discriminator
D(·)
Generator
G(·)
Generative Adversarial Networks (GANs)
Generating images/frames
(Radford et al. 2015)
Deep Conv. GAN (DCGAN) effectively generated 64x64 RGB images in a single
shot. For example bedrooms from LSUN dataset.
Generating images/frames conditioned on captions
(Reed et al. 2016b) (Zhang et al. 2016)
Unsupervised feature extraction/learning representations
Similarly to word2vec, GANs learn a distributed representation that disentangles
concepts such that we can perform operations on the data manifold:
v(Man with glasses) - v(man) + v(woman) = v(woman with glasses)
(Radford et al. 2015)
Image super-resolution
Bicubic: not using data statistics. SRResNet: trained with MSE. SRGAN is able to
understand that there are multiple correct answers, rather than averaging.
(Ledig et al. 2016)
Image super-resolution
Averaging is a serious problem we face when dealing with complex distributions.
(Ledig et al. 2016)
Manipulating images and assisted content creation
https://youtu.be/9c4z6YsBGQ0?t=126 https://youtu.be/9c4z6YsBGQ0?t=161
(Zhu et al. 2016)
112
Adversarial Networks
Slide credit: Víctor Garcia
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with
conditional adversarial networks." arXiv preprint arXiv:1611.07004 (2016).
Generator
Discriminator
Generated Pairs
Real World
Ground Truth
Pairs
Loss → BCE
113Víctor Garcia and Xavier Giró-i-Nieto (work under progress)
Generator
Discriminator Loss2 GAN
{Binary Crossentropy}
1/0
Generative Adversarial Networks (GANs)
Generative models for video
114
Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." 2016.
Generative models for video
115
Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.
116
Adversarial Networks
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua
Bengio. "Generative adversarial nets." NIPS 2014
Goodfellow, Ian. "NIPS 2016 Tutorial: Generative Adversarial Networks." arXiv preprint arXiv:1701.00160 (2016).
F. Van Veen, “The Neural Network Zoo” (2016)
Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
5. Generative models
117
118
Thank you !
https://imatge.upc.edu/web/people/xavier-giro
https://twitter.com/DocXavi
https://www.facebook.com/ProfessorXavi
xavier.giro@upc.edu
Xavier Giró-i-Nieto
[Part B: Video and audio]

More Related Content

PDF
Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
PDF
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
PDF
Deep Learning for Video: Action Recognition (UPC 2018)
PDF
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
PDF
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
PDF
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
PDF
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
PDF
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...

What's hot (20)

PDF
Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016
PDF
Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
One Perceptron to Rule Them All: Language and Vision
PDF
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
PDF
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
PDF
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
PDF
Neural Architectures for Video Encoding
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
PDF
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
PDF
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
PDF
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
PDF
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
PDF
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
PDF
Action Recognitionの歴史と最新動向
PDF
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016
Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
One Perceptron to Rule Them All: Language and Vision
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Neural Architectures for Video Encoding
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Action Recognitionの歴史と最新動向
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
Ad

Similar to Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017) (20)

PPTX
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
DOCX
Large-scale Video Classification with Convolutional Neural Net.docx
PPTX
Learning spatiotemporal features with 3 d convolutional networks
PDF
IRJET-Multiple Object Detection using Deep Neural Networks
PDF
物件偵測與辨識技術
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
PPTX
ACTION-Net_Multipath_Excitation_for_Action_Recognition.pptx
PPTX
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
PPTX
Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Chris...
PDF
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
PPTX
Large-scale Video Classification with Convolutional Neural Network
PPTX
2014 - CVPR Tutorial on Deep Learning for Vision - Object Detection.pptx
PPTX
FINAL_Team_4.pptx
PDF
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
PDF
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
PDF
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
PDF
Video Classification: Human Action Recognition on HMDB-51 dataset
PDF
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
PDF
Deep Neural Networks Presentation
PDF
Efficient video perception through AI
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Large-scale Video Classification with Convolutional Neural Net.docx
Learning spatiotemporal features with 3 d convolutional networks
IRJET-Multiple Object Detection using Deep Neural Networks
物件偵測與辨識技術
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
ACTION-Net_Multipath_Excitation_for_Action_Recognition.pptx
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Chris...
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Large-scale Video Classification with Convolutional Neural Network
2014 - CVPR Tutorial on Deep Learning for Vision - Object Detection.pptx
FINAL_Team_4.pptx
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
Video Classification: Human Action Recognition on HMDB-51 dataset
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
Deep Neural Networks Presentation
Efficient video perception through AI
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Open challenges in sign language translation and production
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
PDF
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
PDF
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Open challenges in sign language translation and production
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...

Recently uploaded (20)

PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Transcultural that can help you someday.
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Predictive modeling basics in data cleaning process
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
Business Analytics and business intelligence.pdf
PPTX
Database Infoormation System (DBIS).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Quality review (1)_presentation of this 21
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Transcultural that can help you someday.
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Predictive modeling basics in data cleaning process
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
modul_python (1).pptx for professional and student
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Data Science and Data Analysis
Business Analytics and business intelligence.pdf
Database Infoormation System (DBIS).pptx
Clinical guidelines as a resource for EBP(1).pdf
Reliability_Chapter_ presentation 1221.5784
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
STUDY DESIGN details- Lt Col Maksud (21).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Quality review (1)_presentation of this 21
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)

  • 1. @DocXavi Module 4 - Lecture 6 Video Analysis with CNNs 31 January 2017 Xavier Giró-i-Nieto [http://pagines.uab.cat/mcv/]
  • 4. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 4
  • 5. Recognition Demo: Clarifai MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015) 5
  • 6. Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 6 Recognition
  • 7. 7 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  • 8. 8 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Previous lectures
  • 9. 9 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  • 10. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. Slides extracted from ReadCV seminar by Victor Campos 10 Recognition: DeepVideo
  • 11. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 11 Recognition: DeepVideo: Demo
  • 12. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 12 Recognition: DeepVideo: Architectures
  • 13. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 13 Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14] Recognition: DeepVideo: Features
  • 14. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 14 Recognition: DeepVideo: Multiscale
  • 15. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 15 Recognition: DeepVideo: Results
  • 16. 16 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  • 17. 17 Recognition: C3D Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  • 18. 18 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Demo
  • 19. 19 K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015. Recognition: C3D: Spatial dimension Spatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).
  • 20. 20 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Temporal dimension 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets Temporal depth 2D ConvNets
  • 21. 21 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets Recognition: C3D: Temporal dimension
  • 22. 22 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 No gain when varying the temporal depth across layers. Recognition: C3D: Temporal dimension
  • 23. 23 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Architecture Feature vector
  • 24. 24 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Feature vector Video sequence 16 frames-long clips 8 frames-long overlap
  • 25. 25 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Feature vector 16-frame clip 16-frame clip 16-frame clip 16-frame clip ... Average 4096-dimvideodescriptor 4096-dimvideodescriptor L2 norm
  • 26. 26 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Visualization Based on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.
  • 27. 27 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Compactness
  • 28. 28 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4 different benchmarks and are comparable with state of the art methods on other 2 benchmarks Recognition: C3D: Performance
  • 29. 29 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Software Implementation by Michael Gygli (GitHub)
  • 30. 30Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." 2014. Recognition: Two stream Two CNNs in paralel: ● One for RGB images ● One for Optical flow (hand-crafted features) Fusion after the softmax layer
  • 31. 31Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code] Recognition: Two stream Two CNNs in paralel: ● One for RGB images ● One for Optical flow (hand-crafted features) Fusion at a convolutional layer
  • 32. 32 Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016. (Slidecast and Slides by Alberto Montes) Recognition: Localization
  • 33. 33 Recognition: Localization Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016. (Slidecast and Slides by Alberto Montes)
  • 34. 34 Recognition: Localization Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016. (Slidecast and Slides by Alberto Montes)
  • 35. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 35
  • 36. Optical Flow Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 36
  • 37. Optical Flow: DeepFlow Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 37 Andrei Bursuc Postoc INRIA @abursuc
  • 38. Optical Flow: DeepFlow Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 38 ● Deep (hierarchy) ✔ ● Convolution ✔ ● Learning ❌
  • 39. Optical Flow: Small vs Large Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 39
  • 40. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 40 Optical Flow Classic approach: Rigid matching of HoG or SIFT descriptors Deep Matching: Allow each subpatch to move: ● independently ● in a limited range depending on its size
  • 41. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 41 Optical Flow: Deep Matching
  • 42. Source: Matlab R2015b documentation for normxcorr2 by Mathworks 42 Optical Flow: 2D correlation Image Sub-Image Offset of the sub-image with respect to the image [0,0].
  • 43. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 43 Instead of pre-trained filters, a convolution is defined between each: ● patch of the reference image ● target image ...as a results, a correlation map is generated for each reference patch. Optical Flow: Deep Matching
  • 44. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 44 Optical Flow: Deep Matching The most discriminative response map The less discriminative response map
  • 45. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 45 Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search. Optical Flow: Deep Matching 4x4 patches 8x8 patches 16x16 patches 32x32 patches Top-down matching (TD)Bottom-up extraction (BU)
  • 46. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 46 Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search. Optical Flow: Deep Matching 4x4 patches 8x8 patches 16x16 patches 32x32 patches Bottom-up extraction (BU)
  • 47. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 47 Optical Flow: Deep Matching (BU)
  • 48. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 48 Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search. Optical Flow: Deep Matching (TD) 4x4 patches 8x8 patches 16x16 patches 32x32 patches Top-down matching (TD)
  • 49. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 49 Optical Flow: Deep Matching (TD) Each local maxima in the top layer corresponds to a shift of one of the biggest (32x32) patches. If we focus on local maximum, we can retrieve the corresponding responses one scale below and focus on shift of the sub-patches that generated it
  • 50. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 50 Optical Flow: Deep Matching (TD)
  • 51. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 51 Optical Flow: Deep Matching
  • 52. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 52 Ground truth Dense HOG [Brox & Malik 2011] Deep Matching Optical Flow: Deep Matching
  • 53. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 53 Optical Flow: Deep Matching
  • 54. Optical Flow Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 54
  • 55. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 55
  • 56. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 56 End to end supervised learning of optical flow.
  • 57. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 57 Option A: Stack both input images together and feed them through a generic network.
  • 58. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 58 Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.
  • 59. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 59 Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage. Correlation layer: Convolution of data patches from the layers to combine.
  • 60. Optical Flow: FlowNet (expanding) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 60 Upconvolutional layers: Unpooling features maps + convolution. Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.
  • 61. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 61 Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in brightness, contrast, gamma and color). Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI. Data augmentation
  • 62. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 62 Optical Flow: FlowNet
  • 63. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 63
  • 64. Object tracking: MDNet 64 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
  • 65. Object tracking: MDNet 65 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
  • 66. Object tracking: MDNet: Architecture 66 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015) Domain-specific layers are used during training for each sequence, but are replaced by a single one at test time.
  • 67. Object tracking: MDNet: Online update 67 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015) MDNet is updated online at test time with hard negative mining, that is, selecting negative samples with the highest positive score.
  • 68. Object tracking: FCNT 68 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]
  • 69. Object tracking: FCNT 69 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification. conv4-3 conv5-3
  • 70. Object tracking: FCNT: Specialization 70 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking sequence.
  • 71. Object tracking: FCNT: Localization 71 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] Although trained for image classification, feature maps in conv5-3 enable object localization… ...but is not discriminative enough to different objects of the same category.
  • 72. Object tracking: Localization 72 Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015. Other works have shown how features maps in convolutional layers allow object localization.
  • 73. Object tracking: FCNT: Localization 73 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation… conv4-3 conv5-3
  • 74. Object tracking: FCNT: Architecture 74 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] SNet=Specific Network (online update) GNet=General Network (fixed)
  • 75. Object tracking: FCNT: Results 75 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
  • 76. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 76
  • 78. 78 Audio and Video: Soundnet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Object & Scenes recognition in videos by analysing the audio track (only).
  • 79. 79 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Videos for training are unlabeled. Relies on CNNs trained on labeled images. Audio and Video: Soundnet
  • 80. 80 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Videos for training are unlabeled. Relies on CNNs trained on labeled images. Audio and Video: Soundnet
  • 81. 81 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Audio and Video: Soundnet
  • 82. 82 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Audio and Video: Soundnet
  • 83. 83 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  • 84. 84 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  • 85. 85 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  • 86. 86 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Audio and Video: Soundnet
  • 87. 87 Audio and Video: Sonorizaton Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Learn synthesized sounds from videos of people hitting objects with a drumstick.
  • 88. 88 Audio and Video: Visual Sounds Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. No end-to-end
  • 89. 89 Audio and Video: Visual Sounds Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
  • 90. 90 Learn more Ruus Salakhutdinov, “Multimodal Machine Learning” (NIPS 2015 Workshop)
  • 91. Generative models for Video 91 Slides D2L5 by Santi Pascual.
  • 92. 92 What are Generative Models? We want our model with parameters θ = {weights, biases} and outputs distributed like Pmodel to estimate the distribution of our training data Pdata. Example) y = f(x), where y is scalar, make Pmodel similar to Pdata by training the parameters θ to maximize their similarity.
  • 93. Key Idea: our model cares about what distribution generated the input data points, and we want to mimic it with our probabilistic model. Our learned model should be able to make up new samples from the distribution, not just copy and paste existing samples! 93 What are Generative Models? Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)
  • 94. 94 Video Frame Prediction Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
  • 95. 95 Video Frame Prediction Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
  • 96. 96 Video Frame Prediction Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
  • 97. Adversarial Training analogy Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 It’s not even green
  • 98. Adversarial Training analogy Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 There is no watermark
  • 99. Adversarial Training analogy Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 Watermark should be rounded
  • 100. Adversarial Training analogy Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. ? After enough iterations, and if the counterfeiter is good enough (in terms of G network it means “has enough parameters”), the police should be confused.
  • 101. Adversarial Training (batch update) ● Pick a sample x from training set ● Show x to D and update weights to output 1 (real)
  • 102. Adversarial Training (batch update) ● G maps sample z to ẍ ● show ẍ and update weights to output 0 (fake)
  • 103. Adversarial Training (batch update) ● Freeze D weights ● Update G weights to make D output 1 (just G weights!) ● Unfreeze D Weights and repeat
  • 104. 104 Generative Adversarial Networks (GANs) Slide credit: Víctor Garcia Discriminator D(·) Generator G(·) Real World Random seed (z) Real/Synthetic
  • 105. 105Slide credit: Víctor Garcia Conditional Adversarial Networks Real World Real/Synthetic Condition Discriminator D(·) Generator G(·) Generative Adversarial Networks (GANs)
  • 106. Generating images/frames (Radford et al. 2015) Deep Conv. GAN (DCGAN) effectively generated 64x64 RGB images in a single shot. For example bedrooms from LSUN dataset.
  • 107. Generating images/frames conditioned on captions (Reed et al. 2016b) (Zhang et al. 2016)
  • 108. Unsupervised feature extraction/learning representations Similarly to word2vec, GANs learn a distributed representation that disentangles concepts such that we can perform operations on the data manifold: v(Man with glasses) - v(man) + v(woman) = v(woman with glasses) (Radford et al. 2015)
  • 109. Image super-resolution Bicubic: not using data statistics. SRResNet: trained with MSE. SRGAN is able to understand that there are multiple correct answers, rather than averaging. (Ledig et al. 2016)
  • 110. Image super-resolution Averaging is a serious problem we face when dealing with complex distributions. (Ledig et al. 2016)
  • 111. Manipulating images and assisted content creation https://youtu.be/9c4z6YsBGQ0?t=126 https://youtu.be/9c4z6YsBGQ0?t=161 (Zhu et al. 2016)
  • 112. 112 Adversarial Networks Slide credit: Víctor Garcia Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." arXiv preprint arXiv:1611.07004 (2016). Generator Discriminator Generated Pairs Real World Ground Truth Pairs Loss → BCE
  • 113. 113Víctor Garcia and Xavier Giró-i-Nieto (work under progress) Generator Discriminator Loss2 GAN {Binary Crossentropy} 1/0 Generative Adversarial Networks (GANs)
  • 114. Generative models for video 114 Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." 2016.
  • 115. Generative models for video 115 Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.
  • 116. 116 Adversarial Networks Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." NIPS 2014 Goodfellow, Ian. "NIPS 2016 Tutorial: Generative Adversarial Networks." arXiv preprint arXiv:1701.00160 (2016). F. Van Veen, “The Neural Network Zoo” (2016)
  • 117. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 117