Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)

@DocXavi
Module 4 - Lecture 6
Video Analysis with
CNNs
31 January 2017
Xavier Giró-i-Nieto
[http://pagines.uab.cat/mcv/]

Acknowledgments
2
Víctor Campos Alberto Montes

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
5. Generative models
4

Recognition
Demo: Clarifai
MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)
5

Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with
convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.
6
Recognition

7
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

8
Recognition
Previous lectures

9
Recognition

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video
classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014
IEEE Conference on (pp. 1725-1732). IEEE.
Slides extracted from ReadCV seminar by Victor Campos 10
Recognition: DeepVideo

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional
neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 11
Recognition: DeepVideo: Demo

Recognition: DeepVideo: Architectures

Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14]
Recognition: DeepVideo: Features

Recognition: DeepVideo: Multiscale

Recognition: DeepVideo: Results

16
Recognition

17
Recognition: C3D
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International
Conference on Computer Vision, pp. 4489-4497. 2015

18
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Demo

19
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015.
Recognition: C3D: Spatial dimension
Spatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).

20
Recognition: C3D: Temporal dimension
3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets
Temporal depth
2D ConvNets

21
A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best
performing architectures for 3D ConvNets

22
No gain when varying the temporal depth across layers.

23
Recognition: C3D: Architecture
Feature
vector

24
Recognition: C3D: Feature vector
Video sequence
16 frames-long clips
8 frames-long overlap

25
Recognition: C3D: Feature vector
16-frame clip
16-frame clip
16-frame clip
16-frame clip
...
Average
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm

26
Recognition: C3D: Visualization
Based on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.

27
Recognition: C3D: Compactness

28
Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4
different benchmarks and are comparable with state of the art methods on other 2 benchmarks
Recognition: C3D: Performance

29
Recognition: C3D: Software
Implementation by Michael Gygli (GitHub)

30Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." 2014.
Recognition: Two stream
Two CNNs in paralel:
● One for RGB images
● One for Optical flow (hand-crafted features)
Fusion after the softmax layer

31Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code]
Recognition: Two stream
Two CNNs in paralel:
● One for RGB images
● One for Optical flow (hand-crafted features)
Fusion at a convolutional layer

32
Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.
(Slidecast and Slides by Alberto Montes)
Recognition: Localization

33

34

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
35

Optical Flow
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 36

Optical Flow: DeepFlow
Andrei Bursuc
Postoc INRIA
@abursuc

Optical Flow: DeepFlow
● Deep (hierarchy) ✔
● Convolution ✔
● Learning ❌

Optical Flow: Small vs Large

Optical Flow
Classic approach:
Rigid matching of HoG or
SIFT descriptors
Deep Matching:
Allow each subpatch to move:
● independently
● in a limited range
depending on its size

Optical Flow: Deep Matching

Source: Matlab R2015b documentation for normxcorr2 by Mathworks
42
Optical Flow: 2D correlation
Image
Sub-Image
Offset of the sub-image with respect to the image [0,0].

Instead of pre-trained filters, a
convolution is defined between
each:
● patch of the reference image
● target image
...as a results, a correlation map is
generated for each reference
patch.

The most
discriminative
response map
The less
discriminative
response map

Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.
4x4
patches
8x8 patches
16x16 patches
32x32 patches
Top-down
matching
(TD)Bottom-up
extraction
(BU)

4x4
patches
8x8 patches
16x16 patches
32x32 patches
Bottom-up
extraction
(BU)

Optical Flow: Deep Matching (BU)

Optical Flow: Deep Matching (TD)
4x4
patches
8x8 patches
16x16 patches
32x32 patches
Top-down
matching
(TD)

Each local maxima in the top layer corresponds to a shift of one of the biggest (32x32) patches.
If we focus on local maximum, we can retrieve the corresponding responses one scale below and focus on
shift of the sub-patches that generated it

Ground truth
Dense HOG
[Brox & Malik 2011]
Deep Matching

Optical Flow
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 54

Optical Flow: FlowNet

End to end supervised learning of optical flow.

Optical Flow: FlowNet (contracting)
Option A: Stack both input images together and feed them through a generic network.

Option B: Create two separate, yet identical processing streams for the two images and combine them at a
later stage.

Option B: Create two separate, yet identical processing streams for the two images and combine them at a
later stage.
Correlation layer:
Convolution of data patches from the layers to combine.

Optical Flow: FlowNet (expanding)
Upconvolutional layers: Unpooling features maps + convolution.
Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.

Optical Flow With Convolutional Networks. ICCV 2015 61
Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset
is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise;
changes in brightness, contrast, gamma and color).
Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.
Data
augmentation

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
63

Object tracking: MDNet
64
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Object tracking: MDNet
65

Object tracking: MDNet: Architecture
66
Domain-specific layers are used during training for each sequence, but are replaced by a single one at test
time.

Object tracking: MDNet: Online update
67
MDNet is updated online at test
time with hard negative mining,
that is, selecting negative
samples with the highest positive
score.

Object tracking: FCNT
68
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]

Object tracking: FCNT
69
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.
conv4-3 conv5-3

Object tracking: FCNT: Specialization
70
Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking
sequence.

Object tracking: FCNT: Localization
71
Although trained for image classification, feature maps in conv5-3 enable object localization…
...but is not discriminative enough to different objects of the same category.

Object tracking: Localization
72
Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.
Other works have shown how features maps in convolutional layers allow object localization.

Object tracking: FCNT: Localization
73
On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…
conv4-3 conv5-3

Object tracking: FCNT: Architecture
74
SNet=Specific Network (online update)
GNet=General Network (fixed)

Object tracking: FCNT: Results
75

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
76

77
Audio and Video
Audio Vision

78
Audio and Video: Soundnet
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.
Object & Scenes recognition in videos by analysing the audio track (only).

79
unlabeled video." NIPS 2016.
Videos for training are unlabeled. Relies on CNNs trained on labeled images.

80
Videos for training are unlabeled. Relies on CNNs trained on labeled images.

81

82
Hidden layers of Soundnet are used to train a standard SVM classifier that
outperforms state of the art.

83
Visualization of the 1D filters over raw audio in conv1.

84

85

86
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):

87
Audio and Video: Sonorizaton
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Learn synthesized sounds from videos of people hitting objects with a drumstick.

88
Audio and Video: Visual Sounds
No
end-to-end

89
Audio and Video: Visual Sounds

90
Learn more
Ruus Salakhutdinov, “Multimodal Machine Learning” (NIPS 2015 Workshop)

Generative models for Video
91
Slides
D2L5 by Santi Pascual.

92
What are Generative Models?
We want our model with parameters θ = {weights, biases} and outputs
distributed like Pmodel to estimate the distribution of our training data Pdata.
Example) y = f(x), where y is scalar, make Pmodel similar to Pdata by training
the parameters θ to maximize their similarity.

Key Idea: our model cares about what distribution generated the input data
points, and we want to mimic it with our probabilistic model. Our learned
model should be able to make up new samples from the distribution, not
just copy and paste existing samples!
93
What are Generative Models?
Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)

94
Video Frame Prediction
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR
2016

95
2016

96
2016

Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
100
100
It’s not even
green

100
100
There is no
watermark

100
100
Watermark
should be
rounded

?
After enough iterations, and if the counterfeiter is good enough (in terms of G
network it means “has enough parameters”), the police should be confused.

Adversarial Training (batch update)
● Pick a sample x from training set
● Show x to D and update weights to
output 1 (real)

● G maps sample z to ẍ
● show ẍ and update weights to output 0 (fake)

● Freeze D weights
● Update G weights to make D output 1 (just G weights!)
● Unfreeze D Weights and repeat

104
Generative Adversarial Networks (GANs)
Slide credit: Víctor Garcia
Discriminator
D(·)
Generator
G(·)
Real World
Random
seed (z)
Real/Synthetic

105Slide credit: Víctor Garcia
Conditional Adversarial Networks
Real World
Real/Synthetic
Condition
Discriminator
D(·)
Generator
G(·)

Generating images/frames
(Radford et al. 2015)
Deep Conv. GAN (DCGAN) effectively generated 64x64 RGB images in a single
shot. For example bedrooms from LSUN dataset.

Generating images/frames conditioned on captions
(Reed et al. 2016b) (Zhang et al. 2016)

Unsupervised feature extraction/learning representations
Similarly to word2vec, GANs learn a distributed representation that disentangles
concepts such that we can perform operations on the data manifold:
v(Man with glasses) - v(man) + v(woman) = v(woman with glasses)
(Radford et al. 2015)

Image super-resolution
Bicubic: not using data statistics. SRResNet: trained with MSE. SRGAN is able to
understand that there are multiple correct answers, rather than averaging.
(Ledig et al. 2016)

Image super-resolution
Averaging is a serious problem we face when dealing with complex distributions.
(Ledig et al. 2016)

Manipulating images and assisted content creation
https://youtu.be/9c4z6YsBGQ0?t=126 https://youtu.be/9c4z6YsBGQ0?t=161
(Zhu et al. 2016)

112
Adversarial Networks
Slide credit: Víctor Garcia
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with
conditional adversarial networks." arXiv preprint arXiv:1611.07004 (2016).
Generator
Discriminator
Generated Pairs
Real World
Ground Truth
Pairs
Loss → BCE

113Víctor Garcia and Xavier Giró-i-Nieto (work under progress)
Generator
Discriminator Loss2 GAN
{Binary Crossentropy}
1/0

Generative models for video
114
Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." 2016.

Generative models for video
115
Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.

116
Adversarial Networks
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua
Bengio. "Generative adversarial nets." NIPS 2014
Goodfellow, Ian. "NIPS 2016 Tutorial: Generative Adversarial Networks." arXiv preprint arXiv:1701.00160 (2016).
F. Van Veen, “The Neural Network Zoo” (2016)

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Audio and Video
117

118
Thank you !
https://imatge.upc.edu/web/people/xavier-giro
https://twitter.com/DocXavi
https://www.facebook.com/ProfessorXavi
xavier.giro@upc.edu
Xavier Giró-i-Nieto
[Part B: Video and audio]

Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)

More Related Content

What's hot (20)

Similar to Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017) (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)