Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I

Human Behavior Understanding:
From Human-Oriented Analysis to Action Recognition
Ting Yao
Principal Researcher, Vision and Multimedia Lab, JD AI Research
Tutorial @ ICME, July 8th, 2019

horse
grass
person
“a boy is cleaning
the floor”
“not just
beautiful”

5
2011
2012
2013
2014
2015
Action recognition by dense trajectories. [Wang et al. CVPR 2011]
Hand-crafted feature
2016

2011
2012
2013
2014
2015
2016
Large-scale Video Classification with Convolutional Neural Networks.
[Karpathy et al. CVPR 2014]
Two-Stream Convolutional Networks for Action Recognition in
Videos. [Simonyan et al. NIPS 2014]
2D convolutional network

2D CNN + LSTM (LRCN)2011
2012
2013
2014
2015
2016
Long-term Recurrent Convolutional Networks for Visual
Recognition and Description. [Donahue et al. CVPR 2015]

3D convolutional network (C3D)2011
2012
2013
2014
2015
2016
Learning Spatiotemporal Features with 3D Convolutional
Networks. [Tran et al. ICCV 2015]

Temporal segment networks (TSN)2011
2012
2013
2014
2015
2016
Temporal Segment Networks: Towards Good
Practices for Deep Action Recognition. [Wang et al. ECCV 2016]

10
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion

11
Backbone Network
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion

12
State-of-the-Arts
Image
Domain
Video
Domain
VGG
[Simonyan et al. ICLR 2015]
C3D
[Tran et al. ICCV 2015]
Inception
[Szegedy et al. CVPR 2015]
I3D
[Carreira et al. CVPR 2017]
ResNet
[He et al. CVPR 2016]
P3D
[Qiu et al. ICCV 2017]

13
Convolution
3D Convolution
2D Convolution
3D Convolution 3D ResNet
2D ResNet
ResNet-152:
Time Cost: 9 x C2 x H x W
Model size: 230MB
3D ResNet-152:
Time Cost: 27 x C2 x T x H x W
Model size: 690MB

15
Bottleneck Architecture:
+
1x1 conv
1x1 conv
3x3 conv
ReLU
ReLU
ReLU
(a) Residual Unit
+
1x1x1 conv
1x1x1 conv
1x3x3 conv
ReLU
ReLU
3x1x1 conv
ReLU
ReLU
(b) P3D-A
+
1x1x1 conv
1x1x1 conv
ReLU
ReLU
1x3x3 conv 3x1x1 conv
+
ReLU
ReLU
(c) P3D-B
+
1x1x1 conv
1x1x1 conv
1x3x3 conv
ReLU
ReLU
3x1x1 conv
ReLU
+
ReLU
(d) P3D-C
(a)
(b)
(c)
(d)
Very deep 3D CNN but still
lighter weights than C3D

16
•R(2+1)D > MCx > rMCx > R3D > R2D

17
• ResNeXt-101 > Wide ResNet-50 > ResNet-200 > ResNet-152 >
ResNet-101 > ResNet-50 > DenseNet-201 > DenseNet-121

18
• Involve large-range (global) context into representation learning
• Model the diffusions between local and global features

19
Feature Aggregation
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion

20
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN

21
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Global Average Pooling

22
CNN
CNN
CNN
...
AttCell
AttCell
AttCell
...
X
X
X
..................
...
AttCell
X
LSTM
......
LSTM
......
LSTM
......
LSTM
......
......
......

24
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Attention
• Visual Attention [Sharma et al. ICLR workshop 2015]
• Recurrent Attention [Du et al. TIP 2018]
• Unified Attention [Li et al. TMM 2018]

25
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Attention
• Visual Attention [Sharma et al. ICLR workshop 2015]
• Recurrent Attention [Du et al. TIP 2018]
• Unified Attention [Li et al. TMM 2018]
• RNN
• LRCN [Donahue et al. CVPR 2015]
• Hybrid Framework (LSTM) [Wu et al. ACM MM 2015]

27
• Global Activations
• Fully connected layer
• Global pooling layer
• Fisher Vector with Variational
Auto-Encoder (FV-VAE)
• Fisher Vector (FV)
...
...
...
...
Global Activations
Convolutional
Activations
FV Encoding
FV-VAE Encoding
Convolutional
Activations
Normalization term Generative model
GMM
VAE
FV
FV-VAE

28
...
Reconstruct
Loss
...
Regularization
Loss
Classification
Loss
...
...
Encoder Sampling Decodertx
...
Reconstruct
Loss
...
...
...
Encoder Identity Decodertx
Back Propagation
Gradient Vector
Accumulator
• Assumption of FV
• Data is generated from Gaussian Mixture Model, which may not hold in practice
• VAE
• Encoder (𝑞 𝜙( Τ𝐳 𝐱)): learn new representations 𝐳 for the given input 𝐱
• Decoder (𝑝 𝜃( Τ𝐱 𝐳)): generate FV of new representations 𝐳
Training Extraction
FV: ℊ 𝜃
𝑋
= 𝐹𝜃
−
1
2
𝛻𝜃 log 𝑢 𝜽(𝑋) = −𝐹𝜃
−
1
2
σ 𝑡=1
𝑇𝑥
𝛻𝜃ℒ 𝒓𝒆𝒄(𝒙𝒕; 𝜃, 𝜙)Reconstruct loss: ℒ 𝒓𝒆𝒄 = − log 𝜇 𝒙 𝑡
= − log 𝑝 𝜃( Τ𝒙 𝑡 𝒛 𝑡)

29
CNN FV-VAE
Gradient
Vector
Video
Representation
Convolutional
Feature
…
…
Region Feature Set
Loss Function Ice Dancing
+
Spatial Pyramid
Pooling
Training Epoch
Extraction Epoch
• FV-VAE based action recognition framework
• CNN as convolutional feature extractor
• Encoding SPP output using FV-VAE

30
Stream Fusion
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion

31
human, guitar Playing Guitar
Jumping Jack
Cliff diving
Basketball dunk
Single Frame
Consecutive Frames
Clip (multiple adjacent frames)
whole video
Different actions may span different granularities!

32
• Multi-granular spatio-temporal architecture for video action recognition
• Hierarchical modeling (4 granularities)
• Fusion based on multi-granular score distribution

35
Single Frame softmax
0.4
…
0.2
…
0.7
…
0.9
Consecutive
Frames
softmax
Clips softmax
Video softmax
Surfing scores
Sort
0.9
…
0.7
…
0.4
…
0.2
Improved
Surfing score
0.8
w=[1, 0, …, 0] Max-pooling
w=[1, 1, …, 1] Ave-pooling
optimized w Distribution-based classifier

37
2011 2012 2014 2015 2017 2018 2019

38
Method UCF101 HMDB51
Improved dense trajectories (IDT) [Wang et al. ICCV 2011] 85.9% 57.2%
Higher dimensional IDT [Peng et al. CVIU 2016] 87.9% 61.1%
2D CNN Slow Fusion [Karpathy et al. CVPR 2014] 65.4% --
Two-stream ConvNet [Simonyan et al. NIPS 2014] 88.0% 59.4%
Factorized ST-ConvNet [Sun et al. ICCV 2015] 88.1% 59.1%
Two-stream + LSTM [Yue-Hei et al. CVPR 2015] 88.6% --
Two-stream Conv fusion [Feichtenhofer et al. CVPR 2016] 92.5% 67.3%
Two-stream ST Residual Networks [Feichtenhofer et al. NIPS 2016] 93.4% 66.4%
Temporal Segment Networks [Wang et al. ECCV 2016] 94.0% 68.5%
C3D [Tran et al. ICCV 2015] 82.3% 56.8%
P3D ResNet [Qiu et al. ICCV 2017] 89.8% 58.6%
Two-stream P3D ResNet [Qiu et al. ICCV 2017] 94.5% 71.8%
I3D [Carreira et al. CVPR 2017] 93.4% 66.4%
I3D + Kinetics pre-train [Carreira et al. CVPR 2017] 97.9% 80.2%
LGD-3D + Kinetics pre-train [Qiu et al. CVPR 2019] 98.2% 80.5%

43
Feature Extractor
Pole vault
0.61
Pole vault
0.83
Pole vault
0.51

44
3D CNN
Pole vault 0.96
Gaussian Kernel

52
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion

Thanks!
tingyao.ustc@gmail.com

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I

More Related Content

What's hot (20)

Similar to Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I (20)

More from Wanjin Yu (15)

Recently uploaded (20)

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I