SlideShare a Scribd company logo
Human Behavior Understanding:
From Human-Oriented Analysis to Action Recognition
Ting Yao
Principal Researcher, Vision and Multimedia Lab, JD AI Research
Tutorial @ ICME, July 8th, 2019
horse
grass
person
“a boy is cleaning
the floor”
“not just
beautiful”
3
……
……
4
5
2011
2012
2013
2014
2015
Action recognition by dense trajectories. [Wang et al. CVPR 2011]
Hand-crafted feature
2016
2011
2012
2013
2014
2015
2016
Large-scale Video Classification with Convolutional Neural Networks.
[Karpathy et al. CVPR 2014]
Two-Stream Convolutional Networks for Action Recognition in
Videos. [Simonyan et al. NIPS 2014]
2D convolutional network
2D CNN + LSTM (LRCN)2011
2012
2013
2014
2015
2016
Long-term Recurrent Convolutional Networks for Visual
Recognition and Description. [Donahue et al. CVPR 2015]
3D convolutional network (C3D)2011
2012
2013
2014
2015
2016
Learning Spatiotemporal Features with 3D Convolutional
Networks. [Tran et al. ICCV 2015]
Temporal segment networks (TSN)2011
2012
2013
2014
2015
2016
Temporal Segment Networks: Towards Good
Practices for Deep Action Recognition. [Wang et al. ECCV 2016]
10
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
11
Backbone Network
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
12
State-of-the-Arts
Image
Domain
Video
Domain
VGG
[Simonyan et al. ICLR 2015]
C3D
[Tran et al. ICCV 2015]
Inception
[Szegedy et al. CVPR 2015]
I3D
[Carreira et al. CVPR 2017]
ResNet
[He et al. CVPR 2016]
P3D
[Qiu et al. ICCV 2017]
13
Convolution
3D Convolution
2D Convolution
3D Convolution 3D ResNet
2D ResNet
ResNet-152:
Time Cost: 9 x C2 x H x W
Model size: 230MB
3D ResNet-152:
Time Cost: 27 x C2 x T x H x W
Model size: 690MB
14
Spatial 2D
Spatial 2D
15
Bottleneck Architecture:
+
1x1 conv
1x1 conv
3x3 conv
ReLU
ReLU
ReLU
(a) Residual Unit
+
1x1x1 conv
1x1x1 conv
1x3x3 conv
ReLU
ReLU
3x1x1 conv
ReLU
ReLU
(b) P3D-A
+
1x1x1 conv
1x1x1 conv
ReLU
ReLU
1x3x3 conv 3x1x1 conv
+
ReLU
ReLU
(c) P3D-B
+
1x1x1 conv
1x1x1 conv
1x3x3 conv
ReLU
ReLU
3x1x1 conv
ReLU
+
ReLU
(d) P3D-C
(a)
(b)
(c)
(d)
Very deep 3D CNN but still
lighter weights than C3D
16
•R(2+1)D > MCx > rMCx > R3D > R2D
17
• ResNeXt-101 > Wide ResNet-50 > ResNet-200 > ResNet-152 >
ResNet-101 > ResNet-50 > DenseNet-201 > DenseNet-121
18
• Involve large-range (global) context into representation learning
• Model the diffusions between local and global features
19
Feature Aggregation
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
20
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
21
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Global Average Pooling
22
CNN
CNN
CNN
...
AttCell
AttCell
AttCell
...
X
X
X
..................
...
AttCell
X
LSTM
......
LSTM
......
LSTM
......
LSTM
......
......
......
23
24
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Global Average Pooling
• Attention
• Visual Attention [Sharma et al. ICLR workshop 2015]
• Recurrent Attention [Du et al. TIP 2018]
• Unified Attention [Li et al. TMM 2018]
25
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Global Average Pooling
• Attention
• Visual Attention [Sharma et al. ICLR workshop 2015]
• Recurrent Attention [Du et al. TIP 2018]
• Unified Attention [Li et al. TMM 2018]
• RNN
• LRCN [Donahue et al. CVPR 2015]
• Hybrid Framework (LSTM) [Wu et al. ACM MM 2015]
26
27
• Global Activations
• Fully connected layer
• Global pooling layer
• Fisher Vector with Variational
Auto-Encoder (FV-VAE)
• Fisher Vector (FV)
...
...
...
...
Global Activations
Convolutional
Activations
FV Encoding
FV-VAE Encoding
Convolutional
Activations
Normalization term Generative model
GMM
VAE
FV
FV-VAE
28
...
Reconstruct
Loss
...
Regularization
Loss
Classification
Loss
...
...
Encoder Sampling Decodertx
...
Reconstruct
Loss
...
...
...
Encoder Identity Decodertx
Back Propagation
Gradient Vector
Accumulator
• Assumption of FV
• Data is generated from Gaussian Mixture Model, which may not hold in practice
• VAE
• Encoder (𝑞 𝜙( Τ𝐳 𝐱)): learn new representations 𝐳 for the given input 𝐱
• Decoder (𝑝 𝜃( Τ𝐱 𝐳)): generate FV of new representations 𝐳
Training Extraction
FV: ℊ 𝜃
𝑋
= 𝐹𝜃
−
1
2
𝛻𝜃 log 𝑢 𝜽(𝑋) = −𝐹𝜃
−
1
2
σ 𝑡=1
𝑇𝑥
𝛻𝜃ℒ 𝒓𝒆𝒄(𝒙𝒕; 𝜃, 𝜙)Reconstruct loss: ℒ 𝒓𝒆𝒄 = − log 𝜇 𝒙 𝑡
= − log 𝑝 𝜃( Τ𝒙 𝑡 𝒛 𝑡)
29
CNN FV-VAE
Gradient
Vector
Video
Representation
Convolutional
Feature
…
…
Region Feature Set
Loss Function Ice Dancing
+
Spatial Pyramid
Pooling
Training Epoch
Extraction Epoch
• FV-VAE based action recognition framework
• CNN as convolutional feature extractor
• Encoding SPP output using FV-VAE
30
Stream Fusion
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
31
human, guitar Playing Guitar
Jumping Jack
Cliff diving
Basketball dunk
Single Frame
Consecutive Frames
Clip (multiple adjacent frames)
whole video
Different actions may span different granularities!
32
• Multi-granular spatio-temporal architecture for video action recognition
• Hierarchical modeling (4 granularities)
• Fusion based on multi-granular score distribution
33
34
35
Single Frame softmax
0.4
…
0.2
…
0.7
…
0.9
Consecutive
Frames
softmax
Clips softmax
Video softmax
Surfing scores
Sort
0.9
…
0.7
…
0.4
…
0.2
Improved
Surfing score
0.8
w=[1, 0, …, 0] Max-pooling
w=[1, 1, …, 1] Ave-pooling
optimized w Distribution-based classifier
36
37
2011 2012 2014 2015 2017 2018 2019
38
Method UCF101 HMDB51
Improved dense trajectories (IDT) [Wang et al. ICCV 2011] 85.9% 57.2%
Higher dimensional IDT [Peng et al. CVIU 2016] 87.9% 61.1%
2D CNN Slow Fusion [Karpathy et al. CVPR 2014] 65.4% --
Two-stream ConvNet [Simonyan et al. NIPS 2014] 88.0% 59.4%
Factorized ST-ConvNet [Sun et al. ICCV 2015] 88.1% 59.1%
Two-stream + LSTM [Yue-Hei et al. CVPR 2015] 88.6% --
Two-stream Conv fusion [Feichtenhofer et al. CVPR 2016] 92.5% 67.3%
Two-stream ST Residual Networks [Feichtenhofer et al. NIPS 2016] 93.4% 66.4%
Temporal Segment Networks [Wang et al. ECCV 2016] 94.0% 68.5%
C3D [Tran et al. ICCV 2015] 82.3% 56.8%
P3D ResNet [Qiu et al. ICCV 2017] 89.8% 58.6%
Two-stream P3D ResNet [Qiu et al. ICCV 2017] 94.5% 71.8%
I3D [Carreira et al. CVPR 2017] 93.4% 66.4%
I3D + Kinetics pre-train [Carreira et al. CVPR 2017] 97.9% 80.2%
LGD-3D + Kinetics pre-train [Qiu et al. CVPR 2019] 98.2% 80.5%
39
40
41
42
43
Feature Extractor
Pole vault
0.61
Pole vault
0.83
Pole vault
0.51
44
3D CNN
Pole vault 0.96
Gaussian Kernel
45
46
47
48
49
50
51
52
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
53
54
55
Thanks!
tingyao.ustc@gmail.com

More Related Content

PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PPTX
Stochastic Screen-Space Reflections
PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
PPTX
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
PPTX
Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite
PDF
Recent Progress on Single-Image Super-Resolution
PPTX
Shiny Pixels and Beyond: Real-Time Raytracing at SEED
PPTX
Single photon 3D Imaging with Deep Sensor Fusion
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Stochastic Screen-Space Reflections
Physically Based and Unified Volumetric Rendering in Frostbite
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Recent Progress on Single-Image Super-Resolution
Shiny Pixels and Beyond: Real-Time Raytracing at SEED
Single photon 3D Imaging with Deep Sensor Fusion

What's hot (20)

PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PPTX
Non-line-of-sight Imaging with Partial Occluders and Surface Normals | TOG 2019
PPTX
GDC2019 - SEED - Towards Deep Generative Models in Game Development
PPTX
Wave-Based Non-Line-of-Sight Imaging Using Fast f–k Migration | SIGGRAPH 2019
PPTX
Past, Present and Future Challenges of Global Illumination in Games
PPTX
A Real-time Radiosity Architecture
PPT
Bending the Graphics Pipeline
PDF
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
PPTX
Lighting you up in Battlefield 3
PPTX
IGARSS-SAR-Pritt.pptx
PPT
A Bit More Deferred Cry Engine3
PDF
Deferred Rendering in Killzone 2
PPTX
Siggraph 2011: Occlusion culling in Alan Wake
PPTX
Rendering Technologies from Crysis 3 (GDC 2013)
PPT
Crysis Next-Gen Effects (GDC 2008)
PDF
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
PPTX
Global illumination
PPT
Secrets of CryENGINE 3 Graphics Technology
PDF
Introduction to Point Based Global Illumination (PBGI)
PPTX
The Rendering Technology of Killzone 2
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Non-line-of-sight Imaging with Partial Occluders and Surface Normals | TOG 2019
GDC2019 - SEED - Towards Deep Generative Models in Game Development
Wave-Based Non-Line-of-Sight Imaging Using Fast f–k Migration | SIGGRAPH 2019
Past, Present and Future Challenges of Global Illumination in Games
A Real-time Radiosity Architecture
Bending the Graphics Pipeline
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
Lighting you up in Battlefield 3
IGARSS-SAR-Pritt.pptx
A Bit More Deferred Cry Engine3
Deferred Rendering in Killzone 2
Siggraph 2011: Occlusion culling in Alan Wake
Rendering Technologies from Crysis 3 (GDC 2013)
Crysis Next-Gen Effects (GDC 2008)
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Global illumination
Secrets of CryENGINE 3 Graphics Technology
Introduction to Point Based Global Illumination (PBGI)
The Rendering Technology of Killzone 2

Similar to Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I (20)

PDF
Action Recognitionの歴史と最新動向
PPTX
Automated Video Analysis and Reporting for Construction Sites
PDF
Video Classification: Human Action Recognition on HMDB-51 dataset
PDF
Deep Learning for Video: Action Recognition (UPC 2018)
PPTX
ACTION-Net_Multipath_Excitation_for_Action_Recognition.pptx
PDF
Human Action Recognition in Videos
PPTX
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
PPTX
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
PDF
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
PPTX
Action_recognition-topic.pptx
PDF
Neural Architectures for Video Encoding
PDF
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
PDF
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
PDF
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
PDF
Human Action Recognition Using Deep Learning
PDF
Attention correlated appearance and motion feature followed temporal learning...
PPTX
PDF
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
PDF
Activity recognition based on spatio-temporal features with transfer learning
PDF
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
Action Recognitionの歴史と最新動向
Automated Video Analysis and Reporting for Construction Sites
Video Classification: Human Action Recognition on HMDB-51 dataset
Deep Learning for Video: Action Recognition (UPC 2018)
ACTION-Net_Multipath_Excitation_for_Action_Recognition.pptx
Human Action Recognition in Videos
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Action_recognition-topic.pptx
Neural Architectures for Video Encoding
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Human Action Recognition Using Deep Learning
Attention correlated appearance and motion feature followed temporal learning...
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Activity recognition based on spatio-temporal features with transfer learning
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding

More from Wanjin Yu (15)

PDF
Architecture Design for Deep Neural Networks III
PDF
Intelligent Multimedia Recommendation
PDF
Architecture Design for Deep Neural Networks II
PDF
Architecture Design for Deep Neural Networks I
PDF
Causally regularized machine learning
PDF
Computer vision for transportation
PDF
Object Detection Beyond Mask R-CNN and RetinaNet III
PDF
Object Detection Beyond Mask R-CNN and RetinaNet II
PDF
Object Detection Beyond Mask R-CNN and RetinaNet I
PDF
Visual Search and Question Answering II
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
PDF
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Architecture Design for Deep Neural Networks III
Intelligent Multimedia Recommendation
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks I
Causally regularized machine learning
Computer vision for transportation
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet I
Visual Search and Question Answering II
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Big Data Intelligence: from Correlation Discovery to Causal Reasoning

Recently uploaded (20)

PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
Testing WebRTC applications at scale.pdf
PPTX
Digital Literacy And Online Safety on internet
PPTX
artificial intelligence overview of it and more
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
Introduction to the IoT system, how the IoT system works
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
E -tech empowerment technologies PowerPoint
PDF
Paper PDF World Game (s) Great Redesign.pdf
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
presentation_pfe-universite-molay-seltan.pptx
SAP Ariba Sourcing PPT for learning material
Decoding a Decade: 10 Years of Applied CTI Discipline
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Testing WebRTC applications at scale.pdf
Digital Literacy And Online Safety on internet
artificial intelligence overview of it and more
Introuction about WHO-FIC in ICD-10.pptx
Introduction to the IoT system, how the IoT system works
Introuction about ICD -10 and ICD-11 PPT.pptx
E -tech empowerment technologies PowerPoint
Paper PDF World Game (s) Great Redesign.pdf
WebRTC in SignalWire - troubleshooting media negotiation
Slides PPTX World Game (s) Eco Economic Epochs.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
introduction about ICD -10 & ICD-11 ppt.pptx
Tenda Login Guide: Access Your Router in 5 Easy Steps
Design_with_Watersergyerge45hrbgre4top (1).ppt
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
APNIC Update, presented at PHNOG 2025 by Shane Hermoso

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I

  • 1. Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition Ting Yao Principal Researcher, Vision and Multimedia Lab, JD AI Research Tutorial @ ICME, July 8th, 2019
  • 2. horse grass person “a boy is cleaning the floor” “not just beautiful”
  • 4. 4
  • 5. 5 2011 2012 2013 2014 2015 Action recognition by dense trajectories. [Wang et al. CVPR 2011] Hand-crafted feature 2016
  • 6. 2011 2012 2013 2014 2015 2016 Large-scale Video Classification with Convolutional Neural Networks. [Karpathy et al. CVPR 2014] Two-Stream Convolutional Networks for Action Recognition in Videos. [Simonyan et al. NIPS 2014] 2D convolutional network
  • 7. 2D CNN + LSTM (LRCN)2011 2012 2013 2014 2015 2016 Long-term Recurrent Convolutional Networks for Visual Recognition and Description. [Donahue et al. CVPR 2015]
  • 8. 3D convolutional network (C3D)2011 2012 2013 2014 2015 2016 Learning Spatiotemporal Features with 3D Convolutional Networks. [Tran et al. ICCV 2015]
  • 9. Temporal segment networks (TSN)2011 2012 2013 2014 2015 2016 Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. [Wang et al. ECCV 2016]
  • 10. 10 Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 11. 11 Backbone Network Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 12. 12 State-of-the-Arts Image Domain Video Domain VGG [Simonyan et al. ICLR 2015] C3D [Tran et al. ICCV 2015] Inception [Szegedy et al. CVPR 2015] I3D [Carreira et al. CVPR 2017] ResNet [He et al. CVPR 2016] P3D [Qiu et al. ICCV 2017]
  • 13. 13 Convolution 3D Convolution 2D Convolution 3D Convolution 3D ResNet 2D ResNet ResNet-152: Time Cost: 9 x C2 x H x W Model size: 230MB 3D ResNet-152: Time Cost: 27 x C2 x T x H x W Model size: 690MB
  • 15. 15 Bottleneck Architecture: + 1x1 conv 1x1 conv 3x3 conv ReLU ReLU ReLU (a) Residual Unit + 1x1x1 conv 1x1x1 conv 1x3x3 conv ReLU ReLU 3x1x1 conv ReLU ReLU (b) P3D-A + 1x1x1 conv 1x1x1 conv ReLU ReLU 1x3x3 conv 3x1x1 conv + ReLU ReLU (c) P3D-B + 1x1x1 conv 1x1x1 conv 1x3x3 conv ReLU ReLU 3x1x1 conv ReLU + ReLU (d) P3D-C (a) (b) (c) (d) Very deep 3D CNN but still lighter weights than C3D
  • 16. 16 •R(2+1)D > MCx > rMCx > R3D > R2D
  • 17. 17 • ResNeXt-101 > Wide ResNet-50 > ResNet-200 > ResNet-152 > ResNet-101 > ResNet-50 > DenseNet-201 > DenseNet-121
  • 18. 18 • Involve large-range (global) context into representation learning • Model the diffusions between local and global features
  • 19. 19 Feature Aggregation Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 20. 20 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN
  • 21. 21 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling
  • 23. 23
  • 24. 24 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling • Attention • Visual Attention [Sharma et al. ICLR workshop 2015] • Recurrent Attention [Du et al. TIP 2018] • Unified Attention [Li et al. TMM 2018]
  • 25. 25 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling • Attention • Visual Attention [Sharma et al. ICLR workshop 2015] • Recurrent Attention [Du et al. TIP 2018] • Unified Attention [Li et al. TMM 2018] • RNN • LRCN [Donahue et al. CVPR 2015] • Hybrid Framework (LSTM) [Wu et al. ACM MM 2015]
  • 26. 26
  • 27. 27 • Global Activations • Fully connected layer • Global pooling layer • Fisher Vector with Variational Auto-Encoder (FV-VAE) • Fisher Vector (FV) ... ... ... ... Global Activations Convolutional Activations FV Encoding FV-VAE Encoding Convolutional Activations Normalization term Generative model GMM VAE FV FV-VAE
  • 28. 28 ... Reconstruct Loss ... Regularization Loss Classification Loss ... ... Encoder Sampling Decodertx ... Reconstruct Loss ... ... ... Encoder Identity Decodertx Back Propagation Gradient Vector Accumulator • Assumption of FV • Data is generated from Gaussian Mixture Model, which may not hold in practice • VAE • Encoder (𝑞 𝜙( Τ𝐳 𝐱)): learn new representations 𝐳 for the given input 𝐱 • Decoder (𝑝 𝜃( Τ𝐱 𝐳)): generate FV of new representations 𝐳 Training Extraction FV: ℊ 𝜃 𝑋 = 𝐹𝜃 − 1 2 𝛻𝜃 log 𝑢 𝜽(𝑋) = −𝐹𝜃 − 1 2 σ 𝑡=1 𝑇𝑥 𝛻𝜃ℒ 𝒓𝒆𝒄(𝒙𝒕; 𝜃, 𝜙)Reconstruct loss: ℒ 𝒓𝒆𝒄 = − log 𝜇 𝒙 𝑡 = − log 𝑝 𝜃( Τ𝒙 𝑡 𝒛 𝑡)
  • 29. 29 CNN FV-VAE Gradient Vector Video Representation Convolutional Feature … … Region Feature Set Loss Function Ice Dancing + Spatial Pyramid Pooling Training Epoch Extraction Epoch • FV-VAE based action recognition framework • CNN as convolutional feature extractor • Encoding SPP output using FV-VAE
  • 30. 30 Stream Fusion Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 31. 31 human, guitar Playing Guitar Jumping Jack Cliff diving Basketball dunk Single Frame Consecutive Frames Clip (multiple adjacent frames) whole video Different actions may span different granularities!
  • 32. 32 • Multi-granular spatio-temporal architecture for video action recognition • Hierarchical modeling (4 granularities) • Fusion based on multi-granular score distribution
  • 33. 33
  • 34. 34
  • 35. 35 Single Frame softmax 0.4 … 0.2 … 0.7 … 0.9 Consecutive Frames softmax Clips softmax Video softmax Surfing scores Sort 0.9 … 0.7 … 0.4 … 0.2 Improved Surfing score 0.8 w=[1, 0, …, 0] Max-pooling w=[1, 1, …, 1] Ave-pooling optimized w Distribution-based classifier
  • 36. 36
  • 37. 37 2011 2012 2014 2015 2017 2018 2019
  • 38. 38 Method UCF101 HMDB51 Improved dense trajectories (IDT) [Wang et al. ICCV 2011] 85.9% 57.2% Higher dimensional IDT [Peng et al. CVIU 2016] 87.9% 61.1% 2D CNN Slow Fusion [Karpathy et al. CVPR 2014] 65.4% -- Two-stream ConvNet [Simonyan et al. NIPS 2014] 88.0% 59.4% Factorized ST-ConvNet [Sun et al. ICCV 2015] 88.1% 59.1% Two-stream + LSTM [Yue-Hei et al. CVPR 2015] 88.6% -- Two-stream Conv fusion [Feichtenhofer et al. CVPR 2016] 92.5% 67.3% Two-stream ST Residual Networks [Feichtenhofer et al. NIPS 2016] 93.4% 66.4% Temporal Segment Networks [Wang et al. ECCV 2016] 94.0% 68.5% C3D [Tran et al. ICCV 2015] 82.3% 56.8% P3D ResNet [Qiu et al. ICCV 2017] 89.8% 58.6% Two-stream P3D ResNet [Qiu et al. ICCV 2017] 94.5% 71.8% I3D [Carreira et al. CVPR 2017] 93.4% 66.4% I3D + Kinetics pre-train [Carreira et al. CVPR 2017] 97.9% 80.2% LGD-3D + Kinetics pre-train [Qiu et al. CVPR 2019] 98.2% 80.5%
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42
  • 43. 43 Feature Extractor Pole vault 0.61 Pole vault 0.83 Pole vault 0.51
  • 44. 44 3D CNN Pole vault 0.96 Gaussian Kernel
  • 45. 45
  • 46. 46
  • 47. 47
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 52. 52 Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 53. 53
  • 54. 54
  • 55. 55