A survey on deep learning based approaches for action and gesture recognition in image sequences

자연어처리 연구실
M2020064
조단비
Published in: 2017 IEEE 12th International Conference on Automatic Face & Gesture Recognition
URL: https://ieeexplore.ieee.org/abstract/document/7961779

Content
1. Introduce
2. Taxonomy (architecture & challenges)
3. Action/Activity & Gesture Recognition
4. Discussion
#Kookmin_University #Natural_Language_Processing_lab. 1

Introduce
> Action and Gesture recognition + Deep learning
> Challenging problem: amounts of data to be processed, model complexity
> Proposed models: RNN and LSTM for action/gesture recognition
+ 3D convolutional networks
+ pre-computed motion-based features
+ combination of multiple visual
> Our goal: how they treat the temporal dimension of the data?
Computer vision and pattern recognition
Temporal dimension in sequences

Taxonomy
1. Architectures
2. Fusion Strategies
3. Datasets
4. Challenges

Architectures
Action/Gesture
Recognition Approaches
3D Models (3D conv a pool)
Motion-based input features
Temporal Methods 2D Models + RNN + LSTM
2D Models + B-RNN + LSTM
2D Models + H-RNN + LSTM
2D Models + D-RNN + LSTM
2D Models + HMM
2D/3D Models + Auxiliary outputs
2D/3D Models + Hand-crafted features
* B: Bidirectional
H: Hierarchical
D: Differential

Architectures
> How the deal with the temporal dimension
in deep-based human action and gesture recognition?
1) Using 3D filters in the convolutional layer
> It captures discriminative features along both spatial and temporal dimensions
while maintaining a certain temporal structure
2) Motion features
> We extract motion features
> The features input to the network as additional channels
3) Combining a 2D(or 3D) CNN applied at individual frames with a temporal sequence modeling
> with RNN or LSTM

Architectures

Fusion Strategies
> Main variants for information fusion in deep learning models
1) Early
> Before the data is feed into the model,
> The model fuses information directly from multiple sources
2) Late
> Output of deep learning models are combined
3) Middle
> Intermediate layers fuse information
Additional fusion strategies: ensembles or stacked networks
to combine the information from parts of a segmented video sequence

Challenges

Reviews: Action/Activity & Gesture Recognition
1. 3D Convolutional Neural Networks
2. Motion-based Features
3. Temporal Deep Learning Models: RNN and LSTM
4. Deep Learning with Fusion Strategies

3D Convolutional Neural Networks
> Extending the convolution along the temporal axis (in 3D CNN)
- Initializing the weights of a 3D CNN by using 2D weights learned from ImageNET
- Factorizing the 3D convolutional kernel learning
as a sequential process of learning 2D spatial and 1D temporal kernels in different layers
- Performing 3D convolutions over stacks of optical flow maps
- Using multiple 3D CNNs in a multi-stage
- Combining 3D CNN models with sequence modeling methods
or hand-crafted feature desciptors

Motion-based Features
> Incorporating pre-computed temporal features within the deep model
- Presenting two-stream CNN (spatial and temporal networks)
- Exploiting a motion vector from video compression
- Extending the convolutions in time with long-term temporal convolutions
> Extending the CNN capabilities using trajectory features
- Pooling and normalization
- Learning bag-of-features from dense trajectories of synthetic 3D human models

Temporal Deep Learning Models: RNN and LSTM
> Combining CNN with temporal sequence models (RNN or LSTM)
- Changing information of motions between successive frames
- Presenting a multi-stream (motion and appearance) using bi-directional RNN
- Observing video frames and deciding both where to look next and when to emit a
prediction
- using 3D skeleton sequences to regularize LSTM network (LSTM+CNN) on video frames
- RNN with Multimodal(depth video, skeleton, and speech) system
- Multi-RNN to facilitate the handling of variable-length gestures

Deep Learning with Fusion Strategies
> Using diverse fusion schemes to improve recognition performance of
action recognition
- Learning an end-to-end hierarchical RNN with skeleton data
- DeepConvLSTM based on convolutional and LSTM recurrent units
- HMM(Hidden Markov Model), GMM(Gaussian Mixture Model)

Discussion
> Comprehensive overview of deep-based models for action and gesture recognition
- How does a method deal with temporal information?
- How can such a large net work be trained with small datasets?
> 3D networks over a long sequence can learn complex temporal patterns
> Temporal models (RNN and LSTM) has the crucial advantage to cope with longer-range
temporal relations
> Ensemble learning reduces the bias and variance errors of the learning algorithm
(fusion strategies)

Other papers
“Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification”
(ACM 2015)

Other papers
“Long-term Recurrent Convolutional Networks for Visual Recognition and Description”
(CVPR 2015)

Other papers
“FASTER Recurrent Networks for Efficient Video Classification”
(AAAI 2020)

Other papers
“Attention Boosted Deep Networks for Video Classification”
(IEEE 2020)

Other papers
“Traditional Bangladeshi Sports Video Classification
Using Deep Learning Method”
(Applied Sciences 2021)

Thank You.
22
#Kookmin_University #Natural_Language_Processing_lab.

A survey on deep learning based approaches for action and gesture recognition in image sequences

More Related Content

What's hot (20)

Similar to A survey on deep learning based approaches for action and gesture recognition in image sequences (20)

More from Danbi Cho (10)

Recently uploaded (20)

A survey on deep learning based approaches for action and gesture recognition in image sequences