SlideShare a Scribd company logo
자연어처리 연구실
M2020064
조단비
Published in: 2017 IEEE 12th International Conference on Automatic Face & Gesture Recognition
URL: https://ieeexplore.ieee.org/abstract/document/7961779
Content
1. Introduce
2. Taxonomy (architecture & challenges)
3. Action/Activity & Gesture Recognition
4. Discussion
#Kookmin_University #Natural_Language_Processing_lab. 1
Introduce
> Action and Gesture recognition + Deep learning
> Challenging problem: amounts of data to be processed, model complexity
> Proposed models: RNN and LSTM for action/gesture recognition
+ 3D convolutional networks
+ pre-computed motion-based features
+ combination of multiple visual
> Our goal: how they treat the temporal dimension of the data?
#Kookmin_University #Natural_Language_Processing_lab. 2
Computer vision and pattern recognition
Temporal dimension in sequences
Taxonomy
1. Architectures
2. Fusion Strategies
3. Datasets
4. Challenges
#Kookmin_University #Natural_Language_Processing_lab. 3
Architectures
#Kookmin_University #Natural_Language_Processing_lab. 4
Action/Gesture
Recognition Approaches
3D Models (3D conv a pool)
Motion-based input features
Temporal Methods 2D Models + RNN + LSTM
2D Models + B-RNN + LSTM
2D Models + H-RNN + LSTM
2D Models + D-RNN + LSTM
2D Models + HMM
2D/3D Models + Auxiliary outputs
2D/3D Models + Hand-crafted features
* B: Bidirectional
H: Hierarchical
D: Differential
Architectures
> How the deal with the temporal dimension
in deep-based human action and gesture recognition?
1) Using 3D filters in the convolutional layer
> It captures discriminative features along both spatial and temporal dimensions
while maintaining a certain temporal structure
2) Motion features
> We extract motion features
> The features input to the network as additional channels
3) Combining a 2D(or 3D) CNN applied at individual frames with a temporal sequence modeling
> with RNN or LSTM
#Kookmin_University #Natural_Language_Processing_lab. 5
Architectures
#Kookmin_University #Natural_Language_Processing_lab. 6
Fusion Strategies
> Main variants for information fusion in deep learning models
1) Early
> Before the data is feed into the model,
> The model fuses information directly from multiple sources
2) Late
> Output of deep learning models are combined
3) Middle
> Intermediate layers fuse information
Additional fusion strategies: ensembles or stacked networks
#Kookmin_University #Natural_Language_Processing_lab. 7
to combine the information from parts of a segmented video sequence
Datasets
8
Datasets
9
Challenges
#Kookmin_University #Natural_Language_Processing_lab. 10
Reviews: Action/Activity & Gesture Recognition
1. 3D Convolutional Neural Networks
2. Motion-based Features
3. Temporal Deep Learning Models: RNN and LSTM
4. Deep Learning with Fusion Strategies
#Kookmin_University #Natural_Language_Processing_lab. 11
3D Convolutional Neural Networks
> Extending the convolution along the temporal axis (in 3D CNN)
- Initializing the weights of a 3D CNN by using 2D weights learned from ImageNET
- Factorizing the 3D convolutional kernel learning
as a sequential process of learning 2D spatial and 1D temporal kernels in different layers
- Performing 3D convolutions over stacks of optical flow maps
- Using multiple 3D CNNs in a multi-stage
- Combining 3D CNN models with sequence modeling methods
or hand-crafted feature desciptors
#Kookmin_University #Natural_Language_Processing_lab. 12
Motion-based Features
> Incorporating pre-computed temporal features within the deep model
- Presenting two-stream CNN (spatial and temporal networks)
- Exploiting a motion vector from video compression
- Extending the convolutions in time with long-term temporal convolutions
> Extending the CNN capabilities using trajectory features
- Pooling and normalization
- Learning bag-of-features from dense trajectories of synthetic 3D human models
#Kookmin_University #Natural_Language_Processing_lab. 13
Temporal Deep Learning Models: RNN and LSTM
> Combining CNN with temporal sequence models (RNN or LSTM)
- Changing information of motions between successive frames
- Presenting a multi-stream (motion and appearance) using bi-directional RNN
- Observing video frames and deciding both where to look next and when to emit a
prediction
- using 3D skeleton sequences to regularize LSTM network (LSTM+CNN) on video frames
- RNN with Multimodal(depth video, skeleton, and speech) system
- Multi-RNN to facilitate the handling of variable-length gestures
#Kookmin_University #Natural_Language_Processing_lab. 14
Deep Learning with Fusion Strategies
> Using diverse fusion schemes to improve recognition performance of
action recognition
- Learning an end-to-end hierarchical RNN with skeleton data
- DeepConvLSTM based on convolutional and LSTM recurrent units
- HMM(Hidden Markov Model), GMM(Gaussian Mixture Model)
#Kookmin_University #Natural_Language_Processing_lab. 15
Discussion
> Comprehensive overview of deep-based models for action and gesture recognition
- How does a method deal with temporal information?
- How can such a large net work be trained with small datasets?
> 3D networks over a long sequence can learn complex temporal patterns
> Temporal models (RNN and LSTM) has the crucial advantage to cope with longer-range
temporal relations
> Ensemble learning reduces the bias and variance errors of the learning algorithm
(fusion strategies)
#Kookmin_University #Natural_Language_Processing_lab. 16
Other papers
#Kookmin_University #Natural_Language_Processing_lab. 17
“Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification”
(ACM 2015)
Other papers
#Kookmin_University #Natural_Language_Processing_lab. 18
“Long-term Recurrent Convolutional Networks for Visual Recognition and Description”
(CVPR 2015)
Other papers
#Kookmin_University #Natural_Language_Processing_lab. 19
“FASTER Recurrent Networks for Efficient Video Classification”
(AAAI 2020)
Other papers
#Kookmin_University #Natural_Language_Processing_lab. 20
“Attention Boosted Deep Networks for Video Classification”
(IEEE 2020)
Other papers
#Kookmin_University #Natural_Language_Processing_lab. 21
“Traditional Bangladeshi Sports Video Classification
Using Deep Learning Method”
(Applied Sciences 2021)
Thank You.
22
#Kookmin_University #Natural_Language_Processing_lab.

More Related Content

PDF
Attention boosted deep networks for video classification
PPTX
let's dive to deep learning
PDF
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
PPTX
Speech Processing with deep learning
PPTX
Handwritten bangla-digit-recognition-using-deep-learning
PPT
deeplearning
PDF
Image Captioning Generator using Deep Machine Learning
PDF
Intro to Deep Learning for Computer Vision
Attention boosted deep networks for video classification
let's dive to deep learning
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
Speech Processing with deep learning
Handwritten bangla-digit-recognition-using-deep-learning
deeplearning
Image Captioning Generator using Deep Machine Learning
Intro to Deep Learning for Computer Vision

What's hot (20)

PPTX
GUI based handwritten digit recognition using CNN
PDF
Dissertation character recognition - Report
PPTX
Video Description using Deep Learning
PDF
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
PDF
Handwritten Digit Recognition
PDF
Automated Neural Image Caption Generator for Visually Impaired People
PDF
Implementation of Steganographic Model using Inverted LSB Insertion
PDF
Handwritten Recognition using Deep Learning with R
PPTX
Sharbani bhattacharya gyanodya 2014
PDF
Mostafa Shabani Cv
PDF
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
PPTX
Basics of Deep learning
PPTX
Digit recognition
PDF
DSRLab seminar Introduction to deep learning
PPTX
Artificial Neural Network Topology
PPTX
Artificial neural network by arpit_sharma
PDF
Animesh Prasad and Muthu Kumar Chandrasekaran - WESST - Basics of Deep Learning
PDF
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
PDF
Tensorflow Training From Bangalore,Online and Classrooms
GUI based handwritten digit recognition using CNN
Dissertation character recognition - Report
Video Description using Deep Learning
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
Handwritten Digit Recognition
Automated Neural Image Caption Generator for Visually Impaired People
Implementation of Steganographic Model using Inverted LSB Insertion
Handwritten Recognition using Deep Learning with R
Sharbani bhattacharya gyanodya 2014
Mostafa Shabani Cv
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Basics of Deep learning
Digit recognition
DSRLab seminar Introduction to deep learning
Artificial Neural Network Topology
Artificial neural network by arpit_sharma
Animesh Prasad and Muthu Kumar Chandrasekaran - WESST - Basics of Deep Learning
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Tensorflow Training From Bangalore,Online and Classrooms
Ad

Similar to A survey on deep learning based approaches for action and gesture recognition in image sequences (20)

PPTX
ACTION-Net_Multipath_Excitation_for_Action_Recognition.pptx
PDF
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
PPTX
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
PPTX
Deep-Aligned CNN
PPTX
Final Major project a b c d e f g h i j k l m
PDF
Neural Architectures for Video Encoding
PPTX
Automated Video Analysis and Reporting for Construction Sites
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
PDF
Action Recognitionの歴史と最新動向
PDF
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
PDF
【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction...
PDF
Attention correlated appearance and motion feature followed temporal learning...
PPTX
Neural networks for semantic gaze analysis in xr settings
PDF
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
PDF
Concepts of Temporal CNN, Recurrent Neural Network, Attention
PDF
Deep Learning for Video: Action Recognition (UPC 2018)
PPTX
Gesture Recognition (an AI model based project )ppt (3).pptx
PDF
Human Action Recognition
PPTX
A Survey of Convolutional Neural Networks
PPTX
Ppt guangyang
ACTION-Net_Multipath_Excitation_for_Action_Recognition.pptx
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
Deep-Aligned CNN
Final Major project a b c d e f g h i j k l m
Neural Architectures for Video Encoding
Automated Video Analysis and Reporting for Construction Sites
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Action Recognitionの歴史と最新動向
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction...
Attention correlated appearance and motion feature followed temporal learning...
Neural networks for semantic gaze analysis in xr settings
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Deep Learning for Video: Action Recognition (UPC 2018)
Gesture Recognition (an AI model based project )ppt (3).pptx
Human Action Recognition
A Survey of Convolutional Neural Networks
Ppt guangyang
Ad

More from Danbi Cho (10)

PDF
Crf based named entity recognition using a korean lexical semantic network
PDF
Gpt models
PDF
ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators
PDF
A survey on automatic detection of hate speech in text
PDF
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
PDF
Decision tree and ensemble
PDF
Can recurrent neural networks warp time
PDF
Man is to computer programmer as woman is to homemaker debiasing word embeddings
PDF
Situation recognition visual semantic role labeling for image understanding
PDF
Mitigating unwanted biases with adversarial learning
Crf based named entity recognition using a korean lexical semantic network
Gpt models
ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators
A survey on automatic detection of hate speech in text
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Decision tree and ensemble
Can recurrent neural networks warp time
Man is to computer programmer as woman is to homemaker debiasing word embeddings
Situation recognition visual semantic role labeling for image understanding
Mitigating unwanted biases with adversarial learning

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Digital Strategies for Manufacturing Companies
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
top salesforce developer skills in 2025.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Essential Infomation Tech presentation.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
How to Migrate SBCGlobal Email to Yahoo Easily
Which alternative to Crystal Reports is best for small or large businesses.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Digital Strategies for Manufacturing Companies
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Design an Analysis of Algorithms I-SECS-1021-03
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Operating system designcfffgfgggggggvggggggggg
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
2025 Textile ERP Trends: SAP, Odoo & Oracle
Odoo POS Development Services by CandidRoot Solutions
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
top salesforce developer skills in 2025.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Essential Infomation Tech presentation.pptx
Odoo Companies in India – Driving Business Transformation.pdf
Transform Your Business with a Software ERP System
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

A survey on deep learning based approaches for action and gesture recognition in image sequences

  • 1. 자연어처리 연구실 M2020064 조단비 Published in: 2017 IEEE 12th International Conference on Automatic Face & Gesture Recognition URL: https://ieeexplore.ieee.org/abstract/document/7961779
  • 2. Content 1. Introduce 2. Taxonomy (architecture & challenges) 3. Action/Activity & Gesture Recognition 4. Discussion #Kookmin_University #Natural_Language_Processing_lab. 1
  • 3. Introduce > Action and Gesture recognition + Deep learning > Challenging problem: amounts of data to be processed, model complexity > Proposed models: RNN and LSTM for action/gesture recognition + 3D convolutional networks + pre-computed motion-based features + combination of multiple visual > Our goal: how they treat the temporal dimension of the data? #Kookmin_University #Natural_Language_Processing_lab. 2 Computer vision and pattern recognition Temporal dimension in sequences
  • 4. Taxonomy 1. Architectures 2. Fusion Strategies 3. Datasets 4. Challenges #Kookmin_University #Natural_Language_Processing_lab. 3
  • 5. Architectures #Kookmin_University #Natural_Language_Processing_lab. 4 Action/Gesture Recognition Approaches 3D Models (3D conv a pool) Motion-based input features Temporal Methods 2D Models + RNN + LSTM 2D Models + B-RNN + LSTM 2D Models + H-RNN + LSTM 2D Models + D-RNN + LSTM 2D Models + HMM 2D/3D Models + Auxiliary outputs 2D/3D Models + Hand-crafted features * B: Bidirectional H: Hierarchical D: Differential
  • 6. Architectures > How the deal with the temporal dimension in deep-based human action and gesture recognition? 1) Using 3D filters in the convolutional layer > It captures discriminative features along both spatial and temporal dimensions while maintaining a certain temporal structure 2) Motion features > We extract motion features > The features input to the network as additional channels 3) Combining a 2D(or 3D) CNN applied at individual frames with a temporal sequence modeling > with RNN or LSTM #Kookmin_University #Natural_Language_Processing_lab. 5
  • 8. Fusion Strategies > Main variants for information fusion in deep learning models 1) Early > Before the data is feed into the model, > The model fuses information directly from multiple sources 2) Late > Output of deep learning models are combined 3) Middle > Intermediate layers fuse information Additional fusion strategies: ensembles or stacked networks #Kookmin_University #Natural_Language_Processing_lab. 7 to combine the information from parts of a segmented video sequence
  • 12. Reviews: Action/Activity & Gesture Recognition 1. 3D Convolutional Neural Networks 2. Motion-based Features 3. Temporal Deep Learning Models: RNN and LSTM 4. Deep Learning with Fusion Strategies #Kookmin_University #Natural_Language_Processing_lab. 11
  • 13. 3D Convolutional Neural Networks > Extending the convolution along the temporal axis (in 3D CNN) - Initializing the weights of a 3D CNN by using 2D weights learned from ImageNET - Factorizing the 3D convolutional kernel learning as a sequential process of learning 2D spatial and 1D temporal kernels in different layers - Performing 3D convolutions over stacks of optical flow maps - Using multiple 3D CNNs in a multi-stage - Combining 3D CNN models with sequence modeling methods or hand-crafted feature desciptors #Kookmin_University #Natural_Language_Processing_lab. 12
  • 14. Motion-based Features > Incorporating pre-computed temporal features within the deep model - Presenting two-stream CNN (spatial and temporal networks) - Exploiting a motion vector from video compression - Extending the convolutions in time with long-term temporal convolutions > Extending the CNN capabilities using trajectory features - Pooling and normalization - Learning bag-of-features from dense trajectories of synthetic 3D human models #Kookmin_University #Natural_Language_Processing_lab. 13
  • 15. Temporal Deep Learning Models: RNN and LSTM > Combining CNN with temporal sequence models (RNN or LSTM) - Changing information of motions between successive frames - Presenting a multi-stream (motion and appearance) using bi-directional RNN - Observing video frames and deciding both where to look next and when to emit a prediction - using 3D skeleton sequences to regularize LSTM network (LSTM+CNN) on video frames - RNN with Multimodal(depth video, skeleton, and speech) system - Multi-RNN to facilitate the handling of variable-length gestures #Kookmin_University #Natural_Language_Processing_lab. 14
  • 16. Deep Learning with Fusion Strategies > Using diverse fusion schemes to improve recognition performance of action recognition - Learning an end-to-end hierarchical RNN with skeleton data - DeepConvLSTM based on convolutional and LSTM recurrent units - HMM(Hidden Markov Model), GMM(Gaussian Mixture Model) #Kookmin_University #Natural_Language_Processing_lab. 15
  • 17. Discussion > Comprehensive overview of deep-based models for action and gesture recognition - How does a method deal with temporal information? - How can such a large net work be trained with small datasets? > 3D networks over a long sequence can learn complex temporal patterns > Temporal models (RNN and LSTM) has the crucial advantage to cope with longer-range temporal relations > Ensemble learning reduces the bias and variance errors of the learning algorithm (fusion strategies) #Kookmin_University #Natural_Language_Processing_lab. 16
  • 18. Other papers #Kookmin_University #Natural_Language_Processing_lab. 17 “Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification” (ACM 2015)
  • 19. Other papers #Kookmin_University #Natural_Language_Processing_lab. 18 “Long-term Recurrent Convolutional Networks for Visual Recognition and Description” (CVPR 2015)
  • 20. Other papers #Kookmin_University #Natural_Language_Processing_lab. 19 “FASTER Recurrent Networks for Efficient Video Classification” (AAAI 2020)
  • 21. Other papers #Kookmin_University #Natural_Language_Processing_lab. 20 “Attention Boosted Deep Networks for Video Classification” (IEEE 2020)
  • 22. Other papers #Kookmin_University #Natural_Language_Processing_lab. 21 “Traditional Bangladeshi Sports Video Classification Using Deep Learning Method” (Applied Sciences 2021)