SlideShare a Scribd company logo
Human Pose Estimation by Deep Learning
Wei Yang
Supervisor: Prof. WANG Xiaogang, Prof. OUYANG Wanli
IVP Lab, CUHK
September 11, 2015
Outline
• Introduction
• Traditional Approaches
• Deep Learning Methods
– Global view (holistic view)
– Local appearance
– Combination of local appearance and global view
– Others
2015/9/11 2
Introduction
• What is articulated body pose estimation?
“recovers the pose of an articulated body, which consists of joints and rigid parts
using image-based observations.”
2015/9/11 3
Applications
Action recognition Clothing Parsing
Gaming
2015/9/11 4
Human tracking
Challenges
2015/9/11 5
Traditional Approaches
Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
Pictorial Structure
• Unary Templates
• Pairwise Springs
Yang & Ramanan 2011
Mixtures of “mini-parts”
• Mixture of part 𝑖
• Unary template for part 𝑖 with mixture 𝑚𝑖
• Pairwise springs between part 𝑖 with
mixture 𝑚𝑖 and part 𝑗 with mixture 𝑚𝑗
2015/9/11 6
head
torso
leg
Example of mini parts: near-vertical and near horizontal limbs
Deep Learning for Pose Estimation
• Holistic View
–e.g., joints position regression
• Local View
–e.g., body parts detection
• Combining global and local information
–e.g., body parts detection + joints position regression
• Others
–e.g., motion features, pose estimation in videos
2015/9/11 7
Holistic View
DeepPose: Human Pose Estimation via Deep Neural
Networks
2015/9/11 8
Holistic Reasoning
2015/9/11 9
• Why holistic reasoning?
– Besides extreme variability in articulations, many of the joints are barely visible
DeepPose: A CNN Regressor
2015/9/11 10
• Network architecture: AlexNet
– Krizhevsky, Sutskever, and Hinton, NIPS 2012 (ImageNet)
– The first time deep model is shown to be effective on large scale
[Toshev & Szegedy, CVPR 2014]
Results on LSP (Leeds Sports Pose) dataset
2015/9/11 11
Cascade of Pose Regressors
• The pose estimation results are very coarse:
– due to its fixed input size of 220 × 220, the network has limited capacity to look
at detail
– Train cascade of pose regressors for more precise joint localization
2015/9/11 12
Cascade of Pose Regressors
2015/9/11 13
Refined pose estimation
2015/9/11 14
Percentage of Correct Parts (PCP) on LSP dataset
2015/9/11 15
Local Appearance Method
Articulated Pose Estimation by a Graphical Model
with Image Dependent Pairwise Relations
2015/9/11 16
Motivation
• Local image patches are able to capture:
– Part presence
– Pairwise part spatial relationships
2015/9/11 17
Number of mixture type for each pair: 6
Neighbor: 1
# of relationships: 61 = 6
Neighbor: 2
# of relationships: 62
= 36
Lowerarm
Upper arm
[Chen & Yuille NIPS 2014]
Tree-structured Relational Graph
• 𝑇 = 𝑉, 𝐸
– 𝑉: body parts
– 𝐸: pairwise relationships between parts
• 𝐩 = 𝑝𝑖 = {(𝑥𝑖, 𝑦𝑖)}
– 𝑝𝑖: Pixel location of part 𝑖
• 𝑡 = {𝑡𝑖𝑗, 𝑡𝑗𝑖| 𝑖, 𝑗 ∈ 𝐸}
– Pairwise relationship
– Defined by relative position
– 𝑡𝑖𝑗 ∈ 1, … , 𝑇𝑖𝑗
– In experiment: 13 type for each pair
𝑖, 𝑗 ∈ 𝐸
2015/9/11 18
Formulation
2015/9/11 19
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 =
𝑖∈𝑉
𝐴𝑖(𝑝𝑖|𝐼; 𝜃)
Part
presence
𝜔𝑖 ⋅
Inference: 𝐩∗
, 𝐭∗
= arg max
𝐩,𝐭
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃
• Tree structure
• Can be solved efficiently by dynamic programming
𝜔𝑖, 𝜔𝑖𝑗, 𝝎𝑖𝑗
𝑡 𝑖𝑗
are learned by Latent structure SVM
+
(𝑖,𝑗)∈𝐸
𝑅(𝑝𝑖, 𝑝𝑗, 𝑡𝑖𝑗, 𝑡𝑗𝑖|𝐼; 𝜃)
Pairwise
deformation
+𝝎𝑖𝑗
𝑡 𝑖𝑗
⋅𝜔𝑖𝑗 ⋅
Pairwise
Relationship
Learning DCNN parameters 𝜃
2015/9/11 20
Derive the type label for each patch
• use relative position 𝑑𝑖𝑗 to represent
the pairwise relations
• Cluster the relative positions over the
whole training set 𝑑𝑖𝑗 𝑖=1
𝑁
• Type label 𝑡𝑖𝑗
𝑛
: cluster index
• Mean relative position 𝑟𝑖𝑗
𝑡 𝑖𝑗
: cluster
center
Casting Full Connections into Convolutions
2015/9/11 21
Elbow
Part presence map
Pairwise relationship
map
PCP and PDJ on LSP dataset and FLIC dataset
Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
LSP
DCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8
Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
LSP FLIC
2015/9/11 22
Combining Local Appearance and Holistic View
Dual-Source Deep Neural Networks for Human Pose
Estimation
2015/9/11 23
Dual-Source CNN
• Integrate both the local part appearance and the holistic view
of each local part for more accurate human pose estimation
• Each input is an image pair
– Part patches
– Body patches
2015/9/11 24
Part patches: incorporate local appearance
• Generated by region proposals with some
restrictions
– Not too small (at least contain a body part)
– Not too big (may contain too many body parts and
lacks sufficient resolution)
• All classes of joints are covered by similar
number of part patches
• During testing, part patches are selected
from multi-scale sliding windows
2015/9/11 25
Body patches: holistic view
• Also from region proposals
– Must cover all body parts
– In testing stage, the body patch can be generated by human detection
• For DS-CNN, each training sample is made up with 3
components
– A part patch
– A body patch
– Binary mask specifying the location of the part patch in body patch
2015/9/11 26
Training of the DS-CNN
2015/9/11 27
Shared weights Classification
(softmax)
Regression
(L2 distance)
• Part heat map
– Same size of input image
– Uniformly distributed probability for each sliding window
– Sum and average over all pixels
Testing
2015/9/11 28
0.0
0.9
0
Testing
• Final pose estimation
– Weighted average of predicted joint locations within part patches with high
responses.
2015/9/11 29
Results: PCP on LSP
2015/9/11 30
Other Methods & Applications
• MoDeep: A Deep Learning Framework Using Motion
Features for Human Pose Estimation
• Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11 31
Using Motion Features for Human Pose Estimation
• motion is a powerful visual cue that alone can be used to
extract high-level information, including articulated pose.
2015/9/11 32
Image credit: Large displacement optical flow: descriptor matching in variational motion estimation
Thomas Brox, J. Malik. IEEE TPAMI, 33(3): 500-513, 2011
Modeep: Using Motion Features for Human Pose
Estimation
• Extended Frames Labeled In Cinema (FLIC) dataset with
additional motion features
2015/9/11 33
MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation.
Arjun et. al., ACCV 2014
Average of frame pair Optical flow
Multi-resolution efficient sliding window model
2015/9/11 34
Simple Spatial Model
• FLIC: multiple people with only one annotated person
• Testing: incorporate annotated torso position with simple
spatial model
2015/9/11 35
Predicted left shoulder Spatial mask of left shoulder Result
Experiment results
2015/9/11 36
Without motion feature
With motion feature
occlusion Cluttered bg Motion blur
Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11 37
• CNN can benefit from temporal context by combining
information across the multiple frames using optical flow.
Spatial ConvNet
2015/9/11 38
Why regression heatmap instead of
joint coordinates?
• The network can be multi-modal
• regressing coordinates directly is a highly
non-linear and more difficult to learn
mapping
Warping neighbouring heatmaps for improving pose
estimates
• Heatmaps from frames (t − n) and (t + n) warped to frame t
using tracks from optical flow (green & blue lines) can help
refine the wrongly estimated part location
2015/9/11 39
Results
2015/9/11 40
• End-to-end pose estimation
– Joint learning of pose features and pose configurations
– Allow local appearance to be fine-tuned by pose configuration
Ongoing Project
2015/9/11 41
UnaryresponsePairwiserelationships
…
Ongoing Project
2015/9/11 42
Pairwise relationships
… 𝑥𝑡−2 𝑥 𝑡−1 𝑥𝑡 𝑥 𝑇
𝑥 𝑡 𝑥 𝑡+1𝑥 𝑡−1
𝑤 𝑑𝑡 𝑤 𝑑𝑡 𝑤 𝑑𝑡
𝑤 𝑚 𝑤 𝑚 𝑤 𝑚
(𝑃𝑎𝑟𝑡 𝑝−1) (𝑃𝑎𝑟𝑡 𝑝−2) (𝑃𝑎𝑟𝑡 𝑝−3)
𝑧𝑡 𝑧𝑡+1𝑧𝑡−1
Add constraints between body parts in a network
Distance transform
Unary response
Preliminary Results (PCP on LSP)
2015/9/11 43
• Future work
– Pose relational graph learning
– Multi-task learning
• Human detection
• Human segmentation
– Combining global information
Head Torso U.arms L.arms U.legs L.legs mean
84.7 91 68.7 53.6 80.7 73.3 72.82
Recent developments
• Deeppose: Human pose estimation via deep neural networks
– A Toshev, C Szegedy – CVPR, 2014
• Joint training of a convolutional network and a graphical model for human pose estimation
– JJ Tompson, A Jain, Y LeCun, C Bregler – NIPS, 2014
• Human Pose Estimation with Iterative Error Feedback
– Carreira, Joao, et al. arXiv preprint arXiv:1507.06550 (2015).
• Maximum-Margin Structured Learning with Deep Networks for 3D Human PoseEstimation
– S Li, W Zhang, AB Chan - arXiv preprint arXiv:1508.06708, 2015
• Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network
– S Li, ZQ Liu, AB Chan – CVPR Workshop, 2014
• Flowing ConvNets for Human Pose Estimation in Videos
– T Pfister, J Charles, A Zisserman - ICCV, 2015
• R-CNNs for Pose Estimation and Action Detection
– G Gkioxari, B Hariharan, R Girshick, J Malik - arXiv preprint arXiv:1406.5212, 2014
• MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation
– A Jain, J Tompson, Y LeCun, C Bregler -ACCV 2014
• Efficient object localization using convolutional networks
– J Tompson, R Goroshin, A Jain, Y LeCun, C Bregler – CVPR, 2015
• Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation
– Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, CVPR 2015
• Parsing Occluded People by Flexible Compositions
– Xianjie Chen, Alan L. Yuille. CVPR 2015
• Articulated pose estimation by a graphical model with image dependent pairwise relations
– X Chen, AL Yuille –NIPS, 2014
• …
2015/9/11 44
Thank you
Human Pose Estimation by Deep Learning
Wei Yang
IVP Lab, CUHK
September 11, 2015
Evaluation Metrics
• Percentage of Correct Parts (PCP)
– measures the percentage of correctly localized body parts.
– A candidate body part is treated as correct if its segment endpoints lie within
50% of the length of the ground-truth annotated endpoints.
• Percentage of Detected Joints (PDJ)
– measures the performance using a curve of the percentage of correctly localized
joints by varying localization precision threshold, which is normalized by the
scale defined as distance between left shoulder and right hip
– invariant to scale
2015/9/11 46

More Related Content

PPTX
[Mmlab seminar 2016] deep learning for human pose estimation
PDF
Human Action Recognition
PDF
210523 swin transformer v1.5
PDF
Lecture13 - Association Rules
PPT
Parallel processing
PDF
About Myself In Interview For Experienced PowerPoint Presentation Slides
PPTX
Search Algorithms in AI.pptx
PDF
An Introduction to Test Driven Development
[Mmlab seminar 2016] deep learning for human pose estimation
Human Action Recognition
210523 swin transformer v1.5
Lecture13 - Association Rules
Parallel processing
About Myself In Interview For Experienced PowerPoint Presentation Slides
Search Algorithms in AI.pptx
An Introduction to Test Driven Development

What's hot (20)

PPTX
Human pose estimation with deep learning
PPTX
Convolutional neural network from VGG to DenseNet
PPTX
Image classification using CNN
PPT
Back propagation
PDF
ViT (Vision Transformer) Review [CDM]
PPTX
Deep learning for object detection
PPTX
Recurrent Neural Network
PDF
Object Detection Using R-CNN Deep Learning Framework
PDF
CIFAR-10
PDF
Bayesian networks in AI
PPTX
Deep Learning With Neural Networks
PPTX
Object detection with deep learning
PPTX
Semantic segmentation with Convolutional Neural Network Approaches
PPTX
Deep Learning - CNN and RNN
PPTX
Swin transformer
PPTX
Image classification with Deep Neural Networks
ODP
Image Processing with OpenCV
PPTX
Introduction to Deep Learning
PDF
Faster R-CNN - PR012
PPTX
Recurrent neural network
Human pose estimation with deep learning
Convolutional neural network from VGG to DenseNet
Image classification using CNN
Back propagation
ViT (Vision Transformer) Review [CDM]
Deep learning for object detection
Recurrent Neural Network
Object Detection Using R-CNN Deep Learning Framework
CIFAR-10
Bayesian networks in AI
Deep Learning With Neural Networks
Object detection with deep learning
Semantic segmentation with Convolutional Neural Network Approaches
Deep Learning - CNN and RNN
Swin transformer
Image classification with Deep Neural Networks
Image Processing with OpenCV
Introduction to Deep Learning
Faster R-CNN - PR012
Recurrent neural network
Ad

Viewers also liked (20)

PPTX
Articulated human pose estimation by deep learning
PDF
Convolutional Pose Machines
PPTX
Deep learning-for-pose-estimation-wyang-defense
PPTX
Manifold learning
PPTX
Pose Machine
PDF
Deformable Part Models are Convolutional Neural Networks
PPTX
Deep convolutional neural fields for depth estimation from a single image
PDF
DeepPose: Human Pose Estimation via Deep Neural Networks
PDF
CVML2011: human action recognition (Ivan Laptev)
PPT
2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)
PPT
Recovering 3D human body configurations using shape contexts
PPTX
Semantic human activity detection in videos
PPTX
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
PPTX
20 Instagram Pics that will Have You Wanting to Visit the Grand Canyon Just ...
PDF
Docking Pose Assessment: The importance of keeping your GARD up
PPTX
Shape Matching and Object Recognition Using Shape Contexts
PPT
Shape context
PDF
Modern features-part-2-descriptors
PPT
Action Recognition (Thesis presentation)
Articulated human pose estimation by deep learning
Convolutional Pose Machines
Deep learning-for-pose-estimation-wyang-defense
Manifold learning
Pose Machine
Deformable Part Models are Convolutional Neural Networks
Deep convolutional neural fields for depth estimation from a single image
DeepPose: Human Pose Estimation via Deep Neural Networks
CVML2011: human action recognition (Ivan Laptev)
2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)
Recovering 3D human body configurations using shape contexts
Semantic human activity detection in videos
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
20 Instagram Pics that will Have You Wanting to Visit the Grand Canyon Just ...
Docking Pose Assessment: The importance of keeping your GARD up
Shape Matching and Object Recognition Using Shape Contexts
Shape context
Modern features-part-2-descriptors
Action Recognition (Thesis presentation)
Ad

Similar to Human Pose Estimation by Deep Learning (20)

PPTX
Placing Images with Refined Language Models and Similarity Search with PCA-re...
PPTX
Human Pose estimation project for computer vision
PDF
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
PPTX
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
PDF
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...
PPTX
Remotely sensed image segmentation using multiphase level set acm
PDF
SINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDY
PDF
How much position information do convolutional neural networks encode? review...
PPTX
[20240621_LabSeminar_Huy]Spatial-Temporal Interplay in Human Mobility: A Hier...
PDF
Model-Based Reinforcement Learning @NIPS2017
PPTX
Optical Flow with Semantic Segmentation and Localized Layers
PPTX
Human Action Recognition Based on Spacio-temporal features-Poster
PDF
最近の研究情勢についていくために - Deep Learningを中心に -
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
PDF
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
PPTX
final_project_1_2k21cse07.pptx
PDF
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
PDF
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
PDF
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Placing Images with Refined Language Models and Similarity Search with PCA-re...
Human Pose estimation project for computer vision
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...
Remotely sensed image segmentation using multiphase level set acm
SINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDY
How much position information do convolutional neural networks encode? review...
[20240621_LabSeminar_Huy]Spatial-Temporal Interplay in Human Mobility: A Hier...
Model-Based Reinforcement Learning @NIPS2017
Optical Flow with Semantic Segmentation and Localized Layers
Human Action Recognition Based on Spacio-temporal features-Poster
最近の研究情勢についていくために - Deep Learningを中心に -
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
final_project_1_2k21cse07.pptx
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017

Recently uploaded (20)

PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
BIOMOLECULES PPT........................
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPT
protein biochemistry.ppt for university classes
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Sciences of Europe No 170 (2025)
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
microscope-Lecturecjchchchchcuvuvhc.pptx
An interstellar mission to test astrophysical black holes
2. Earth - The Living Planet Module 2ELS
. Radiology Case Scenariosssssssssssssss
BIOMOLECULES PPT........................
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Viruses (History, structure and composition, classification, Bacteriophage Re...
2Systematics of Living Organisms t-.pptx
ECG_Course_Presentation د.محمد صقران ppt
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
The scientific heritage No 166 (166) (2025)
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
protein biochemistry.ppt for university classes
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
7. General Toxicologyfor clinical phrmacy.pptx
HPLC-PPT.docx high performance liquid chromatography
Sciences of Europe No 170 (2025)

Human Pose Estimation by Deep Learning

  • 1. Human Pose Estimation by Deep Learning Wei Yang Supervisor: Prof. WANG Xiaogang, Prof. OUYANG Wanli IVP Lab, CUHK September 11, 2015
  • 2. Outline • Introduction • Traditional Approaches • Deep Learning Methods – Global view (holistic view) – Local appearance – Combination of local appearance and global view – Others 2015/9/11 2
  • 3. Introduction • What is articulated body pose estimation? “recovers the pose of an articulated body, which consists of joints and rigid parts using image-based observations.” 2015/9/11 3
  • 4. Applications Action recognition Clothing Parsing Gaming 2015/9/11 4 Human tracking
  • 6. Traditional Approaches Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005 Pictorial Structure • Unary Templates • Pairwise Springs Yang & Ramanan 2011 Mixtures of “mini-parts” • Mixture of part 𝑖 • Unary template for part 𝑖 with mixture 𝑚𝑖 • Pairwise springs between part 𝑖 with mixture 𝑚𝑖 and part 𝑗 with mixture 𝑚𝑗 2015/9/11 6 head torso leg Example of mini parts: near-vertical and near horizontal limbs
  • 7. Deep Learning for Pose Estimation • Holistic View –e.g., joints position regression • Local View –e.g., body parts detection • Combining global and local information –e.g., body parts detection + joints position regression • Others –e.g., motion features, pose estimation in videos 2015/9/11 7
  • 8. Holistic View DeepPose: Human Pose Estimation via Deep Neural Networks 2015/9/11 8
  • 9. Holistic Reasoning 2015/9/11 9 • Why holistic reasoning? – Besides extreme variability in articulations, many of the joints are barely visible
  • 10. DeepPose: A CNN Regressor 2015/9/11 10 • Network architecture: AlexNet – Krizhevsky, Sutskever, and Hinton, NIPS 2012 (ImageNet) – The first time deep model is shown to be effective on large scale [Toshev & Szegedy, CVPR 2014]
  • 11. Results on LSP (Leeds Sports Pose) dataset 2015/9/11 11
  • 12. Cascade of Pose Regressors • The pose estimation results are very coarse: – due to its fixed input size of 220 × 220, the network has limited capacity to look at detail – Train cascade of pose regressors for more precise joint localization 2015/9/11 12
  • 13. Cascade of Pose Regressors 2015/9/11 13
  • 15. Percentage of Correct Parts (PCP) on LSP dataset 2015/9/11 15
  • 16. Local Appearance Method Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations 2015/9/11 16
  • 17. Motivation • Local image patches are able to capture: – Part presence – Pairwise part spatial relationships 2015/9/11 17 Number of mixture type for each pair: 6 Neighbor: 1 # of relationships: 61 = 6 Neighbor: 2 # of relationships: 62 = 36 Lowerarm Upper arm [Chen & Yuille NIPS 2014]
  • 18. Tree-structured Relational Graph • 𝑇 = 𝑉, 𝐸 – 𝑉: body parts – 𝐸: pairwise relationships between parts • 𝐩 = 𝑝𝑖 = {(𝑥𝑖, 𝑦𝑖)} – 𝑝𝑖: Pixel location of part 𝑖 • 𝑡 = {𝑡𝑖𝑗, 𝑡𝑗𝑖| 𝑖, 𝑗 ∈ 𝐸} – Pairwise relationship – Defined by relative position – 𝑡𝑖𝑗 ∈ 1, … , 𝑇𝑖𝑗 – In experiment: 13 type for each pair 𝑖, 𝑗 ∈ 𝐸 2015/9/11 18
  • 19. Formulation 2015/9/11 19 𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 = 𝑖∈𝑉 𝐴𝑖(𝑝𝑖|𝐼; 𝜃) Part presence 𝜔𝑖 ⋅ Inference: 𝐩∗ , 𝐭∗ = arg max 𝐩,𝐭 𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 • Tree structure • Can be solved efficiently by dynamic programming 𝜔𝑖, 𝜔𝑖𝑗, 𝝎𝑖𝑗 𝑡 𝑖𝑗 are learned by Latent structure SVM + (𝑖,𝑗)∈𝐸 𝑅(𝑝𝑖, 𝑝𝑗, 𝑡𝑖𝑗, 𝑡𝑗𝑖|𝐼; 𝜃) Pairwise deformation +𝝎𝑖𝑗 𝑡 𝑖𝑗 ⋅𝜔𝑖𝑗 ⋅ Pairwise Relationship
  • 20. Learning DCNN parameters 𝜃 2015/9/11 20 Derive the type label for each patch • use relative position 𝑑𝑖𝑗 to represent the pairwise relations • Cluster the relative positions over the whole training set 𝑑𝑖𝑗 𝑖=1 𝑁 • Type label 𝑡𝑖𝑗 𝑛 : cluster index • Mean relative position 𝑟𝑖𝑗 𝑡 𝑖𝑗 : cluster center
  • 21. Casting Full Connections into Convolutions 2015/9/11 21 Elbow Part presence map Pairwise relationship map
  • 22. PCP and PDJ on LSP dataset and FLIC dataset Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP LSP DCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8 Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6 LSP FLIC 2015/9/11 22
  • 23. Combining Local Appearance and Holistic View Dual-Source Deep Neural Networks for Human Pose Estimation 2015/9/11 23
  • 24. Dual-Source CNN • Integrate both the local part appearance and the holistic view of each local part for more accurate human pose estimation • Each input is an image pair – Part patches – Body patches 2015/9/11 24
  • 25. Part patches: incorporate local appearance • Generated by region proposals with some restrictions – Not too small (at least contain a body part) – Not too big (may contain too many body parts and lacks sufficient resolution) • All classes of joints are covered by similar number of part patches • During testing, part patches are selected from multi-scale sliding windows 2015/9/11 25
  • 26. Body patches: holistic view • Also from region proposals – Must cover all body parts – In testing stage, the body patch can be generated by human detection • For DS-CNN, each training sample is made up with 3 components – A part patch – A body patch – Binary mask specifying the location of the part patch in body patch 2015/9/11 26
  • 27. Training of the DS-CNN 2015/9/11 27 Shared weights Classification (softmax) Regression (L2 distance)
  • 28. • Part heat map – Same size of input image – Uniformly distributed probability for each sliding window – Sum and average over all pixels Testing 2015/9/11 28 0.0 0.9 0
  • 29. Testing • Final pose estimation – Weighted average of predicted joint locations within part patches with high responses. 2015/9/11 29
  • 30. Results: PCP on LSP 2015/9/11 30
  • 31. Other Methods & Applications • MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation • Flowing ConvNets for Human Pose Estimation in Videos 2015/9/11 31
  • 32. Using Motion Features for Human Pose Estimation • motion is a powerful visual cue that alone can be used to extract high-level information, including articulated pose. 2015/9/11 32 Image credit: Large displacement optical flow: descriptor matching in variational motion estimation Thomas Brox, J. Malik. IEEE TPAMI, 33(3): 500-513, 2011
  • 33. Modeep: Using Motion Features for Human Pose Estimation • Extended Frames Labeled In Cinema (FLIC) dataset with additional motion features 2015/9/11 33 MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation. Arjun et. al., ACCV 2014 Average of frame pair Optical flow
  • 34. Multi-resolution efficient sliding window model 2015/9/11 34
  • 35. Simple Spatial Model • FLIC: multiple people with only one annotated person • Testing: incorporate annotated torso position with simple spatial model 2015/9/11 35 Predicted left shoulder Spatial mask of left shoulder Result
  • 36. Experiment results 2015/9/11 36 Without motion feature With motion feature occlusion Cluttered bg Motion blur
  • 37. Flowing ConvNets for Human Pose Estimation in Videos 2015/9/11 37 • CNN can benefit from temporal context by combining information across the multiple frames using optical flow.
  • 38. Spatial ConvNet 2015/9/11 38 Why regression heatmap instead of joint coordinates? • The network can be multi-modal • regressing coordinates directly is a highly non-linear and more difficult to learn mapping
  • 39. Warping neighbouring heatmaps for improving pose estimates • Heatmaps from frames (t − n) and (t + n) warped to frame t using tracks from optical flow (green & blue lines) can help refine the wrongly estimated part location 2015/9/11 39
  • 41. • End-to-end pose estimation – Joint learning of pose features and pose configurations – Allow local appearance to be fine-tuned by pose configuration Ongoing Project 2015/9/11 41 UnaryresponsePairwiserelationships …
  • 42. Ongoing Project 2015/9/11 42 Pairwise relationships … 𝑥𝑡−2 𝑥 𝑡−1 𝑥𝑡 𝑥 𝑇 𝑥 𝑡 𝑥 𝑡+1𝑥 𝑡−1 𝑤 𝑑𝑡 𝑤 𝑑𝑡 𝑤 𝑑𝑡 𝑤 𝑚 𝑤 𝑚 𝑤 𝑚 (𝑃𝑎𝑟𝑡 𝑝−1) (𝑃𝑎𝑟𝑡 𝑝−2) (𝑃𝑎𝑟𝑡 𝑝−3) 𝑧𝑡 𝑧𝑡+1𝑧𝑡−1 Add constraints between body parts in a network Distance transform Unary response
  • 43. Preliminary Results (PCP on LSP) 2015/9/11 43 • Future work – Pose relational graph learning – Multi-task learning • Human detection • Human segmentation – Combining global information Head Torso U.arms L.arms U.legs L.legs mean 84.7 91 68.7 53.6 80.7 73.3 72.82
  • 44. Recent developments • Deeppose: Human pose estimation via deep neural networks – A Toshev, C Szegedy – CVPR, 2014 • Joint training of a convolutional network and a graphical model for human pose estimation – JJ Tompson, A Jain, Y LeCun, C Bregler – NIPS, 2014 • Human Pose Estimation with Iterative Error Feedback – Carreira, Joao, et al. arXiv preprint arXiv:1507.06550 (2015). • Maximum-Margin Structured Learning with Deep Networks for 3D Human PoseEstimation – S Li, W Zhang, AB Chan - arXiv preprint arXiv:1508.06708, 2015 • Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network – S Li, ZQ Liu, AB Chan – CVPR Workshop, 2014 • Flowing ConvNets for Human Pose Estimation in Videos – T Pfister, J Charles, A Zisserman - ICCV, 2015 • R-CNNs for Pose Estimation and Action Detection – G Gkioxari, B Hariharan, R Girshick, J Malik - arXiv preprint arXiv:1406.5212, 2014 • MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation – A Jain, J Tompson, Y LeCun, C Bregler -ACCV 2014 • Efficient object localization using convolutional networks – J Tompson, R Goroshin, A Jain, Y LeCun, C Bregler – CVPR, 2015 • Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation – Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, CVPR 2015 • Parsing Occluded People by Flexible Compositions – Xianjie Chen, Alan L. Yuille. CVPR 2015 • Articulated pose estimation by a graphical model with image dependent pairwise relations – X Chen, AL Yuille –NIPS, 2014 • … 2015/9/11 44
  • 45. Thank you Human Pose Estimation by Deep Learning Wei Yang IVP Lab, CUHK September 11, 2015
  • 46. Evaluation Metrics • Percentage of Correct Parts (PCP) – measures the percentage of correctly localized body parts. – A candidate body part is treated as correct if its segment endpoints lie within 50% of the length of the ground-truth annotated endpoints. • Percentage of Detected Joints (PDJ) – measures the performance using a curve of the percentage of correctly localized joints by varying localization precision threshold, which is normalized by the scale defined as distance between left shoulder and right hip – invariant to scale 2015/9/11 46

Editor's Notes

  • #2: Good afternoon everyone. Welcome to the first IVP seminar of this term. I’m YANG Wei. In last two seminars, Xingyu and Chu Xiao gave us a comprehensive overview of object detection as well as traditional human pose estimation approaches. In this talk, I will continue the discussion on recent developments of human pose estimation based on the powerful deep learning methods. Hope you can benefit from these methods.
  • #3: First, we will briefly review the problem of human pose estimation Meanwhile, we will go over the traditional approaches for pose estimation, which have been discussed in the seminar given by Chu Xiao. Then we will spend most of the time discussing several important approaches based on deep learning techniques, from both global view and local view.
  • #4: According to Wikipedia, the goal of articulated pose estimation is to “recovers the joint positions of articulated limbs, as we show here for a man playing baseball.
  • #5: There are lots of applications where being able to estimate human pose is useful. For example, pose estimation is helpful for recognizing action. It also helps to parse clothing in fashion photographs. Recently, pose estimation has been successful applied in human tracking and gaming systems.
  • #6: However, In unconstrained images, human pose estimation can be a very hard problem because people can appear with a variety of poses, clothing, and body shape. In the slides, you can see some very interesting and unusual examples that demonstrate how flexible the human pose is.
  • #7: Traditional approaches for human pose estimation model the human as a set of parts, such as a head, torso, arm, and leg part. In 3D, these parts can be modeled as cylinders. Pictorial structures use 2D part models, where geometric relations between parts are encoded by springs. However, capturing the whole range of appearances using pictorial structures is still quite difficult. A big problem is that even projections of a simple cylinder into 2D yields many different appearances. So one usually has to explicitly evaluate many different possible in-plane orientations and foreshortenings in order to find a good match for a part template. Yang propose mini parts to approximate these transformations. in this case the mini-parts are tuned to represent near-vertical and near horizontal limbs.
  • #8: As the fast development of DL, in recent two years, several pose estimation methods based on deep learning technich have been proposed. Some based on holistic view (global view), e.g., directly regress body joints location. Some based on local appearance. Some combine global view and local view in a unified framework, and achieve state-of-the-art methods. Finaly, we will also discuss some pioneer works on pose estimation in videos.
  • #10: For example, in the left image. We can guess the location of the right arm only because we see the rest of the pose and anticipate the activity of the person. Similarly, in the right image, the left half body of the person is not visible at all. Since Deep Neural Networks can model very complex relationships, the authors believe that DNN can provide a holistic reasoning.
  • #11: The initial stage of DeepPose is quite straight forward. It trains a DNN to regress the locations of all the body joints given an input image. DeepPose adopts AlexNet as the basic network structure. This structure was proposed in 2012. It won the imagenet competition on a large margin, and is the first time that deep model is shown to be effective on large scale computer vision task.
  • #12: This is the visualized results on LSP dataset. We can see that this method has limitations in high precision regions, such as lower arms and lower legs. It is worth to mention that this method is very fast, since predictions can be get by batch forward propagation.
  • #13: The pose estimation results from the initial stage are very coarse especially in high precision regions: One possible reason is that the input size is fixed as 220 by 220, the network has limited capacity to look at details. To refine coarse regression results, the authors further train cascade of pose regressors for more precise joint localization
  • #14: Given the predicted joint locations from the last stage. We first crop image patches centering at the predicted location. And then train a DNN-based regressor to refine the respected locations. This process can be repeated several times. It is helpful to refine the coarse predictions because the network can see higher resolution regions.
  • #15: The ground truths are in green and predicted poses are in red. We can see that the initial stage is usually successful at estimating roughly correct pose. However, the results are not precise enough. After one stage of refinement, the results are much more accurate.
  • #18: We observe that local image patches are not only able to capture part presence, but also able to reason pairwise spatial relationships. For example, consider the patch centered at wrist can predict the relative position of elbow; the patch centered at elbow can reliably predict position of shoulder and wrist. We use mixture model to define different types of spatial relationships. The right panel shows typical spatial relationships the wrist can have with its neighbor elbow. The left panel shows the typical spatial relationships the elbow can have with its two neighbors, say shoulder and wrist.
  • #19: Based on this observation, we can define human pose as a tree structure graph, where each node denotes the position of each part, and the edges denote the pairwise spatial relationships.
  • #20: We define the score function of part locations p and pairwise relation types t. It is computed by summing the Unary appearance term and the pairwise relationship term. The unary term is the part presence map indicating the probability that part I appears at each location of the image. Pairwise term consists of two part. The first part is the pairwise relationship map, and the second part is the deformation cost. Theta are parameters which are learned by CNN. Inference is to find the positions and mixture types to maximize this score. As the relational graph is tree structure, it can be efficiently solved by dynamic programming.
  • #21: Here we talk about how to learn theta. Given an image, we want produce a score map to indicate its probability of a specific type. This is done by learn a multi class classifier on local image patches. First we need to derive type label for each patch.
  • #22: Then we use two convolutional layers with 1 by 1 kernels to replace the original fully connected layers. Then the network becomes a fully convolutional network, and can perform convolutions on input image with arbitrary size, and the output is the scoremap for each type, as we want. Then we can easily compute the part presence map and pairwise relationship maps as this figure illustrated. For example, to compute part presence map of elbow, we just add all the score maps associated with elbow to shoulder, and elbow to wrist together. To compute pairwise relationship maps, we need to perform marginalization.
  • #23: Here are
  • #24: As we discussed before, both global and local methods have merits and drawbacks for human estimation. Hence in this years CVPR, a paper combining both local appearance and holistic view is proposed.
  • #25: In this paper, the authors train a network by dual-sources. Which is to say that each input is an image pair. One image is the body patch, which incorporate local appearance information. One image is the full image, which incorporate the global context information. The authors hope that this combination would result in more accurate human pose estimation.
  • #26: The authors first use the objectiveness methods to propose a lot of category-independent object proposals, as shown in the boxes in the image. Then the part patches are selected by some restrictions. First the region cannot be too small, it must contain a whole body part. Second, the proposed region cannot be too big either. Because all patches will be warped to the same size as the input of the network, too large regions lacks sufficient resolution. Moreover, for efficient training, all classes of joints are covered by similar number of part patches During testing, part patches are selected from multi-scale sliding windows.
  • #27: Body patches are also selected from region proposals. The region must cover all the body parts. In testing stage, these regions can be generated by human detection. The binary mask is concatenated with the body patch as an additional alpha channel.
  • #28: During training, both part patch and body patch are fed into a two branch CNN. The local part branch is to predict the label of the part patch. This is a classification problem, and is trained by using softmax loss function. The global branch is to predict the x, y coordinate given the body patch and the corresponding part mask. This is a regression problem and is trained by using the Euclidean loss function. Note that the structures of the two branch are the same, hence the weights are shared except for the last layer.
  • #29: In test stage, a heap map is generated for each part. The heap map has the same size of the input image. First, the part patches are obtained by sliding window method. Then use the trained network to predict the probability of a each label for each part. The pixels within the patch have the same probability. Finally, sum and average over all pixels to get the final heat map.
  • #30: While the heat map provides a rough estimation of the joint location, it is insufficient to accurately localize the body joints. Remember that the global branch predicts the accurate joint location within a given patch. Hence for a specific part, we select part patches with high probability. And compute the weighted sum of the predicted joint locations to get the final joint location.
  • #31: Here is the PCP value on LSP dataset. We can see that this method improves the performance on a large margin.
  • #32: OK. After discuss methods from local and global view. Lets discuss some applications of pose estimation in videos.
  • #33: We all know that motion is a powerful…. This figure illustrates the optical flow. The left side is the average of two adjacent frames. The right side is the estimated optical flow. We can see that the background can be greatly suppressed by the motion feature. Which would be a great help for pose estimation.
  • #34: Here, a method called modeep try to incorporate motion features to improve human pose estimation. This method extended the FLIC dataset with additional motion features, as shown is the figure.
  • #35: Then it trains a multi-resolution convolutional network to predict the heat maps for each body parts with the additional motion features as the input.
  • #36: Since FLIC is a dataset with multiple people within an image, but only one person is annotated. In testing stage, the tors box can be used to help determine which pose to be estimated. This method compute a spatial mask of each part with respect to the torso box. This mask is helpful for suppressing false positives.
  • #37: Here are some experiment results. The first line are the estimated pose without motion feature and the second lines are with motion feature. We can see that motion can greatly improve the results in occlusion, cluttered background, and the motion blur situation.
  • #38: A very similar work also use optical flow to track human pose in videos. This work has published in this years ICCV. It first use a CNN to predict heatmaps for each body parts for each frame, then for the t’th frame, it computes the optical flow of t-n to t+n frame with respect to the t-th frame. The heatmaps are then warped to the t-th frame according to the optical flow. Finally, the authors use a 1by 1 convolutional layer to combine all the heat maps together.
  • #39: Here is an illustration of the network producing part heat maps. The authors discussed why….
  • #42: Finally. I wanna give brief introduction of my ongoing project. As we have discussed before, most of the pose estimation frameworks are not end-to-end. They often learn pose features first, and then fixed the feature to optimize a relationship model. In my work, I design an end-to-end pose estimation framework. It can be viewed as the feature extraction part plus a deformable part model. However, we plug the deformable part model into the network. And the parameter of both parts can be learned jointly.
  • #43: Here is an illustration of the deformable part model. Since the relation graphs of human pose are often in tree structure. We can use message passing method for efficient inference. This is very similar to traditional recurrent neural network. Here each time step denote a part. The message is passed from the leaves to the root. The deformation weights are shared across different parts. To learn the parameters, we can use backpropagation to learn the deformation weights and the weights of convolution layer and fully connected layers jointly.
  • #44: Preliminary experiments shows that the proposed method out performs most of the traditional approaches. However, it still not better can recent deep learning methods. In in the future, we plan to learn the pose relational graph from the dataset. Meanwhile, pose estimation may benefit from related tasks such as human detection and human segmentation. Finally, we need to figure out how to combine global information into this framework.