Human Pose Estimation by Deep Learning

Human Pose Estimation by Deep Learning
Wei Yang
Supervisor: Prof. WANG Xiaogang, Prof. OUYANG Wanli
IVP Lab, CUHK
September 11, 2015

Outline
• Introduction
• Traditional Approaches
• Deep Learning Methods
– Global view (holistic view)
– Local appearance
– Combination of local appearance and global view
– Others
2015/9/11 2

Introduction
• What is articulated body pose estimation?
“recovers the pose of an articulated body, which consists of joints and rigid parts
using image-based observations.”
2015/9/11 3

Applications
Action recognition Clothing Parsing
Gaming
2015/9/11 4
Human tracking

Traditional Approaches
Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
Pictorial Structure
• Unary Templates
• Pairwise Springs
Yang & Ramanan 2011
Mixtures of “mini-parts”
• Mixture of part 𝑖
• Unary template for part 𝑖 with mixture 𝑚𝑖
• Pairwise springs between part 𝑖 with
mixture 𝑚𝑖 and part 𝑗 with mixture 𝑚𝑗
2015/9/11 6
head
torso
leg
Example of mini parts: near-vertical and near horizontal limbs

Deep Learning for Pose Estimation
• Holistic View
–e.g., joints position regression
• Local View
–e.g., body parts detection
• Combining global and local information
–e.g., body parts detection + joints position regression
• Others
–e.g., motion features, pose estimation in videos
2015/9/11 7

Holistic View
DeepPose: Human Pose Estimation via Deep Neural
Networks
2015/9/11 8

Holistic Reasoning
2015/9/11 9
• Why holistic reasoning?
– Besides extreme variability in articulations, many of the joints are barely visible

DeepPose: A CNN Regressor
2015/9/11 10
• Network architecture: AlexNet
– Krizhevsky, Sutskever, and Hinton, NIPS 2012 (ImageNet)
– The first time deep model is shown to be effective on large scale
[Toshev & Szegedy, CVPR 2014]

Results on LSP (Leeds Sports Pose) dataset
2015/9/11 11

Cascade of Pose Regressors
• The pose estimation results are very coarse:
– due to its fixed input size of 220 × 220, the network has limited capacity to look
at detail
– Train cascade of pose regressors for more precise joint localization
2015/9/11 12

Cascade of Pose Regressors
2015/9/11 13

Refined pose estimation
2015/9/11 14

Percentage of Correct Parts (PCP) on LSP dataset
2015/9/11 15

Local Appearance Method
Articulated Pose Estimation by a Graphical Model
with Image Dependent Pairwise Relations
2015/9/11 16

Motivation
• Local image patches are able to capture:
– Part presence
– Pairwise part spatial relationships
2015/9/11 17
Number of mixture type for each pair: 6
Neighbor: 1
# of relationships: 61 = 6
Neighbor: 2
# of relationships: 62
= 36
Lowerarm
Upper arm
[Chen & Yuille NIPS 2014]

Tree-structured Relational Graph
• 𝑇 = 𝑉, 𝐸
– 𝑉: body parts
– 𝐸: pairwise relationships between parts
• 𝐩 = 𝑝𝑖 = {(𝑥𝑖, 𝑦𝑖)}
– 𝑝𝑖: Pixel location of part 𝑖
• 𝑡 = {𝑡𝑖𝑗, 𝑡𝑗𝑖| 𝑖, 𝑗 ∈ 𝐸}
– Pairwise relationship
– Defined by relative position
– 𝑡𝑖𝑗 ∈ 1, … , 𝑇𝑖𝑗
– In experiment: 13 type for each pair
𝑖, 𝑗 ∈ 𝐸
2015/9/11 18

Formulation
2015/9/11 19
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 =
𝑖∈𝑉
𝐴𝑖(𝑝𝑖|𝐼; 𝜃)
Part
presence
𝜔𝑖 ⋅
Inference: 𝐩∗
, 𝐭∗
= arg max
𝐩,𝐭
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃
• Tree structure
• Can be solved efficiently by dynamic programming
𝜔𝑖, 𝜔𝑖𝑗, 𝝎𝑖𝑗
𝑡 𝑖𝑗
are learned by Latent structure SVM
+
(𝑖,𝑗)∈𝐸
𝑅(𝑝𝑖, 𝑝𝑗, 𝑡𝑖𝑗, 𝑡𝑗𝑖|𝐼; 𝜃)
Pairwise
deformation
+𝝎𝑖𝑗
𝑡 𝑖𝑗
⋅𝜔𝑖𝑗 ⋅
Pairwise
Relationship

Learning DCNN parameters 𝜃
2015/9/11 20
Derive the type label for each patch
• use relative position 𝑑𝑖𝑗 to represent
the pairwise relations
• Cluster the relative positions over the
whole training set 𝑑𝑖𝑗 𝑖=1
𝑁
• Type label 𝑡𝑖𝑗
𝑛
: cluster index
• Mean relative position 𝑟𝑖𝑗
𝑡 𝑖𝑗
: cluster
center

Casting Full Connections into Convolutions
2015/9/11 21
Elbow
Part presence map
Pairwise relationship
map

PCP and PDJ on LSP dataset and FLIC dataset
Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
LSP
DCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8
Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
LSP FLIC
2015/9/11 22

Combining Local Appearance and Holistic View
Dual-Source Deep Neural Networks for Human Pose
Estimation
2015/9/11 23

Dual-Source CNN
• Integrate both the local part appearance and the holistic view
of each local part for more accurate human pose estimation
• Each input is an image pair
– Part patches
– Body patches
2015/9/11 24

Part patches: incorporate local appearance
• Generated by region proposals with some
restrictions
– Not too small (at least contain a body part)
– Not too big (may contain too many body parts and
lacks sufficient resolution)
• All classes of joints are covered by similar
number of part patches
• During testing, part patches are selected
from multi-scale sliding windows
2015/9/11 25

Body patches: holistic view
• Also from region proposals
– Must cover all body parts
– In testing stage, the body patch can be generated by human detection
• For DS-CNN, each training sample is made up with 3
components
– A part patch
– A body patch
– Binary mask specifying the location of the part patch in body patch
2015/9/11 26

Training of the DS-CNN
2015/9/11 27
Shared weights Classification
（softmax）
Regression
(L2 distance)

• Part heat map
– Same size of input image
– Uniformly distributed probability for each sliding window
– Sum and average over all pixels
Testing
2015/9/11 28
0.0
0.9
0

Testing
• Final pose estimation
– Weighted average of predicted joint locations within part patches with high
responses.
2015/9/11 29

Results: PCP on LSP
2015/9/11 30

Other Methods & Applications
• MoDeep: A Deep Learning Framework Using Motion
Features for Human Pose Estimation
• Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11 31

Using Motion Features for Human Pose Estimation
• motion is a powerful visual cue that alone can be used to
extract high-level information, including articulated pose.
2015/9/11 32
Image credit: Large displacement optical flow: descriptor matching in variational motion estimation
Thomas Brox, J. Malik. IEEE TPAMI, 33(3): 500-513, 2011

Modeep: Using Motion Features for Human Pose
Estimation
• Extended Frames Labeled In Cinema (FLIC) dataset with
additional motion features
2015/9/11 33
MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation.
Arjun et. al., ACCV 2014
Average of frame pair Optical flow

Multi-resolution efficient sliding window model
2015/9/11 34

Simple Spatial Model
• FLIC: multiple people with only one annotated person
• Testing: incorporate annotated torso position with simple
spatial model
2015/9/11 35
Predicted left shoulder Spatial mask of left shoulder Result

Experiment results
2015/9/11 36
Without motion feature
With motion feature
occlusion Cluttered bg Motion blur

Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11 37
• CNN can benefit from temporal context by combining
information across the multiple frames using optical flow.

Spatial ConvNet
2015/9/11 38
Why regression heatmap instead of
joint coordinates?
• The network can be multi-modal
• regressing coordinates directly is a highly
non-linear and more difficult to learn
mapping

Warping neighbouring heatmaps for improving pose
estimates
• Heatmaps from frames (t − n) and (t + n) warped to frame t
using tracks from optical flow (green & blue lines) can help
refine the wrongly estimated part location
2015/9/11 39

• End-to-end pose estimation
– Joint learning of pose features and pose configurations
– Allow local appearance to be fine-tuned by pose configuration
Ongoing Project
2015/9/11 41
UnaryresponsePairwiserelationships
…

Ongoing Project
2015/9/11 42
Pairwise relationships
… 𝑥𝑡−2 𝑥 𝑡−1 𝑥𝑡 𝑥 𝑇
𝑥 𝑡 𝑥 𝑡+1𝑥 𝑡−1
𝑤 𝑑𝑡 𝑤 𝑑𝑡 𝑤 𝑑𝑡
𝑤 𝑚 𝑤 𝑚 𝑤 𝑚
(𝑃𝑎𝑟𝑡 𝑝−1) (𝑃𝑎𝑟𝑡 𝑝−2) (𝑃𝑎𝑟𝑡 𝑝−3)
𝑧𝑡 𝑧𝑡+1𝑧𝑡−1
Add constraints between body parts in a network
Distance transform
Unary response

Preliminary Results (PCP on LSP)
2015/9/11 43
• Future work
– Pose relational graph learning
– Multi-task learning
• Human detection
• Human segmentation
– Combining global information
Head Torso U.arms L.arms U.legs L.legs mean
84.7 91 68.7 53.6 80.7 73.3 72.82

Recent developments
• Deeppose: Human pose estimation via deep neural networks
– A Toshev, C Szegedy – CVPR, 2014
• Joint training of a convolutional network and a graphical model for human pose estimation
– JJ Tompson, A Jain, Y LeCun, C Bregler – NIPS, 2014
• Human Pose Estimation with Iterative Error Feedback
– Carreira, Joao, et al. arXiv preprint arXiv:1507.06550 (2015).
• Maximum-Margin Structured Learning with Deep Networks for 3D Human PoseEstimation
– S Li, W Zhang, AB Chan - arXiv preprint arXiv:1508.06708, 2015
• Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network
– S Li, ZQ Liu, AB Chan – CVPR Workshop, 2014
• Flowing ConvNets for Human Pose Estimation in Videos
– T Pfister, J Charles, A Zisserman - ICCV, 2015
• R-CNNs for Pose Estimation and Action Detection
– G Gkioxari, B Hariharan, R Girshick, J Malik - arXiv preprint arXiv:1406.5212, 2014
• MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation
– A Jain, J Tompson, Y LeCun, C Bregler -ACCV 2014
• Efficient object localization using convolutional networks
– J Tompson, R Goroshin, A Jain, Y LeCun, C Bregler – CVPR, 2015
• Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation
– Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, CVPR 2015
• Parsing Occluded People by Flexible Compositions
– Xianjie Chen, Alan L. Yuille. CVPR 2015
• Articulated pose estimation by a graphical model with image dependent pairwise relations
– X Chen, AL Yuille –NIPS, 2014
• …
2015/9/11 44

Thank you
Human Pose Estimation by Deep Learning
Wei Yang
IVP Lab, CUHK
September 11, 2015

Evaluation Metrics
• Percentage of Correct Parts (PCP)
– measures the percentage of correctly localized body parts.
– A candidate body part is treated as correct if its segment endpoints lie within
50% of the length of the ground-truth annotated endpoints.
• Percentage of Detected Joints (PDJ)
– measures the performance using a curve of the percentage of correctly localized
joints by varying localization precision threshold, which is normalized by the
scale defined as distance between left shoulder and right hip
– invariant to scale
2015/9/11 46

Human Pose Estimation by Deep Learning

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Human Pose Estimation by Deep Learning (20)

Recently uploaded (20)

Human Pose Estimation by Deep Learning

Editor's Notes