Dragos_Papava_dissertation

Large Scale Datasets and Predictive Methods
for 3D Human Motion Estimation
Drago¸s Papav˘a
Master Thesis
Faculty of Automatic Control and Computers
“POLITEHNICA” University of Bucharest
Bucharest, Romania
September 16, 2012
Advisor:
Prof. Dr. Cristian Sminchisescu
c⃝Copyright author

Contents
1 Introduction 1
2 Related Work 5
2.1 Methods for detection and 2D localization . . . . . . . . . . . . . . . . . 5
3 Human Motion Capture Methodology 11
3.1 Motion capture dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Experimental setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Dataset Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Image processing pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Discriminative Methods for Human Pose Prediction 23
4.1 Simple discriminative learning methods . . . . . . . . . . . . . . . . . . 23
5 Experiments 27
5.1 Training and Testing scenarios . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Image Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Online Evaluation Tools 31
7 Conclusions 33
References 35

Acknowledgements
First of all, I would like to thank my advisor Prof. Dr. Cristian Sminchisescu for his
personal support, great patience, and an excellent way of sharing his knowledge.
I also wish to thank my friend Catalin Ionescu, for his help while acquiring the
database and his experimental expertise.

Chapter 1
Introduction
One of the many research directions in Computer Vision is the classic problem of
detection, segmentation and pose estimation of people in images. The organization of a
Computer Vision system is highly application dependent. Some systems are stand-alone
applications which solve a specific measurement or detection problem, while others
constitute a sub-system of a larger design. Computer Vision is closely related to the
study of biological vision. The field of biological vision studies and models the processes
behind visual perception in humans and other animals. Computer Vision, on the other
hand, studies and describes the processes implemented in software and hardware behind
artificial vision systems. Over the last century, there has been an extensive study of
brain structures devoted to processing of visual stimuli in humans. One of the tasks that
humans do extremely well is estimating the position or orientation of a specific object
relative to the them. 3D human pose reconstruction is the problem of determining the
transformation of a human in a 2D image which gives the 3D limb positions. It is possible
to estimate the 3D position of a human from a single 2D photo, if its approximate 3D
model is known and the corresponding joint positions in the 2D image are known.
The importance of solving the pose estimation problem can be seen in
many of the tasks involving interactions with humans who depend on understanding
what people are doing. Being able to infer the human pose allows better analysis on a
higher level, such as action recognition. This provides a wide range of applications such
as video indexing or surveillance of streets and airports. The uses of 3D human pose
reconstruction range from entertainment, environmental awareness to human computer
interaction.
Some of the difficulties of real life pose estimation are illumination changes,
partial views and/or heavy occlusions, but the biggest obstacle is the large number of
possible poses and the very different aspects that these take under different view points.
In the monocular case, the same image evidence can be consistent with different 3D
poses. This is common and very difficult to recover from.

2 CHAPTER 1. INTRODUCTION
The contour shape of a human body is one of the most important cues to his pose, but
clothing and body shape variability further complicate the inference of its pose, based on
shape. Loose clothes introduce aspect variations that make analyzing pose very difficult
because they hide the underlying body shape and give only minimal information about
the body. The clothes texture introduces ambiguities in inferring depth discontinuities
or some internal edges.
Varying degrees of self-occlusion are common in real images. In monocular pose
estimation, it is an obvious problem when the arms are occluded by the body and no
other information exists about them. In movies, a fundamental difficulty is that many
people are partially or fully occluded for long periods of time. A person tracker is then
used to associate between people detections in different frames. It is known that a
human representation based on parts is necessary to address the challenge of detecting
partially visible people in images.
Solving the human pose estimation faces a bigger problem than any of the above,
namely the large number of possible poses. The difficulty rises from the combinatorial
nature of the problem. Joints parameters are relatively independent from one another
and only weak physical constraints or statistical correlations (i.e., some poses are much
more likely to appear than others) tie them together, having some of this state space
constrained. Because of this, statistical models that will try to reliably infer human
pose from images will need very large amounts of training data. Some computational
constraints on algorithm design are then likely to appear.
Over the past 10 years the field has made significant progress fueled by the new
optimization and modeling methodology, discriminative methods, feature design and
standardized datasets for model training. It is now widely agreed that any successful
human sensing system, be it generative, discriminative or mixed, would need a significant
training component – not just measurement – in order to be successful, particularly
under scene effects like monocular viewing or occlusion. Such situations do not represent
infrequent extremes but are rather commonplace in real world situations. Yet these
cannot be handled very well with the existing modeling and training tools. Part of
the problem is that humans are highly flexible, move in complex ways against natural
backgrounds, and their clothing deforms. Other issues, like occlusions, have to do
with more comprehensive scene modeling beyond that of the humans involved, and they
stretch the ability of the pose sensing system to exploit prior knowledge and correlations,
by using the sparse visible information in order to constrain estimates of unobserved
body parts. Assuming humans have been appropriately identified however, one of the
key problems is insufficient data coverage in trainable systems.
Existing state of the art datasets like HumanEva [32], contain about 40,000 different
poses and the class of motions covered is regular and relatively small, reflecting the
original design purpose of algorithm evaluation. While we want to continue to be able
to offer difficult benchmarks, we also wish to collect datasets to be used in building

3
Figure 1.1: Example of a complex real image with multiple people in different poses (left)
as well as a matching sample of actors from our dataset in similar poses (middle), and their
reconstructed 3D poses (right).
systems that can operate in realistic environments. The space of possible human poses
is 30-dimensional or more. All this space cannot be sampled efficiently, so we put an
emphasis on typical human action scenarios and larger datasets. Real world people move
less regularly than assumed in many of our datasets. Consider the case of a pedestrian
walking. It is not so frequent, particularly in busy urban environments, to encounter
‘perfect’ walkers. Driven by their daily tasks, people carry bags, walk with hands in
their pockets and gesticulate when talking to other people or on the phone. We need to
be able to cover a fair number of such scenarios and handle their composites as well.
While we here focus on large datasets acquired with accurate marker based motion
capture systems and moderately realistic clothing and backgrounds, other technologies
for unconstrained capture based on non-invasive sensors or attached body cameras have
emerged recently [41, 27]. Results are very promising with around 4,600 different poses
available in [27]. As technology matures, developments along all fronts are welcome,
particularly as the data we provide is complementary in terms of its choice of activities
to existing datasets, e.g. [12, 32, 27]. While even through a combined community effort,
we cannot hope to sample the 30+ dimensional space of all human poses very densely,
an emphasis on typical scenarios and larger datasets, in line with current efforts in visual
recognition, may still offer a degree of prior knowledge bootstrapping that can improve
the performance of existing measurement and search based human sensing systems.
The contribution of this thesis consists of a novel dataset with increased complexity
and diversity that complements earlier work by several orders of magnitude. In particular,
by design, we aim to cover the following aspects:
Large Set of Human Poses, Diverse Activities: We collect over 3.6 million different
human poses, viewed from 4 different angles with an accurate human motion capture
system. The motions were executed by 11 professional actors, and cover a diverse set
of everyday scenarios including conversations, eating, greeting, talking on the phone,
posing, sitting, smoking, taking photos, waiting, walking in various non-typical scenarios
(with hand in pocket, talking on the phone), walking a dog, or buying an item.
Synchronized Modalities, 2D and 3D data, Subject Body Scans: We collect and fully
synchronize both 2D and 3D data, in particular images from 4 high-speed progressive
scan, high-resolution video cameras, a Time of Flight (TOF) depth sensor, as well as

4 CHAPTER 1. INTRODUCTION
human motion capture data acquired by 10 cameras. We also provide 3D full body
meshes of all subjects in the dataset, acquired with a 3D laser scanner.
Evaluation Benchmarks, Complex Backgrounds, Occlusion: The dataset provides
not only training, validation and testing sources for data collected in the laboratory,
but also a variety of realistic mixed-reality settings where complex graphics characters
have been inserted in video environments collected using real digital cameras. The
insertions and occlusions are geometrically correct, based on estimates of the camera
motion and 3D reconstructions of the ﬁlmed environment.
Online Software and Evaluation Support Tools: We provide online software for
model visualization including both joint positions and joint angle formats, background
subtraction, person bounding box information, as well as feature extraction and prediction
baselines including linear and kernel regressors and structured predictors. All these
models are complemented with linear Fourier approximation extensions, in order to
allow accurate large scale training for non-linear models.
This work is based on a technical report [23].

Chapter 2
Related Work
There are many ways in which one could address the problem of extracting 3D pose
data from an image or video. Learning based methods use an artificial intelligence based
system which finds out the mapping from 2D image features to 3D pose transformation.
In short, this means that a sufficiently large set of images of the object, in different
poses, must be presented to the system during a learning phase. Once the learning
phase is completed, the system should be able to present an estimate of the object’s
pose given an image of the object. The accuracy of such systems is limited to situations
which are represented in their database of images, however their goal is to recognize a
pose, rather than determine it.
While recognizing a human pose involves visually identifying a discrete action out of
a range of possible ones, reconstructing a pose involves predicting its 3D configuration
out of a continuous set of data (e.g., joint angles).
Many people have tried to reconstruct the human pose from monocular video or
by using additional information taken with specialized cameras, e.g. time-of-flight [18]
or motion capture systems [32]. The multitude of methods they have created for
reconstructing human poses gives a short glimpse about the difficulty of this ambition.
I will account for some of the most important methods in the following sections.
2.1 Methods for detection and 2D localization
Most of these methods rely on extracting low level features from images and matching
them against those of the respective object to be detected. This can be done by scanning
the image with different sized windows, while looking for common edges, corners, interest
points or by using feature descriptors.
Appearance-based methods work by searching in an image for the pattern of an
object. These methods take a lot of time to run, especially if the search is done for
multiple scales and orientations. Also, they give bad results if the object in the image
is slightly rotated than the one trained in the model.

6 CHAPTER 2. RELATED WORK
Feature-based methods try to find feasible matches between object features and
image features. The main idea consists of finding the correspondences between a
collection of image features and a subset of object features. These matches are used to
find the orientation of the object inside the image frame. Further cues are then used to
validate the object hypothesis in the image.
Viola & Jones face detector
Paul Viola and Michael Jones’ [40] detector is a fast detection framework for processing
images in the scope of object detection. While it can be trained to detect different
object classes, it was first motivated by the problem of face detection. The features
used by the framework involve sums of pixels inside rectangular regions. These regions
are composed of positive rectangles, where the values of pixels are added to the sum,
and negative rectangles, where the pixels are subtracted from the sum, as represented
in figure 2.1. By using an image representation called the integral image, these features
can be evaluated in constant time, which gives them a great advantage in computing
time over other features.
Integral images is the name introduced by Viola and Jones for computer vision of
summed area tables. It works by computing a matrix the same size as the image, where
I(x, y) =
∑
x′
≤x,y′
≤y
i(x
′
, y
′
). Thus, computing the sum of pixels in the ABCD rectangle
becomes Sum = I(A) + I(C) − I(B) − I(D).
Figure 2.1: Feature types used by Viola and Jones
In a 24x24 pixel sub-window there are more than 40,000 possible features, making
it very expensive to evaluate them all. The object detection framework trains classifiers
using the AdaBoost learning algorithm to select the best features to use and to train
classifiers on them. These are used in a linear cascaded architecture where each successive
classifier is run only when the previous ones don’t reject the sub-window under inspection.
They are ordered by complexity such that the most complex classifiers are ran last, thus
making possible real-time detection rates.
Global detectors based on HoG
Histogram of Oriented Gradients (HoG) are image feature descriptors commonly used for
object detection in the field of computer vision. They were first introduced by Navneet
Dalal and Bill Triggs in their 2005 CVPR paper [14]. Originally, their algorithm was

2.1. METHODS FOR DETECTION AND 2D LOCALIZATION 7
used for people detection in static images, but it was later extended to human detection
in videos, as well as to other classes of objects in images and film.
Figure 2.2: The HoG feature. Image is partitioned into blocks, and for each block we
compute the histogram of gradient orientations. Pictured on the right is the distribution
of gradients for each block.
The essential idea of the HoG descriptor is that the local distribution of gradients
is a good approximation to an object’s shape. The image is divided into small regions,
and for each region a histogram of gradient directions is computed, as seen in figure
2.2. Most of the time, the color values of the image are block-normalized for added
invariance to changes in illumination or shadowing.
A common method of obtaining the image gradients is by filtering the color or
intensity values of the pixels by the [−1, 0, 1] and [−1, 0, 1]T kernels. Dalal and Triggs
tested other techniques such as applying the 3x3 Sobel operator or pre-smoothing
the image before applying the derivative kernels, but these methods provided poorer
performance in their experiments.
Methods of localization - Pictorial Structures
Pictorial Structures (PS) are extensively used for part-based recognition of scenes,
people, animals and multi-part objects. Usually, the structure and parameterization
of the model is restricted, by assuming tree dependency structure between its parts.
Part based models for recognition of articulated objects were proposed long ago
by Fischler and Elschlager [15]. B. Sapp, C. Jordan and B. Taskar [29] proposed a
semi-parametric approach that combines the easy pictorial structure inference with the
flexibility of non-parametric methods (i.e., skeleton stuctures). They represent an object
as a collection of distinctive parts with geometric relationships between them. The model
characterizes local visual properties of object parts and posits spring-like connections
between pairs of parts, which express variability of part locations. It determines an
object match in an image by selecting part locations that minimize appearance matching
costs and deformation costs for pairs of connected parts.
The basic Pictorial Structures model consists of a graph where its nodes represent
object parts and the edges between parts encode pairwise geometric relationships. For
human modeling, the PS model decomposes as a tree structure into unary potentials

(also referred to as appearance terms) and pairwise potentials between pairs of physically
connected parts, as pictured in figure 2.3.
Figure 2.3: The Pictorial Structures model. Objects are decomposed into parts and
spatial relations among parts. Reprinted from [15].
Local detectors - Poselets
Poselets build upon the Pictorial Structures idea, by decoupling the tree-like structure
from the model. This enables for a data-driven search procedure that finds poselets
that are tightly clustered in both 3D joint configuration space as well as 2D image
appearance. To permit this, L. Bourdev and J. Malik [11] have created a dataset, H3D,
of annotations of humans in 2D photographs with 3D joint information, inferred using
anthropometric constraints. In order to reconstruct the 3D pose from an image, a set of
poselets (linear support vector machines [17]) scans the input image at multiple scales
and votes for the location of torso bounds or body keypoints. The probability that a
keypoint O is detected at position x is simply P(O|x)
∑
i
wiai(x), where ai(x) is the
score that a poselet classifier assigns to location x and wi is the weight of the poselet.
The weights w account for the fact that some poselets are more discriminative than
others.
Compared to other methods, poselets do very well at detecting the visible portions
of a person in the presence of severe occlusion. Using the annotated keypoint locations,
the expected image locations of a set of keypoints can be conditionally detected on the
locations of other keypoints.
Skeletal models
Skeletal reconstruction methods recover an interpretation tree of possible 3D joint
positions, based on user-specified image joint positions [24, 37]. Lee and Chen [24]
attempt to prune their perspective interpretation tree using physical reasoning, while
Taylor [37] relies on additional user input to specify plausible relative joint-centre
depths for his affine one. Although these methods do incorporate the forward-backward

2.1. METHODS FOR DETECTION AND 2D LOCALIZATION 9
Figure 2.4: The flipping ambiguities of the forearm and hand under monocular
perspective. A single frontal view of the subject is barely usable for inferring the real
3D pose from its 2D projection, while these side views show great differences. Reprinted
from [35].
flipping ambiguity, they can not reconstruct skeletal joint angles, and this makes them
inappropriate for tracking applications.
Sminchisescu and Triggs [35] used a 3D body model that consists of a kinematic
‘skeleton’ of articulated joints controlled by angular joint parameters. Each limb has
attached ellipsoids to approximate the surface of the human body [9]. Each configuration
of the skeletal kinematic tree has an associated interpretation tree of the assigned 3D
skeletal configurations that can be obtained from the given one by forwards/backwards
flips. The tree all configurations that are image consistent in the sense that their joint
centres have the same image projections as the given one (see figure 2.4). Their algorithm
generates random samples in the joint space and optimizes them using nonlinear local
optimization, respecting joint constraints. This is a fast method for investigating
alternative limb rotations, as choosing the wrong limb configuration usually leads to
limb mistracking.
Space embeddings for dimensional reduction
In this paper - [33] - the authors presented the difficulty of optimizing the 3D human
motion estimation over high-dimensional state spaces. The typical human skeleton
used in computer vision consists of 15 to 20 joints, thus having a 45 to 60-dimensional
representation space for it. Given some constraints - like knowing that the person is
upright, or simply some relations between joints - greatly enforces some restrictions
on those affected dimensions. This means that there is a lower-dimensional subspace
contained in the original 60-dimensional one, that retains a low representation error
of the skeleton. Transforming the data into a space of fewer dimension can be done
linearly, as in principal component analysis, or using some techniques for a nonlinear
dimensionality reduction, as this paper proposed. The authors compared learning in
the original space and in the lower-dimensional space reduced by Laplacian eigenmaps

and reported joint angle errors of 2◦ mean and 35.6◦ maximum error for the first case
and a 1.4◦ mean and 4.3◦ maximum error using their method.
Generative models based on silhouettes
Agarwal and Triggs [7] described a learning based method for recovering 3D human body
pose from single images and monocular image sequences without using an explicit body
model nor prior labeling of body parts in the image. Instead, they recover 3D poses
by direct nonlinear regression against shape descriptor vectors extracted automatically
from image silhouettes. For robustness against local silhouette segmentation errors,
the silhouette shapes are encoded by histogram-of-shape-contexts descriptors. Their
regressor is made of a Relevance Vector Machine (RVM) [38, 39], a sparse Bayesian
approach to classification and regression.
Silhouettes contain enough information for human pose recognition and RVM is
a good regression model which automatically chooses the relevant input data from
the training set and discards the others. Matching silhouettes is then reduced to
matching shape context distributions, a comparison of 100-D histograms. Histograms
are built by allowing context vectors to vote softly into the few centres nearest to them,
with Gaussian weights. This histogram-of-shape-contexts scheme gives some degree of
robustness to occlusions and local silhouette segmentation failures. The RVM regressor
would retain only 6% of its training data, thus giving a large effective reduction in
storage space compared to other methods.
As Agarwal and Triggs [7] are stating, and Sminchisescu [36] discussed earlier,
silhouettes do not entirely constrain pose estimations. High estimation errors occur
when similar silhouettes arise from very different poses, so that the regressor will output
a compromise solution.

Chapter 3
Human Motion Capture
Methodology
In order to reconstruct the 3D pose of a human being given 2D images, there is a need for
spatial 3D information to be used when training a model (i.e., there is no way in which
3D poses could be inferred just from simple 2D images). In 3D space, the human joint
positions are highly correlated (e.g., the distance between the shoulders is roughly the
same regardless of orientation), unlike the 2D space where the positions of ankles could
be near the head for a person that is lying in bed. Current 3D annotated databases
contain poorly labeled body parts or joint positions, or are incomplete. This gives rise
to the need for a custom dataset.
To create a custom dataset for my experiments, I am using a Vicon motion capture
system [5] consisting of 4 DV video cameras [2], 10 motion capture MX cameras [4]
and one time-of-flight camera [6], all being synchronized to give solid inter-camera
coordination. This system is calibrated, so that each camera lies in a common coordinate
system, allowing for reliable localization of 3D human joint positions in captured videos.
The 10 motion capture cameras work by recognizing spherical markers glued to human
body or clothes, by generating rays for each detected marker and then by calculating the
points in space where rays from multiple cameras intersect, thus obtaining a point cloud
of 3D marker positions. We do not really need the 3D marker positions, but the 3D
human joint positions. For this to happen, I am running a kinematic fitting algorithm
- provided by Vicon Nexus software, and specifically tuned for each captured person -
that outputs the desired person’s skeleton (i.e., 3D joint positions).
Reprojecting the captured 3D joint positions to the captured videos required a
pipeline of operations to be set up. At first, there was the need for extracting the
kinematic fit from Vicon’s software, which did not offer this possibility. I managed to
stream the data into Autodesk MotionBuilder and export the animated skeleton from
there. Projecting the 3D joint positions to videos in Matlab required many coordinate
system changes, as each of Vicon, MotionBuilder and Matlab uses a different one (e.g.

12 CHAPTER 3. HUMAN MOTION CAPTURE METHODOLOGY
Y-up, Z-up, left-handed, right-handed). Knowing the location of each body joint in the
captured images and its actual 3D position allows for training a model that, being given
2D images of people, would extract the desired 3D pose information.
Our Vicon system doesn’t allow for reliable outdoor capturing. Because of this,
we will extend our dataset by synthetically inserting rendered characters into real
images. I have written a Python script for MotionBuilder that takes the recorded
animated skeleton and transfers its motion to a character (a realistic fully clothed male
or female) with an operation called retargeting. This character is then exported from
MotionBuilder and imported into Autodesk 3ds Max where a 3ds Max script (written
in MAXScript) takes it, prepares it and then renders it over the video background.
Simply rendering the characters over the videos would not be sufficient, as we needed
a moving camera, because the videos would not be static. This is where Vicon boujou [3]
came into help. It allowed the estimation of both the intrinsic and extrinsic parameters
of the video camera. It is able to track corners and some custom points into filmed
sequences, as well as to reconstruct their 3D position. This allows for a good scene
geometry approximation, and a natural feel of the character placement and rendering
results.
3.1 Motion capture dataset
HumanEva [32] is the state-of-the-art motion capture (mocap) dataset at this moment
and contains only 5 motions (walking, jogging, throwing and catching, boxing and
gestures) not nearly enough for covering the real poses encountered in day to day life
especially since these are all the idealized versions of the motions. I want to show that
using a combination of motion capture and sprite insertion, a dataset can be generated
that is large enough and good enough for getting ahead with pose estimation in real
images (not just from constrained laboratories).
Using the time-of-flight camera allows for capturing 3D motion sensor data synchronized
with the mocap data. This would allow prediction of more than just joint angles (e.g.
surfaces). Furthermore, it would allow prediction of poses for people using loose clothes.
Small occlusion data is beneficial. For instance, in the real world, different objects
could be placed in the field of view of the video camera. People sitting at a table cannot
avoid the occlusion of their lower body. We took advantage of the Vicon system and
positioned the MX cameras so that they would record as many markers as possible.
3.2 Experimental setting
We have gathered this data using our state-of-the-art Vicon motion capture system.
The system relies on reflective markers and works by tracking these markers over time.
Tracking maintains the label identity and propagates this information though time from

3.2. EXPERIMENTAL SETTING 13
Figure 3.1: A schematic view of our experiment setup with the capture surface and
the placement of the 4 DV cameras and our TOF sensor. 10 other high-speed motion
cameras are rigged on the walls, to maximize capture volume. The 3d laser scanning
system used to capture the body models of actors is also shown in the right image.
MX System
MX model Vicon T40
MX cameras No. 10
MX resolution 4 Megapixels
MX freq. 200Hz
DV System
DV model Basler piA1000-60gc
DV cameras No. 4
DV resolution 1000x1000
DV freq. 50Hz
DV sync hardware
TOF System
TOF model Mesa Imaging SR4000
TOF No. 1
TOF resolution 176x144
TOF freq. 25Hz
TOF sync Software sync
Table 3.1: The motion capture system specifications. Motion capture cameras are
denoted with MX, the video data with DV and time-of-flight with TOF.
Figure 3.2: A sample of the data provided in our dataset. Aside from the image and
pose data we provide the time of flight camera, with example range data shown in the
middle image, a mesh of the subject shown next to it.
an initial pose which is easily labeled (either manually or automatically). A fitting
process uses the position and identity of these labels, as well as proprietary human
motion models to infer accurate pose parameters.
The laboratory setup is represented in figure 3.1. It comprises of a 6m by 5m capture
area, and within it a 4m by 3m effective capture space where all the subjects were fully

visible in all cameras. The placement of the 4 digital video (DV) cameras as well as
the time-of-flight sensor (TOF) are shown in the figure. A set of 10 motion capture
(MX) cameras are placed on the walls around the capture surface such to maximize
the effective experiment volume. More detailed specifications of the system are given in
table 3.1.
A Human Solutions Scanworks 2.9 3D laser body scanner was used to obtain accurate
volumetric information about the subjects.
3.3 Dataset Structure
In this section we discuss the choice of poses in the dataset, the data types provided as
well as the image processing and the input annotations that are pre-computed.
Actors
The experiments were performed by 11 subjects (5 female and 6 male) chosen to span
a body mass index (BMI) from 17 to 29. This provides a moderate amount of body
shape variability as well as different ranges of mobility. This variability is not available
in previous, smaller datasets and reflects more accurately real world people. Volumetric
information was gathered on each of the actors, and this may be used to obtain more
detailed 3D information about the subjects than joint positions alone. This data can
be used to also evaluate human body shape estimation algorithms [31].
The subjects wore regular clothing (as opposed to special motion capture costumes)
to maintain as much of the realism as possible. The motion capture technology requires
the presence of reflective markers which compromises somewhat the natural appearance
of the actors. We chose a larger set of markers because the more complicated poses
would create partial occlusions on the markers, thus making the kinematic fitting very
inaccurate compared to the minimal set of markers. Using our own more accurate image
calibration, we can compute masks based on the 3D marker positions which can be used
to cover the existing markers thus limiting their negative impact on image features. The
actors were given detailed tasks with examples in order to help them plan a stable set
of poses between repetitions for the creation of training, validation and test sets. In the
execution of these tasks the actors were however given quite a bit of freedom in moving
naturally over a strict, rigid interpretation of the tasks.
Dataset details
The dataset contains 910K frames with 4 cameras or almost 3.6 million images in the
monocular (before mirroring which would raise it up to more than 7 million frames).
Several training and testing scenarios were prepared. We fix training and testing subjects
and motions to make quantitative comparisons possible. In the simplest scenario we

3.3. DATASET STRUCTURE 15
Figure 3.3: A sample of diﬀerent scenarios performed by 6 of the actors. From left to
right, top to bottom: Discussion, Talking on the phone, Waiting, Sitting on a chair,
Posing, Purchases.
consider testing with each individual subject and testing on the corresponding test
set of that subject. This can help isolate the pose from the body shape and clothing
variability. A more challenging task is prediction with a model trained on the set of
training subjects and tested on the subjects from the test set.
We use 7 subjects (3 female and 4 male) for training and validation, and 4 subjects
(2 female and 2 male) for testing. The structure of the dataset is presented in table 3.2.
A 3D body scan of all our subjects gives an accurate volumetric model which is provided
as additional data. This can be used to more accurately model the 3D geometry of the
subjects. The mesh, skinned to a skeleton, will be released as part of the dataset.
Annotation, Image Processing, Silhouettes and Person Bounding Boxes
We provide accurate pixel-wise foreground-background segmentation for all images obtained
using background models. We train our models as Gaussian distributions in each of RGB

Scenarios Train Validation Test
Upper body movement (Directions, Discussion) 254,008 193,424 245,412
Full body upright variations 335,192 325,484 312,488
(Greeting, Posing, Purchases, TakingPhoto, Waiting)
Walking variations (Walking, WalkDog, WalkTogether) 210,364 181,612 196,324
Variations while seated on a chair 330,940 322,840 376,792
(Eating, TalkPhone, Sitting, Smoking)
Sitting down on the floor (SittingDown) 142,896 88,400 124,612
Total 1,273,400 1,111,760 1,255,628
Table 3.2: A breakdown of the number of 3d human poses by motion type for training,
validation and testing. We use 7 subjects for training and validation (3 female and 4
male) and 4 subjects for testing (2 female and 2 male). Notice that in practice, due to
our combined system sampling rate (table 3.1) we have 3.6 million 3d poses as well as
3.6 million pairs of synchronized images with 3d pose ground truth.
and HSV color channels as well as the image gradient in each RGB channel (in total
3+3+2x3=12).
We use these background models in a graphcut to obtain the final pixel labelings.
The weights between the input features of graphcut are learned by optimizing a measure
of pixel segmentation accuracy to a manually labeled ground truth on a small subset of
images sampled from the videos. The segmentation measure we used was the standard
segment overlap (intersection over union). We use Nelder-Mead simplex algorithm for
optimization because the objective is non-smooth.
Since some algorithms are still working with bounding box annotation we also
provide ground truth bounding box annotation. This data was obtained by projecting
the skeleton into the image and fitting a rectangular box around the projection.
A separate camera calibration procedure was performed to improve the accuracy of
the default one provided by the Vicon system. This is necessary because the camera
distortion parameters are not estimated by the default Vicon calibration procedure. This
data is also provided with the release of the dataset. It was obtained by positioning
a large number of reflective markers (about 30) on the capture surface and manually
labeling them in each of the cameras with subpixel accuracy. Camera models with radial
distortion parameters were fitted which improved calibration significantly.
Joint Positions and Joint Angle Skeleton Representations
Common pose parametrizations considered in the literature include relative 3D joint
positions (R3DJP) and kinematic representation (KR). Our dataset provides data in
both parametrizations with a full skeletons containing the same 32 joints in both cases.
In the first case (R3DJP), the joint positions in 3D space are provided. They are
obtained from the joint angles provided by the Vicon skeleton fitting procedure by
applying forward kinematics on the subject skeleton. R3DJP is challenging because it
is very hard to estimate the size of the person. This problem is obviated in practice by
providing the same skeleton (limb lenght) information for all subjects,including those in
testing, if needed. The parametrization is called relative because there is a joint, usually

3.3. DATASET STRUCTURE 17
called root joint (roughly corresponding to the human pelvis bone position), which is
taken as the center of the prediction coordinate system and the other joints are estimated
relatively to it. The kinematic representation (KR), considers the relative joint angles
between limbs and is more convenient because it is invariant to both scale and body
proportions. The dependencies between variables are however much more complex
making estimation more difficult. The process of obtaining joint angle values involves a
complex constrained non-linear optimization process which is not guaranteed to reach a
global optimum and can sometimes introduce errors in the data. We devoted significant
effort to ensure that data is clean and the fitting process is accurate. Outputs were
visually inspected multiple times, during different processing phases, to ensure accuracy.
These representations can be directly used in independent monocular predictions or in
multi camera settings. The monocular prediction dataset can be increased 4-fold by
globally rotating and translating the pose coordinates as to move the 4 DV cameras
in a unique coordinate system (code is provided for this data manipulation). As seen
in table 3.1 poses are available at four-fold faster rates than images from DV cameras.
The code provided can also be used to double both image and pose data by considering
their mirror symmetric versions.
Additional Mixed Reality Test Data
Besides the laboratory test sets created we also focused on providing test data to cover
variations in clothing and complex backgrounds, as well as camera motion and occlusion.
We are not aware of any setting of this level of difficulty in the literature. Real images
contain people in complex poses, but the diverse backgrounds as well as the scene
illumination and occlusions can vary independently and represent important nuisance
factors the vision systems should be robust against. Although approaches to handle
such cases exist, in principle, in the literature, it is still difficult to annotate real images.
Our dataset contains a section that has been especially designed to address such issues.
This is by no means the only possible realistic testing scenario – different sport motions
or backgrounds appear in [27, 41].
We create movies by inserting high quality 3D rigged animation models in real
videos, to create a realistic and complex background, good quality image data and very
accurate 3d pose information. The mixed reality movies were created by inserting and
rendering 3D models of a fully clothed man and woman in real videos. The poses used
for animating the models were extracted directly from our laboratory test set. The Euler
ZXY joint angles we extracted were used to create BVH files whose limb lengths were
matched automatically to the models. This was necessary in the next step, where we
retargeted the captured motion of the BVH data to one of the models’ skeletons using
Autodesk MotionBuilder software. The actual insertion required solving the camera
motion of the backgrounds, as well as its internal parameters, for good quality rendering.
This was achieved using camera matchmoving software (Boujou). The exported camera

Figure 3.4: A sample of images from our mixed reality test set. The dataset is quite
challenging due to the complexity of backgrounds, viewpoints, diverse subject poses,
camera motion and occlusion. See our supplementary material for video.
tracks as well as the model were then imported into Autodesk 3ds Max where the
actual rendering was made. The scene was set up and rendered using the mental ray
(raytracing) renderer, with several well-placed area lights and skylights. To improve
quality, we have placed a transparent plane on the ground, to receive shadows. Scenes
with occlusion were also created. The dataset contains 5 different dynamic backgrounds
obtained with a moving camera, total of 10350 examples, out of which 1270 frames
contain various degrees of occlusion. A sample of the images created is shown in figure
3.4.
Additional information
A 3D scan of our subjects provides an accurate volumetric model which is given as
additional data. This can be used to more accurately model the 3D geometry of the
subjects. We are currently considering skinning this mesh to a skeleton and providing
it as part of the dataset.
A low resolution Time-of-Flight camera synchronized with the rest of our system is
also available. Though the data is not very accurately calibrated with the the rest of
our system a rough calibration will be provided to interested users.
3.4 Image processing pipeline
Color reconstruction from a color filter array
While testing the Vicon system, I have discovered that the DV cameras use the Bayer
color encoding [1] of a color filter array (CFA). While this allows for reduced bandwidth

3.4. IMAGE PROCESSING PIPELINE 19
Figure 3.5: Standard deviation of noise before (red) and after (green) filtering. The
filtered images are more stable.
between the cameras and the computer, great amounts of structured noise are generated
as the na¨ıve full color reproduction implemented by Vicon simply interpolates the
neighboring noisy pixels in each color plane. To reduce this noise, I have used a temporal
smoothing filter over the captured raw Bayer images:
I
′
t = (wt−1It−1 + It + wt+1It+1)/(wt−1 + 1 + wt+1), (3.1)
where
wt−1 = e−k∗(It−1−It)
(3.2)
wt+1 = e−k∗(It+1−It)
(3.3)
and It−1 is the previous image, It is the current image, and It+1 is the next image,
seen as either matrices or individual pixels. Weights w account for differences between
successive frames and the constant k gives the amount of filtering that is applied. The
resulting images show great improvements: the background contains greatly less noise
and the edges between the moving objects and the background are well preserved and
not blurred, as plotted in figure 3.5.
For the reconstruction of full color data from the filtered grayscale Bayer pattern
(fig. 3.6), I have implemented the CFA interpolation framework of Lukak et al. [26] (fig.
3.7).
Based on the Bayer CFA pattern structure, only one color component is available
for each spatial location. Assuming that x(i,j)k is the pixel value available for the kth
color plane at the (i, j) pixel position, the missing G components x(r,s)2 are generated

Figure 3.6: (Left): Bayer encoding scheme with alternating colors over the image pixels.
(Middle): Raw grayscale image from one camera, showing the CFA pattern. (Right):
Reconstructed color image. Note that the navy blue wires coming down from the
cameras appear less than a pixel thick and it’s very hard for the algorithm to not
create a false color effect on them. Same problem happens sometimes on the sharp
edges of objects but it’s having no impact on the subsequent processing steps.
using a weighted sum of the surrounding original G components as follows:
x(r,s)2 =
∑
(i,j)∈ζ
w(i,j)x(i,j)2, (3.4)
where (r, s) is the location at the centre of the diamond shaped structure of the four
original G components described as ζ = {(r − 1, s), (r, s − 1), (r, s + 1), (r + 1, s)}. The
normalized weights are generated via w(i,j) = u(i,j)/
∑
(g,h)∈ζ u(g,h), and the positive
edge-sensing coefficients u(g,h) are defined as follows:
u(g,h) =
1
1 +
∑
(e,f)∈ζ |x(g,h)2 − x(e,f)2|
(3.5)
The subsequent steps require interpolating the R(k = 1) and B(k = 3) color planes,
by keeping in mind that they are subsampled twice sparser (in fig. 3.6 left, it is shown
how the green component represents half of the original spatial frequency, but the red
and green components represent only 1/4 of the original). The interpolation of each
of these two color components for reconstructing as much as possible from the original
image is given by
x(r,s)k = x(r,s)2
∑
(i,j)∈ζ
w(i,j){x(i,j)k/x(i,j)2}, (3.6)
where w(i,j) are the weights (3.5) obtained for k = 1 and k = 3 and then normalized.
The color ratios from the interpolator of (3.6) are generated using the G components
x(i,j)2 in the same spatial position as the R or B components x(i,j)k. The square shaped
structure ζ = {(r − 1, s − 1), (r − 1, s + 1), (r + 1, s − 1), (r + 1, s + 1)} is used for
interpolating the R and B components (fig. 3.7b) while the color ratios account for the
missing edges due to subsampling when doubling the color spatial frequency.

3.4. IMAGE PROCESSING PIPELINE 21
Figure 3.7: Reconstruction steps of color data from the CFA pattern. (a)-diamond shape
interpolation of green channel. (b)-square shape interpolation of red and blue channels.
(c)-diamond shape interpolation of the red channel. (d)-diamond shape interpolation of
the blue channel.
The remaining missing R and B components are generated using (3.6) with a diamond-
shaped structure as shown in fig. 3.7c,d.
Background Subtraction
Figure 3.8 shows the result of the next step of my pipeline, the background subtraction.
At first, I was using only the RGB channels by computing the mean and standard
deviation of each pixel over the background video and testing the foreground video
against this distribution. It wasn’t good enough, so I have transformed it into using the
HSV space which improved the result. The main problem in training the background
using HSV space came from the H channel which is circular and just doing the mean
of it was not mathematically correct. So, I have split it into training separately sin(H)
and cos(H) and it worked a little better by multiplying back their responses to obtain
the response from hue. My idea is that using more filters on the image to test their
belonging to the background and combining their result in a way that reinforces their
responses would give better result than using only one filter.
Currently, after more developing, the algorithm fits 12 Gaussian distributions over
2500 frames of pure background video for each pixel with the following filters: RGB
and HSV color spaces, and image gradients over the RGB color space. For classifying a
pixel as foreground/background, the same series of filters is computed and tested against
the background model. For combining all these filter responses, they are weighted and
then summed, to give a single scalar value per pixel where 0 represents 100% chance of
background and larger values represent a chance of foreground:
F(w) =
∑
i∈Ψ
wi ∗ (Xi − µi)/σi, (3.7)
where Ψ = {R, G, B, H, S, V, ∇x(R), ∇y(R), ∇x(G), ∇y(G), ∇x(B), ∇y(B)}, Xi is the
output of pixel filters and µ and σ are the background’s mean value and standard

Figure 3.8: (Left): Unary potential scaled to view range. (Middle): Thresholded image.
(Right): Result using graph cuts.
deviation of noise. Weights wi are learned by training them against ground truth masks,
GT, by minimizing the segmentation error:
w∗
= arg max
w
Area(F(w)
∩
GT)
Area(F(w)
∪
GT)
(3.8)
In order to reliably threshold this image, I am using a graph cut model, by creating
an energy minimization problem, where the unary potential is (3.7) and the pairwise
potential is the coherence between neighborhood pixels. They are defined as neighbors
if they are adjacent either horizontally or vertically. This way, the uncertainties of
the background/foreground classifier are greatly reduced thanks to the smoothing term
introduced by the second order potential, as seen in figure 3.8. For the implementation,
I have used R. Howe’s [20] optimized C/C++ code provided by the author.
Because of the many shadows that an actor casts on the ground or walls, the
algorithm takes care of them by detecting areas where the hue and saturation are
approximately the same as the background model, but the pixel luminance drops. Over
those areas, I have used a soft thresholding approach for implementing shadow removal.
The final step of the algorithm retains only the largest connected component, in
order to remove scattered noise. In figure 3.2 second image, are shown the results
of background subtraction. An improvement to the above method would be to use
the reprojections of joints and bones into the image by forcing the current background
subtraction algorithm to take into consideration the foreground that results at the joints
and bones location, or simply by not discarding the segmented areas that contain a joint
or a bone as a result of largest connected component.

Chapter 4
Discriminative Methods for
Human Pose Prediction
We provide several simple discriminative learning methods with our dataset and test
them in different situations to better understand the complexity of the data and establish
a set of baselines for other methods to compare against. We adopt a fully automatic
discriminative framework for simplicity and efficiency in view of the amount of data we
are analyzing.
4.1 Simple discriminative learning methods
In a discriminative framework, the estimation problem is framed as the one of learning
a mapping (or an index) from image descriptors extracted on the person silhouette or
its bounding box, to the pose in either joint positions or joint angles representations. If
Xi is the image descriptor data and Yi the pose representation for frame i and f or fW
is this mapping, where W are the parameters of the function, then by learning our goal
is to obtain a mapping with fW (X) = Y , with X and Y being a pair of image descriptor
and pose both not seen in training. Here we perform some experiments with different
methods on our dataset. The methods we considered are : k-nearest neighbor (KNN),
linear and kernel ridge regression (LinKRR, KRR), as well as a structured prediction
method based on kernel dependency estimation (KDE) [13, 34].
k-Nearest neighbor regression (KNN) is probably the simplest yet very effective
method for learning f and still relatively popular in computer vision [16]. For this
method training consists of storing training examples or, in our case, a subset of it.
Depending on the distance function an intermediate data structure, typically a KD-tree
or, more recently cover trees [10], is constructed in the training phase to help speed up
the subsequent inference. These data structures however are dependent on the input
metric and offer efficiency payoff only in problems with low input dimensionality. In
our case we use χ2, a metric for histograms, that is known to perform very well on

24CHAPTER 4. DISCRIMINATIVE METHODS FOR HUMAN POSE PREDICTION
image data. If X = [x1 . . . xd] and Y = [y1 . . . yd] are two vectors then the χ2 distance
is defined as
χ2
(X, Y ) =
1
d
∑
l
√
(xl − yl)2
xl + yl
(4.1)
Since we have a very large input dimensionality we dispense with the KD-tree. For
efficiency reasons we chose to put an upper bound of 400K training examples and
subsampled the data if it surpassed this upper bound. To obtain a prediction KNN
performs an inference where the closest k examples according to the input metric are
chosen and, with them, their corresponding poses (as seen in training). On these an
inference rule is applied. Given this set of poses a large number of inference rules
can be applied to obtain the final prediction some simple ones being simple average
f(X) = 1
k
∑
i Yk, another being weighted averages based on local kernels
f(X) =
1
k
∑
i
k(∥X − Xk∥)Yk (4.2)
where k is an increasing function. In the experiments we have found that both perform
similarly but the first one has less parameters so we chose the simple average as our
inference procedure for KNN.
Kernel ridge regression (KRR) is a simple and very reliable kernel method [19]. It
is a non-linear least-squares with l2 regularization i.e.
arg min
α
1
2
∑
j
∥
∑
i
αik(Xj, Xi) − Yj∥2
2 + λ∥α∥2
2 (4.3)
There is also a particular appeal coming from the fact that there is a closed form solution
for this problem making it very easy to implement
α = (K + λI)−1
Y (4.4)
with Kij = k(Xi, Xj) and Y = [Y1 . . . Yn]. This simple solution also reveals the weakness
of the method which is the roughly O(n3) dependency on the training set, due to matrix
inversion of an n × n matrix. In our experiments, as before, we choose χ2 as our
input metric and use the exponential map to transform the metric into a kernel i.e.
k(Xi, Xj) = exp(−βχ2(Xi, Xj)), where β is a simple scalar parameter. This kernel
is called the exponential-χ2 kernel in the literature. Prediction is done by fα,β(X) =
∑
i αik(X, Xi).
Linear approximations for kernel ridge regression (LinKRR) is a more recent approach
that overcomes the computational issues of KRR while maintaining most of its performance
benefits. In a seminal article [28], Rahimi and Recht have noticed that using a theorem
by Bochner a certain class of kernels, which includes the ubiquitous Gaussian and
Laplace kernels but also many other useful ones, can be approximated using their Fourier
representations. The approximation comes as an expectation over the frequency domain

4.1. SIMPLE DISCRIMINATIVE LEARNING METHODS 25
of a feature function ϕ which depends on the input
k(Xi, Xj) ≃
∫
ω
(ϕ(Xi; ω)ϕ(Xj, ω))µ(ω) (4.5)
This is a very important observation because the approximation fits perfectly in a
Monte Carlo integration framework. In this framework not only can we obtain an
explicit representation of the kernel which is separable in the inputs i.e. k(Xi, Xj) ≃
Φ(Xi)⊤Φ(Xj), with Φ(Xi) = [ϕ(Xi; ω1) . . . ϕ(Xi; ωD)] a vector of the ϕ(Xi; ω), with ωs
being D samples from µ(ω), but at the same time we have an approximation guarantee
on the kernel approximation which is independent of the learning method. Using
standard duality arguments one can show that equation 4.3 is equivalent to
arg min
W
1
2
∑
i
∥W⊤
Φ(Xi) − Yi∥2
2 + λ∥W∥2
2 (4.6)
This is a least squares regression model applied to non-linearly mapped data which has
a simple closed form solution
W = (Φ(X)⊤
Φ(X) + λID)−1
Φ(X)⊤
Y (4.7)
To compare this to the corresponding equation of KRR, although there is a matrix
inversion in this case as well, this time the dimension of the matrix D × D. Thus
the inversion is independent of the number of examples making it a very appealing
candidate model for large scale training sets. The constructing the matrix Φ(X)⊤Φ(X)
is a linear operation in n and can be computed with little memory load in an online
fashion. Note that D is a parameter for the method and allows for a trade-off between
efficiency, larger D makes inversion more demanding, and performance, larger D makes
the approximation more accurate. Experiments show that indeed when D is large
enough there is little or no performance loss for many interesting kernels. In a perfect
parallel to KRR we use an approximation exponential-χ2 kernel we have discussed earlier
using an approximation by [25].
Kernel Dependency Estimation. Finally as a simple structured prediction model
we use a model first described and applied in a slightly more complex pose estimation
context in [22]. The model is an application of the above approximation methodology
to the kernel dependency estimation (KDE) of [13]. KRR and a host of other classical
regression models assume that targets are uncorrelated. In many cases this is a reasonable
assumption and this explains the success of these models. In the case of pose estimation
however there are clear correlations between the positions of the joints due to the
constraints imposed by limbs e.g. if the elbow and the wrist of an arm would be
independent then knowing the position of the elbow would tell us nothing about where
to find the wrist but the physical structure of the body ensures that the distance
between the two is exactly that of the forearm. One simple way to deal with this
correlation is to find an orthogonal decomposition of the targets [42] using for instance

26CHAPTER 4. DISCRIMINATIVE METHODS FOR HUMAN POSE PREDICTION
input : X,Y - Image features and poses
output: W - weights
Compute Φ(X) and Φ(Y) the approximations for the input and target kernels;
Center the data (required by KPCA) Φ0(Y) = Φ(Y) − 1
N Φ(Y);
Solve KPCA to obtain k largest eigen-vectors V = svd(Φ0(Y), k) ;
Solve KRR arg minW ∥W⊤Φ(X) − V ⊤Φ(Y)0∥2
2 + λ∥W∥2
2 ;
Algorithm 1: LinKDE training algorithm.
Kernel Principal Component Analysis (KPCA) [30]. This becomes an intermediate,
low dimensional, representation of the target data which KRR is perfectly suitable to
predict. An overview of the training algorithm is given in figure 4.1.
KPCA provides a simple methodology for recovering a good space to project the data
and we are able to predict this space from the input data using KRR but unfortunately
to obtain the final prediction we need map from KPCA space back to the original pose
space, which is not trivial to do. For this the literature suggests directly solving the
pre-image problem [8]
arg min
Y
∥W⊤
Φ(X) − V ⊤
Φ0(Y)∥2
2 (4.8)
Intuitively the optimization looks for the point Y in the target space whose KPCA
projection is closest to the prediction given by the input regressor. It has been shown
that although this is in theory a non-linear non-convex optimization problem it can be
solved approximately using classical methods like gradient descent with great success.
Using the kernel approximation methodology we can, not only perform a very efficient
regression by solving a problem like the LinKRR problem above albeit with different
outputs but, by applying it to the kernels over targets, we can also perform KPCA very
efficiently.

Chapter 5
Experiments
5.1 Training and Testing scenarios
Several training and testing model scenarios were prepared. In the simplest one, we
consider data from each subject separately (we call this Subject Specific Model or SSM
scenario). Our 15 motions are each captured in 2 trials which are used for training
and for validation, respectively. A set of 2 motions from each subject were reserved
for testing. These include different poses and motions that appear in the 15 training
motions (one involves sitting, the second one does not). This scenario was designed to
help isolate the pose variability from the body shape and clothing variability. A second
more challenging scenario considers prediction with a model trained on a set of 7 fixed
training subjects (5 for training, and 2 for validation) and tested on the remaining 4
subjects on a per motion basis (we call it Activity Specific Model or the ASM scenario).
Finally we consider a scenario where all motions are considered together in the same
split among subjects (we call this a General Model, or GM scenario).
5.2 Image Descriptors
We use a pyramid of grid SIFT descriptors with 3 levels (2x2, 4x4 and 8x8) and 9
orientation bins. Variants of these features have been shown to work very well on
previous datasets (e.g. HumanEva) and we show that they are quite effective even in
this more complex setting. Since we provide both background subtraction (BS) and
bounding box (BB) localizations for our subjects we perform experiments using both
features extracted over the entire bounding box and descriptors where the BS mask is
used to filter some of the background. Our 3d pose data is mapped to the coordinate
system of a virtual camera centered in (0, 0, 0) with orientation I3, and all predictions are
achieved in that coordinate system. For the joint position output we provide, the root
joint of the skeleton is always in the center of the coordinate system used for prediction.
All errors are reported using MPJE and are measured in mm.

28 CHAPTER 5. EXPERIMENTS
5.3 Results
Linear Fourier methods used input embeddings based on a 10,000-dimensional random
feature map, with 4000-d output embedding. Typical running times for models with
400,000 examples, on an 8 core PC, include 16h for testing kNN models, 1.4h for
training and 1.9h for testing KRR (50K training sample). For the full training set
of 2.4 million examples (where only linear approximations can be effectively applied),
training LinKRR takes 5h and testing takes about 1h. Code for all methods will be
provided with the data.
We first test the baseline methods on the simplest, SSM scenario. We notice very
little difference between background subtraction (BS) and bounding box (BB) results
with a slight edge to BB. This is not entirely surprising as in examples involving, e.g.,
sitting poses, the chair used for sitting is included in the foreground, and the image
descriptor computation is affected.
In our second scenario we test our baselines on each motion separately (the activity
specific model or ASM scenario). We noticed that errors are considerably higher both
because our test set is large and because significant subject body variation is introduced.
Our sitting down motion is by far the most challenging. It consists of subjects sitting
on the floor in different poses. This is complex to process because of a very high degree
of self-occlusion. It also stretches the use of regular image descriptors, suggesting that
while these may be reasonable for standing poses, they are not an ideal for general ones.
The other ‘Sitting’ scenario in the dataset is very challenging due to the use of external
objects, in this case a chair. The ‘taking photo’ and ‘walking dog’ motions are also
difficult because they are less repeatable and more liberty was given to the actors in
performing them. Overall, we feel the dataset offers a good balance between somewhat
more standard settings and difficult or very challenging ones, making it also a reasonable
benchmark for new modeling and algorithmic developments.
Our final baseline scenario is the one where models are trained based on all motions
from all subjects – this is our general motion (GM) setting, with results shown in table
5.3. It is encouraging that linear Fourier approximations to kernel methods can be
applied to large datasets with reasonable results. The models we have tested still seem
to not be able to very effectively leverage the structure in all data, showing that ample
space for future research and better methodology – both feature design and predictors
– remains.

5.3. RESULTS 29
method mask S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
KNN BB 150.4 149.3 162.8 144.7 164.6 188.1 159.8 98.8 131.1 153.7 163.3
KNN BS 177.6 144.2 152.8 154.7 166.8 191.2 158.5 81.7 144.3 155.9 169.0
KRR BB 108.6 104.3 108.9 104.4 121.9 130.2 105.8 64.3 104.6 100.1 117.9
KRR BS 117.7 102.4 112.7 112.6 124.5 128.0 111.8 64.6 108.6 106.1 126.2
LinKRR BB 120.3 116.3 121.2 119.5 135.8 144.3 120.4 76.3 118.5 112.6 129.3
LinKRR BS 130.2 117.7 128.3 132.2 139.1 144.5 130.2 77.9 122.7 122.9 141.3
LinKDE BB 105.7 103.1 111.4 105.1 113.1 134.5 104.4 64.0 103.1 97.2 114.7
LinKDE BS 110.5 97.2 110.9 109.4 119.7 127.4 105.6 61.8 103.9 99.0 118.8
Table 5.1: Results of baseline methods in the subject specific modeling (SSM) scenario
for all subjects S1 - S11, in the dataset. KNN indicates nearest neighbor (K=1), KRR
is kernel ridge regression, and LinKRR represents a linear Fourier approximation of
KRR. LinKDE is the linear Fourier model for a structured predictor based on Kernel
Dependency Estimation (KDE). Errors are given in mm using MPJE.
method mask Directions Discussion Eating Greeting Phone Posing Buying Sitting
KNN BB 183.29 200.68 151.01 207.39 191.21 198.81 217.61 231.47
KNN BS 204.82 209.81 204.44 226.23 216.73 245.36 232.12 266.84
KRR BB 132.58 147.80 135.14 154.32 157.71 144.91 185.12 179.25
KRR BS 139.59 134.54 131.83 151.06 143.64 168.47 165.19 184.99
LinKRR BB 166.33 180.49 156.94 164.60 184.00 183.66 216.51 212.19
LinKRR BS 156.32 152.01 152.52 169.14 162.81 182.53 191.86 215.99
LinKDE BB 140.20 133.83 127.73 159.06 149.10 166.16 179.88 211.31
LinKDE BS 144.96 132.78 130.47 161.31 142.25 173.04 179.07 203.73
method mask SittingDown Smoking TakingPhoto Waiting Walking WalkingDog Walking
KNN BB 321.82 206.77 257.50 208.62 163.50 276.24 210.79
KNN BS 365.71 237.57 313.74 241.02 201.15 287.84 223.56
KRR BB 237.39 151.83 177.76 152.59 117.88 190.87 156.70
KRR BS 247.85 149.81 217.13 162.81 134.19 187.55 164.29
LinKRR BB 276.83 154.28 225.11 168.00 148.92 224.67 172.67
LinKRR BS 304.10 176.33 238.74 179.85 149.56 216.91 175.85
LinKDE BB 245.72 147.51 221.16 163.81 143.55 216.94 172.46
LinKDE BS 305.09 153.50 231.36 161.18 128.37 205.62 175.55
Table 5.2: Comparison of baseline methods in the activity specific setting ASM. KNN
indicates a nearest neighbor (K=1), KRR is kernel ridge regression, LinKRR is a linear
Fourier approximation of KRR, and LinKDE is the linear Fourier model for a structured
predictor based on Kernel Dependency Estimation (KDE). Errors are given in mm using
MPJE.
Joint Positions Joint Angles
BB BS BB BS
KNN KRR LinKRR KNN KRR LinKRR KNN KRR LinKRR KNN KRR LinKRR
214.1 155.4 166.1 254.0 169.4 179.2 181.8 152.3 155.7 183.9 155.2 158.6
Table 5.3: Results obtained using our approximated kernel models, trained on 2.4 million
examples and tested on 1.2 million data points. KNN is K-nearest neighbor (K=1)
where the training set was subsampled to 400,000 (for efficiency), LinKRR is a kernel
model based on random Fourier approximations trained/tested on 2.4M/1.2M. The
KRR results are orientative, obtained on a subset of 50,000 human poses (sub-)sampled
from the 2.4 million training set. While the large-scale linear approximation does not
currently improve on the full non-linear sub-sampled model, it is still encouraging that
it can use all data in training. New methods that better leverage all training data, and
have better image (input) representations should in the long run fill the accuracy gap.

30 CHAPTER 5. EXPERIMENTS
Activity Speciﬁc Model (ASM)
BB BS
KNN KRR LinKRR KNN KRR LinKRR
Phoning 216.3 187.7 197.3 227.4 194.9 204.5
Posing 297.7 190.7 199.1 251.8 179.1 197.4
SittingDown 347.1 251.0 278.4 270.8 244.7 275.0
TakingPhoto 259.6 201.9 228.7 251.7 222.0 236.1
Waiting 199.3 161.4 188.2 206.0 164.8 193.2
Walking 209.7 179.8 197.5 218.7 167.2 209.8
General Model (GM)
BB BS
KNN KRR LinKRR KNN KRR LinKRR
Phoning 302.9 253.4 264.8 344.2 245.1 262.3
Posing 300.5 216.2 196.8 270.1 236.9 219.1
SittingDown 305.6 241.4 238.0 273.3 241.9 241.9
TakingPhoto 240.3 215.5 231.4 273.7 226.3 223.3
Waiting 252.0 191.6 199.4 275.5 189.0 203.5
Walking 219.0 232.6 303.9 272.8 248.5 298.7
Table 5.4: Pose estimation error for some of our mixed reality dataset with moving
cameras, backgrounds and occlusion (see ﬁg. 3.4). LinKRR indicates linear Fourier
approximations to kernel models. The models are trained on the data captured in the
laboratory. For ASM we use the model trained on motions of the same type as the test
motion. The results show promise but also clear scope for modeling and algorithmic
improvements in such challenging settings.

Chapter 6
Online Evaluation Tools
In order to support the development of novel pose prediction models on our newly
acquired dataset, we provide a set of online tools for easy evaluation. The website offers
users the ability to browse and download our dataset. For inspecting the data we share
a MATLAB visualization code, and a set of implementations for simple baselines in
order to enable the community an easy way of handling the data. This includes code
for data manipulation, feature extraction and some simple large scale learning methods.
On the front page, users are greeted with a description of the dataset. Several tabs
let them register on our website, send an email for contacting us, and log in. Once
logged in, they can view and download the data files.
An important part of the website is dedicated to handling users results on our
dataset. Their 2D and 3D predictions on the test set videos can be exported into a
highly compressed file using our provided code, and uploaded to the website where their
submissions are automatically evaluated and scored. We hold out the test set poses in
order to provide an independent unbiased verification of the performance of methods
proposed.
In the future, the development of these online tools will continue. We hope to give
the community a possibility of viewing the dataset online without needing to download
any of the large videos or the visualization code.

32 CHAPTER 6. ONLINE EVALUATION TOOLS
Figure 6.1: The Register page: the header image, main tabs, and the usual user form.
For added security, users are required to ﬁll in a CAPTCHA in order to prevent abuse
by automated programs.
Figure 6.2: A view of the dataset description: the header image, main tabs and
secondary tabs, showing the Mixed Reality content along with a short description of it.

Chapter 7
Conclusions
We have introduced a large scale dataset, Human3.6M, where 3.6 million different
poses were captured from a set of professional men and women actors. The dataset
complements existing benchmarks with a variety of human poses typical of natural
environments, and provides 2d and 3d data (including time of flight, high quality image
and motion capture data), accurate 3d human models (body scans) of the actors, and
mixed reality settings for performance evaluation under realistic backgrounds, correct
geometry, and occlusion. We also provide a set of consistency studies and evaluation
benchmarks for automatic discriminative methods, including linear and non-linear methods
and structured predictors, showing that data is useful and, at the same time, scope for
improving the current methodology exists. The data, as well as the software tools for
visualization, prediction, and evaluation, will be made available online to the research
community.
The dataset and its framework have a primary role in supporting the progress of
new algorithms in the area of 3d human pose reconstruction and can be considered as
one step in the direction of increasing the complexity and diversity of both algorithms
and data. We hope that our proposed dataset will foster further research in broad
domains like computer vision, machine learning, interaction and visual perception and
bridge the gap towards artificial image-based 3d human sensing systems that can operate
seamlessly in natural environments.
The data we provide can be useful in training real models that can be employed in
actual applications. For instance HCI [21] has received a lot of attention when Kinect
was released as an easy to use, plug-and-play device for controlling computer games.
However it has numerous problems for application in outdoor settings. Our data could
allow for developing devices with Kinect capability that work with normal video cameras
in areas like human-computer interaction, security and virtual presence. In addition,
our 3.6 million poses dataset can help in better tracking the 3d human poses in real time,
or shed some new insights in previously unreached areas like medicine and health care.
Human studies can be well supported by employing activity recognition in sociology.

34 CHAPTER 7. CONCLUSIONS
We hope that Human3.6M will address many of the necessities in the above domains
while being the testbed for improving the current algorithms and for developing new
ones.

References
[1] http://en.wikipedia.org/wiki/Bayer ﬁlter.
[2] http://www.motionanalysisinc.com/pia100048gc.html.
[3] http://www.vicon.com/boujou/index.html.
[4] http://www.vicon.com/products/tseries.html.
[5] http://www.vicon.com/products/viconmx.html.
[6] Mesa Imaging SwissRanger SR4000, http://www.mesa-imaging.ch/prodview4k.php.
[7] Ankur Agarwal and Bill Triggs. 3D Human Pose from Silhouettes by Relevance
Vector Regression. In CVPR (2), pages 882–888, 2004.
[8] G. Bakir, J. Weston, and B. Scholkopf. Learning to ﬁnd pre-images. 2004.
[9] A. Barr. Global and Local Deformations of Solid Primitives. Computer Graphics,
18:21–30, 1984.
[10] Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest
neighbor. In ICML ’06: Proceedings of the 23rd international conference on
Machine learning, pages 97–104, New York, NY, USA, 2006. ACM.
[11] Lubomir D. Bourdev and Jitendra Malik. Poselets: Body part detectors trained
using 3D human pose annotations. In ICCV, pages 1365–1372, 2009.
[12] C. G. L. CMU. Human Motion Capture DataBase. Available online at
http://mocap.cs.cmu.edu/search.html, 2003.
[13] C. Cortes, M. Mohri, and J. Weston. A general regression technique for learning
transductions. In ICML, pages 153–160, 2005.
[14] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In CVPR (1), pages 886–893. IEEE Computer Society, 2005.
[15] M. Fischler and R. Elschlager. Kinematic Jump Processes for Monocular 3D Human
Tracking. IEEE Transactions on Computers, 100(22), 1973.

36 REFERENCES
[16] P. Indyk G. Shakhnarovich, T. Darrell. Nearest-Neighbors methods in Learning and
Vision: Theory and Practice.
[17] Ingo Graf, Ulrich Kressel, and Jürgen Franke. Polynominal Classifiers and Support
Vector Machines. In ICANN, pages 397–402, 1997.
[18] Martin Haker, Martin Böhme, Thomas Martinetz, and Erhardt Barth.
Self-Organizing Maps for Pose Estimation with a Time-of-Flight Camera. In
Dyn3D, pages 142–153, 2009.
[19] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. Kernel methods
in machine learning. The Annals of Statistics, Jan 2008.
[20] N. Howe and A. Deschamps. Better Foreground Segmentation Through Graph
Cuts. Computer Vision and Pattern Recognition, 2004.
[21] Stephen S. Intille, Jason Nawyn, Beth Logan, and Gregory D. Abowd. Developing
shared home behavior datasets to advance hci and ubiquitous computing research.
In Dan R. Olsen Jr., Richard B. Arthur, Ken Hinckley, Meredith Ringel Morris,
Scott E. Hudson, and Saul Greenberg, editors, Proceedings of the 27th International
Conference on Human Factors in Computing Systems, CHI 2009, Extended
Abstracts Volume, Boston, MA, USA, April 4-9, 2009, pages 4763–4766. ACM,
2009.
[22] C. Ionescu, F. Li, and C. Sminchisescu. Latent Structured Models for Human Pose
Estimation. In IEEE International Conference on Computer Vision, November
2011.
[23] Catalin Ionescu, Dragos Papava, and Cristian Sminchisescu. Human3.6M: Large
Scale Datasets for 3D Human Sensing in Natural Environments. Technical report,
Institute of Mathematics of the Romanian Academy and University of Bonn.
[24] Hsi-Jian Lee and Zen Chen. Determination of 3D human body postures from a
single view. Computer Vision, Graphics, and Image Processing, 30(2):148–168,
1985.
[25] F. Li, G. Lebanon, and C. Sminchisescu. Chebyshev Approximations to the
Histogram χ2 Kernel. In IEEE International Conference on Computer Vision and
Pattern Recognition, 2012.
[26] R. Lukac, K. Martin, K.N. Plataniotis, B. Smolka, and A.N. Venetsanopoulos.
Bayer pattern based CFA zooming / CFA interpolation framework.
[27] G. Pons-Moll, A. Baak, J. Gall, L. Leal-Taixe, M. Mueller, H. P. Seidel, and
B. Rosenhahn. Outdoor human motion capture using inverse kinematics and von
mises-fisher sampling. 2011.

REFERENCES 37
[28] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In
NIPS, 2007.
[29] Benjamin Sapp, Chris Jordan, and Ben Taskar. Adaptive pose priors for pictorial
structures. In CVPR, pages 422–429, 2010.
[30] B. Schölkopf, A. Smola, and K. Müller. Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation, 10:1299–1319, 1998.
[31] L. Sigal, A. Balan, and M. J. Black. Combined discriminative and generative
articulated pose and non-rigid shape estimation. 2007.
[32] Leonid Sigal, Alexandru O. Balan, and Michael J. Black. HumanEva: Synchronized
Video and Motion Capture Dataset and Baseline Algorithm for Evaluation
of Articulated Human Motion. International Journal of Computer Vision,
87(1-2):4–27, 2010.
[33] C. Sminchisescu and A. Jepson. Generative Modeling for Continuous Non-Linearly
Embedded Visual Inference. In International Conference on Machine Learning,
pages 759–766, Banff, 2004.
[34] C. Sminchisescu, A. Kanaujia, and D. Metaxas. BM3E: Discriminative Density
Propagation for Visual Tracking. PAMI, 2007.
[35] C. Sminchisescu and B. Triggs. Kinematic Jump Processes for Monocular 3D
Human Tracking. In IEEE International Conference on Computer Vision and
Pattern Recognition, volume 1, pages 69–76, Madison, 2003.
[36] Cristian Sminchisescu. Consistency and Coupling in Human Model Likelihoods. In
FGR, pages 27–32, 2002.
[37] Camillo J. Taylor. Reconstruction of Articulated Objects from Point
Correspondences in a Single Uncalibrated Image. Computer Vision and Image
Understanding, 80(3):349–363, 2000.
[38] Michael E. Tipping. The Relevance Vector Machine. In NIPS, pages 652–658, 1999.
[39] Michael E. Tipping. Sparse Bayesian Learning and the Relevance Vector Machine.
Journal of Machine Learning Research, 1:211–244, 2001.
[40] Paul A. Viola and Michael J. Jones. Robust real-time face detection. In ICCV,
page 747, 2001.
[41] Daniel Vlasic, Rolf Adelsberger, Giovanni Vannucci, John Barnwell, Markus Gross,
Wojciech Matusik, and Jovan Popović. Practical motion capture in everyday
surroundings. 2007.

38 REFERENCES
[42] Jason Weston, Olivier Chapelle, André Elisseeff, Bernhard Schölkopf, and Vladimir
Vapnik. Kernel dependency estimation. In Suzanna Becker, Sebastian Thrun, Klaus
Obermayer, Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors,
NIPS, pages 873–880. MIT Press, 2002.

Dragos_Papava_dissertation

More Related Content

Similar to Dragos_Papava_dissertation (20)

Dragos_Papava_dissertation