Efficient Model-based 3D Tracking by Using Direct Image Registration

UNIVERSIDAD POLITÉCNICA DE MADRID
FACULTAD DE INFORMÁTICA
TESIS DOCTORAL
Efficient Model-based 3D Tracking by Using Direct Image
Registration
presentada en la
FACULTAD DE INFORMÁTICA
de la
UNIVERSIDAD POLITÉCNICA DE MADRID
para la obtención del
GRADO DE DOCTOR EN INFORMÁTICA
AUTOR: Enrique Muñoz Corral
DIRECTOR: Luis Baumela Molina
Madrid, 2012

Agradecimientos
La verdad es que los diez años (diez!) que he tardado en escribir esta tesis dan
para muchas cosas, y si tuviera que agradecer algo a todas las personas que me
han ayudado, necesitar´ıa un cap´ıtulo entero. En primer lugar quisiera agradecer
a Luis Baumela, gran director de tesis y mejor persona, el haber despertado en
m´ı el gusanillo por la investigación, y sobre todo, por tener la suficiente paciencia
para aguantar mis cabezonadas. Luis, si no fuera por t´ı, no habr´ıa entrado en la
Universidad y estar´ıa en la empresa privada ganando una pasta gansa—yeah, thank
you so much!
Gracias mil a Javier de Lope, por incansables discusiones técnicas y no tan
técnicas y sobre todo a José Miguel Buenaposada, quien durante todos estos años
me ha aguantado, ayudado, irritado, bromeado, e incluso buscado trabajo. No me
puedo olvidar de los buenos ratos pasados en la hora de la comida junto con las
“chicas” de estad´ıstica (Maribel, Arminda, Concha y Juan Antonio), en las que han
aguantado mis interminables peroratas sobre la burbuja inmobiliaria y los pol´ıticos
patrios. Un recuerdo también para todos los compañeros que han pasado por el
laboratorio L-3202 durante estos años: los “chicos de Javi” (Javi, Juan, Bea y
Yadira), Juan Bekios, los dos “Pablos” (Márquez y Herrero), Antonio y Rubén.
Quisiera agradecer también a Lourdes Agapito por permitirme participar en el
proyecto Automated facial expression analysis using computer vision, financiado por
la Royal Society del Reino Unido. Gracias a este proyecto pude tener el privilegio
de trabajar con Lourdes y con Xavier Lladó, y sobre todo de conocer a ese singular
personaje llamado Alessio del Bue. No tengo palabras para agradecer a Alessio el
ser tan majete y el aguantar estoicamente tantas veces como le hemos gorroneado.
Tampoco puedo olvidarme de la ayuda prestada por el profesor Thomas Vetter y su
grupo de la Universidad de Basilea (especialmente Brian Amberg y Pascal Paysan);
ellos se tomaron la molestia de construir un modelo tridimensional de mi cara,
incluyendo deformaciones y expresiones. No quisiera cerrar estos agradecimientos
sin comentar que parte de los trabajos de esta tesis se han realizado bajo los proyectos
del Ministerio de Ciencia y Tecnolog´ıa TIC2002-00591, y del Ministerio de Ciencia
e Innovación TIN2008-06815-C02-02.
Y por último, aunque no por ello menos importante, agradecer a Susana la
paciencia que ha tenido todos estos años (que han sido muchos) en los que he estado
liado con la tesis. Va por t´ı, Susana!
Enero de 2012
iii

Contents
Resumen xvii
Summary xix
Notations 1
1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Review 13
2.1 Image Registration vs. Tracking . . . . . . . . . . . . . . . . . . . . . 13
2.2 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Model-based 3D Tracking . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Modelling assumptions . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Rigid Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Nonrigid Objects . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Facial Motion Capture . . . . . . . . . . . . . . . . . . . . . . 18
3 Eﬃcient Direct Image Registration 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Modelling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Imaging Geometry . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Brightness Constancy Constraint . . . . . . . . . . . . . . . . 23
3.2.3 Image Registration by Optimization . . . . . . . . . . . . . . . 23
3.2.4 Additive vs. Compositional . . . . . . . . . . . . . . . . . . . 25
3.3 Additive approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Lucas-Kanade Algorithm . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Hager-Belhumeur Factorization Algorithm . . . . . . . . . . . 29
3.4 Compositional approaches . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Forward Compositional Algorithm . . . . . . . . . . . . . . . . 33
3.4.2 Inverse Compositional Algorithm . . . . . . . . . . . . . . . . 35
3.5 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v

4 Equivalence of Gradients 39
4.1 Image Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Image Gradients in R2
. . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Image Gradients in P2
. . . . . . . . . . . . . . . . . . . . . . 42
. . . . . . . . . . . . . . . . . . . . . . 43
4.2 The Gradient Equivalence Equation . . . . . . . . . . . . . . . . . . . 45
4.2.1 Relevance of the Gradient Equivalence Equation . . . . . . . . 46
4.2.2 General Approach to Gradient Replacement . . . . . . . . . . 46
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Additive Algorithms 51
5.1 Gradient Replacement Requirements . . . . . . . . . . . . . . . . . . 52
5.2 Systematic Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 3D Rigid Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 3D Textured Models . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Shape-induced Homography . . . . . . . . . . . . . . . . . . . 57
5.3.3 Change to the Reference Frame . . . . . . . . . . . . . . . . . 57
5.3.4 Optimization Outline . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.5 Gradient Replacement . . . . . . . . . . . . . . . . . . . . . . 61
5.3.6 Systematic Factorization . . . . . . . . . . . . . . . . . . . . . 63
5.4 3D Nonrigid Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Nonrigid Morphable Models . . . . . . . . . . . . . . . . . . . 65
5.4.2 Nonrigid Shape-induced Homography . . . . . . . . . . . . . . 65
5.4.3 Change of Variables to the Reference Frame . . . . . . . . . . 66
5.4.4 Optimization Outline . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.5 Gradient Replacement . . . . . . . . . . . . . . . . . . . . . . 69
5.4.6 Systematic Factorization . . . . . . . . . . . . . . . . . . . . . 71
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Compositional Algorithms 77
6.1 Unravelling the Inverse Compositional Algorithm . . . . . . . . . . . 77
6.1.1 Change of Variables in IC . . . . . . . . . . . . . . . . . . . . 79
6.1.2 The Eﬃcient Forward Compositional Algorithm . . . . . . . . 79
6.1.3 Rationale of the Change of Variables in IC . . . . . . . . . . . 82
6.1.4 Diﬀerences between IC and EFC . . . . . . . . . . . . . . . . . 84
6.2 Requirements for Compositional Warps . . . . . . . . . . . . . . . . . 85
6.2.1 Requirement on Warp Composition . . . . . . . . . . . . . . . 85
6.2.2 Requirement on Gradient Equivalence . . . . . . . . . . . . . 85
6.3 Other Compositional Algorithms . . . . . . . . . . . . . . . . . . . . 86
6.3.1 Generalized Inverse Compositional Algorithm . . . . . . . . . 86
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vi

7 Computational Complexity 91
7.1 Complexity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1.1 Number of Operations . . . . . . . . . . . . . . . . . . . . . . 91
7.1.2 Complexity of Matrix Operations . . . . . . . . . . . . . . . . 92
7.1.3 Comparing Algorithm Complexities . . . . . . . . . . . . . . . 93
7.2 Algorithm Naming Conventions . . . . . . . . . . . . . . . . . . . . . 94
7.2.1 Additive Algorithms . . . . . . . . . . . . . . . . . . . . . . . 95
7.2.2 Compositional Algorithms . . . . . . . . . . . . . . . . . . . . 96
7.3 Complexity of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.1 Additive Algorithms . . . . . . . . . . . . . . . . . . . . . . . 97
7.3.2 Compositional Algorithms . . . . . . . . . . . . . . . . . . . . 103
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8 Experiments 107
8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2 Features and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2.1 Numerical Ranges for Features . . . . . . . . . . . . . . . . . . 115
8.3 Generation of Synthetic Experiments . . . . . . . . . . . . . . . . . . 116
8.3.1 Synthetic Datasets and Images . . . . . . . . . . . . . . . . . 118
8.3.2 Generation of Result Plots . . . . . . . . . . . . . . . . . . . . 120
8.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.4.1 Convergence Criteria . . . . . . . . . . . . . . . . . . . . . . . 122
8.4.2 Visibility Management . . . . . . . . . . . . . . . . . . . . . . 122
8.4.3 Scale of Homographies . . . . . . . . . . . . . . . . . . . . . . 125
8.4.4 Minimization of Jacobian Operations . . . . . . . . . . . . . . 126
8.5 Additive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.5.1 Experimental Hypotheses . . . . . . . . . . . . . . . . . . . . 126
8.5.2 Experiments with Synthetic Rigid data . . . . . . . . . . . . . 127
8.5.3 Experiments with Synthetic Nonrigid data . . . . . . . . . . . 142
8.5.4 Experiments With Nonrigid Sequence . . . . . . . . . . . . . . 151
8.5.5 Experiments with real Rigid data . . . . . . . . . . . . . . . . 154
8.5.6 Experiment with real Nonrigid data . . . . . . . . . . . . . . . 158
8.6 Compositional Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 163
8.6.1 Experimental Hyphoteses . . . . . . . . . . . . . . . . . . . . 163
8.6.2 Experiments with Synthetic Rigid data . . . . . . . . . . . . . 163
8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9 Conclusions and Future work 179
9.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 179
9.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
A Gauss-Newton Optimization 201
B Plane-induced Homography 203
vii

C Plane+Parallax-constrained Homography 205
C.1 Compositional Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
D Methodical Factorization 209
D.1 Basic Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
D.2 Lemmas that Re-organize Product of Matrices . . . . . . . . . . . . . 211
D.3 Lemmas that Re-organize Kronecker Products . . . . . . . . . . . . . 215
D.4 Lemmas that Re-organize Sums of Matrices . . . . . . . . . . . . . . 216
E Methodical Factorization of f3DTM 219
F Methodical Factorization of f3DMM (Partial case) 223
G Methodical Factorization of f3DMM (Full case) 225
H Detailed Complexity of Algorithms 235
H.1 Warp f3DTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
H.2 Warp f3DMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
H.3 Jacobian of Algorithm HB3DTM . . . . . . . . . . . . . . . . . . . . 237
H.4 Jacobian of Algorithm HB3DTMNF . . . . . . . . . . . . . . . . . . 239
H.5 Jacobian of Algorithm HB3DMMNF . . . . . . . . . . . . . . . . . 241
H.6 Jacobian of Algorithm HB3DMMSF . . . . . . . . . . . . . . . . . . 246
viii

List of Figures
1.1 Example of 3D rigid tracking. . . . . . . . . . . . . . . . . . . . . 6
1.2 3D Nonrigid Tracking. . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Image registration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Industrial applications of 3D tracking. . . . . . . . . . . . . . . 9
1.5 Motion capture in the film industry. . . . . . . . . . . . . . . . 10
1.6 Markerless facial motion capture. . . . . . . . . . . . . . . . . . 11
3.1 Imaging geometry. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Iterative gradient descent image registration. . . . . . . . . . . 24
3.3 Generic descent method for image registration. . . . . . . . . . 26
3.4 Lucas-Kanade image registration. . . . . . . . . . . . . . . . . . 28
3.5 Hager-Belhumeur image registration. . . . . . . . . . . . . . . . 32
3.6 Forward compositional image registration. . . . . . . . . . . . . 34
3.7 Inverse compositional image registration. . . . . . . . . . . . . 36
4.1 Depiction of Image Gradients. . . . . . . . . . . . . . . . . . . . 41
4.2 Image Gradient in P2
. . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Image gradient in R3
. . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Comparison between BCC and GEE. . . . . . . . . . . . . . . . 47
4.5 Gradients and Convergence. . . . . . . . . . . . . . . . . . . . . . 49
4.6 Open Subsets in Various Domains. . . . . . . . . . . . . . . . . . 49
5.1 3D Textured Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Shape-induced homographies. . . . . . . . . . . . . . . . . . . . . 58
5.3 Warp defined on the reference frame. . . . . . . . . . . . . . . . 59
5.4 Reference frame advantages. . . . . . . . . . . . . . . . . . . . . . 60
5.5 Nonrigid Morphable Models. . . . . . . . . . . . . . . . . . . . . 65
5.6 Nonrigid shape-induced homographies. . . . . . . . . . . . . . . 67
5.7 Deformable warp defined on the reference frame. . . . . . . . 68
6.1 Change of variables in IC. . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Forward compositional image registration. . . . . . . . . . . . . 83
6.3 Generalized inverse compositional image registration. . . . . . 88
7.1 Complexity of Additive Algorithms. . . . . . . . . . . . . . . . . 102
7.2 Complexities of Compositional Algorithms . . . . . . . . . . . 105
ix

8.1 Registration vs. Tracking. . . . . . . . . . . . . . . . . . . . . . . 109
8.2 Algorithm initialization . . . . . . . . . . . . . . . . . . . . . . . . 110
8.3 Accuracy and convergence. . . . . . . . . . . . . . . . . . . . . . 114
8.4 Ground Truth and Noise Variance. . . . . . . . . . . . . . . . . 117
8.5 Deﬁnition of Datasets. . . . . . . . . . . . . . . . . . . . . . . . . 118
8.6 Example of Synthetic Datasets. . . . . . . . . . . . . . . . . . . . 119
8.7 Experimental Evaluation with Synthetic Data . . . . . . . . . 121
8.8 Visibility management. . . . . . . . . . . . . . . . . . . . . . . . . 123
8.9 Eﬃciently solving of WLS. . . . . . . . . . . . . . . . . . . . . . . 125
8.10 The cube model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.11 The face model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.12 The tea box model. . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.13 Results from dataset DS1 for cube. . . . . . . . . . . . . . . . . . 130
8.19 tea box sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.20 Results for the tea box sequence. . . . . . . . . . . . . . . . . . . 137
8.21 Estimated parameters from teabox sequence. . . . . . . . . . . 138
8.22 Estimated parameters from face sequence. . . . . . . . . . . . . 140
8.23 Good texture vs. bad texture. . . . . . . . . . . . . . . . . . . . 141
8.24 The face-deform model. . . . . . . . . . . . . . . . . . . . . . . . . 142
8.25 Distribution of Synthetic Datasets. . . . . . . . . . . . . . . . . 143
8.26 Results from dataset DS1 for face-deform. . . . . . . . . . . . . 145
8.32 face-deform sequence. . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.33 Results from face-deform sequence. . . . . . . . . . . . . . . . . 152
8.34 Estimated parameters from face-deform sequence. . . . . . . . 153
8.35 The cube-real model. . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.36 The cube-real sequence. . . . . . . . . . . . . . . . . . . . . . . . 156
8.37 Results from cube-real sequence. . . . . . . . . . . . . . . . . . . 157
8.38 Selected facial scans used to build the model. . . . . . . . . . . 158
8.39 Unfolded texture model. . . . . . . . . . . . . . . . . . . . . . . . 159
8.40 The face-real sequence. . . . . . . . . . . . . . . . . . . . . . . . 160
8.41 Anchor points in the model. . . . . . . . . . . . . . . . . . . . . . 161
8.42 Results for the face-real sequence. . . . . . . . . . . . . . . . . 162
8.43 The plane model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.44 Distribution of Synthetic Datasets. . . . . . . . . . . . . . . . . 165
x

8.45 Results from dataset DS1 for plane. . . . . . . . . . . . . . . . . 167
8.51 Average Time per iteration. . . . . . . . . . . . . . . . . . . . . . 176
9.1 Spiderweb Plots for Image Registration Algorithms. . . . . . 182
9.2 Spherical Harmonics-based Illumination Model . . . . . . . . . 184
9.3 Tracking by simultaneously using texture and edges infor-
mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.4 Eﬃcient tracking using multiple views . . . . . . . . . . . . . . 186
B.1 Plane-induced homography. . . . . . . . . . . . . . . . . . . . . . 203
C.1 Plane+Parallax-constrained homograpy. . . . . . . . . . . . . . 206
xi

List of Tables
4.1 Characteristics of the warps . . . . . . . . . . . . . . . . . . . . . 50
6.1 Relationship between compositional algorithms and warps . . 89
6.2 Requirements for Optimization Algorithms . . . . . . . . . . . 90
7.1 Complexity of matrix operations. . . . . . . . . . . . . . . . . . 93
7.2 Additive testing algorithms. . . . . . . . . . . . . . . . . . . . . . 95
7.3 Additive testing algorithms. . . . . . . . . . . . . . . . . . . . . . 96
7.4 Complexity of Algorithm LK3DTM. . . . . . . . . . . . . . . . . 97
7.5 Complexity of Algorithm HB3DTM. . . . . . . . . . . . . . . . 98
7.6 Complexity of Algorithm LK3DMM. . . . . . . . . . . . . . . . 98
7.7 Complexity of Algorithm HB3DMMNF. . . . . . . . . . . . . . 99
7.8 Complexity of Algorithm HB3DMM. . . . . . . . . . . . . . . . 100
7.9 Complexity of Algorithm HB3DMMSF. . . . . . . . . . . . . . 101
7.10 Complexities of Additive Algorithms. . . . . . . . . . . . . . . . 101
7.11 Complexity of Algorithm LKH8. . . . . . . . . . . . . . . . . . . 103
7.12 Complexity of Algorithm ICH8. . . . . . . . . . . . . . . . . . . 103
7.13 Complexity of Algorithm HBH8. . . . . . . . . . . . . . . . . . . 104
7.14 Complexity of Algorithm GICH8. . . . . . . . . . . . . . . . . . 104
7.15 Complexities of Compositional Algorithms. . . . . . . . . . . . 106
7.16 Comparison of Relative Complexities for Additive Algorithms106
7.17 Comparison of Relative Complexities for Compositional Al-
gorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.1 Registration vs. tracking in eﬃcient methods . . . . . . . . . . 111
8.2 Features and Measures. . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3 Numerical Ranges for Features. . . . . . . . . . . . . . . . . . . . 115
8.4 Evaluated Additive Algorithms . . . . . . . . . . . . . . . . . . . 127
8.5 Ranges of parameters for cube experiments. . . . . . . . . . . . 129
8.6 Average reprojection error vs. noise for cube. . . . . . . . . . . 129
8.7 Ranges of parameters for face-deform experiments. . . . . . . 144
8.8 Average reprojection error vs. noise for face-deform. . . . . . 144
8.9 Evaluated Compositional Algorithms . . . . . . . . . . . . . . . 164
8.10 Ranges of motion parameters for each dataset. . . . . . . . . . 165
8.11 Average reprojection error vs. noise for plane. . . . . . . . . . 166
xiii

9.1 Classiﬁcation of Motion Warps. . . . . . . . . . . . . . . . . . . . 181
D.1 Lemmas used to re-arrange matrices product. . . . . . . . . . 214
D.2 Lemmas used to re-arrange Kronecker matrix products. . . . 216
xiv

List of Algorithms
1 Outline of the basic GN-based descent method for image
registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Outline of the Lucas-Kanade algorithm. . . . . . . . . . . . . . 28
3 Outline of the Hager-Belhumeur algorithm. . . . . . . . . . . . 31
4 Outline of the Forward Compositional algorithm. . . . . . . . 34
5 Outline of the Inverse Compositional algorithm. . . . . . . . . 36
6 Iterative factorization of the Jacobian matrix. . . . . . . . . . 54
7 Outline of the HB3DTM algorithm. . . . . . . . . . . . . . . . . 64
8 Outline of the full-factorized HB3DMM algorithm. . . . . . . 75
9 Outline of the HB3DMMSF algorithm. . . . . . . . . . . . . . . 76
10 Outline of the Eﬃcient Forward Compositional algorithm. . . 82
11 Outline of the Generalized Inverse Compositional algorithm. 88
12 Creating the synthetic datasets. . . . . . . . . . . . . . . . . . . 119
13 Outline of the GN algorithm. . . . . . . . . . . . . . . . . . . . . 202
xv

Resumen
Esta tesis trata el problema de seguimiento eficiente de objectos 3D en secuencias de
imágenes. Tratamos el problema del seguimiento 3D usando registrado de imágenes
directo, una técnica que permite alinear dos imágenes usando sus niveles de inten-
sidad. El registrado de imágenes se suele resolver usando métodos de optimización
iterativa, donde la función a minimizar depende del error en los niveles de intensidad.
En esta tesis examinaremos los métodos de registrado de imágenes más comunes,
haciendo hincapié en aquellos que usan algoritmos eficientes de optimización.
En esta tesis investigaremos dos formas de registrado eficiente. La primera in-
cluye a los métodos aditivos de registrado: los parámetros de movimiento se calculan
incrementalmente mediante una aproximación lineal de la función de error. Dentro
de este tipo de algoritmos, nos centraremos en el método de factorización de Hager y
Belhumeur. Introduciremos un requisito necesario que el algoritmo de factorización
debe cumplir para tener una buena convergencia. Además, proponemos un pro-
cedimiento automático de factorización que nos permitirá seguir objetos 3D tanto
r´ıgidos como deformables.
El segundo tipo son los llamados métodos composicionales de registrado, donde
la norma de error se reescribe usando composición de funciones. Estudiaremos los
métodos composicionales más usuales, haciendo hincapié en el método de registrado
más rápido, el algoritmo composicional inverso. Introduciremos un nuevo método
de registrado composicional, el algoritmo Efficient Forward Compositional, que nos
permite interpretar los mecanismos de funcionamiento del algoritmo composicional
inverso. Gracias a esta interpretación novedosa, enunciaremos dos requisitos funda-
mentales para algoritmos composicionales eficientes.
Por último, realizaremos una serie de experimentos con datos reales y sintéticos
para comprobar los postulados teóricos. Además, diferenciaremos entre los proble-
mas de registrado y seguimiento para algoritmos eficientes: aquellos algoritmos que
cumplan su(s) requisito(s) podrán usarse para registrado de imágenes, pero no para
seguimiento.
xvii

Abstract
This thesis deals with the problem of efficiently tracking 3D objects in sequences of
images. We tackle the efficient 3D tracking problem by using direct image registra-
tion. This problem is posed as an iterative optimization procedure that minimizes
a brightness error norm. We review the most popular iterative methods for image
registration in the literature, turning our attention to those algorithms that use
efficient optimization techniques.
Two forms of efficient registration algorithms are investigated. The first type
comprises the additive registration algorithms: these algorithms incrementally com-
pute the motion parameters by linearly approximating the brightness error function.
We centre our attention on Hager and Belhumeur’s factorization-based algorithm for
image registration. We propose a fundamental requirement that factorization-based
algorithms must satisfy to guarantee good convergence, and introduce a systematic
procedure that automatically computes the factorization. Finally, we also bring
out two warp functions to register rigid and nonrigid 3D targets that satisfy the
requirement.
The second type comprises the compositional registration algorithms, where the
brightness function error is written by using function composition. We study the
current approaches to compositional image alignment, and we emphasize the impor-
tance of the Inverse Compositional method, which is known to be the most efficient
image registration algorithm. We introduce a new algorithm, the Efficient Forward
Compositional image registration: this algorithm avoids the necessity of inverting
the warping function, and provides a new interpretation of the working mechanisms
of the inverse compositional alignment. By using this information, we propose two
fundamental requirements that guarantee the convergence of compositional image
registration methods.
Finally, we support our claims by using extensive experimental testing with
synthetic and real-world data. We propose a distinction between image registration
and tracking when using efficient algorithms. We show that, depending whether the
fundamental requirements are hold, some efficient algorithms are eligible for image
registration but not for tracking.
xix

Notations
Speciﬁc Sets and Constants
X Set of target points or target region.
Ω Set of target points currently visible.
N Number of points in the target region—i.e., N = X .
NΩ Number of visible target points—i.e., NΩ = Ω .
P Dimension of the parameter space.
C Number of image channels.
K Dimension of the deformations space.
F Number of frames in the image sequence.
Vectors and Matrices
a Lowercase bold letters denote vectors.
Am×n Monospace uppercase letters denote m × n matrices.
vec(A) Vectorization of matrix A: if A is a m × n matrix, vec(A) is
a mn × 1 vector.
Ik ∈ Mk×k k × k identity matrix.
I 3 × 3 identity matrix.
0k ∈ Rk
k × 1 vector full with zeroes.
0m×n ∈ Mm×n m × n matrix full with zeroes.
Camera Model Notations
x ∈ R2
Pixel location at the image.
ˆx ∈ P2
Location in the Projective space.
X ∈ R3
Point in Cartesian coordinates
Xc ∈ R3
Point expressed in the camera reference system.
K ∈ M3×3 3 × 3 camera intrinsics matrix.
P ∈ M3×4 3 × 4 camera projection matrix.
1

Imaging Notations
T (x) ∈ Rc
Brightness value of the template image at pixel x.
I(x, t) ∈ Rc
Brightness value of the current image for pixel x at instant t.
It(x) Another notation for I(x, t).
T,It Vector forms of functions T and It.
[ ] Composite function of I ◦ p, that is I[x] = I(p(x)).
Optimization Notations
µ ∈ RP
Column vector of motion parameters.
µ0 ∈ RP
Initial guess of the optimization.
µi ∈ RP
Parameters at the i-th iteration of the optimization.
µ∗
∈ RP
Actual optimum of the optimization.
µt ∈ RP
Parameters at image t.
µJ ∈ RP
Parameters where the Jacobian is computed for eﬃcient algorithms.
δµ ∈ RP
Incremental step at the current state of the optimization.
ℓ(δµ) Linear model for the incremental step δµ.
L(δµ) Local minimizer for the incremental step δµ.
r(µ) ∈ RN
N × 1 vector-valued residuals function at parameters µ.
∇ˆxf(x) Derivatives of function f with respect to variables x, instantiated at x.
J(µ) ∈ MN×P Jacobian matrix of the brightness dissimilarity at µ (i.e., J(µ) =
∇ˆµD(X; µ)).
H(µ) ∈ MP×P Hessian matrix of the brightness dissimilarity at µ (i.e., H(µ) =
∇2
ˆµD(X; µ)).
Warp Function Notations
f(x; µ) : Rn
× RP
→ Rn
Motion model or Warp.
p : Rn
→ R2
Projection into the Cartesian plane.
R ∈ M3×3 3 × 3 rotation matrix.
ri ∈ R3
Columns of the rotation matrix R (i.e., R = (r1, r2, r3)).
t ∈ R3
Translation vector in Euclidean space.
D : R2
× Rp
→ R Dissimilarity function.
U : Rp
× Rp
→ Rp
Parameters update function.
ψ : Rp
× Rp
→ Rp
Jacobian update function for algorithm GIC.
2

Factorization Notations
⊗ Kronecker product.
⊙ Row-wise Kronecker product.
S(x) Constant matrix in the factorization method that is computed from the
target structure and camera calibration.
M(µ) Variable matrix in the factorization methods that is computed from
motion parameters.
W ∈ Mp×p Weighting matrix for Weighted Least-Squares.
π : Rn
→ Rn
Permutation of the set {1, . . . , n}.
Pπ(n) ∈ Mn×n Permutation matrix of the set {1, . . . , n}.
π(n, q) Permutation of the set {1, . . . , n} with ratio q.
3D Models Notations
F ⊂ R2
Reference frame for algorithm HB.
S : F → R3
Target shape function.
T : F → RC
Target texture function.
u ∈ F Target coordinates in the reference frame.
S ∈ M3×Nv Target 3D shape.
s ∈ R3
Shape coordinates in the Euclidean space.
s0
∈ R3
Mean shape of the target generative model.
si
∈ R3
i-th basis of deformation of the target generative model.
n⊤
∈ R3
Normal vector to a given triangle. n⊤
is normalized with the triangle
depth (i.e., if x belongs to the triangle, then n⊤
x = 1).
Bs ∈ M3×K Basis of deformations.
c ∈ RK
Vector containing K deformation coefficients.
HA ∈ M3×3 Affine warp between the image reference frame and F.
˙R∆ Derivatives of the rotation matrix R with respect to the Euler angle
∆ = {α, β, γ}.
λ ∈ R Homogeneous scale factor.
v ∈ R3
Change of variables defined as v = K−1
HA û.
Function Naming Conventions
fH82D : P2
→ P2
8-dof homography.
fH6P : P2
→ P2
Plane-induced homography.
fH6S : P2
→ P2
Shape-induced homography.
f3DTM : P2
→ P2
3D Textured Model motion model.
fH6D : P2
→ P2
Deformable shape-induced homography.
f3DMM : P2
→ P2
3D Textured Morphable Model motion model.
ε : Rp
→ R Reprojection error function.
3

Algorithms Naming Conventions
LK Lucas-Kanade algorithm [Lucas and Kanade, 1981]1
.
HB Hager-Belhumeur factorization algorithm [Hager and Belhumeur, 1998].
IC Inverse Compositional algorithm [Baker and Matthews, 2004].
FC Forward Compositional algorithm [Baker and Matthews, 2004].
GIC Generalized Inverse Compositional algorithm [Brooks and Arbel, 2010].
EFC Eﬃcient Forward Compositional algorithm.
LKH8 Lucas-Kanade algorithm for homographies.
LKH6 Lucas-Kanade algorithm for plane-induced homographies.
LK3DTM Lucas-Kanade algorithm for 3D Textured Models (rigid).
LK3DMM Lucas-Kanade algorithm for 3D Morphable Models (deformable).
HB3DTR Full-factorized HB algorithm for 6-dof motion in R3
[Sepp, 2006].
HB3DTM Full-factorized HB algorithm for 3D Textured Models (rigid).
HB3DMM Full-factorized HB algorithm for 3D Morphable Models (deformable).
HB3DMMSF Semi-factorized HB algorithm for 3D Morphable Models.
HB3DMMNF HB algorithm for 3D Morphable Models without the factorization stage.
ICH8 IC algorithm for homographies.
ICH6 IC algorithm for plane-induced homographies.
GICH8 IC algorithm for homographies.
GICH6 IC algorithm for plane-induced homographies.
IC3DRT IC algorithm for 6-dof motion in R3
[Mu˜noz et al., 2005].
FCH6PP FC algorithm for plane+parallax homographies.
1
We only show the most relevant cite for each algorithm
4

Chapter 1
Introduction
This thesis deals with the problems of registration and tracking in sequences of
images. Both problems are classical topics in Computer Vision and Image Processing
that have been widely studied in the past. We summarize the subjects of this thesis
in the dissertation title:
Efficient Model-based 3D Tracking by using Direct Image Registration
What is 3D Tracking? Let the target be a part of the scene—e.g. the cube in
Figure 1.1. We define tracking as the process of repeatedly computing the target
state in a sequence of images. When we describe this state as the relative 3D
orientation and location of the target with respect the coordinate system of the
camera (or another arbitrary reference system), we refer to this process as 3D rigid
tracking (see Figure 1.1). If we also include state parameters that describe the
possible deformation of the object, we have 3D nonrigid or deformable tracking (see
Figure 1.2). We use 3D tracking to refer to both the rigid or the nonrigid case.
What is Direct Image Registration? When the target is imaged by two cam-
eras with different point-of-view, the resulting images are different although they
represent the same portion of the scene (see Figure 1.3). Image Registration or
Image Alignment computes the geometric transformation that best aligns the coor-
dinate systems of both images such that their pixel-wise differences are minimal (cf.
Figure 1.3). We say that the image registration is a direct method when we register
the coordinate systems by just using the brightness differences of the images.
What is Model-based? We say that a technique is model-based when we re-
strict the information from the real world by using certain assumptions: on the
target dynamics, on the target structure, on the camera sensing process, etc—e.g.
in Figure 1.1 we model the target with a cube structure and rigid body dynamics.
5

Figure 1.1: Example of 3D rigid tracking (Left) Selected frames of a scene containing
a textured cube. We track the object and we overlay its state in blue. (Right) The relative
position of the camera—represented by a coloured pyramid—and the cube is computed
from the estimated 3D parameters.
Figure 1.2: 3D Nonrigid Tracking. Selected frames from a sequence of a cushion
under a bending motion. We track some landmarks on the cushion through the sequence,
and we plot the resulting triangular mesh for the selected frames. The motion of the
landmarks is both global—translation of the mesh—and local—changes on the relative
position of the mesh vertices due to the deformation. Source: Alessio del Bue.
And Finally, What does Efficient mean? We say that a method is efficient
if it substantially improves the computation time with respect to gold-standard
techniques. In a more practical way, efficient is equivalent to real-time—i.e. the
6

Figure 1.3: Image registration (Top-row)Image of a portion of the scene under two
distinct point-of-views. We have outlined the target in blue (Top-left) and green (Top-
right). (Bottom)The left image is warped such that the coordinates of the target match
up in both images. Source:Graﬃti sequence, from Oxford Visual Geometry Group.
tracking procedure operates at 25 frames per second.
1.1 Motivation
In less than thirty years, and quite enclosed to academic or military environments,
video tracking has a widespread acknowledgement mainly thanks to the media.
7

Thus, video tracking is now a staple in sci-fi shows and films where futuristic Head-
up Displays (hud) work in a show-and-tell fashion, a camera surveillance system
can locate an object or a person, or a robot can address people and even recognize
their mood.
However, tv is, sadly to say, years ahead of reality. Actual video tracking systems
are still in a primitive stage: they are inaccurate, sloppy, slow, and usually work in
laboratory conditions only. Anyway, video tracking progression increases by leaps
and bounds and it will probably match some sci-fi standards soon.
We investigate the problem of efficiently tracking an object in a video sequence.
Nowadays there exists several efficient optimization algorithms for video tracking
or image registration. We study two of the fastest algorithms available: the Hager-
Belhumeur factorization algorithm and the Baker-Matthews inverse compositional
algorithm. Both algorithms, although very efficient for planar registration, present
diverse problems for 3D tracking. This thesis studies which assumptions can be done
with these algorithms whilst underlining their limitations through extensive testing.
Eventually, the objective is to provide a detail description of each algorithm, pointing
out pros and cons, leading to a kind of Quick Guide to Efficient Tracking Algorithms.
1.2 Applications
Typical applications for 3D tracking include target localization for military oper-
ations; security and surveillance tasks such as person counting, face identification,
people detection, determining people activity or detecting left objects; it also in-
cludes human-computer interaction for computer security, aids for disabled people
or even controlling video-games. Tracking is used for augmenting video sequences
with additional information such as advertisements, expanding information about
the scene, or adding or removing objects of the scene. We show some examples of
actual industrial applications in Figure 1.4.
A tracking process that is widely used in film industry is Motion Capture: we
track the motion of the different parts of the an actor’s body using a suit equipped
with reflective markers; then, we transfer the estimated motion to a computer-
generated character (see Figure 1.5). Using this technique, we can animate a syn-
thetic 3D character in a movie as Gollum in the Lord of the Rings trilogy (2001),
or Jar-Jar Binks in the new Star Wars trilogy (1999). Another relevant movies
that employ these techniques are Polar Express (2004), King Kong (2005), Beowulf
(2007), A Christmas Carol (2009), and Avatar (2009). Furthermore, we can generate
a complete computer-generated movie populated with characters animated through
motion capture. Facial motion capture is of special interest for us: we animate a
computer-generated facial expression by facial expression tracking (see Figure 1.5).
We turn our attention to markerless facial motion capture, that is, the process
of recovering the face expression and orientation without using fiducial markers.
Markerless motion capture does not require special equipment—such as close-up
8

Figure 1.4: Industrial applications of 3D tracking. (Top-left) Augmented reality
inserts virtual objects into the scene. (Top-middle) Augmented reality shows additional
information about tracked objects in the scene. Source:Hawk-eye, Hawk-Eye Innovations
Ltd., copyright c 2008. Top-right Tracking a pedestrian for video surveillance. Source:
Martin Communications, copyright c 1998-2007. Bottom-left People flow counter by
tracking. Source: EasyCount, by Keeneo, copyright c 2010. Bottom-middle Car track-
ing detects possible traffic infractions or estimates car speed. Source: Fibridge, copy-
right c . Bottom-right Body tracking is used for interactive controlling of video-games.
Source: Kinect, Microsoft, copyright c 2010.
cameras—or a complicated set-up on the actor’s face—such as special reflective
make-up or facial stickers. In this thesis we propose a technique that captures facial
expressions motion by only using brightness information and a prior knowledge on
the deformation of the target (see Figure 1.6).
1.3 Contributions of the Thesis
We outline the remaining chapters of the thesis and their principal contributions as
follows:
Chapters2: Literature Review We provide a detailed survey of the literature
on techniques for both image registration and tracking.
Chapters3: Efficient Image Registration We review the state-of-the-art on
efficient methods. We introduce the taxonomy for efficient registration algorithms:
9

Figure 1.5: Motion capture in the film industry. Facial and body motion capture
from AvatarTM (Top-row) and Polar ExpressTM (Bottom-row). (Left-column) The
body motion and head pose are computed using reflective fiducial markers—grey spheres
of the motion capture jumpsuit. For facial expression capture they use plenty of smaller
markers and even close-up cameras. (Right-column) They use the estimated motion to
animate characters in the movie. Source: Avatar, 20th Century Fox, copyright c 2009;
Polar Express, Warner Bros. Pictures, copyright c 2004.
an algorithm is classified as either additive or compositional.
Chapter 4: Equivalence of Gradients We introduce the gradient equiva-
lence equation constraint: we show that the accomplishment of this assumption
has positive effects on the performance of the algorithms.
Chapter 5: Additive Algorithms We review which constraints determine the
convergence of additive registration algorithms, specially the factorization approach.
We provide a methodical procedure to factorize an algorithm in general form; we
state a basic set of theorems and lemmas that enable us to systematize the factor-
ization. We introduce two tracking algorithms using factorization: one for rigid 3D
objects, and other for deformable 3D objects.
10

Figure 1.6: Markerless facial motion capture. (Top) Several frames where the
face modifies both its orientation—due to a rotation—and its shape structure—due to
changes in facial expression. (Bottom) The tracking state vector includes both pose and
deformation. Legend: Blue Actual projection of the target shape using the estimated
parameters; Pink Highlighted projections corresponding to profiles of the jaw, eyebrows,
lips and nasolabial wrinkles.
Chapter 6: Compositional Algorithms We review the basic inverse composi-
tional algorithm. We introduce an alternative efficient compositional algorithm that
is equivalent to the inverse compositional algorithm under certain assumptions. We
show that if the gradient equivalent equation holds then both efficient compositional
methods shall converge.
Chapter 7: Computational Complexity We study the resources used by the
registration algorithms in terms of their computational complexity. We compare the
theoretical complexities of efficient and nonefficient algorithms.
Chapter8: Experiments We devise a set of experimental tests that shall con-
firm our assumptions on the registration algorithms, that is, (1) the dependence
of the convergence on the algorithm constraint, and (2) evaluate the theoretical
complexities with actual data.
Chapter 9: Conclusions and Future Work Finally, we drawn conclusions
about where each technique is more suitable to be used, and we provide insight into
future work to improve the proposed methods.
11

Chapter 2
Literature Review
In this chapter we review the basic literature on tracking and image registration.
First we introduce the basic similarities and differences between image registration
and tracking. Then, we review the usual methods for both tracking and image
registration.
2.1 Image Registration vs. Tracking
The frontier between image registration and tracking is a bit fuzzy: tracking identi-
fies the location of an object in a sequence of images, whereas registration finds the
pixel-to-pixel correspondence between a pair of images. Note that in both cases we
compute a geometric and photometric transformation between images: pairwise in
the context of image registration and among multiple images for the tracking case.
Although we may indistinctly use the terms registration and tracking, we define the
following subtle semantic differences between them:
• Image registration finds the best alignment between two images of the same
scene. We use use a geometric transformation to align the images of both
cameras. We consider that image registration emphasizes in finding the best
alignment between two images in visual terms, not in accurately recovering
parameters of the transformation—this is usually the case in e.g., medical
applications.
• Tracking finds the location of a target object in each frame of a sequence. We
assume that the difference of object position between two consecutive frames is
small. In tracking we are typically interested in recovering the parameters de-
scribing the state of the object rather than the coordinates of the location: we
can describe an object using richer information that just its position (e.g. 3D
orientation, modes of deformation, lighting changes, etc.). This is usually the
case in robotics [Benhimane and Malis, 2007; Cobzas et al., 2009; Nick Molton,
2004], or augmented reality [Pilet et al., 2008; Simon et al., 2000; Zhu et al.,
2006].
13

Also, image registration involves two images with arbitrary baseline whereas track-
ing usually operates in a sequence with a small inter-frame baseline. We assume
that tracking is a higher level problem than image registration. Furthermore, we
propose a tracking-by-registration approach: we track an object through a sequence
by iteratively registering pairs of consecutive images [Baker and Matthews, 2004];
however, we can perform tracking without any registration at all (e.g. tracking-
by-detection [Viola and Jones, 2004], or tracking-by-classification [Vacchetti et al.,
2004]).
2.2 Image Registration
Image registration is a classic topic in computer vision and numerous approaches
have been proposed in the literature; two good surveys in the subject are [Brown,
1992] and [Zitova, 2003]. The process involves computing the pixel-to-pixel corre-
spondence between the two images: that is, for each pixel on one image we find
the corresponding pixel in the other image so that both pixels project from the
same actual point in the scene (cf. Figure 1.1). Applications include image mo-
saicing [Capel, 2004; Irani and Anandan, 1999; Shum and Szeliski, 2000], video
stitching [Caspi and Irani, 2002], super-resolution [Capel, 2004; Irani and Peleg,
1991], region tracking [Baker and Matthews, 2004; Hager and Belhumeur, 1998; Lu-
cas and Kanade, 1981], recovering scene/camera motion [Bartoli et al., 2003; Irani
et al., 2002], or medical image analysis [Lester and Arridge, 1999].
Image registration methods commonly fall in one of the two following groups [Bar-
toli, 2008; Capel, 2004; Irani and Anandan, 1999]:
Direct methods A direct image registration method aligns two images by only
using the colour—or intensity in greyscale data—values of the pixels that
are common to both images (namely, the region of support). Direct meth-
ods minimize an error measure based on image brightness from the region of
support. Typical error measures include a L2
-norm of the brightness differ-
ence [Irani and Anandan, 1999; Lucas and Kanade, 1981], normalized cross-
correlation [Brooks and Arbel, 2010; Lewis, 1995], or mutual information [Dow-
son and Bowden, 2008; Viola and Wells, 1997].
Feature-based methods In feature-based methods, we align two images by com-
puting the geometric transformation between a set of salient features that
we detect in each image. The idea is to abstract distinct geometric image
features that are more reliable than the raw intensity values; typically these
features show invariance with respect to modifications of the camera point-of-
view, illumination conditions, scale, or orientation of the scene [Schmid et al.,
2000]. Corners or interest points [Bay et al., 2008; Harris and Stephens, 1988;
Lowe, 2004; Torr and Zisserman, 1999] are classical features in the literature,
although we can use other features such us edges [Bartoli et al., 2003], or
extremal image regions [Matas et al., 2002].
14

Direct or feature-based methods? Choosing between direct or feature-based
methods is not an easy task: we have to know the strong points of each method
and for what applications it is more suitable. A good comparison between the two
types of methods is [Capel, 2004]. Feature-based methods typically show strong
invariance to a wide range of photometric and geometric transformation of the im-
age, and they are more robust to partial occlusions of the scene that their direct
counterparts [Capel, 2004; Torr and Zisserman, 1999]. On the other hand, direct
methods can align images with sub-pixel accuracy, estimate dominant motion even
when multiple motion are present, and they can provide dense motion field in case of
3D estimation [Irani and Anandan, 1999]. Moreover, direct methods do not require
high-frequency textured surfaces (corners) to operate, but have optimal performance
with smooth graylevel transitions [Benhimane et al., 2007].
2.3 Model-based 3D Tracking
In this section we define what is model-based tracking, and we review the previous
literature on 3D tracking of rigid and nonrigid objects. A special case of interest
for nonrigid objects is the 3D tracking of human faces or facial motion capture.
Recovering the 3D orientation and position of the target can be done with respect
to the camera (or an arbitrary reference system), or the relative displacement and
orientation of the camera with respect to the target (or another arbitrary reference
system in the scene), [Sepp, 2008]. A good survey on the subject is [Lepetit and
Fua, 2005].
2.3.1 Modelling assumptions
In model-based techniques we use a priori knowledge about the scene, the target,
or the sensing device, as a basis for the tracking procedure. We classify these
assumptions on the real-world information as follows:
Target model
The target model specifies how to represent the information about the structure of
the scene in our algorithms. Template tracking or template matching simply repre-
sents the target as the pixel intensity values inside a region defined on one image:
we call this region—or the image itself—the reference image or template. One of
the first proposed technique for template matching was [Lucas and Kanade, 1981],
although it was initially devised for solving optical flow problems. The literature
proposes numerous extensions to this technique [Baker and Matthews, 2004; Benhi-
mane and Malis, 2007; Brooks and Arbel, 2010; Hager and Belhumeur, 1998; Jurie
and Dhome, 2002a].
We may also allow the target to deform its shape: this deformation induces
changes in the target projected appearance. We model these changes in target
texture by using generative models such as eigenimages [Black and Jepson, 1998;
15

Buenaposada et al., 2009], Active Appearance Models (aam) [Cootes et al., 2001],
active blobs [Sclaroff and Isidoro, 2003], or subspace representation [Ross et al.,
2004]. Instead of modelling brightness variations we may represent target shape
deformation by using a linear model representing the location of a set of feature
points [Blanz and Vetter, 2003; Bregler et al., 2000; Del Bue et al., 2004], or Finite
Element Meshes [Pilet et al., 2005; Zhu et al., 2006]. Alternative approaches model
non-rigid motion of the target by using anthropometric data [Decarlo and Metaxas,
2000], or by using a probability distribution of the intensity values of the target
region [Comaniciu et al., 2000; Zimmermann et al., 2009].
These techniques are suitable to track planar objects of the scene. If we add fur-
ther knowledge about the scene, we can track more complex objects: with a proper
model we are able to recover 3D information. Typically, we use a wireframe 3D
model of the target and tracking consists on finding the best alignment between the
sensed image and the 3D model [Cipolla and Drummond, 1999; Kollnig and Nagel,
1997; Marchand et al., 1999]. We can augment this model by adding further texture
priors either from the image stream [Cobzas et al., 2009; Muñoz et al., 2005; Sepp
and Hirzinger, 2003; Vacchetti et al., 2004; Xiao et al., 2004a; Zimmermann et al.,
2006], or from and external source (e.g. a 3D scanner or a texture mosaic) [Hong
and Chung, 2007; La Cascia et al., 2000; Masson et al., 2004, 2005; Pressigout and
Marchand, 2007; Romdhani and Vetter, 2003].
Motion model
The motion model describes the target kinematics (i.e. how the object modifies
its position in the image/scene). The motion model is tightly coupled to the tar-
get model: it is usually represented by a geometric transformation that maps the
coordinates of the target model into a different set of coordinates. For a planar
target, these geometric transformations are typically affine [Hager and Belhumeur,
1998], homographic [Baker and Matthews, 2004; Buenaposada and Baumela, 1999],
or spline-based warps [Bartoli and Zisserman, 2004; Brunet et al., 2009; Lester and
Arridge, 1999; Masson et al., 2005]. For actual 3D targets, the geometric warps
account for computing the rotation and translation of the object using a 6 degree-
of-freedom (dof) rigid body transformation [Cipolla and Drummond, 1999; La Cascia
et al., 2000; Marchand et al., 1999; Sepp and Hirzinger, 2003].
Camera model
The camera model specifies how the images are sensed by the camera. The pin-
hole camera models the imaging device as a projector of the coordinates of the
scene [Hartley and Zisserman, 2004]. For tracking zoomed objects located far away,
we may use orthographic projection [Brand and R.Bhotika, 2001; Del Bue et al.,
2004; Tomasi and Kanade, 1992; Torresani et al., 2002]. The perspective projection
accounts for perspective distortion, and it is more suitable for close-up views [Muñoz
et al., 2005, 2009]. The camera model may also account for model deviations such
as lens distortion [Claus and Fitzgibbon, 2005; Tsai, 1987].
16

Other model assumptions
We can also model prior photometric knowledge about the target/scene such as
illumination cues [La Cascia et al., 2000; Lagger et al., 2008; Romdhani and Vetter,
2003], or global colour [Bartoli, 2008].
2.3.2 Rigid Objects
We can follow two strategies to recover the 3D parameters of a rigid object:
2D Tracking The first group of methods involves a two-step process: first we
compute the 2D motion of the object as a displacement of the target projection
on the image; second, we recover the actual 3D parameters from the computed
2D displacements by using the scene geometry. A natural choice is to use
optical flow: [Irani et al., 1997] computes the dominant 2D parametric motion
between two frames to register the images; the residual displacement—the
image regions that cannot be registered—is used to recover the 3D motion.
When the object is a 3D plane, we can use a homographic transformation to
compute plane-to-plane correspondences between two images; then we recover
the actual 3D motion of the plane using the camera geometry [Buenaposada
and Baumela, 2002; Lourakis and Argyros, 2006; Simon et al., 2000]. We
can also compute the inter-frame displacements by using linear regressors or
predictors, and then we robustly adjust the projections to a target model—
using RANSAC—to compute the 3D parameters [Zimmermann et al., 2009]. An
alternative method is to compute pixel-to-pixel correspondences by using a
classifier [Lepetit and Fua, 2006], and then recover the target 3D pose using
POSIT [Dementhon and Davis, 1995], or equivalent methods [Lepetit et al.,
2009].
3D Tracking These methods directly compute the actual 3D motion of the object
from the image stream. They mainly use a 3D model of the target to compute
the motion parameters; the 3D model contains a priori knowledge of the target
that improves the estimation of motion parameters (e.g. to get rid of projec-
tive ambiguities). The simplest way to represent a 3D target is using a texture
model—a set of image patches sensed from one or several reference images—as
in [Cobzas et al., 2009; Devernay et al., 2006; Jurie and Dhome, 2002b; Masson
et al., 2004; Sepp and Hirzinger, 2003; Xu and Roy-Chowdhury, 2008]. The
main drawback of these methods is the lack of robustness against changes in
scene illumination, specular reflections. We can alternatively fit the projection
of a 3D wireframe model (e.g. a cad model) to the edges of the image [Drum-
mond and Cipolla, 2002]. However, these methods have also problems with
cluttered backgrounds [Lepetit and Fua, 2005]. To gain robustness, we can use
hybrid models of texture and contours such as [Marchand et al., 1999; Masson
et al., 2003; Vacchetti et al., 2004], or simply use an additional model to deal
with illumination [Romdhani and Vetter, 2003].
17

2.3.3 Nonrigid Objects
Tracking methods for nonrigid objects fall in the same categories that we used for
rigid ones. Point-to-point correspondences of the deformable target can recover
the pose and/or deformation parameters using subspace methods [Del Bue, 2010;
Torresani et al., 2008], or fitting a deformable triangle mesh [Pilet et al., 2008;
Salzmann et al., 2007]. We can alternatively fit the 2D silhouette of the target to a
3D skeletal deformable model of the object [Bowden et al., 2000].
Direct estimation of the 3D parameters unifies the processes of matching pixel
correspondences, and estimating the pose and deformation of the target. [Brand,
2001; Brand and R.Bhotika, 2001] constrains the optical flow by using a linear
generative model to represent the deformation of the object. [Gay-Bellile et al.,
2010] models the object 3D deformations, including self-occlusions, by using a set
of Radial Basis Functions (rbf).
2.3.4 Facial Motion Capture
Estimation of facial motion parameters is a challenging task; head 3D orientation
was typically estimated by using fiducial markers to overcome the inherent difficulty
of the problem [Bickel et al., 2007].
However, markerless methods have been also developed in recent years. Facial
motion capture involves recovering head 3D orientation and/or face deformation due
to changes in expression. We first review techniques for recovering head 3D pose,
then we review techniques for recovering both pose and expression.
Head pose estimation There are numerous techniques to compute head pose or
3D orientation. In the following, we review a number of them—a recent detailed
survey on the subject is [Murphy-Chutorian and Trivedi, 2009]. The main difficulty
of estimating head pose lies on the nonconvex structure of the human head. Classic
2D approaches such as [Black and Yacoob, 1997; Hager and Belhumeur, 1998] are
only suitable to track motions of the head parallel to the image plane: the rea-
son is that these methods only use information from a single reference image. To
fully recover the 3D rotation parameters of the head we need additional informa-
tion. [La Cascia et al., 2000] uses a texture map that was computed by cylindrical
projection of different point-of-view images of the head; [Baker et al., 2004a; Jang
and Kanade, 2008] also use an analogous cylindrical model. In a similar fashion, we
can use a 3D ellipsoid shape [An and Chung, 2008; Basu et al., 1996; Choi and Kim,
2008; Malciu and Prêteux, 2000]. Instead of using a cylinder or an ellipsoid, we can
have a detailed model of the head like a 3D Morphable Model (3dmm) [Blanz and
Vetter, 2003; Muñoz et al., 2009; Xu and Roy-Chowdhury, 2008], an aam coupled
together with a 3dmm [Faggian et al., 2006], or a triangular mesh model of the
face [Vacchetti et al., 2004]. The latter is robustly tracked in [Strom et al., 1999]
using an Extended Kalman Filter. We can also have a head model with reduced
complexity as in [B. Tordoff et al., 2002].
18

Face expression estimation A change of facial expression induces a deforma-
tion in the 3D structure of the face. The estimation of this deformation can be
used for face expression recognition, expression detection, or facial motion trans-
fer. Classic 2D approaches such as aams [Cootes et al., 2001; Matthews and Baker,
2004] are only suitable to recover expressions from a frontal face. 3D aams are the
three-dimensional extension to these 2D methods: they adjust a statistical model
of 3D shapes and texture—typically a PCA model—to the pixel intensities of the
image [Chen and Wang, 2008; Dornaika and Ahlberg, 2006]. Hybrid methods that
combine 2D and 3D aams show both real-time performance and actual 3D head
pose estimation: we can use the 3D aams to simultaneously constrain the 2D aams
motion and compute the 3D pose [Xiao et al., 2004b], or directly compute the fa-
cial motion from the 2D aams parameters [Zhu et al., 2006]. In contrast to pure
2D aams, 3D aams can recover actual 3D pose and expression from faces that are
not frontal to the camera. However, the out-of-plane rotations that can be recov-
ered by these methods are typically smaller than using a pure 3D model (e.g. a
3dmm). [Blanz and Vetter, 2003; Romdhani and Vetter, 2003] search the best con-
figuration for a 3dmm such that the differences between the rendered model and the
image are minimal; both methods also show great performance recovering strong fa-
cial deformations. Real-time alternatives using 3dmm include [Hiwada et al., 2003;
Muñoz et al., 2009]. [Pighin et al., 1999] uses a linear combination of 3D face models
fitted to match the images to estimate realistic facial expressions. Finally, [Decarlo
and Metaxas, 2000] derives an anthropometric physically-based face model that may
be adjusted to each individual face target; besides, they solve a dynamic system for
the face pose and expression parameters by using optical flow constrained by the
edges of the face.
19

Chapter 3
Efficient Direct Image Registration
3.1 Introduction
This chapter reviews the problem of efficiently registering two images. We define
Direct Image Alignment (dia) problem as the process that computes the trans-
formation between two frames using only image brightness information. We orga-
nize the chapter as follows: Section 3.2 introduces basic registration notions; Sec-
tion 3.3 reviews additive registration algorithms such as Lucas-Kanade or Hager-
Belhumeur; Section 3.4 reviews compositional registration algorithms such as Baker
and Matthews’ Forward Compositional and Inverse Compositional; finally, other
methods are reviewed in Section 3.5.
3.2 Modelling Assumptions
This section reviews those assumptions on the real world that we use to mathemat-
ically model the registration procedure. We introduce the notation on the imaging
process through a pinhole camera. We ascertain the Brightness Constancy Assump-
tion or Brightness Constancy Constraint (bcc) as the cornerstone of the direct
image registration techniques. We also pose the registration problem as an itera-
tive optimization problem. Finally, we provide a classification of the existing direct
registration algorithms.
3.2.1 Imaging Geometry
We represent points of the scene using Cartesian coordinates in R3
(e.g. X =
(X, Y, Z)⊤
). We represent points on the image with homogeneous coordinates, so
that the pixel position x = (i, j)⊤
is represented using the notation for augmented
points as ˜x = (i, j, 1)⊤
. The homogeneous point ˜x = (x1, x2, x3)⊤
is conversely
represented in Cartesian coordinates using the mapping p : P2
→ R2
, such that
p(˜x) = x = (x1/x3, x2/x3). The scene is imaged through a perfect pin-hole cam-
era [Hartley and Zisserman, 2004]; by abuse of notation, we define the perspective
21

Figure 3.1: Imaging geometry. An object of the scene is imaged through camera
centres C1 and C2 onto two distinct images I1 and I2 (related by a rotation R and
a translation t). The point X is projected to the points x1 = p(K I|0 ˜X) and x2 =
p(K R − Rt ˜X) in the two images.
projection p : R3
→ R2
that maps scene coordinates onto image points,
x = p(Xc) =
k⊤
1 Xc
k⊤
3 Zc
,
k⊤
2 Yc
k⊤
3 Zc
⊤
,
where K = (k⊤
1 , k⊤
2 , k⊤
3 )⊤
is the 3 × 3 matrix that contains the camera intrinsics
(cf. [Hartley and Zisserman, 2004]), and Xc = (Xc, Yc, Zc)⊤
. We implicitly assume
that Xc represents a point in the camera reference system. If the points to project
are expressed in an arbitrary reference system of the scene we need an additional
mapping; hence, the perspective projection for a point X in the scene is
˜x = K R − Rt
X
1
,
where R and t are the rotation and translation between the scene and the camera
coordinate system (see Figure 3.1). Our input is a smooth sequence of images—i. e.
inter-frame diﬀerences are small—where It is the t-th frame of the sequence. We de-
note T as the reference image or template. Images are discrete matrices of brightness
values, although we represent them as functions from R2
to RC
, where C is the num-
ber of image channels (i.e. C = 3 for colour, and C = 1 for gray-scale images): It(x) is
the brightness value at pixel x. For non-discrete pixel coordinates, we use bilinear in-
terpolation. If X is a set of pixels, we collect the brightness values of I(x), ∀x ∈ X in
a single column vector as I(X)—i.e., I(X) = (I(x1), . . . , I(xN))⊤
, {x1, . . . , xN} ∈ X.
22

3.2.2 Brightness Constancy Constraint
The bcc relates brightness information between two frames of a sequence [Hager
and Belhumeur, 1998; Irani and Anandan, 1999]. The reference image T is one
arbitrary image of the sequence. We define the target region X as a set of pixel
coordinates X = {x1, . . . , xN} defined on T (see Figure 3.2). We define the template
as the image values of the target region, that is, T (X). Let us assume we know
the transformation of the target region between T and another arbitrary image of
the sequence, It. The motion model f defines this transformation as Xt = f(X; µt),
where the set of coordinates Xt is the target region on It and µt are the motion
parameters. The bcc states that the brightness values of the template T and the
input image It warped by f with parameters µt should be equal,
T (X) = It(f(X; µt)). (3.1)
The direct conclusion from Equation 3.1 is that the brightness of the target does not
depend on its motion—i.e., the relative position and orientation of the camera with
respect the target does not affect the brightness of the latter. However, we may aug-
ment the bcc to include appearance changes [Black and Jepson, 1998; Buenaposada
et al., 2009; Matthews and Baker, 2004], and changes in illumination conditions due
to ambient [Bartoli, 2008; Basri and Jacobs, 2003] or specular lighting [Blanz and
Vetter, 2003].
3.2.3 Image Registration by Optimization
Direct image registration is usually posed as an optimization problem. We minimize
an error function based on the brightness pixel-wise difference that is parameterized
by motion variables:
µ∗
= arg min
µ
{D(X; µ)2
}, (3.2)
where
D(X; µ) = T (X) − It(f(X; µ)) (3.3)
is a dissimilarity measure based on the bcc (Equation 3.1).
Descent Methods
Recovering these parameters is typically a non-linear problem as it depends on
image brightness—which is usually non-linearly related to the motion parameters.
The usual approach is iterative gradient-based descent (GD): from a starting point
µ0 in the search space, the method iteratively computes a series of partial solu-
tions µ1, µ2, . . . µk that, under certain conditions, converge to the local minimizer
µ∗
[Madsen et al., 2004] (see Figure 3.2). We typically use Gauss-Newton (GN)
methods for efficient registration because they provide good convergence without
computing second derivatives (see Appendix A). Hence, the basic GN-based algo-
rithm for image registration operates as we outline in Algorithm 1 and depict in
Figure 3.3. We describe the four stages of the algorithm in the following:
23

Figure 3.2: Iterative gradient descent image registration. Top-left Template
image for the registration. We highlight the target region as a green quadrangle. Top-
right Image that we register against the template. We generate the image by rotating the
image around its centre and translating it in the X-axis. We highlight the corresponding
target region in yellow. We also display the initial guess for the optimization as a green
quadrangle. Notice that it exactly corresponds to the position of the target region at the
template. Bottom-left Contour plot of the image brightness dissimilarity. The axis show
the values of the search space: image rotation and translation. We show the successive
iterations in the search space: we reach the solution in four steps—µ0 to µ4. Bottom-
right We show the target region that corresponds to the parameters of each iteration.
The colour of each quadrangle matches the colour of the parameters that generated it as
seen in the Bottom-left ﬁgure.
24

Dissimilarity measure The dissimilarity measure is a function on the image bright-
ness error between two images. The usual measure for image registration is
the Sum of Squared Differences (ssd), that is, the L2
-norm of the difference
of pixel brightness (Equation 3.3) [Brooks and Arbel, 2010; Hager and Bel-
humeur, 1998; Irani and Anandan, 1999; Lucas and Kanade, 1981]. However,
we can use other measures such as normalized cross-correlation [Brooks and
Arbel, 2010; Lewis, 1995], or mutual information [Brooks and Arbel, 2010;
Dowson and Bowden, 2008; Viola and Wells, 1997].
Linearize the dissimilarity The next stage linearizes the brightness function about
the current search parameters µ; this linearization enables us to transform
the problem into a system of linear equations on the search variables. We
typically approximate the function using Taylor series expansion; depending
on how many terms—derivatives—we compute, we have optimisation methods
like Gradient Descent [Amberg and Vetter, 2009], Newton-Raphson [Lucas and
Kanade, 1981; Shi and Tomasi, 1994], Gauss-Newton [Baker and Matthews,
2004; Brooks and Arbel, 2010; Hager and Belhumeur, 1998] or even higher-
order methods [Benhimane and Malis, 2007; Keller and Averbuch, 2004, 2008;
Megret et al., 2008]. This is theoretically a good approximation when the dis-
similarity is small [Irani and Anandan, 1999], although the estimation can be
improved by using coarse-to-fine iterative methods [Irani and Anandan, 1999],
or by selecting appropriate pixels [Benhimane et al., 2007]. Although Taylor
series expansion is the usual approach to compute the coefficients of the sys-
tem, other approaches such as linear regression [Cootes et al., 2001; Jurie and
Dhome, 2002a] or numeric differentiation [Gleicher, 1997] may be used.
Compute the descent direction The descent direction is a vector δµ in the
search space such that D(µ+δµ) < D(µ). In a GN-based algorithm, we solve
the linear system of equations of the previous stage using least-squares [Baker
and Matthews, 2004; Madsen et al., 2004]. Note that we do not perform the
line search stage—i.e., we implicitly assume that the step size α = 1, cf.
Appendix A.
Update the search parameters Once we have determined the search direction,
δµ, we compute the next point in the series by using the update function
U : RP
→ RP
: µ1 = U(µ0, δµ). We compute the dissimilarity value at µ1 to
check convergence: if the dissimilarity is below a given threshold, then µ1 is the
minimizer µ∗
—i.e., µ∗
= µ1; in other case, we repeat the whole process (i.e.
µ1 are the actual current parameters µ) until we find a suitable minimizer.
3.2.4 Additive vs. Compositional
We turn our attention to the step 4 of Algorithm 1: how to compute the new es-
timation of the optimization parameters. In a GN optimization scheme, the new
25

Algorithm 1 Outline of the basic GN-based descent method for image
registration
On-line: Let µi = µ0 be the initial guess.
1: while no convergence do
2: Compute the dissimilarity function at D(µi).
3: Compute the search direction: linearize the dissimilarity and compute the
descent direction, δµi.
4: Update the optimization parameters:µi+1 = U(µi, δµi).
5: end while
Figure 3.3: Generic descent method for image registration. We initialize the
current parameter estimation at frame It+1 (µ = µ0) using the local minimizer at the
previous frame It (µ0 = µ∗
t ). We compute the Dissimilarity Measure between the Im-
age and the Template using µ (Equation 3.3). We linearize the dissimilarity measure
to compute the descent direction of the search parameters (δµ). We update the search
parameters using the search direction and we obtain an approximation to the minimum
(µ1). We check if µ1 is a local minimizer by using the brightness dissimilarity: if D is
small enough, then µ1 is the local minimizer (µ∗ = µ1); in other case, we repeat the
process with using µ1 as the current parameters estimation (µ = µ1).
26

parameters are typically computed by adding the former optimization parameters
to the search direction vector: µt+1 = µt + δµt (cf. Appendix A); this summation
is a direct consequence of the definition of Taylor series [Madsen et al., 2004]. We
call additive approaches to those methods that update parameters by using addi-
tion [Hager and Belhumeur, 1998; Irani and Anandan, 1999; Lucas and Kanade,
1981]. Nonetheless, Baker and Matthews [Baker and Matthews, 2004] subsequently
proposed a GN-based method that updated the parameters using composition—
i.e., µt+1 = µt ◦ δµt. We call these methods compositional approaches [Baker and
Matthews, 2004; Cobzas et al., 2009; Muñoz et al., 2005; Romdhani and Vetter,
2003; Xu and Roy-Chowdhury, 2008].
3.3 Additive approaches
In this section we review some works that use additive update. We introduce the
Lucas-Kanade algorithm, the fundamental work on direct image registration. We
show the basic algorithm as well as the common problems regarding the method. We
also introduce the Hager-Belhumeur approach to image registration and we point
out its highlights.
3.3.1 Lucas-Kanade Algorithm
The Lucas-Kanade (LK) algorithm [Lucas and Kanade, 1981] solves the registration
problem using a GN optimization scheme. The algorithm defines the residuals r of
Equation 3.3 as
r(µ) ≡ T(x) − I(f(x; µ)). (3.4)
The corresponding linear model for these residuals is
r(µ + δµ) ≃ ℓ(δµ) ≡ r(µ) + r′
(µ)δµ
= r(µ) + J(µ)δµ,
(3.5)
where
r(µ) ≡ T(x) − I(f(x; µ)), and J(µ) ≡
∂I(f(x; ˆµ)
∂ ˆµ ˆµ=µ
. (3.6)
Hence, our optimization process amounts to minimise now
δµ∗
= arg min
δµ
{ℓ(δµ)⊤
ℓ(δµ)} = arg min
δµ
{L(δµ)}. (3.7)
We compute the local minimizer of L(δµ) as follows:
0 = L′
(δµ) = ∇δµ r(µ)⊤
r(µ) + 2δµ⊤
J(µ)r(µ) + δµ⊤
J(µ)⊤
J(µ)δµ
= J(µ)r(µ) + J(µ)⊤
J(µ)δµ.
(3.8)
Again, we obtain an approximation to the local minimum at
δµ = − J(µ)⊤
J(µ)
−1
J(µ)⊤
r(µ), (3.9)
which we iteratively refine until we find a suitable solution. We summarize the
optimization process in Algorithm 2 and Figure 3.4.
27

Algorithm 2 Outline of the Lucas-Kanade algorithm.
2: Compute the residual function at r(µi) from Equation 3.4.
3: Linearize the dissimilarity: J = ∇µr(µi)
4: Compute the search direction: δµi = − J(µi)⊤
J(µi)
−1
J(µi)⊤
r(µi).
5: Update the optimization parameters:µi+1 = µi + δµi.
6: end while
Figure 3.4: Lucas-Kanade image registration. We initialize the current parameter
estimation at frame It+1 (µ ≡ µ0) using the local minimizer at the previous frame It
(µ0 ≡ µ∗
t ). We compute the dissimilarity residuals between the Image and the Template
using µ (Equation 3.4). We linearize the residuals at the current parameters µ, and
we compute the descent direction of the search parameters (δµ). We additively update
the search parameters using the search direction and we obtain an approximation to the
minimum—i.e. µ1 = µ0 + δµ. We check if µ1 is a local minimizer by using the brightness
dissimilarity: if D is small enough, then µ1 is the local minimizer (µ∗ ≡ µ1); in other
case, we repeat the process with using µ1 as the current parameters estimation (µ ≡ µ1).
28

Known Issues
The LK algorithm is one instance of a well known technique for object tracking, [Baker
and Matthews, 2004]. The most remarkable feature of this algorithm is its robust-
ness: given a suitable bcc, the LK algorithm typically ensures a good convergence.
However, the algorithm has a series of weaknesses that degrades the overall perfor-
mance of the tracking:
Computational Cost The LK algorithm computes the Jacobian at each iteration
of the optimization loop. Furthermore, the minimization cycle is repeated
between each two consecutive frames of the video sequence. The consequence
is that the Jacobian is computed F × L times, where F is the number of frames
and L is the number of iterations in the optimization loop. The computational
burden of these operations is really high if the Jacobian is large: we have to
compute the derivatives at each point of the target region, and each point
contributes to a row in the Jacobian. As an example, Table 7.15—page 106—
compares the computational complexity of LK algorithm with respect to other
efficient methods.
Local Minima The GN optimization scheme, which is the basis for the LK al-
gorithm, is prone to get trapped in local minima. The very essence of the
minimization implies that the algorithm converges to the closest minimum
to the starting point. So, we must choose the initial guess of the optimiza-
tion very carefully to assure convergence to the true optimum. The best way
to guarantee that the starting point for tracking and the optimum are close
enough is imposing that the differences between consecutive images are small.
On the contrary, images with large baseline will cause problems to LK as falling
into local minima is more likely, which leads to incorrect alignment. To solve
this problem, common to all direct approaches, a pyramidal implementation
of the optimization may be used [Bouguet, 2000].
3.3.2 Hager-Belhumeur Factorization Algorithm
We review now an efficient algorithm for determining the motion parameters of the
target. The algorithm is similar to LK, but uses a priori information about the tar-
get motion and structure to save computation time. The Hager-Belhumeur (HB) or
factorization algorithm was first proposed by G. Hager and P. Belhumeur in [Hager
and Belhumeur, 1998]. The authors noticed the high computational cost when lin-
earizing the brightness error function in the LK algorithm: the dissimilarity depends
on each different frame of the sequence, It. The method focuses on how to efficiently
compute the Jacobian matrix of step 3 of the LK algorithm (see Algorithm 2). The
computation of the Jacobian in the HB algorithm has two separate stages:
1. Gradient replacement
The key idea is to use the derivatives at the template T instead of computing
the derivatives at frame It when estimating J. Hager and Belhumeur dealt with
29

this issue in a very neat way: they noticed that, if the bcc (Equation 3.1) re-
lated image and template brightness values, it could possibly relate also image
and template derivatives—cf. [Hager and Belhumeur, 1998]. The derivatives
of both sides of Equation 3.1 with respect to the target region coordinates are
∇xT (x) =∇xIt(f(x; µt)), x ∈ X,
=∇xIt(x)∇xf(x; µ), x ∈ X. (3.10)
On the other hand, we compute the Jacobian as
J =∇µt
It(f(x; µt)),
=∇xIt(x)∇µt
f(x; µ). (3.11)
We isolate the term ∇tIx(x) in Equations 3.10 and 3.11, and we equal the
remaining terms as follows:
J = ∇xT (x)∇xf(x; µ)−1
∇µt
f(x; µ). (3.12)
Notice that in Equation 3.12 the Jacobian depends on the template derivatives,
∇xT (x), which are constant. Using template derivatives speed up the whole
process up to 10-fold (cf. Table 7.16–page 106).
2. Factorization
Equation 3.12 reveals the internal structure of the Jacobian: it comprises
the product of three matrices: a matrix ∇xT (x) that depends on template
brightness values and two matrices,∇xf(x; µ)−1
and ∇µt
f(x; µ), whose values
depend on both the target shape coordinates and the motion parameters µt.
The factorization stage re-arranges the Jacobian internal structure such that
we speed up the computation of this matrix product.
A word about factorization In the literature, matrix factorization or ma-
trix decomposition refers to the process that expresses the values of a
matrix as the product of matrices of special types. One mayor example
is to factorize a matrix A into the product of a lower triangular ma-
trix L and and upper triangular matrix U, A = LU. This factorization
is called lu decomposition and it allows us to solve the linear system
Ax = b more eﬃciently: solving Ux = L−1
b require fewer additions and
multiplications than the original system, [Golub and Van Loan, 1996].
Other famous examples of matrix factorization are spectral decomposi-
tion, Cholesky factorization, Singular Value Decomposition (svd) and
qr factorization (see [Golub and Van Loan, 1996] for more information).
The key concept on using factorization in this problem states as follows:
Given a matrix product whose operands contain both constant and
variable terms, we want to re-arrange the product such that one
operand contains only constant values and the other one only con-
tains variable terms.
30

We rewrite this idea in equation as follows:
J = ∇xT (x)∇xf(x; µ)−1
∇µt
f(x; µ) = S(x)M(µ), (3.13)
where S(x) contains only target coordinate values and M(µ) contains only
motion parameters. The process to decompose the matrix J into the product
S(x)M(µ) is generally ad hoc: we must gain insight of the analytic structure
of the matrices ∇xf(x; µ)−1
and ∇µt
f(x; µ) to re-arrange their entries into
S(x)M(µ) [Hager and Belhumeur, 1998]. This process is not obvious at all
and it has been a frequent source of criticism for the HB algorithm [Baker
and Matthews, 2004]. However, we shall introduce procedures for systematic
factorization in Chapter 5
We outline the basic HB optimization in Algorithm 3.3; notice that the only
difference with the LK algorithm lies on the Jacobian computation. We depict the
differences more clearly in Figure 3.5: in the dissimilarity linearization stage we use
the derivatives of the template instead of the frame.
Algorithm 3 Outline of the Hager-Belhumeur algorithm.
Off-line: Let µi = µ0 be the initial guess.
1: Compute S(x)
On-line:
4: Compute the matrix M(µi)
5: Compute the Jacobian: J(µi) = S(x)M(µi)
J(µi)
−1
J(µi)⊤
r(µi).
8: end while
3.4 Compositional approaches
From Section 3.2.4 we recall the definition of compositional method: a GN-like
optimization method that updates the search parameters using function composition.
We review two compositional algorithms: the Forward Compositional (FC) and the
Inverse Compositional (IC), [Baker and Matthews, 2004].
A word about composition Function composition is usually defined as the ap-
plication of the results of a function onto another. Let f : X → Y, and
g : Y → Z be two function applications. We define the composite func-
tion g ◦ f : X → Z as (g ◦ f)(x) = g(f(x)). In the literature on image
registration the problem is posed as follows: Let f : R2
→ R2
be the tar-
get motion model parameterized by µ. We compose the target motion as
31

Figure 3.5: Hager-Belhumeur image registration. We initialize the current param-
eter estimation at frame It+1 (µ ≡ µ0) using the local minimizer at the previous frame
It (µ0 ≡ µ∗
t ). We additionally create the matrix S(x) whose entries depend on the target
values. We compute the dissimilarity residuals between the Image and the Template using
µ (Equation 3.4). Instead of linearizing the residuals, we compute the Jacobian matrix
at µ using Equation 3.12, and we solve for the descent direction using Equation 3.9. We
additively update the search parameters using the search direction and we obtain an ap-
proximation to the minimum— i.e. µ1 = µ0 + δµ. We check if µ1 is a local minimizer
by using the brightness dissimilarity: if D is small enough, then µ1 is the local minimizer
(µ∗ ≡ µ1); in other case, we repeat the process with using µ1 as the current parameters
estimation (µ ≡ µ1).
32

z = f(f(x; µ1); µ2) = f(x; µ1 ◦ µ2) ≡ f(x; µ3), that is, the coordinates z
are the result of mapping x onto y = f(x; µ1) and y onto z = f(y; µ2). We
represent the composite parameters as µ3 = µ1 ◦ µ2 such that z = f(x; µ3).
3.4.1 Forward Compositional Algorithm
The FC algorithm was ﬁrst proposed in [Shum and Szeliski, 2000], although the
terminology was introduced in [Baker and Matthews, 2001]: FC is an optimization
algorithm, equivalent to the LK approach, that relies in a compositional update
step. Compositional algorithms for image registration uses a dissimilarity brightness
function slightly diﬀerent from Equation 3.3; we pose the image registration problem
as the following optimization:
µ∗
= arg min
µ
{D(X; µ)2
}, (3.14)
with
D(X; µ) = T (X) − It+1(f(f(X; µ); µt)), (3.15)
where µt comprises the optimal parameters at the image It. Note that our search
variables µ are those parameters that should be composed with the current estima-
tion to yield the minimum. The residuals corresponding to Equation 3.15 are
r(µ) ≡ T(x) − It+1(f(f(x; µ); µt)), (3.16)
As in the LK algorithm, we compute the linear model of the residuals, but now at
the point µ = 0 in the search space:
r(0 + δµ) ≃ ℓ(δµ) ≡ r(0) + r′
(0)δµ
= r(0) + J(0)δµ,
(3.17)
where
r(0) ≡ T(x) − It+1(f(f(x; 0); µt)),
and J(0) ≡
∂It+1(f(f(x; ˆµ); µt)
∂ ˆµ ˆµ=0
.
(3.18)
Notice that, in this case, µt acts as a constant in the derivative. Again, the local
minimizer is
δµ = − J(0)⊤
J(0)
−1
J(0)⊤
r(0). (3.19)
We iterate the above procedure until convergence. The next point in the iterative
series is not computed as µt+1 = µt +δµ, but as µt+1 = µt ◦δµ to be coherent with
Equation 3.16. Also notice that the Jacobian J(0) (Equation 3.18) is not constant
as it depends both in the image It+1 and the parameters µt. Figure 3.6 shows a
graphical depiction of the algorithm that is outlined in Algorithm 4.
33

Algorithm 4 Outline of the Forward Compositional algorithm.
3: Linearize the dissimilarity: J = ∇ˆµr(0), using Equation 3.18.
4: Compute the search direction: δµi = − J(0)⊤
J(0)
−1
J(0)⊤
r(0).
5: Update the optimization parameters:µi+1 = µi ◦ δµi.
6: end while
Figure 3.6: Forward compositional image registration. We initialize the current
parameter estimation at frame It+1 (µ ≡ µ0) using the local minimizer at the previous
frame It (µ0 ≡ µ∗
t ). We compute the dissimilarity residuals between the Image and
the Template using µ (Equation 3.15). We linearize the residuals at µ = 0 and we
compute the descent direction δµ using Equation 3.19. We update the parameters using
function composition— i.e. µ1 = µ0 ◦ δµ. We check if µ1 is a local minimizer by using
the brightness dissimilarity: if D (Equation 3.15) is small enough, then µ1 is the local
minimizer (µ∗ ≡ µ1); in other case, we repeat the process with using µ1 as the current
parameters estimation (µ ≡ µ1).
34

3.4.2 Inverse Compositional Algorithm
The IC algorithm reinterprets the FC optimization scheme by changing the roles
of the template and the image. The key feature of IC is that its GN Jacobian is
constant: we compute the Jacobian using only template brightness values, therefore
it is constant. Using a constant Jacobian speeds up the whole computation as
the linearization stage is the most critical in time. The IC algorithm receives its
name because we reverse the roles of the template and the current frame (i.e. we
compute the Jacobian on the template). We rewrite the residuals function from FC
(Equation 3.16) as follows:
r(µ) ≡ T(f(x; µ)) − It+1(f(x; µt)), (3.20)
yielding the residuals for IC. Notice that the template brightness values now depend
on the search parameters µ. We linearize the Equation 3.20 around the point µ = 0
in the search space:
r(0 + δµ) ≃ ℓ(δµ) ≡ r(0) + r′
(0)δµ
= r(0) + J(0)δµ,
(3.21)
where
r(0) ≡ T(f(x; 0)) − It+1(f(x; µt)),
and J(0) ≡
∂T(f(x; ˆµ))
∂ ˆµ ˆµ=0
.
(3.22)
We compute the local minimizer of Equation 3.7 by deriving it respect δµ and
equalling to zero,
0 = L′
(δµ) = ∇δµ r(0)⊤
r(0) + 2δµ⊤
J(0)r(0) + δµ⊤
J(0)⊤
J(0)δµ
= J(0)r(0) + J(0)⊤
J(0)δµ.
(3.23)
Again, we obtain an approximation to the local minimum at
δµ = − J(0)⊤
J(0)
−1
J(0)⊤
r(0), (3.24)
which we iteratively refine until we find a suitable solution. We summarize the
optimization process in Algorithm 5 and Figure 3.7.
Note that the Jacobian matrix J(0) is constant as it is computed on the template
image—which is fixed—at the point µ = 0 (cf. Equation 3.22). Notice that the
crucial point of the derivation of the algorithm lies in the change of variables in
Equation 3.20. Solving for the search direction only consists on computing the
IC residuals and computing the least-squares approximation (Equation 3.24). The
Dissimilarity Linearization stage from Algorithm 1 is no longer required, which
results in a boost of the performance of the algorithm.
35

Algorithm 5 Outline of the Inverse Compositional algorithm.
Oﬀ-line: Compute J(0) = ∇µr(0) using Equation 3.22.
3: Compute the search direction: δµi = − J(0)⊤
J(0)
−1
J(0)⊤
r(0).
4: Update the optimization parameters:µi+1 = µi ◦ δµ−1
i .
5: end while
Figure 3.7: Inverse compositional image registration. We initialize the current
parameter estimation at frame It+1 (µ ≡ µ0) using the local minimizer at the previous
frame It (µ0 ≡ µ∗
t ). At this point we compute the Jacobian J(0) using Equation 3.22.
We compute the dissimilarity residuals between the Image and the Template using µ
(Equation 3.15). Using J(0) we compute the descent direction δµ (Equation 3.24). We
update the parameters using inverse function composition— i.e. µ1 = µ0 ◦ δµ−1
. We
check if µ1 is a local minimizer by using the brightness dissimilarity: if D (Equation 3.15)
is small enough, then µ1 is the local minimizer (µ∗ ≡ µ1); in other case, we repeat the
process with using µ1 as the current parameters estimation (µ ≡ µ1).
36

Relevance of IC
The IC algorithm is known to be the most efficient optimization technique for direct
image registration [Baker and Matthews, 2004]. The algorithm was initially pro-
posed for template tracking, although it was later improved to use aams [Matthews
and Baker, 2004], register 3D Morphable Models [Romdhani and Vetter, 2003; Xu
and Roy-Chowdhury, 2008], account for photometric changes [Bartoli, 2008] and
allow for appearance variation [Gonzalez-Mora et al., 2009].
Some efficient algorithms using a constant residual Jacobian with additive in-
crements have been proposed in literature but no one shows reliable performance:
in [Cootes et al., 2001] an iterative regression-based gradient scheme is proposed to
align AAM to frontal images of faces. The regression matrix (similar to our Jaco-
bian matrix) is numerically computed off-line and it remains constant during the
Gauss-Newton optimisation. The method shows good performance because the so-
lution does not depart far from the initial guess. The method is revisited in [Donner
et al., 2006] using Canonical Correlation Analysis instead of numerical differentia-
tion to achieve better convergence rate and range. In [La Cascia et al., 2000] the
authors propose a Gauss-Newton scheme with constant Jacobian matrix for 6-dof
3D tracking of heads. The method needs regularisation constraints to improve the
convergence of the optimisation.
Recently, [Brooks and Arbel, 2010] augmented the scope of the IC framework
with the Generalized Inverse Compositional (GIC) image registration: they propose
an additive update to the parameters that is equivalent to the compositional update
from IC; therefore, they can adapt the IC to other optimization methods than GN,
such as Broyden-Fletche-Goldfarb-Shanno (bfgs) [Press et al., 1992].
3.5 Other Methods
Iterative gradient-based optimization algorithms (see Figure 3.4) can improve their
efficiency in two different ways: (1) by speeding up the linearization of the dissim-
ilarity function, and (2) by reducing the number of iterations of the process. The
algorithms that we have presented—i.e. HB and IC—belong to the first type. The
second type of methods achieve efficiency by using a more involved linearization
that converges faster to the minimum. [Averbuch and Keller, 2002] approximates
the error function in both the template and the current image and average the
least-squares solution to both. They show it converges with less iterations than
LK although the time per iteration is higher. Malis et. al [Benhimane and Malis,
2007] propose a similar method called Efficient Second-Order Minimization (esm)
which differs from the latter in using an efficient linearization on the template by
means of Lie algebra properties. Recently, both methods have been revisited and
reformulated in a common Bi-directional Framework in [Megret et al., 2008]. [Keller
and Averbuch, 2008] derives a high-order approximation to the error function that
leads to a faster algorithm with a wider convergence basin. Unfortunately—with
the exception of esm—none of these algorithm are appropriate for real-time image
37

registration.
3.6 Summary
We have introduced the basic concepts on direct image registration. We pose the reg-
istration problem as the result of gradient-descent optimizing a dissimilarity function
based on brightness diﬀerences. We classify the direct image registration algorithms
as either additive or compositional: in the former group we highlight the LK and the
HB algorithms, whereas the FC and IC algorithms belong to the latter.
38

Chapter 4
Equivalence of Gradients
In this chapter we introduce the concept of Equivalence of Gradients, that is, the
process of replacing the gradient of a brightness function for an equivalent alterna-
tive. In chapter 3 we have shown that some efficient algorithms for direct image
registration use a gradient replacement technique as a basis for their speed improve-
ment: (1) HB algorithm transforms the template derivatives using the target warp to
yield the image derivatives; and (2) IC algorithm replaces the image derivatives by
the template derivatives without any modification, but they change the parameters
update rule so the GN-like optimization converges. We introduce a new constraint,
the Gradient Equivalence Equation, and we show that this constraint is a necessary
requirement for the high computational efficiency of both HB and IC algorithms.
We organize the chapter as follows: Section 4.1 introduces the basic concepts
on image gradients in R2
, and its extension to higher dimension spaces such as P2
and R3
; Section 4.2 introduces the Gradient Equivalence Equation, that shall be
subsequently used to impose some requirements on the registration algorithms.
4.1 Image Gradients
We introduce the concept of gradient of a scalar function below. We consider images
as functions in two dimensions that assign a brightness value to an image pixel
position.
The Concept of Gradient The gradient of a scalar function f : Rn
→ R at a
point x ∈ Rn
is a vector ∇f(x) ∈ Rn
that points towards the direction of greatest
rate of increase of f(x). The length of the gradient vector |∇f(x)| is the greatest
rate of change of the function.
Image Gradients Grayscale images are discrete scalar functions I : R2
→ R
ranging from 0 (black) to 255 (white)—see Figure 4.1. We turn our attention to
grayscale images, but we may deal with colour-channelled images (e.g. rgb images)
by simply considering them as one grayscale image per colour plane. Grayscale
39

images are discrete functions: we represent an image as a matrix whose elements
I(i, j) are the brightness function values. We continuously approximate the discrete
function by using interpolation (see Figure 4.1).
We introduce the image gradients in the most common domains in Computer
Vision—R2
, P2
, and R3
. Image gradients are naturally defined in R2
, since the
images are functions defined in such domain. In some Computer Vision applications
the domain of x, D, is not constrained to R2
, but to P2
[Buenaposada and Baumela,
2002; Cobzas et al., 2009], or to R3
[Sepp, 2006; Xu and Roy-Chowdhury, 2008]. In
the following, the target coordinates are expressed in a domain D ∈ {R3
, P2
}, so
we need a projection function to map the target coordinates onto the image. We
generically define the projection mapping as p : D → R2
.
The corresponding projectors are the homogeneous to Cartesian mapping, p :
P2
→ R2
, and the perspective projection, p : R3
→ R2
. Image gradients in domains
other than R2
are computed by using the chain rule with the projector p : Rn
→ R2
:
∇ˆx(I ◦ p(x)) = ∇ˆxI(p(x)) = ∇ˆxI(x)∇ˆxp(x),
=

 ∂I( ˆX)
∂ ˆX ˆX=p(x)

 ∂p( ˆY)
∂ ˆY ˆY=x
,
= ∇ ˆp(x)I(p(x))∇ˆxp(x), x ∈ D ⊂ Rn
.
(4.1)
Equation 4.1 represents image gradients in domain D as the image gradient in
R2
lifted up onto the higher-dimension space D by means of the Jacobian matrix
∇ˆxp(x).
Notation We use operator [ ] to denote the composite function I ◦ p, that is,
I(p(x)) = I[x].
If the target and its kinematics are expressed in R2
, there is no need to use a
projector as both the target and the image share a common reference frame. The
gradient of a grayscale image at point x = (i, j)⊤
is the vector
∇ˆxI(x) = (∇iI(x), ∇jI(x)) =
∂I(x)
∂i
,
∂I(x)
∂j
, (4.2)
that flows from the darker areas of the image to the brighter ones (see Figure 4.1).
Moreover, the direction of the gradient vector at point x ∈ R2
is orthogonal to the
level set of the brightness function at the point (see Figure 4.1).
40

Figure 4.1: Depiction of Image Gradients. (Top-left) An image is a rectangular
array where each element is a brightness value. (Top-right) Continuous representation
of the image brightness values; we compute the values from the discrete array by interpo-
lation. (Bottom-left) Image gradients are vectors from each image array element in the
direction of maximum increase of brightness (compare to the top-right image). (Bottom-
right) Gradient vectors are orthogonal to the brightness function contour curves. Legend:
blue Gradient vectors. diﬀerent colours Contour curves.
41

4.1.2 Image Gradients in P2
Projective warps map the projective plane onto itself f : P2
→ P2
. We represent
points in P2
using homogeneous coordinates (see Section 3.2.1).
We compute the image derivatives on the projective plane by using derivatives
on homogeneous coordinates. We compute the gradient of the composite function
I ◦ p at the point ˜x = (x, y, w)⊤
∈ P2
using the chain rule:
∇˜xI[˜x] =
∂I(p(ˆx))
∂ˆx ˆx=˜x
=
∂I(ˆp)
∂ˆp ˆp=p(˜x)
⊤
∂p(ˆx)
∂ˆx ˆx=˜x
⊤
,
= ∇iI ∇jI
⊤
1
w 0 − x
w2
0 1
w − y
w2
,
=
1
w∇iI 1
w∇jI − 1
w2 (x∇iI + y∇jI) .
(4.3)
Geometric Interpretation The image brightness gradient in P2
has a geometric
interpretation. The following proposition deﬁnes the geometric locus of the image
gradient in P2
Proposition 1. The image gradient of ˜x ∈ P2 is a projective line l incident to the
point—i.e. l⊤
˜x = 0.
Proof. The projective line l = 1
w
∇iI 1
w
∇jI − 1
w2 (x∇iI + y∇jI) is incident to
the projective point ˜x = (x, y, z)⊤
since:
l⊤
˜x = 1
w∇iI 1
w∇jI − 1
w(x∇iI + y∇jI)
⊤


x
y
w

 ,
=
x
w
∇jI +
y
w
∇jI −
w
w2 (x∇iI + y∇jI) = 0.
(4.4)
Figure 4.2 depicts the image gradients in P2
for image T . This may also be derived
by using Euler’s homogeneous function theorem1
: I[x] is a homogeneous function of
degree 0 in x ∈ P2
,
I[λx] = I[x] = λn
I[x],
with λ = 0 and n = 0. Then, by Euler’s theorem we have
∂I[λx]
∂λ
= 0 = n · I[x] = x
∂I[x]
∂x
+ y
∂I[x]
∂y
+ w
∂I[x]
∂w
.
1
Let f : Rn
+ → R be a continuously diﬀerentiable homogeneous function of degree n (i.e.
f(λx) = λn
f(x)), then nf(x) =
i
xi
∂f
∂xi
42

Figure 4.2: Image Gradient in P2. (Left) Coordinates in P2 are equal up to scale, that
is, x ∼ λx ∼ λ′x. Incidence is also preserved up to scale—i.e., xl⊤
= λxl⊤
= λ′xl⊤
= 0.
(Right) The image gradient in P2, l⊤
, is tangent to the contour of the brightness function.
Notice that the normal to l⊤
is parallel to the image gradient in R2, ∇ˆxI(x).
Corollary 1. The director vector of the image gradient at ˜x ∈ P2 is orthogonal to
the image brightness contour at p(x).
Proof. The image gradient at p(x), ∇ ˆp(x)I(p(x)), is tangent to the contour of the
brightness function at that point. Thus, its director vector, which is 1
w
∇iI 1
w
∇jI ,
is orthogonal to the brightness function contour curve—see Figure 4.2.
Often, the target is not defined in R3
, but on a manifold in R3
—e.g., a surface
embedded in 3D space. Images defined in the whole R3
space provide 3D volumetric
data instead of two-dimensional information [Baker et al., 2004b].
We assume that the target is defined on a manifold D ∈ R3
, and that the warp
f : R3
→ R3
defines the motion of manifold D. In this case P : R3
→ R2
is a
projector such that (x, y, z)⊤
→ (f x
z
, f y
z
)⊤
—with f, the camera focal length.
We compute the image derivatives by using the chain rule on the function I ◦ P
43

at the point x = (x, y, z)⊤
∈ R3
:
∇ˆxI[x] =
∂I(P(ˆx))
∂ˆx ˆx=u
=
∂I(ˆp)
∂ˆp ˆp=P(u)
⊤
∂P(ˆx)
∂ˆx ˆx=u
⊤
,
= ∇iI ∇jI
⊤ f 1
z 0 −f x
z2
0 f 1
z −f y
z2
,
=
f
z∇iI f
z∇jI − f
z2 (x∇iI + y∇jI) .
(4.5)
Geometric Interpretation As in projective coordinates, image gradients in R3
can be geometrically represented. Sepp [Sepp, 2008] introduces the following propo-
sition:
Proposition 2. The image gradient of a point x ∈ R3 is a plane through x and the
origin.
Proof. A plane through a point x ∈ R3
and the origin o is a 3-element vector π
such that π⊤
x = 0. This is true in our case as:
π⊤
x = 1
z∇iI 1
z∇jI − 1
z(x∇iI + y∇jI)
⊤


x
y
z

 ,
=
x
z
∇jI +
y
z
∇jI −
z
z2 (x∇iI + y∇jI) = 0.
(4.6)
Thus, the point x belongs to the plane as π⊤
x = 0. Besides, o = 0 trivially belongs
to the plane as π does not have an independent term (i.e. π⊤
0 = 0). We show the
geometry of the image gradient in R3
in Figure 4.3. As in the P2
, this proposition
is an immediate result from the fact that I[x] is a homogeneous function of degree
0 in x ∈ R3
.
We also infer the following two corollaries from Proposition 2:
Corollary 2. The image gradient of a point x ∈ R3, is a plane π through the origin
that also contains the projection of the point, P(x).
Proof. Let ˜x = (f x
z , f
y
z , f)⊤
be the projection of x onto the image plane by means
of P. Point ˆx belongs to the plane π⊤
deﬁned by the target image gradient as
π⊤
˜x = 0 and the origin,
π⊤
˜x = 1
z∇iI 1
z∇jI −1
z(x∇iI + y∇jI)
⊤


f x
z
f
y
z
f

 ,
= f
x
z2 ∇jI + f
y
z2 ∇jI −
f
z2 (x∇iI + y∇jI) = 0.
(4.7)
44

Figure 4.3: Image gradient in R3. (Left) A 3D model is imaged onto the camera. The
gradient on the image of the target point x is a plane that contains both x and the origin
(the camera centre). (Right) Close-up of the image at x. Plane π contains ∇ˆxI(x)⊥,
the tangent to the brightness function. Thus, the plane vector π is orthogonal to the
brightness function.
This corollary could be immediately proved by noticing that the projection of the
point x onto the image belongs to the line through o and x by deﬁnition of perspec-
tive projection, and this line is contained in the plane π (see Figure 4.3).
Corollary 3. The image gradient of a point x ∈ R3, π, is a plane going through the
origin whose director vector is orthogonal to the brightness contour at the image point
P(x).
Proof. The image gradient at p(x) is tangent to the contour of the brightness func-
tion at that point; the normal vector is thus orthogonal to the brightness contour
curve.
Note that P2
can be interpreted as an Euclidean space where points are lines through
the origin and lines are planes through the origin [Hartley and Zisserman, 2004].
Thus, results for P2
and R3
are equivalent.
4.2 The Gradient Equivalence Equation
We recall the brightness constancy constraint bcc from Equation 3.1:
T (X) = It(f(X; µt)).
The Gradient Equivalence Equation (GEE) relates the derivatives of the brightness
function between two frames of the sequence:
∇ˆxT (X) = ∇ˆxIt(f(X; µt)). (4.8)
45

We define the GEE to be the differential counterpart of the bcc: the bcc relates
image brightness values between two images whereas the GEE relates image gradi-
ents (see Figure 4.4). Note that image brightness are scalars (in grayscale images)
but image gradients are vectors
4.2.1 Relevance of the Gradient Equivalence Equation
We verify whether we can substitute image derivatives by their template derivatives
by using the gradient equivalence equation: that is, if Equation 4.8 holds then we
can swap gradients. We shall show in Chapters 5 and 6 that swapping gradients is
the cornerstone to speed improvement for both HB and IC algorithms:
• Additive algorithms such as HB rewrites the image gradient as a function of
the template gradient, increasing the speed of the optimization.
• Compositional algorithms such as IC rely on a compositional formulation of
the bcc to directly use the template gradient.
We shall show that if the GEE does not hold, the convergence of the algorithms
worsen when using gradient swapping. The foundations for this statement are sim-
ple: the Jacobian of a GN-like optimization method—e.g. HB or IC—depends on
the image derivatives. The Jacobian establishes the search direction towards the
minimum in the optimization space. When we build the Jacobian from template
derivatives—to gain efficiency—we must guarantee that the resulting Jacobian is
equivalent to the one computed from image derivatives. If the GEE (Equation 4.8)
holds, then the Jacobian matrices computed from both template and image deriva-
tives are equivalent. If the Jacobians are not equivalent, the iterative search direc-
tions may not converge to the optimum (see Figure 4.5). Hence, the GEE is directly
related to the algorithm convergence.
4.2.2 General Approach to Gradient Replacement
Once we have acknowledged the importance of the GEE, one question remains still
open: how do we verify that GEE holds? Gradient equivalence (Equation 4.8)
implies that the image gradient vectors in both T and It are equal for each point
x ∈ X. Recall that two vectors are equal if both their directions and lengths match.
From basic calculus we recall the following lemma:
Lemma 3. Let f and g be two functions f, g : D → R that coincide in an open set
Ω ∋ x0. Then
∂f(ˆx)
∂ˆx ˆx=x0
=
∂g(ˆx)
∂ˆx ˆx=x0
.
46

T It
Figure 4.4: Comparison between BCC and GEE. (Top-row) The image on the left
is rotated to generate the image on the right. Their bcc states that the image values are
equal for the target regions of both images despite their orientation. (Middle-row) Gra-
dients corresponding to both images; from left to right: ∇iT (x), ∇jT (x), ∇iIt(f(x; µt)),
and ∇jIt(f(x; µt)). Notice that the relative values of ∇iT (x) and ∇iIt(f(x; µt)) are equal
despite their orientation—ditto for ∇jT (x) and ∇jIt(f(x; µt)). (Bottom-row) The GEE
states that the gradient vectors for both images are coherent up to the warp. Transforms
one image into another: if a point of an image suﬀers a rotation, its gradient vector is
rotated by the same amount. 47

The proof of this lemma is immediate from the definition of open set and the
equality of f and g in Ω—see Figure 4.5.
Corollary 4. Let T [x] be the reference texture image, It[f(x; µt)] the input image at
time t warped by f, and x a pixel location. If the bcc holds—i.e. T [x] = It[f(x; µt)]—
in an open set Ω, then the GEE holds,
∇xT [x] = ∇xIt[f(x; µt)], ∀x ∈ Ω.
The proof is immediate from Lemma 3.
We may assume that both T [x] and It[x; µt] are continuous functions of x by
using interpolation from neighbouring pixel locations. Then, the bcc holds in an
open set Ω ⊂ R2
or Ω ⊂ P2
, except maybe for those pixels at image borders or
occlusion boundaries. The GEE consequently holds from Corollary 4.
Unfortunately, GEE generally does not hold in R3
. In the case of 2.5d track-
ing [Matthews et al., 2007]—a 2D surface embedded in 3D space—T [x] and It[x; µt]
do not coincide in an open set, since in general, T [x] = It[x; µt] for points outside
the surface. Thus, the GEE does not hold as a consequence of Lemma 3—see Fig-
ure 4.6.
4.3 Summary
As we shall shown later, the GEE is the cornerstone to speed improvement in efficient
image registration algorithms. If the image warping function is defined on R2
or P2
,
the GEE is satisfied. If the warping is defined on a 3D manifold D ⊂ R3
—i.e., the
case of 2.5D tracking—the GEE does not hold. In Table 4.1 we enumerate some
warping functions and state whether they satisfy the GEE.
48

Figure 4.5: Gradients and Convergence. (Left) We optimize a quadratic form
starting from the initial guess x0 = (0, 14) using two gradient-based iterative methods.
Solid red line: Algorithm that uses a correct Jacobian to compute each iterative step
xi towards the solution; by definition, the gradient is orthogonal to the isocurve of the
function at each xi. The algorithm reaches the optimum x∗ after some iterations. Dotted
black line: Algorithm that uses an incorrect Jacobian to compute iterative steps ˜xi; the
computed gradients are no longer orthogonal to the function curve. The algorithm reaches
a solution ˜xn that is different to the actual optimum. (Right) Functions f and g, and
their respective gradients (see figure legend). The grey region represents the open interval
D = (3.5, 5.5).
Figure 4.6: Open Subsets in Various Domains. (Left) Open subset in R2. The
neighbourhood D ⊂ R2 of the point is fully contained in R2. (Right) Open subset for a
point x ∈ R3 in a 3D manifold D ⊂ R3. The neighbourhood of the point x is not fully
contained in the manifold D.
49

Table 4.1: Characteristics of the warps
Warp Name Domain Geometry dof Allowed Motion GEE
2D Aﬃne R2
6
• rotation
• translation
• scaling
• shearing
YES
Homography R2
, P2
8
• rotation
• translation
• scaling
• shearing
• perspective
deformation
YES
Plane-induced
homography
(see Appendix B)
P2
6
• 3D rotation
• 3D translation
YES(1)
3D Rigid Body
R3
6
• 3D rotation
• 3D translation
NO
(1)
NO in the case of Plane+Parallax-constrained homography (see Appendix C)
50

Chapter 5
Additive Algorithms
This chapter discusses efficient additive algorithms for image registration. We turn
our attention to the HB algorithm, as it is known to be the most efficient additive
algorithm for image registration—we do not consider GIC as additive, but composi-
tional.
Open Issues with the Factorization Algorithm Although the HB algorithm
beats LK performance and enables us to achieve real-time image registration (cf. [Hager
and Belhumeur, 1998]), it is not free of criticism. Matthews and Baker analyse the
algorithm in [Baker and Matthews, 2004], and we summarize their criticisms in the
following two open questions:
Is every warp suitable for HB optimization?
The usual criticism to HB is that it works with only a limited number of
motion models: pure translation, an affine model (scaling, translation, rota-
tion and shearing), a restricted affine model (without shearing) and an “es-
oteric” (according to [Baker and Matthews, 2004]) non-linear model. [Baker
and Matthews, 2004] also stated that the algorithm could not use homographic
warps, although [Buenaposada and Baumela, 2002] subsequently solved the
problem by using homogeneous coordinates. However, there is no evidence
that the algorithm could work with other warps. In this chapter we intro-
duce a requirement that determines whether a given warp works with the HB
algorithm. We show that this requirement is related to the Gradient Re-
placement stage of the HB algorithm.
Can we systematize the factorization scheme?
The second quibble refers to stage two of the HB algorithm: the Factorization
step. [Baker and Matthews, 2004] argue that the factorization step must be
done using ad hoc techniques for each motion model. In this chapter and Ap-
pendix D we provide lemmas and theorems that systematize the factorization
of any expression involving matrices.
This chapter provides some insight of the two stages of the HB algorithm: in Sec-
tion 5.1 we study the requirements to perform the gradient replacement, and in Sec-
51

tion 5.2 we study the process of methodical factorization. We subsequently apply
this knowledge to register a 3D target model under a rigid body motion (Section 5.3):
we provide a suitable warp that can be directly used with the HB algorithm, and
we show the resulting HB optimization. Finally, Section 5.4 shows how to register
morphable models in a similar fashion.
5.1 Gradient Replacement Requirements
In this section we show the necessary requirements to perform the gradient replace-
ment operation: that is, we state the requirements on target motion and structure
that assure a proper convergence of the HB algorithm. We recall the scheme of the
gradient replacement operation: we expand the GEE (Equation 4.8) using the chain
rule,
∇ˆxT [x] = ∇ˆxI[x]∇ˆµf(x; µ), (5.1)
and we isolate the term ∇ˆxI[x] as follows:
∇ˆxI[x] = ∇ˆxT [x] (∇ˆµf(x; µ))−1
. (5.2)
Then, we insert Equation 5.2 into the equation of the Jacobian expanded using the
chain rule (Equation 3.11),
J = ∇ˆxT [x]∇ˆxf(ˆx; µ)−1
∇ˆµf(x; µ), (5.3)
which expresses the Jacobian matrix in terms of the template gradients. Notice that
the GEE has a key role in Equation 5.3—it is the basis for the gradient replacement.
Thus, we formulate our requirement using the GEE as follows:
Requirement 1. The gradient replacement operation within the HB algorithm is
feasible if and only if the GEE holds.
We do not prove Requirement 1 as it is a direct consequence of the GEE. Require-
ment 1 is a rule that let us establish if we can use a warp with the HB optimization.
We thoroughly studied the GEE in Chapter 4, and we summarized the relation
between warps and GEE in Table 4.1. Thus, those warps that do not hold the
GEE, shall not satisfy Requirement 1, and consequently won’t be suitable for the
HB algorithm.
It is important to note that we additionally require that an inverse must exist
for the derivative ∇ˆxf(ˆx; µ)—i.e., warp f must be invertible. If this derivative is
singular, we shall not be able to compute Equation 5.3.
5.2 Systematic Factorization
In this section we provide insight into the factorization process. We introduce a me-
thodical procedure to perform the factorization. This is by large a quite challenging
52

task—as we shall show, the factorization is generally nonunique—but we provide a
theoretical framework to support our claims.
Why use factorization? The efficiency of the HB algorithm depends on two fac-
tors, (1) the gradient replacement operation, and (2) the factorization of the Jaco-
bian matrix. The improvement due to the gradient replacement is noticeable as it
avoids to repeatedly compute image gradients. However, the improvement in speed
due to the factorization stage is not obvious at all.
Let us suppose a chain of product matrices,
En×r = An×m Bm×p Cp×q Dq×r , (5.4)
where red matrices—A and D—depend on parameters x, and green matrices—B and
C—depend on parameters y. In the factorization we group those matrices whose
elements depend on the same parameters, that is,
En×r = An×m D′
m×r′ B′
r′×p′ C′
p′×r ,
= Xn×r′ Yr′×r
(5.5)
We compute matrix D′
by reordering the elements of matrix D into a suitable matrix—
idem for matrices B′
and C′
. Furthermore, matrix X in Equation 5.5 only depends
on the parameters x as we compute it from the matrices A and D′
(we equivalently
compute matrix Y). Notice that although matrices in Equation 5.4 are different from
those in Equation 5.5, the final product E is identical. When some of the parameters
are constant—and so are their associated matrices—the improvement in speed due
to factorization is rather noticeable. For example, if we assume that the parameters
x are constant, then we avoid to compute the product AD′
from Equation 5.5; hence,
the key point of applying factorization to compute E is to spare computing spurious
operations due to constant terms.
Factorization via reversible operators There is still an open question left:
How do we systematize the factorization? [Brand and R.Bhotika, 2001] introduces
the following definition:
Definition 5.2.1. We may reorder arbitrarily a sequence of matrix operations
(sums, products, reshapings and rearrangements) using matrix reversible operators.
[Brand and R.Bhotika, 2001] define the matrix reversible operators as those that
do not lose matrix information. They state that the Kronecker product, ⊗, or
column vectorization, vec(A) [K. B. Petersen], are reversible whereas matrix multi-
plication or division are not [Brand and R.Bhotika, 2001].
We also introduce the operator ⊙ which performs a row-wise Kronecker product
of two matrices. In Appendix D we introduce a series of theorems and lemmas that,
using the aforementioned reversible operators, rearrange the product and sum of
matrices.
53

Uniqueness of the factorization In general, the factorization of Equation 5.5
is not unique: we can rearrange the matrices of Equation 5.4 in different ways such
that the result of Equations 5.4 and 5.5 is identical. This situation is particularly
noticeable when we have distributive products: we can either apply the distributive
property then factorize, or factorize first, then distribute.
Jacobian matrix factorization We apply the aforementioned procedures to de-
compose the Jacobian matrix (Equation 5.3). First, we represent the Jacobian ma-
trix as a product of matrices whose elements are either shape or motion parameters.
Notice that the terms ∇ˆxf(ˆx; µ)−1
and ∇ˆµf(x; µ) intermingle shape and motion el-
ements such that the factorization is not obvious at all—on the other hand, ∇ˆxT [x]
only depends on shape terms. The objective is to represent J as a leftmost matrix
whose elements are only shape terms—or constant values—and a rightmost matrix
whose elements are only motion terms:
J =∇ˆxT [x]∇ˆxf(ˆx; µ)−1
∇ˆµf(x; µ),
= S M .
(5.6)
We use an iterative procedure to compute the factorization. We represent this
procedure in Algorithm 6.
Algorithm 6 Iterative factorization of the Jacobian matrix.
Off-line: Represent the Jacobian as a chain of matrix sums or products.
On-line:
1: Select two adjacent elements such that the right-hand one is a motion term and
the left-hand one is shape term.
2: Use one of the factorization lemmas to reverse the order of the terms.
3: Repeat until the Jacobian is as Equation 5.6.
Notation of the Factorization Process We consistently use the following no-
tation to describe the factorization procedure:
• We represent a matrix whose terms are either constants or shape terms using
a red box S .
• We represent a matrix whose terms are either constants or motion terms using
a green box M .
• When we reverse a pair of shape-motion terms by using a lemma we highlight
the result by using a blue box R . We represent the application of a given
lemma by using an arrow labelled with the name of the rule (see Equation 5.7).
54

Em×r =Am×n Bn×p Cp×q Dq×r,
Lemma
−−−−→Am×n B′
n×p′ C′
p′×q Dq×r,
=Am×n B′
n×p′ C′
p′×q Dq×r.
(5.7)
5.3 3D Rigid Motion
In this section we describe how to register images from a 3D target under a rigid
body motion in R3
by using a HB optimization. The fundamental challenge lies in
satisfying Requirement 1: according to Table 4.1, the usual rigid body warp does
not verify the GEE. Hence, we demand a warp that both models the target dynamics
and holds Requirement 1.
Previous Work The original HB algorithm was not specifically suited for 3D tar-
gets: [Hager and Belhumeur, 1998] only defined affine warps over 2D templates.
Later, [Hager and Belhumeur, 1999] extended the HB algorithm to handle 3D motion;
nonetheless, the algorithm seems to be limited as it only handles small rotations—
around 10 degrees. [Buenaposada and Baumela, 2002] extended the HB algorithm
to handle homogeneous coordinates, so a full projective homography could be used.
Using this homography, the authors effectively computed the 3D orientation of a
plane in space. [Sepp and Hirzinger, 2003] proposed a HB algorithm to register com-
plex 3D targets. Unfortunately, the results seems poor as the algorithm handle
limited out-of-plane rotations—less than 30o
, cf. [Sepp and Hirzinger, 2003]. We
may explain this flaw in the performance as a direct result of a wrongful gradient re-
placement: the rigid body transformation does not verify the GEE (cf. Chapter 4),
hence Requirement 1 does not hold.
We define an algorithm that effectively registers 3D targets by using a family of
homographies: one homography per plane/triangle of the model. We combine these
homographies together by considering that all the triangles of the target model
are parameterized by the same rotation and translation [Muñoz et al., 2009]. In
the following we introduce the convention to represent target models, 3D Textured
Models, and a new warp that parameterized a family of homographies for a set of
planes, the shape-induced homography.
5.3.1 3D Textured Models
We describe the target using 3D Textured Models (3dtm) [Blanz and Vetter, 2003].
We follow a convention similar to [Blanz and Vetter, 1999]: we model shape and
colour separately, but both are defined on a common bi-dimensional space, F ⊂ R2
,
that we call the Reference Frame. Target shape is a discrete function S : F → R3
that maps ui = (xi, yi)⊤
∈ F into si = (Xi, Yi, Zi)⊤
∈ R3
for i = 1, . . . , N, where
N is the number of vertices of the target model. Vertices are arranged in a discrete
polygonal mesh that we define by a list of triangles. We provide continuity in the
55

Figure 5.1: 3D Textured Model. (Top-left) The bi-dimensional reference frame F.
We associate a colour triplet to each discrete point in F, resulting in a texture image.
(Top-right) The shape of the model in R3. (Bottom-left) Close-up of F (green square
in top-left image). The reference frame is discretized in triangles. (Bottom-right) Close
up of the shape s (green square in top-right image). The shape is a 3D triangular mesh.
The bi-dimensional triangle (u1, u2, u3)⊤ is mapped into the 3D triangle (s1, s2, s3)⊤. The
interior point of the triangle u is coherently mapped into its corresponding point in the
shape by using barycentric coordinates.
space F by interpolating among neighbouring vertices in the triangle list [Romdhani
and Vetter, 2003]. The target colour or texture, T : F → RC
, similarly maps the
bi-dimensional space u ∈ F into the colour space—RGB if coloured or grey scale if
monochrome (see Figure 5.1). Again, we achieve continuity in the colour space by
interpolating the colour values at the vertices in the triangle list.
56

Notation By abuse of notation, T denotes both the usual image function, T :
R2
→ RC
, and the texture defined in the reference space, T : F → RC
.
5.3.2 Shape-induced Homography
We introduce a new warp that (1) represents the target rigid body motion in R3
, and
(2) holds Requirement 1. We base this warp on the plane-induced homography fh6
—see Appendix B. The plane-induced homography relates the projections of a plane
that rotates and translates in R3
. We equivalently define the shape-induced homo-
graphies fh6s as a family of plane-induced homographies that relate the projections
of a shape that rotates and translates in space,
fh6s(˜x, n; µ) = ˜x′
= K R − Rtn⊤
K−1
˜x, (5.8)
where ˜x and ˜x′
are the projections of a generic vertex s of shape S (see Figure 5.2).
We consider that vertex s is the centroid of the triangle with normal n, located
at d depth from the origin—i.e. n⊤
s = d. We normalize n by the triangle depth,
n = n/d, such that n⊤
s = 1. Vector µ contains a parameterization for R and t.
Notice that µ is common to every point in S but n is not; hence, we have one plane-
induced homography for each pair of projections ˜x ↔ ˜x′
, but every homography
shares the same R and t (see Figure 5.2).
5.3.3 Change to the Reference Frame
Equation 5.8 relates the projections of a shape vertex in two views. We describe
now how to express Equation 5.8 in terms of F-coordinates. Let u1, u2, and u3
be the coordinates in the reference frame of the shape triangle s1, s2, and s3—i.e.
si = S(ui). Let ˜xi be the projection of si on a given image. Figure 5.3 shows the
relationship among the triangles (s1, s2, s3)⊤
, (˜x1, ˜x2, ˜x3)⊤
, and (˜u1, ˜u2, ˜u3)⊤
. We
represent the transformation between vertices (˜u1, ˜u2, ˜u3)⊤
and (˜x1, ˜x2, ˜x3)⊤
using
an affine warp HA,
˜xi = HA˜ui =


a11 a12 tx
a21 a22 ty
0 0 1

 ui
1
i = 1, 2, 3, (5.9)
where ˜ui ∈ P2
is the augmented vector of ui. Affine transformation HA is explicitly
defined by the three correspondences ˜u1 ↔ ˜x1, ˜u2 ↔ ˜x2,and ˜u3 ↔ ˜x3; the inte-
rior points of the triangles are coherently transformed as the affinity is invariant
to barycentric combinations [Hartley and Zisserman, 2004]. When we extend Equa-
tion 5.9 to the N vertices of S we obtain a piecewise affine transformation [Matthews
and Baker, 2004] between F and the view (see Figure 5.3). If the affinity HA is not
degenerate—i.e. det(HA) = 0—then we can rewrite Equation 5.8 as follows:
f3DTM(˜u, n; µ) = ˜u′
= K R − Rtn⊤
K−1
HA˜u. (5.10)
The transformation f3dtm (Equation 5.10) relates the 3D motion of the shape to the
reference frame F (see Figure 5.3).
57

Figure 5.2: Shape-induced homographies. (Top) We respectively image the shape
S using P = [I|0] (left view) and P′ = [R| − Rt] (right view). (Middle) Close-up of the 3D
shape. We select three shape points s1, s2, s3 ∈ R3, each one belongs to a triangle located
on a diﬀerent plane in R3 —whose normals are respectively π1, π2, π3 ∈ R3. (Bottom)
Close-ups of the two views. We image the shape point si as ˜xi on the left view, and ˜x′
i
on the right view, for i = {1, 2, 3}. The point si on plane πi induces the homography ˜Hi
between ˜xi and ˜x′
i. Note that ˜H1, ˜H2, and ˜H3 are diﬀerent homographies that share R and
t parameters (cf. Equation 5.8).
58

Figure 5.3: Warp defined on the reference frame. (Top-left) Shape triangle mesh
and texture on the reference frame F. (Top-right) The image corresponding to the left
view of Figure 5.2. This is the reference image. (Middle) Close-up of the 3D shape.
We select three shape points s1, s2, s3 ∈ R3, each one belongs to a triangle located on a
different plane in R3—whose normals are respectively π1, π2, π3 ∈ R3. (Bottom) Close-
ups from the images of the top row: the reference frame (left) and the reference image
(right). We image the shape point si as ˜xi on the left view, and ˜x′
i on the right view, for
i = {1, 2, 3}. On the other hand, we know the relationship between the point si and its
correspondence in the reference frame ˜ui by means of the function S. Thus, there exists a
correspondence ˜ui ↔ ˜xi by means of si. We compute such a correspondence as an affine
transformation Hi
A between ˜ui and ˜xi —i.e. ˜xi = Hi
A˜ui, the correspondence is a piecewise
affine transformation. Note that transformations Hi
A are different from each other since
they depend on the correspondence ˜ui ↔ ˜xi.
59

Does f3DTM hold the GEE? Equation 5.10 is a homography resulting from
chaining two nondegenerate homographies. Thus, according to Lemma 3 any ho-
mographic transformation such as f3dtm holds the GEE and, by extension, holds
Requirement 1—see Table 4.1.
Advantages of the Reference Frame
Many previous approaches to 3D tracking, e.g. [Baker and Matthews, 2004; Bue-
naposada et al., 2004; Cobzas et al., 2009; Decarlo and Metaxas, 2000; Hager and
Belhumeur, 1998; Sepp, 2006], use a reference template—or a selected frame of the
sequence—to deﬁne the texture function T (see Figure 5.4). Using this texture in-
formation they deﬁne the brightness dissimilarity function that they subsequently
optimize (e.g., by using Equation 3.1).
β = 0o
β = 30o
β = 60o
β = 90o
Figure 5.4: Reference frame advantages. (Top-row) We rotate the shape model
around Y -axis with degrees β = {0, 30, 60, 90}. (Bottom-row) Visible points for each
view of the rotated shape. Green dots: Hidden triangles due to self-occlusions. We consider
that a shape triangle is visible in the image if the angle of the normal to the triangle and
the camera ray is less than 70o.
This information is valid in the neighbourhood of the reference image only. The
dissimilarity function uses the texture that is available from the imaged target at that
reference image. Thus, there is no texture information from those parts of the object
that are not imaged in the template. As the projected appearance changes due to
the motion of the target, more and more uncertainty is introduced in the brightness
60

dissimilarity function. This leads to a decrease in the performance of the tracker for
large rotations [Sepp, 2006; Xu and Roy-Chowdhury, 2008] (see Figure 5.4).
We solve the problem by using the texture reference frame [Romdhani and Vetter,
2003]. With the texture reference frame we can define a continuous brightness
dissimilarity function over the whole 3D target. Using this continuous texture we
can define the brightness dissimilarity even in the case of large rotations of the
target (see Figure 5.4). Another approach to this problem is to use several reference
images [Vacchetti et al., 2004]. When the target is a 3D plane, one reference image
suffices to provide texture information [Baker and Matthews, 2004; Buenaposada
and Baumela, 2002; Hager and Belhumeur, 1998]. Although it does not suffer from
self-occlusions, it may have aliasing artefacts.
5.3.4 Optimization Outline
We define the brightness dissimilarity function (using Equation 5.10) as follows:
D(F; µ) = T [F] − It[f3dtm(F, n; µ)], (5.11)
where n ≡ n(˜u) is the normal vector to the plane that contains the point S(˜u), ∀˜u ∈
F—strictly speaking, ˜u = (u, 1)⊤
, with u ∈ F. The dissimilarity function (Equa-
tion 5.11) is continuous over the reference space F—as well as the normal function
n(F). Notice that the dissimilarity function is defined over the texture function
T instead of a single template—as in Equation 3.1. We rewrite Equation 5.11 in
residuals form as:
r(µ) = T[˜u] − It+1[f3dtm(˜u; µ)], (5.12)
where we drop the parameter n of f3dtm for clarity. The corresponding linear model
for the residuals of Equation 5.12 is
ℓ(δµ) ≡ r(µ) + J(µ)δµ, (5.13)
where
J(µ) =
∂It+1[f3dtm(˜u; ˆµ)]
∂ ˆµ ˆµ=µ
. (5.14)
5.3.5 Gradient Replacement
We rewrite Equation 5.14 using the gradient replacement equation (Equation 5.3)
as follows:
J(µ) = (∇ûT[˜u])⊤
(∇ûf3dtm(˜u; µ))−1
(∇ˆµf3dtm(˜u; µ)) . (5.15)
In the following, we individually analyze each term of Equation 5.15.
Template gradients on F The first term deals with the template derivatives on
the reference frame F:
∇ûT [˜u]⊤
=
1
w
∇iT [˜u],
1
w
∇jT [˜u], −
1
w
(u∇iT [˜u] + v∇jT [˜u])
⊤
. (5.16)
61

Warp gradients on target coordinates The second term handles the gradients
of the warp f3dtm(˜u; µ) with respect to the target coordinates ˜u. The target is
defined on the projective plane P2
, so we trivially compute the gradient as:
∇ûf3dtm(˜u, n; µ) = K R − Rtn⊤
K−1
HA. (5.17)
The resulting homography matrix (see Equation 5.17) must be inverted for each
point ˜u in the reference frame. We directly invert the Equation 5.17 as follows:
∇ûf−1
3dtm = K R − Rtn⊤
K−1
HA
−1
= H−1
A K R − Rtn⊤ −1
K−1
,
= H−1
A K I − tn⊤ −1
R⊤
K−1
.
(5.18)
We invert the term I − tn⊤
using Sherman-Morrison matrix inversion formula [K. B. Pe-
tersen]:
I − tn⊤ −1
= I +
tn⊤
1 − n⊤t
. (5.19)
Plugging Equation 5.19 into Equation 5.18 results in
∇ûf−1
3dtm = H−1
A K I + tn⊤
1 − n⊤
t
R⊤
K−1
,
= H−1
A K (1 − n⊤
t)I + tn⊤
1 − n⊤
t
R⊤
K−1
,
= λH−1
A K I − (n⊤
t)I + tn⊤
R⊤
K−1
,
(5.20)
where λ = 1/(1 − n⊤
t) is a homogeneous scale factor that depends on each shape
point.
Target motion gradients The third term computes the gradients of the warp
f3dtm(˜u; µ) with respect to the motion parameters µ. The resulting Jacobian matrix
has the following form:
∇ˆµf3dtm(˜u; µ) = ∇ˆRf3dtm(˜u; R) ∇ˆtf3dtm(˜u; t) . (5.21)
The derivative of the warp with respect to each rotation parameter is computed as
follows:
∇ˆ∆f3dtm(˜u; ∆) = K˙R∆K−1
HA˜u, (5.22)
where ˙R∆ is the derivative of the rotation matrix R with respect to the Euler angle
∆ = {α, β, γ}. We trivially compute the derivatives of the warp with respect to the
translation parameters t in the following:
∇ˆtf3dtm(˜u; t) = Kn⊤
K−1
HA˜u. (5.23)
Notice that Equation 5.23 only depends on the target shape —˜u, the target coordi-
nates in F, and n, its corresponding word plane—but does not depend on the motion
parameters anymore; hence, the Equation 5.23 is constant. Plugging Equations 5.22
and 5.23 into the Equation 5.21 we obtain the final form of the derivatives:
∇ˆµf3dtm(˜u; µ) = K˙RαK−1
HA˜u K˙RβK−1
HA˜u K˙RγK−1
HA˜u Kn⊤
K−1
HA˜u . (5.24)
62

Assemblage of the Jacobian Substituting Equations 5.16, 5.20, and 5.24 back
into Equation 5.15 we have the analytic form of each row of the Jacobian matrix
(Equation 5.14):
J⊤
= J1 J2 J3 J4 J5 J6 , (5.25)
with
J1 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
R⊤ ˙RαK−1
HA˜u,
J2 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
R⊤ ˙RβK−1
HA˜u,
J3 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
R⊤ ˙RγK−1
HA˜u,
J4 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
r1n⊤
K−1
HA˜u,
J5 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
r2n⊤
K−1
HA˜u,
J6 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
r3n⊤
K−1
HA˜u,
(5.26)
where ri for i = 1, . . . , 3 is the i − th column vector of the rotation matrix R—i.e.
R = (r1, r2, r3).
5.3.6 Systematic Factorization
We proceed with the systematic factorization of Equation 5.25 using the theorems
and lemmas from Appendix D. We attempt to rewrite the terms of the Jacobian such
that we compute the products in Equation 5.15 more efficiently. Our first operation
is a change of variables v = K−1
HA˜u that rewrites Equations 5.26 as follows:
J1 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
R⊤ ˙Rαv,
J2 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
R⊤ ˙Rβv,
J3 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
R⊤ ˙Rγv,
J4 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
r1n⊤
v,
J5 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
r2n⊤
v,
J6 = ∇ûT [˜u]⊤
λH−1
A K I − (n⊤
t)I + tn⊤
r3n⊤
v.
(5.27)
We factorize each term of Equation 5.27 as we indicated in Section 5.3. Thus, we
rewrite the i-th row of the Jacobian matrix J as
J(i)⊤
= S(i)⊤
M, (5.28)
where S(i)⊤
= S
(i)⊤
1 S
(i)⊤
2 . We define matrices S
(i)⊤
1 , and S
(i)⊤
2 as follows:
S
(i)⊤
1 =∇ûT [˜u]⊤
(I3 ⊗ n(i)⊤
)A + (I3 ⊗ n(i)⊤
)(I9 ⊗ v(i)⊤
) − (I3 ⊗ n(i)⊤
)(Pπ(9:3) ⊗ v(i)⊤
) B,
S
(i)⊤
2 =S
(i)⊤
1 (I3 ⊗ n(i)
) ⊗ I4 .
(5.29)
63

We build the motion matrix M as
M =




1
t
⊗ I9 vec(˙R
⊤
αt
R) vec(˙R
⊤
βt
R) vec(˙R
⊤
γt
R) 036×3
012×3 I3 ⊗
1
t
R



 . (5.30)
The full derivation of the matrices from Equation 5.29 is presented in Appendix E.
We assemble the Jacobian J by stacking the N rows J(i)⊤
up into a single matrix.
Since matrix M is the same for each row J(i)⊤
, we can extract M as a common factor
and write the Jacobian matrix as
J = SM, (5.31)
where S is the matrix that we compute by stacking the N entries S
(i)⊤
1 and S
(i)⊤
2 ,
S =



S
(1)⊤
1 S
(1)⊤
2
...
...
S
(N)⊤
1 S
(N)⊤
2


 . (5.32)
Outline of Algorithm HB3DTM We deﬁne the HB algorithm for the warp f3dtm
as hb3dtm. We use the factorization equations (Equations 5.29) as a basis for our
algorithm; we show the outline for algorithm hb3dtm in Algorithm 7.
Algorithm 7 Outline of the HB3DTM algorithm.
1: for i = 1 to N do
2: Compute S
(i)
1 and S
(i)
2 using Equation 5.29.
3: end for
4: Assemble matrix S using Equation 5.32.
On-line:
6: Compute the residual function at r(µi) from Equation 5.12
7: Compute matrix M(µi) using Equation 5.30.
8: Assemble the Jacobian: J(µi) = SM(µi) (Equation 5.31).
J(µi)
−1
J(µi)⊤
r(µi).
10: Additively update the optimization parameters:µi+1 = µi + δµi.
11: end while
64

5.4 3D Nonrigid Motion
In addition to the motion of the target in space, we allow deformations of the target
itself. This nonrigid motion is caused by elastic deformations of the target (e.g.
deformations on an elastic sheet, changes on facial expression), or by nonelastic
motion of target portions (e.g. jaw rotation due to mouth opening).
5.4.1 Nonrigid Morphable Models
As in the rigid case (Section 5.3), we describe the target using nonrigid morphable
models (3dmm) [Romdhani and Vetter, 2003]: we describe the shape deformation
using a linear combination of modes of deformation that we have obtained by ap-
plying pca on a set of shape samples:
s = s0
+
K
k=1
cksk
, s, s0
, sk
∈ R3
(5.33)
where s0
∈ R3
is the pca average sample shape, sk
are the K modes of variation
from pca, and ck are the coefficients of the linear combination (see Figure 5.5).
Note that each shape point s has different s0
and sk
, but all of them share the same
K coefficients:
S3×N = S0
3×N +
NK
k=1
ckSk
3×N, (5.34)
where S0
3×N, and Sk
3×N are the matrices that we compute by joining the N shape
averages s0
and sk
.
Figure 5.5: Nonrigid Morphable Models. The 3dmm models the 3D shape as a
linear combination of 3D shapes that represent the modes of variation.
5.4.2 Nonrigid Shape-induced Homography
As for the rigid case, we model the target dynamics using shape-induced homo-
graphies (see Equation 5.8), but we must account for the target deformation. We
equivalently define the nonrigid shape-induced homography fh6d as a family of plane-
induced homographies that relate the projections of a shape that rotates, translates
and deforms in space:
fh6d(˜x, n; µ) = ˜x′
= K R + RBscn⊤
− Rtn⊤
K−1
˜x, (5.35)
65

where x and x′
are the projections of a generic shape point s located on the plane
π⊤
= (n, 1)⊤
(see Figure 5.6), Bs = s1
, . . . , sk
is a 3 × K matrix that contains the
modes of variations, and c = (c1, . . . , ck)⊤
is the vector of deformation coefficients.
Vector µ contains a parameterization for the rotation matrix R, the translation t
and the deformation coefficients c.
Warp Rationale We project the generic shape point s onto the cameras
P = K[I|0] P′
= K[R| − Rt]. (5.36)
We describe the point s by rewriting Equation 5.33 as:
s =s0
+
N
k=1
cksk
, s, s0
, sk
∈ R3
,
=s0
+ Bsc,
(5.37)
where. We explicitly assume that the target does not deform in the first view, that
is, we image s under P as:
˜x = K[I|0]
s0 + Bs0
1
. (5.38)
If we encode the deformation between the two views as c, then we image s under P′
as:
˜x′
=K[R| − Rt]
s0 + Bsc
1
,
=K (Rs0 + RBsc − Rt) .
(5.39)
The world plane π⊤
= (n⊤
, 1)⊤
naturally satisfies π⊤
s0 = 1; thus, we rewrite
Equation 5.39 as follows:
˜x′
=K Rs0 + RBscn⊤
s0 − Rtn⊤
s0 ,
=K R + RBscn⊤
− Rtn⊤
s0.
(5.40)
Using Equation 5.38 we rewrite Equation 5.40 as the nonrigid shape-induced ho-
mography between projections ˜x ↔ ˜x′
:
˜x′
= K R + RBscn⊤
− Rtn⊤
K−1
˜x. (5.41)
5.4.3 Change of Variables to the Reference Frame
We can also express the target coordinates in terms of the reference frame F. As
in the rigid case, there exists an affine transformation HA such that ˜x = HA˜u —see
Figure 5.7. Thus, we write the warp f3dmm that relates shape coordinates in F with
the projections onto a view due to the target motion:
f3dmm = ˜x′
= K R + RBscn⊤
− Rtn⊤
K−1
HA˜u. (5.42)
66

Figure 5.6: Nonrigid shape-induced homographies. (Top-left) We image the
average shape s1 using P = [I|0] onto the left view. (Top-right) We compute the deformed
shape s′
1 using Equation 5.33. We respectively image this shape onto the right view by
using P′ = [R| − Rt]. (Middle-left) Close-up of the average shape. Point s1 on the
average shape lies on the plane with coordinates (π1, 1)⊤. (Middle-right) Close-up of
the deformed shape. The plane in which s′
1 lies diﬀers from (π1, 1)⊤ by a rigid body
transformation and a shape deformation. (Bottom-left) Close-up of the left view. We
image the average shape point s1 as ˜x1. (Bottom-right) Close-up of the right view. We
image the deformed point s′
1 as ˜x′
1. The correspondence s1 ↔ s′
1 induces a homography
˜H1 between ˜x1 and ˜x′
1. Note that each shape induces a family of homographies that are
parameterized by a common R and t (cf. Equation 5.40).
67

Figure 5.7: Deformable warp defined on the reference frame. (Top-left) Shape
triangle mesh and texture on the reference frame F. (Top-right) The image corresponding
to the left view of Figure 5.6—the reference image. (Middle) Close-up of the average
shape (see Figure 5.6—Middle-left). (Bottom-left) Close-up of Top-left. We compute
the point ˜u1 on the reference frame that corresponds to the average shape point s1 by
using the shape function S. (Bottom-right) Close-up of Top-right. We respectively
image the point s1 as ˜x1. Thus, there exists a correspondence ˜u1 ↔ ˜x1 by means of s1.
We compute such a correspondence as an affine transformation H1
A between ˜u1 and ˜x1.
This transformation holds for all the points on the triangle that contains ˜u1. There is
a different transformation Hi
A for each i-th triangle in the shape. Hence, the mapping
between the reference frame and the reference image is a piecewise affine transformation.
68

5.4.4 Optimization Outline
We define the brightness dissimilarity function using Equation 5.42 as follows:
D(F; µ) = T [F] − It[f3dmm(F, n(F); µ)], (5.43)
where n ≡ n(˜u) is the vector normal to the plane that contains the point S(˜u), ∀˜u ∈
F. As in the rigid case (Equation 5.11), Equation 5.43 is continuous over F. We
rewrite Equation 5.43 as residuals,
r(µ) = T[˜u] − It+1[f3dmm(˜u; µ)], (5.44)
where we drop the parameter n from f3dmm for clearness. The corresponding linear
model for the residuals of Equation 5.44 is
ℓ(δµ) ≡ r(µ) + J(µ)δµ, (5.45)
where
J(µ) =
∂It+1[f3dmm(˜u; ˆµ)]
∂ ˆµ ˆµ=µ
. (5.46)
5.4.5 Gradient Replacement
We rewrite Equation 5.46 using the gradient replacement equation (Equation 5.3)
as follows:
J(µ) = (∇ûT[˜u])⊤
(∇ˆxf3dmm(˜u; µ))−1
(∇ˆµf3dmm(˜u; µ)) . (5.47)
In the following, we analyze each term of the Equation 5.47 separately.
Template gradients on F The first term deals with the template derivatives on
the reference frame F. These derivatives are identical to Equation 5.16 since they
do not depend upon the target dynamics.
Warp gradients on target coordinates The second term handles the gradients
of the warp f3dmm(˜u; µ) with respect to the target coordinates ˜u. We compute the
gradient as follows:
∇ûf3dmm(˜u, n; µ) =K R + RBscn⊤
− Rtn⊤
K−1
HA,
=KR I + Bscn⊤
− tn⊤
K−1
HA.
(5.48)
Equation 5.47 calls for the inverse form of Equation 5.48, thus
∇ûf3dmm(˜u, n; µ)−1
= H−1
A K I + Bscn⊤
− tn⊤ −1
R⊤
K−1
. (5.49)
Again, we analytically invert I + Bscn⊤
− tn⊤
by using the Sherman-Morrison
Inversion Formula, [K. B. Petersen]:
I + (Bsc − t)n⊤ −1
= I +
(Bsc − t)n⊤
1 + n⊤
(Bsc − t)
. (5.50)
69

Plugging (5.50) into (5.49) results in
∇ˆuf3dmm(˜u, n; µ)−1
= H−1
A K I +
(Bs − t)n⊤
1 + n⊤
(Bs − t)
R⊤
K−1
,
= H−1
A K
(1 + n⊤
(Bs − t)I + (Bst)n⊤
1 + n⊤
(Bs − t)
R⊤
K−1
,
= λH−1
A K I + (n⊤
Bsct)I − (n⊤
t)I − Bscn⊤
+ tn⊤
R⊤
K−1
,
(5.51)
where λ = 1/(1 + n⊤
(Bsctt)) is a homogeneous scale factor that depends on each
target point.
Target motion gradients The third term computes the gradients of the warp
f3dmm(˜u; µ) with respect to the motion parameters µ:
∇ˆµf3dmm(˜u; µ) = ∇ˆRf3dmm(˜u; R) ∇ˆtf3dmm(˜u; t) ∇ˆcf3dmm(˜u; c) . (5.52)
We compute the derivatives of the warp with respect to each one of the rotation
parameters as follows:
∇ˆ∆f3dmm(˜u; ∆) = K˙R∆K−1
HA˜u + K˙R∆BscK−1
HA˜u,
= K˙R∆ (I + Bsc) K−1
HA˜u,
(5.53)
where ˙R∆ is the derivative of the rotation matrix R with respect to the Euler angle
∆ = {α, β, γ}. We trivially compute the derivatives of the warp with respect to the
translation parameters t in the following:
∇ˆtf3dmm(˜u; t) = Kn⊤
K−1
HA˜u. (5.54)
We additionally compute the derivatives of f3dmm with respect to the deformation
parameters c:
∇ ˆck
f3dmm(˜u; ck) = KRBkn⊤
K−1
HA˜u, (5.55)
where Bk is the k-th column of the matrix B— i.e. Bk is the k-th mode of deforma-
tion.
Assemblage of the Jacobian Substituting Equations 5.53, 5.54, and 5.55 back
into Equation 5.47 yields the analytic form of each row of the Jacobian matrix,
J⊤
= J1 J2 J3 J4 J5 J6 J7 · · · J(6+NK ) , (5.56)
70

with
J1 = ∇ûT [˜u]⊤
D˙Rα (I + Bsc) v,
J2 = ∇ûT [˜u]⊤
D˙Rβ (I + Bsc) v,
J3 = ∇ûT [˜u]⊤
D˙Rγ (I + Bsc) v,
J4 = ∇ûT [˜u]⊤
Dr1n⊤
v,
J5 = ∇ûT [˜u]⊤
Dr2n⊤
v,
J6 = ∇ûT [˜u]⊤
Dr3n⊤
v,
Jk = ∇ûT [˜u]⊤
DBkn⊤
v, k = 1, . . . , NK,
(5.57)
where D is short for
D = λH−1
A K I + (n⊤
Bsct)I − (n⊤
t)I − Bscn⊤
+ tn⊤
R⊤
, (5.58)
v is short for v = K−1
HA˜u, and ri for i = 1, . . . , 3 is the i − th column vector of the
rotation matrix R—i.e. R = (r1, r2, r3).
5.4.6 Systematic Factorization
In this Section we introduce the factorization of Equation 5.57. As we will see in
Chapter 7, the factorization of the nonrigid warp f3dmm does not increase the effi-
ciency of the original model (Equation 5.57). The overhead of repeated operations
between parameters is a computational burden. We solve the problem by intro-
ducing a partial factorization procedure: we only factorize and precompute those
nonrepeated combination of parameters, which is faster than computing the full
factorization.
Full Factorization
We factorize Equation 5.57 using the theorems and lemmas from Appendix D. We
present the full derivation of matrices S and M in Appendix G. As in the rigid case,
we proceed with each row of the Jacobian matrix (Equation 5.56) by rewriting them
as
J⊤
1×(6+K) = S⊤
M. (5.59)
We write the row vector S⊤
as
S⊤
= S⊤
1 , S⊤
1 , S⊤
1 , S⊤
4 , S⊤
5 (210+280K+72K2)
, (5.60)
71

where the we deﬁne the vectors S⊤
i for i = 1, . . . , 5 as follows:
S1 =




















D⊤
(I3 ⊗ n⊤
Bs)(I3 ⊗ v⊤
)
⊤
D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ v⊤
)
⊤
D⊤
(I3 ⊗ n⊤
Bs) I3 ⊗ vec(B⊤
s )v
⊤
− D⊤
Bs(IK ⊗ n⊤
) I3 ⊗ vec(B⊤
s )v
⊤
D⊤
(I3 ⊗ v⊤
)
⊤
− D⊤
(I3 ⊗ n⊤
)(I3 ⊗ v⊤
)
⊤
D⊤
(I3 ⊗ n⊤
)(I3 ⊗ v⊤
)
⊤
D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)
⊤
− D⊤
(I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤ ⊤
D⊤
(I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤ ⊤




















⊤
1×(63+81K+18K2)
,
S2 =








D⊤
n⊤
v(I3 ⊗ n⊤
Bs)
⊤
− D⊤
n⊤
v Bs(IK ⊗ n⊤
)
⊤
D⊤
n⊤
v
⊤
− D⊤
n⊤
v(I3 ⊗ n⊤
)
⊤
D⊤
n⊤
v(I3 ⊗ n⊤
)
⊤








⊤
1×(21+6K)
,
and
S3 =








D⊤
(n⊤
v)B(IK ⊗ (n⊤
Bs))
⊤
− D⊤
(n⊤
v)Bs(IK ⊗ n⊤
)(I3K ⊗ vec(B)⊤
)
⊤
D⊤
Bn⊤
v
⊤
− D⊤
(I3 ⊗ n⊤
)(n⊤
v)(I9 ⊗ vec(B)⊤
)
⊤
D⊤
(I3 ⊗ n⊤
)(n⊤
v)(I9 ⊗ vec(B)⊤
)
⊤








⊤
1×(31K+18K2)
.
(5.61)
The matrix M comprises the motion terms; we deﬁne this matrix as
M =






M1 0 0 0 0
0 M2 0 0 0
0 0 M3 0 0
0 0 0 M4 0
0 0 0 0 M5






(210+280K+72K2)×(6+K)
, (5.62)
72

where
M1 =


















vec(˙R
⊤
α R(IK ⊗ c⊤
))
vec(˙R
⊤
α R(I3 ⊗ c⊤
))
vec((IK ⊗ c)R(I3 ⊗ c⊤
))
))
vec(˙R
⊤
α R)
vec(˙R
⊤
α R(I3 ⊗ t⊤
))
vec(˙R
⊤
α R(t⊤
⊗ I3))
vec((I3 ⊗ c)R)
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
vec((I3 ⊗ c)R⊤
(t ⊗ I3))


















(63+81K+18K2)×1
,
M2 =


















vec(˙R
⊤
β R(IK ⊗ c⊤
))
vec(˙R
⊤
β R(I3 ⊗ c⊤
))
))
))
vec(˙R
⊤
β R)
vec(˙R
⊤
β R(I3 ⊗ t⊤
))
vec(˙R
⊤
β R(t⊤
⊗ I3))
vec((I3 ⊗ c)R)
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
vec((I3 ⊗ c)R⊤
(t ⊗ I3))


















(63+81K+18K2)×1
,
M3 =


















vec(˙R
⊤
γ R(IK ⊗ c⊤
))
vec(˙R
⊤
γ R(I3 ⊗ c⊤
))
))
))
vec(˙R
⊤
γ R)
vec(˙R
⊤
γ R(I3 ⊗ t⊤
))
vec(˙R
⊤
γ R(t⊤
⊗ I3))
vec((I3 ⊗ c)R)
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
vec((I3 ⊗ c)R⊤
(t ⊗ I3))


















(63+81K+18K2)×1
,
M4 =






(IK ⊗ c)R⊤
(I3 ⊗ c)R⊤
R⊤
(I3 ⊗ t)R⊤
(t ⊗ I3)R⊤






(21+6K)×3
,
and
M5 =






(I3K ⊗ c)
(I3K ⊙ (I3 ⊗ c))
IK
((I3 ⊗ t) ⊙ IK)
((t ⊗ I3) ⊙ IK)






(31K+18K2)×K
.
(5.63)
73

We assemble the Jacobian matrix (Equation 5.47) as
J = SM, (5.64)
where we define S as the concatenation of N rows S⊤
(Equation 5.60).
Partial Factorization
We introduce an alternate decomposition of the Jacobian matrix (Equation 5.47).
The main feature of this decomposition is that it does not provide a full factorization:
the factorization does not completely separate structure and motion terms, but
provides a partial separation instead. In the experiments we show that this partial
factorization is more efficient than using no factorization at all or using the full
factorization procedure. The partial factorization provides an speed improvement
as it precomputes some operations among shape parameters.
We show the detailed derivation for the partial factorization of Equation 5.47 in
Appendix F. The resulting elements for a row of the Jacobian are
J1 =D1D2R⊤ ˙Rαt I3 + Bscn⊤
v,
J2 =D1D2R⊤ ˙Rβt I3 + Bscn⊤
v,
J3 =D1D2R⊤ ˙Rγt I3 + Bscn⊤
v,
J4 =D1D2r1n⊤
v,
J5 =D1D2r2n⊤
v,
J6 =D1D2r3n⊤
v,
Jk =D1D2R⊤
Bk, i = 1, . . . , K,
(5.65)
where
D1 = I3P′
+ [s1P + s2Q] Q′
,
and
D2 =

I3 ⊗


1
t
c



 .
(5.66)
Note that there are shape terms post-multiplying the motion term D2 (see Equa-
tion 8.5), so we cannot express the Jacobian as in the full factorization case—i.e.
J = SM. We show that the partial factorization (Equations F.9) is far more efficient
than (1) not using factorization at all, and (2) using the full factorization (Equa-
tion 5.61). The reason for the latter is that if we try to compute a full factorization
from Equations 8.5, the computational cost increases due to the larger size of the
inner product matrices. We give the theoretical foundations of this fact in Chapter 7.
74

Outline of Algorithm HB3DMM We define the full-factorization HB algorithm
for the warp f3dmm as hb3dmm. We use the factorization equations (Equations 5.59)
as a basis for our algorithm; we show the outline for algorithm hb3dmm in Algo-
rithm 8.
Algorithm 8 Outline of the full-factorized HB3DMM algorithm.
2: Compute S
(i)
1 ,S
(i)
2 , and S
(i)
3: end for
4: Assemble the matrix S.
On-line:
7: Compute the matrix M(µi) using Equation 5.63.
8: Assemble the Jacobian: J(µi) = SM(µi) (Equation 5.59).
J(µi)
−1
J(µi)⊤
r(µi).
11: end while
Outline of Algorithm HB3DMMSF In this section we define again the HB
algorithm for the warp f3dmm. In this case we deal with the partial factorization
case; thus, we rename this algorithm as hb3dmmsf to differentiate for the full-
factorized algorithm —i.e. hb3dmm algorithm. We use the factorization equations
(Equations 8.5) as a basis for our algorithm; we show the outline for algorithm
hb3dmmsf in Algorithm 9.
5.5 Summary
• In this section have analysed the HB factorization-based optimization in depth;
we have proved that the efficiency of the method relies in (1) a gradient re-
placement procedure, and (2) a neat factorization of the Jacobian matrix.
• We have proposed a necessary requirement that constrains the motion model
for the HB algorithm. We have solved a fundamental criticism on the algorithm
HB by proposing a systematic factorization framework.
• We have also introduced two motion/warping models that enable us to effi-
ciently track 3D rigid—shape-induced homography—and non-rigid targets—
non-rigid shape homography—by using a factorization approach.
75

Algorithm 9 Outline of the HB3DMMSF algorithm.
2: Compute D
(i)
3: end for
On-line:
6: Compute D2 using Equation 5.66 on the current parameters t and c.
8: Compute J
(i)
1 , . . . , J
(i)
6+k using Equation 8.5.
9: end for
10: Assemble the Jacobian J(µi).
J(µi)
−1
J(µi)⊤
r(µi).
13: end while
76

Chapter 6
Compositional Algorithms
In this chapter we discuss compositional algorithms in greater depth than in Sec-
tion 3.4. We organize this chapter as follows: in Section 6.1.3 we provide the basic
assumptions on compositional image registration; besides, we give some insights
into the workings of the IC algorithm, specially in the reverse role of template and
image; moreover, we introduce the Eﬃcient Forward Compositional algorithm, and
we show that IC can be derived as a special case of this algorithm. Section 6.2 in-
troduces two basic requirements that compositional algorithms must hold. Finally,
Section 6.3 studies with detail other compositional methods such as the Generalized
Inverse Compositional algorithm.
6.1 Unravelling the Inverse Compositional Algo-
rithm
IC is known to be the fastest algorithm for image registration [Baker and Matthews,
2004; Buenaposada et al., 2009]. Although it is widely used, [Brooks and Arbel, 2010;
Dowson and Bowden, 2008; Guskov, 2004; Megret et al., 2008, 2006; Mu˜noz et al.,
2005; Papandreu and Maragos, 2008; Romdhani and Vetter, 2003; Tzimiropoulos
et al., 2011; Xu and Roy-Chowdhury, 2008], it is not still well understood how
the IC algorithm works in terms of traditional gradient descent algorithms. We
summarize these questions in the following:
Convergence of compositional algorithms In GD optimization the conver-
gence is guaranteed by construction: the algorithm looks for a set of parameters
xk+1 ∈ RN
such that the value of the cost function F : RN
→ R decreases—i.e.
F(xk+1) < F(xk). This problem is solved by expressing xk+1 as xk+1 = xk + h, for
some unknown h ∈ RN
; notice that this is equivalent to the additive update step
of the additive image registration algorithms (see Section 3.3). The values of h are
computed by expanding F(xk+1) by Taylor series (cf. [Madsen et al., 2004; Press
77

et al., 1992]) as follows:
F(xk+1) ≃ F(xk) + h⊤
F′
(xk).
Vector h is the descent direction for F at xk if h⊤
F′
(xk) < 0; hence the require-
ment F(xk+1) < F(xk) holds, as F(xk+1) − F(xk) < 0. Then, the next iteration is
computed by using xk+1 = xk + h.
In the case of compositional algorithms we cannot use the previous approach: the
next iteration in the search space is not computed as xk+1 = xk + h but as xk+1 =
Ψ(xk, h) for some composition function Ψ. The algorithm is not a GD method in
the strict sense. Convergence is assured in GD methods by construction: the cost
function value at the next step is always lower than the previous one (cf. [Madsen
et al., 2004; Press et al., 1992]). However, such an statement cannot be made for
the IC algorithm as it is not possible to relate the values of the objective function
between two steps due to a non-additive step.
Origins of inverse composition In Section 3.4.2 we showed that the crucial
point in the improvement of eﬃciency in the IC with respect to the FC algorithm is
to rewrite the brightness error function: the FC brightness dissimilarity function,
D(X; δµ) = T (X) − It(f(f(X; δµ); µ)), (6.1)
is rewritten in the IC brightness dissimilarity as
D(X; δµ) = T (f−1
(X; δµ)) − It(f(X; µ)). (6.2)
The vector δµ comprises the optimization variables—µ is deemed as constant. The
local miminizer based on the residuals of Equation 6.2 has a constant Jacobian (cf.
Section 3.4.2) unlike the minimizer based on Equation 6.1.
In the original formulation of the IC algorithm [Baker and Matthews, 2001, 2004],
Baker and Matthews simply stated Equation 6.2 without any further explanation:
they did not justify how to transform the FC dissimilarity (Equation 6.1) into the
IC dissimilarity (Equation 6.2). Here we show that this transformation depends on
a change of variables in Equation 6.1.
Can we always use Inverse Composition? [Baker and Matthews, 2004] state
that we can reverse the roles of template and image provided that the following
requirements on the warp f are satisﬁed:
1. The warp is closed under composition (i.e. f(x; µ′
) = f(f(x; δµ); µ)).
2. The warp has an inverse f−1
such that x = f−1
(f(x; µ); µ), ∀µ.
3. The warp identity is µ = 0 (i.e. f(x; 0) = x).
78

These requirements imply that the warp f must form a group [Baker and Matthews,
2004]. However, the IC algorithm is not suitable for certain problems such as 2.5d
tracking [Matthews et al., 2007]: even when the group constrain holds, the algorithm
does not properly converge. We introduce two requirements that effectively constrain
the warp.
6.1.1 Change of Variables in IC
We define a new variable U = f(X; δµ), such that X = f−1
(U; δµ) also holds. We
substitute this variable in Equation 6.1 as follows:
D(U; δµ) = T (f−1
(U; δµ)) − It(f(U; µ)). (6.3)
Note that Equations 6.1 and 6.3 are equivalent by construction: we have just changed
the domain in which the functions are defined (see Figure 6.1), but the coordinates
X and f−1
(U; δµ) are exactly the same—ditto for U and f(X; δµ). Also note that
Equation 6.3 is similar to the IC dissimilarity function—cf. Equation 6.2. The only
difference is that both equations are defined in different domains: Equation 6.3 is
defined in domain U and Equation 6.2 (IC) is defined in X. This difference is not
trivial at all: X and U are only identical if δµ = 0 (see Figure 6.1); hence, the IC
problem should be solved in the unknown domain U—we don’t know the coordinates
U as they depend on the unknown variables δµ. Thus, we are facing with a chicken-
and-egg situation: we have to know δµ to solve for δµ. In [Baker and Matthews,
2004] they just ignore the problem and they solve δµ using Equation 6.2, which
raises the following question:
How does the IC algorithm (Equation 6.2) converge, if it is defined in the wrong
domain?
We shall show that this is not always true; we demonstrate this assertion by in-
troducing a new FC algorithm that is equivalent to the IC under certain assumptions
only.
6.1.2 The Efficient Forward Compositional Algorithm
We define the Efficient Forward Compositional (EFC) algorithm as a FC algorithm
with a constant Jacobian: the EFC is similar to the IC—both are GN-like methods
with constant Jacobian. Their critical difference is that EFC does not reverse the roles
of template and image: the brightness dissimilarity is linearized in the image—as in
FC—and not in the template—as the IC algorithm does.
First, we rewrite Equation 6.1 such that the variable t appears in explicit form,
D(X; δµ, t + 1) = T (X) − I(f(f(X; δµ); µt), t + 1), (6.4)
79

T It+1
Figure 6.1: Change of variables in IC. (Top-left) We overlay the target region X
(yellow square) onto the template T . (Top-right) We transform the target region X by
means of f(f(X; δµ); µt), and we overlay it onto the image It+1 (yellow square). (Bottom-
left) We overlay the target region U = f(X; δµ) (green square) onto the template T . We
also depict the transformed region f−1
(U; δµ) (dotted blue line). (Bottom-right) We
transform the target region U by means of f(U; µt), and we overlay it onto the image
It+1 (green square). Notice that the regions X and f−1
(U; δµ) delimit identical image
areas—ditto for f(f(X; δµ); µt) and f(U; µt)—but X and U do not.
80

where the dissimilarity is now a two-variable function. Let tτ = t + τ be a time
instant in the range t ≤ t+τ ≤ t+1 such that its brightness constancy assumption,
I(f(X; µtτ
); tτ ) = T (X), (6.5)
holds for parameters µtτ
. We rewrite Equation 6.4 as a residuals vector as follows:
rEFC(µtτ
; ˆµ, ˆτ) ≡ T(x) − I(f(f(x; ˆµ); µtτ
), ˆτ), (6.6)
where ˆµ are registration parameters such that
T(x) = I(f(f(x; ˆµ); µtτ
), ˆτ). (6.7)
We approximate the residuals function rEFC(µtτ
; 0 + δµ, ˆτ) by using a ﬁrst order
Taylor expansion at ˆµ = 0 and ˆτ = tτ ,
rEFC(µtτ
; δµ, t+1) ≡ rEFC(µtτ
; 0, tτ )+∇ˆµrEFC(µtτ
; 0, tτ )δµ+∇ˆτ rEFC(µtτ
; 0, tτ )∆t+O(δµ, ∆t)2
,
(6.8)
where ∆t = t + 1 − tτ ,
rEFC(µtτ
; 0, tτ ) = T(x) − I(f(f(x; 0); µtτ
); tτ ) = T(x) − I(f(x; µtτ
); tτ ), (6.9)
∇ˆµrEFC(µtτ
; 0, tτ ) = −
∂I(f(f(x; ˆµ); µtτ
), tτ )
∂ ˆµ ˆµ=0
, (6.10)
and
∇ˆτ rEFC(µtτ
; 0, tτ ) = −
∂I(f(f(x; 0); µtτ
), ˆτ)
∂ˆτ ˆτ=tτ
. (6.11)
This linear approximation is valid for any µtτ
provided that δµ and ∆t are small
enough. We can then make the additional approximation
∂I(f(f(x; 0; µtτ
), ˆτ)
∂ˆτ ˆτ=tτ
∆t ≈ I(f(f(x; 0); µtτ
), t+1)−I(f(f(x; 0); µtτ
), tτ )+O(∆t)2
.
(6.12)
Inserting Equation 6.12 into Equation 6.8 we get
rEFC(µt+τ ; δµ, t + 1) ≃ ℓ(δµ) ≡ rEFC(µt+τ ; 0, t + 1) + JEFC(µt+τ ; 0, tτ )δµ, (6.13)
where
rEFC(µt+τ ; 0, t + 1) = T(x) − I(f(f(x; 0); µt+τ ), t + 1), (6.14)
JEFC(µt+τ ; 0, tτ ) =
∂I(f(f(x; ˆµ); µt, tτ )
∂ ˆµ ˆµ=0
=
∂I(f(ˆx; µtτ
), tτ )
∂ˆx ˆx=x
⊤
∂f(x; ˆµ)
∂ ˆµ ˆµ=0
.
(6.15)
The ﬁrst term of Equation 6.15 is the gradient of the warped image at time tτ . The
second is the Jacobian of the warp at µ = 0, which is constant. From Equation 6.5
we know that the image at time tτ warped by µtτ
is identical to the template image.
81

Note that, actually, only the images associated to time instants t and t + 1 shall
be available. This is not a problem, since we are only interested in substituting the
gradient of the warped image by that of the template.
Therefore, the gradient of the warped image should be equal to the gradient of
the template:
∂I(f(ˆx; µtτ
); tτ )
∂ˆx ˆx=x
=
∂T(ˆx)
∂ˆx ˆx=x
. (6.16)
Equation 6.16 holds if the GEE does. In this case, we may rewrite Equation 6.15 as
JEFC(0) =
∂T(ˆx)
∂ˆx ˆx=x
⊤
∂f(x; ˆµ)
∂ ˆµ ˆµ=0
, (6.17)
which is constant by construction as it only depends on x and µ = 01
—thus, we
remove the dependencies on µt+τ and tτ from Equation 6.17.
Outline of the EFC Algorithm We compute the local minimizer of ℓEFC(δµ) by
using:
δµ = − JEFC(0)⊤
JEFC(0)
−1
JEFC(0)⊤
rEFC(0), (6.18)
which will be iterated until convergence using µt+1 = µt ◦ δµ as update rule—the
algorithm is still forward compositional. We outline the algorithm in Figure 6.2 and
Algorithm 10.
Algorithm 10 Outline of the Eﬃcient Forward Compositional algorithm.
Oﬀ-line:
1: Compute the constant Jacobian, JEFC(0), by using Equation 6.17.
3: Compute the residual function at rEFC(µi; 0, t + 1) from Equation 6.6.
4: Compute the search direction: δµi =
− JEFC(0)⊤
JEFC(0)
−1
JEFC(0)⊤
rEFC(µi; 0, t + 1).
5: Update the optimization parameters:µi+1 = µi ◦ δµi.
6: end while
6.1.3 Rationale of the Change of Variables in IC
We show now that we may transform any EFC problem (Equation 6.1) into its
corresponding IC equivalent (Equation 6.2) in the following proposition:
1
We thankfully acknowledge J. M. Buenaposada for rewriting the FC algorithm by using the
GEE
82

Figure 6.2: Forward compositional image registration. We compute the target
region on the frame t + 1 (Image) using the parameters of frame t (µt). Using the target
region at the Template we compute a Dissimilarity Measure. We linearize the dissimilarity
measure around 0 and we compute descent direction in the search space using Least-
squares. We update the parameters using composition and we recompute the target region
on frame t + 1 using the new parameters. The process is iterated until convergence.
83

Proposition 4. The EFC problem is equivalent to the IC problem.
Proof. The GEE holds by definition of EFC. Thus, Corollary 4 holds up to a first
order approximation. Let us assume there exists an open set Ω ∋ x0, and some δµ
such that f−1
(x0; δµ) = x′
∈ Ω, then the bcc holds both in x0 and x′
,
T [x0] = It+1[f(f(x0; δµ); µt)], (6.19)
and
T [x′
] = It+1[f(f(x′
; δµ); µt)]. (6.20)
Thus, we may rewrite Equation 6.20 as
T [f−1
(x0; δµ)] = It+1[f(f(f−1
(x0; δµ)
x′
; δµ)
x0
; µt)],
which is the IC equivalent formulation of the Equation 6.19.
6.1.4 Differences between IC and EFC
Although both IC and EFC are equivalent according to Proposition 4, there are
subtle differences between them. In Section 6.2 we introduced the notion that the
IC dissimilarity function,
D(X; δµ) = T (f−1
(X; δµ)) − It(f(X; µ)). (6.21)
is the result of a change of variables in the EFC dissimilarity function,
D(X; δµ) = T (X) − It(f(f(X; δµ); µ)).
However, the original IC formulation computes the warp function at the template,
not its inverse (cf. [Baker and Matthews, 2004]),
D(X; δµ) = T (f(X; δµ)) − It(f(X; µ)). (6.22)
Equations 6.21 and 6.22 are equivalent but they do not yield the same result; the
parameters δµ computed from Equation 6.21 are different than those computed
from Equation 6.22, but both yield the same parameters µt+1: the update function
for Equation 6.21 is µt+1 = µt ◦ δµ, but Equation 6.22 computes µt+1 as µt+1 =
µt ◦ δµ−1
.
Thus, although equivalent, EFC has an immediate advantage over the original IC:
efficiency in the EFC algorithm does not depend on any inversion, neither in the warp
nor in the parameters update function. Note that inversion may pose a problem for
warps such as 3DMM or AAM—as pointed out in [Romdhani and Vetter, 2003].
84

6.2 Requirements for Compositional Warps
In this section we state two requirements that every efficient compositional algorithm
should meet; note that we refer to efficient methods, that is, the EFC, IC, and GIC
algorithms. We intentionally leave out the FC algorithm, as it must verify only one
of the requirements.
6.2.1 Requirement on Warp Composition
The first requirement constrains the properties of the motion warp; [Baker and
Matthews, 2004] states a similar property by requiring the warp f to be closed
under composition, that is
f(x; µt+1) = f(f(x; δµ); µt), (6.23)
for some parameters µt, µt+1, and δµ. We generalize this property by allowing
the composition between different warps: f, g : X × P → X are two warps map-
ping domain X into itself, that are parameterized in the domain P; we state the
requirement as follows:
Requirement 2. The composition f ◦ g must be a warp f, that is, for any µ ∈ P
there exist δµ ∈ P and µ′ ∈ P such that
f(X; µ′
) = f(g(X; δµ); µ).
This generalization is useful for some warps—e.g. for plane+parallax homogra-
phies h3dpp (see Section C)—although for most of the cases we can safely assume
that g ≡ f. Besides, there must exist the identity parameters µ0 such that
g(X; µ0) = X.
This constraint is similar to the one proposed in [Baker and Matthews, 2004] where
µ0 = 0. Requirement 2 is mandatory to express the dissimilarity function
D(X; µt+1) = T (X) − It+1(f(X; µt+1)), (6.24)
into the equivalent error function
D(X; µt+1) = T (X) − It+1(f(f(X; δµ); µt)), (6.25)
which is intrinsic to every compositional algorithm—FC, EFC, IC, and GIC.
6.2.2 Requirement on Gradient Equivalence
The second requirement is absolutely necessary to achieve efficiency in compositional
algorithms.
85

Requirement 3. A GN-like algorithm with constant Jacobian is feasible if the
brightness error GEE holds.
Requirement 3 lets us transform the FC algorithm into the EFC constant Jacobian
algorithm (see Section 6.1.2). Furthermore, Requirement 3 allows to effectively per-
form the change of variables needed by IC algorithm (see Section 6.1.3). Notice that
Requirement 3 is similar to the Requirement 1 proposed for additive image registra-
tion algorithms; although both requirements are identical, we distinguish between
them to have separate requirements for additive and compositional approaches.
6.3 Other Compositional Algorithms
[Baker et al., 2004b] introduced FC and IC as the basic algorithms for compositional
image registration. However, other authors have also proposed modifications to these
algorithms to extend their functionality. In this section we review the Generalized
Inverse Compositional, which extends the IC to use other optimization methods
than GN.
6.3.1 Generalized Inverse Compositional Algorithm
We introduced the GIC algorithm [Brooks and Arbel, 2010] in Section 3.5: the
motivation under the GIC was to create an efficient algorithm—i.e. with constant
Jacobian—that could be used with other optimization methods with additive update
different than GN such as bgfs, etc.
We now review the GIC algorithm using the change-of-variable procedure that
we applied to IC. We recall the IC residuals from Equation 3.20,
r(δµ) ≡ T(f(x; δµ)) − It+1(f(x; µt)). (6.26)
We rewrite Equation 6.26 introducing a function ψ such that δµ = ψ(µt, µt+1); note
that we can always define ψ because µt+1 and µt are related through f(x; µt+1) =
f(f(x; δµ); µt). Thus, we rewrite Equation 6.26 as follows:
r(δµ) ≡ T(f(x; ψ(µt, µt+1))) − It+1(f(x; µt)). (6.27)
Notice that Equation 6.27 does not explicitly depend on δµ, but on µt+1. However,
the GIC algorithm implicitly defines this relationship as µt+1 = µt + δµ (as in the
LK algorithm). Substituting this constrain in Equation 6.27 we have
r(µt + δµ) ≡ T(f(x; ψ(µt, µt + δµ))) − It+1(f(x; µt)). (6.28)
We linearize Equation 6.28 around µt by using Taylor series,
r(µt + δµ) ≃ ℓ(δµ) ≡ r(µt) + J(µt)δµ, (6.29)
86

where
r(µt) =T(f(x; ψ(µt, µt))) − It+1(f(x; µt)),
=T(f(x; 0)) − It+1(f(x; µt)),
(6.30)
and
J(µt) =
∂T(f(x; ψ(µt; ˆµ)))
∂ ˆµ ˆµ=µt
. (6.31)
Notice that ψ(µt, µt) = 0 by definition. Unlike the Jacobian in the IC algorithm
(Equation 3.22), the Jacobian in Equation 6.31 is not constant as it depends on µt.
However, we can obtain a pseudo-constant Jacobian from Equation 6.31 by using
the chain rule:
J(µt) =
∂T(f(x; ψ(µt; ˆµ)))
∂ ˆµ ˆµ=µt
,
=
∂T(f(x; ˆψ))
∂ ˆψ ˆψ=ψ(µt,µt)
∂ψ(µt, ˆµ)
∂ ˆµ ˆµ=µt
,
=
∂T(f(x; ˆψ))
∂ ˆψ ˆψ=0
JIC(0)
∂ψ(µt, ˆµ)
∂ ˆµ ˆµ=µt
,
=JIC(0)
∂ψ(µt, ˆµ)
∂ ˆµ ˆµ=µt
.
(6.32)
The Jacobian matrix J(µt) is not constant: JIC(0) is constant but ∇µψ(µt) is not—
it depends on µt. However, computing J(µt) is efficient as we only need to compute
∇µψ(µt), which is a square matrix of the size of the number of parameters, and the
product matrix of Equation 6.32.
Again, we compute the local minimizer of Equation 6.29 using least-squares:
δµ = − J(µt)⊤
J(µt)
−1
J(µt)⊤
r(µt). (6.33)
Unlike in the IC algorithm, where the update is compositional, the GIC algorithm
additively updates the current parameters with the descent direction. We outline
the algorithm in Figure 6.3 and Algorithm 11.
Discussion of the GIC algorithm The GIC algorithm expresses the compositional
optimization in terms of the usual gradient descent formulation. However, the whole
procedure still depends upon the implicit change of variables of IC residuals (cf.
Equations 6.26 and 6.27). Thus, GIC must comply the same requirements that IC,
namely Requirements 2 and 3. This implication reduces the number of warps that
provide good convergence for GIC. One could infer at a first glance that GIC is a
slower copy of IC; nonetheless, we must take into account of the impact on the
algorithm performance due to the use of more powerful optimization schemes such
as bgfs or Quasi-Newton [Brooks and Arbel, 2010].
87

Algorithm 11 Outline of the Generalized Inverse Compositional algo-
rithm.
3: Linearize the dissimilarity: J = ∇µr(0)∇µψ(µt), using Equation 6.32.
J(µi)
−1
J(µi)⊤
r(µi).
6: end while
Figure 6.3: Generalized inverse compositional image registration. We compute
the target region on the frame t + 1 (Image) using the parameters of frame t (µt). Using
the target region at the Template we compute a Dissimilarity Measure. We linearize the
dissimilarity measure around 0 and we compute descent direction in the search space using
Least-squares. We update the parameters using composition and we recompute the target
region on frame t+1 using the new parameters. The process is iterated until convergence.
88

Table 6.1: Relationship between compositional algorithms and warps
Warp Name FC EFC IC GIC
2D Aﬃne
(2DAFF) YES YES YES YES
Homography
(H8) YES YES YES YES
Plane-induced
Homography
(H6)
NO(1)
NO(1)
NO(1)
NO(1)
Plane+Parallax
(H6PP) YES NO(2)
NO(2)
NO(2)
3D Rigid Body
(3DRT) YES NO(2)
NO(2)
NO(2)
(1)
Does not meet Requirement 2
(2)
Does not meet Requirement 3
6.4 Summary
• In this section we have analysed in detail the compositional image alignment
approach, and introduced two requirements that a warp function must satisfy
to be used within this paradigm.
• We have introduced the Eﬃcient Forward Compositional (EFC), a new com-
positional algorithm, and proved that it is equivalent to the well-known IC
algorithm. The EFC algorithm provides a new interpretation of IC that allow
us to state a basic requirement such that the algorithm is valid.
• We have also reviewed the GIC image alignment algorithm, and proved that
its requirements for convergence are the same to those of IC.
Table 6.2 summarizes the requirements and the principal characteristics of the al-
gorithms reviewed in Chapters 5 and 6. Table 6.1 compares each compositional
algorithm to the warps introduced in Chapter 4. We consider whether a warp is
suitable for an optimization algorithm or not—YES/NO in the table—in terms of
the compliance of the warp with the algorithm requirements.
89

Table 6.2: Requirements for Optimization Algorithms
Warp Name Jacobian Update Rule
Warp
Requirements
Lucas-Kanade
(LK)
Variable Additive None
Hager-Belhumeur
(HB)
Part-constant(1)
Additive Requirement 1
Forward
Compositional
(FC)
Variable Compositional Requirement 2
Inverse
Compositional
(IC)
Constant Compositional
Requirement 2,
and
Requirement 3
Eﬃcient Forward
Compositional
(EFC)
Constant Compositional
Requirement 2,
and
Requirement 3
Generalized Inverse
Compositional
(GIC)
Part-Constant(2)
Additive
Requirement 2,
and
Requirement 3
(1)
The Jacobian is partially factorized
(2)
The Jacobian is post-multiplied by a nonconstant matrix
90

Chapter 7
Computational Complexity
In this chapter we study the resources that the registration algorithms require to
solve a problem—i.e. their computational complexity. We organize the chapter as
follows: Section 7.1 describes the measures and the criteria that we shall use to
compare the complexities of the algorithms. Section 7.2 introduces the algorithms
that we shall experimentally evaluate in later chapters; we propose a naming con-
vention for the algorithms and we define two sets of testing algorithms: additive
and compositional algorithms. Finally, in Section 7.3, we compute the theoretical
complexity for each algorithm, and we provide a some comparisons between them.
7.1 Complexity Measures
We can measure the complexity by either using (1) the time that an algorithm re-
quires or time complexity, or (2) the computational resources—i.e. memory space—
that the algorithm requires or computational complexity. Both measures are often
expressed as a function of the length of the input: the running time of an algorithm
depends on the size of the input (e.g. larger problems require more run-time or
larger memory). In analysis of algorithms, which provides theoretical estimates for
the resources needed by an algorithm, the big O notation describes the usage of
computational resources as a function of the problem size. For example, finding
an item in an unsorted vector takes O(n) time, where n is the length of the array.
Run-time is a function of the length, so although we increase the size of the vector,
the time complexity of the algorithm would be still O(n).
7.1.1 Number of Operations
Despite that big O notation is the most usual measure in algorithm analysis, we
prefer to define our own measure. The reason is that big O notation is not suitable
for fine-grain comparisons: for example, both IC and HB algorithms yield O(n)
complexity while we know that the former is more efficient than the latter. We
provide a fine-grained comparison by using the number of operations. We define the
number of operations of an algorithm, Θ, as the total aggregate of multiplications
91

and additions for each step of the algorithm. The number of operations of some
algorithm alg, Θalg, is written as
Θalg =< number of multiplications > M+ < number of sums > A, (7.1)
where M and A respectively stand for multiplications and additions. Notice that
the + operator is used only for the sake of notation: The operator indicates that
the total number of operations is the aggregate of the number of multiplications and
the number of additions, but it is not an actual addition. As in big O notation,
the number of operations of an algorithm depends on the problem size. We use two
variables to take into account the scale of the problem: NΩ represents the size of the
visible target region, and K represents the number of deformation basis.
7.1.2 Complexity of Matrix Operations
In this section we describe the number of operations for the most common matrix
operations: the dot product, the matrix product, and the matrix summation.
Vector Scalar Product : If a = (a1, . . . , an)⊤
and b = (b1 . . . , bn)⊤
are n × 1
vectors, we deﬁne their scalar or dot product as
a⊤
b = a1 · · · an



b1
...
bn



= a1 × b1 + . . . + an × bn.
(7.2)
We compute the number of operations of the vector scalar product by counting
the number of products and sums in Equation 7.2,
Θa⊤b =< n > M+ < (n − 1) > A. (7.3)
The complexity depends on the number of elements of the vector, n.
Matrix Product : If A is a m × n matrix, and B is a n × p matrix, then matrix
product AB is
AB =



a11 . . . a1n
...
...
...
am1 . . . amn






b11 . . . b1p
...
...
...
bn1 . . . anp



=
a⊤
1
...
... a⊤
m
b1 · · · bp
=



a⊤
1 b1 · · · a⊤
1 bp
...
...
...
a⊤
mb1 · · · a⊤
mbp


 ,
(7.4)
92

Table 7.1: Complexity of matrix operations.
Operation Multiplications Additions
a⊤
1×nbn×1 n (n − 1)
an×1b⊤
1×n n2
0
An×mBm×p mpn mp(n − 1)
An×m + Bn×m 0 mn
where a⊤
i = (ai1, . . . , ain) is the i-th row of matrix A, and bj = (b1j, . . . , bnj)
is the j-th column of matrix B. In the final statement of Equation 7.4 we
reformulate the matrix product as m×p dot products of their columns. Hence,
we compute the number of operations of the matrix product from the Θ of the
scalar product:
ΘAB = (mp)Θa⊤b =< mpn > M+ < mp(n − 1) > A. (7.5)
Matrix Addition : If A and B are m × n matrices, then their sum is
A + B =



a11 + b11 · · · a1n + b1n
...
...
...
am1 + bm1 · · · amn + bmn


 . (7.6)
There are no multiplication operations in the definition, so the complexity of
summing up two matrices is
ΘA+B =< mn > A. (7.7)
We summarize the complexities of matrix operations in table 7.1.
7.1.3 Comparing Algorithm Complexities
When computing the complexity of an algorithm—using the operations from Ta-
ble 7.1—we shall compare two algorithms as fair as we can using the following
assumptions.
Only Count Non-Zero Operations If we compute the product of two matrices,
we only take into account those operations that affect non-zero entries. A typical
example is when a matrix multiplication involves a Kronecker product. Let a and b
93

be n×1 vectors and Im be the m×m identity matrix; the product (Im ⊗a)(Im ⊗b⊤
)
is
(Im ⊗ a)(Im ⊗ b⊤
) =





a 0 · · · 0
0 a · · · 0
...
...
...
...
0 0 · · · a










b⊤
0⊤
· · · 0⊤
0⊤
b⊤
· · · 0⊤
...
...
...
...
0⊤
0⊤
· · · b⊤





=





ab⊤
0 0 0
0 ab⊤
0 0
...
...
...
...
0 0 0 ab⊤





,
(7.8)
where 0 is the 3 × 1 zero vector, and 0 is the 3 × 3 zero matrix. The complexity
of this matrix product is Θmatrixp = (m3
n2
)m + (m2
n2
(m − 1))a. Nonetheless,
the last statement of Equation 7.8 shows that many of these sums and products
operate over zero entries of the matrices; hence, these operations can be spared. In
fact, the non-zero operations are on the block-diagonal of the result matrix: the
diagonal comprises m matrices whose complexities are ΘMATRIXP = (n2
)m operations
each. Thus, the total number of non-zero operations for Equation 7.8 is Θmatrixp =
(mn2
)m.
Neglect Duplicated Operations If an operation has to be computed several
times, we shall count just once that operation. A typical example is a matrix
product with repeated entries (as in the Kronecker product): in Equation 7.8, each
matrix in the block-diagonal of the result matrix is the product ab⊤
; we shall denote
the product complexity as Θmatrixp = (n2
)m instead of (mn2
)m.
Matrix Chain Multiplication As we showed in Section 5.3, the factorization of
the Jacobian matrix is generally not unique. Furthermore, given a single factoriza-
tion in form of a chain of matrix products, there are many ways to choose the order
in which we perform the multiplications. The Matrix chain multiplication or matrix
parenthesization is a well known optimization problem [Cormen et al., 2001]. We
may use dynamic programming [Neapolitan and Naimipour, 1996] to compute the
most efficient multiplication order for our factorization.
7.2 Algorithm Naming Conventions
We define a testing algorithm as the combination of an optimization scheme and
a warp. We write a given algorithm by using fixed size fonts—e.g. HB3DMM is the
union of the optimization algorithm HB and the warp 3DMM.
94

7.2.1 Additive Algorithms
We present the testing algorithms that use additive update in Table 7.2. For con-
venience, we keep the naming convention for optimization algorithms that we used
in Chapter 3; warp names are accordingly taken from Table 4.1 and Chapter 5.
Table 7.2: Additive testing algorithms.
Algorithm Warp Optimization Commentaries
LK3DTM
3D Shape-induced
Homography
(f3dtm)
LK
(Algorithm 2)
We use the original GN
algorithm from [Lucas
and Kanade, 1981].
HB3DTM
3D Shape-induced
Homography
(f3dtm)
HB
(Algorithm 3)
We implement Algo-
rithm 7 (see page 64).
HB3DRT
3D Rigid Body
(f3drt)
HB
(Algorithm 3)
We use the original
algorithm from [Sepp
and Hirzinger, 2003].
LK3DMM
Nonrigid
Shape-induced
Homography
(f3dmm)
LK
(Algorithm 2)
and Kanade, 1981].
HB3DMM
Nonrigid
Shape-induced
Homography
(f3dmm)
HB
(Algorithm 3)
We implement Algo-
HB3DMMSF
Nonrigid
Shape-induced
Homography
(f3dmm)
HB
(Algorithm 3)
We implement Algo-
LKH8
Cartesian
Homography
(fh82d)
LK
(Algorithm 2)
and Kanade, 1981].
LKH6
Plane-induced
Homography
(fh6p)
LK
(Algorithm 2)
—
95

7.2.2 Compositional Algorithms
We present the testing algorithms that use compositional update in Table 7.3. For
convenience, we keep the naming convention for optimization algorithms that we
used in Chapter 3; warp names are accordingly taken from Table 4.1 and Chapter 5.
Table 7.3: Additive testing algorithms.
Algorithm Warp Optimization Commentaries
ICH8
Cartesian
Homography
(fh82d)
IC
(Algorithm 5)
—
GICH8
Cartesian
Homography
(fh82d)
GIC
(Algorithm 11)
—
ICH6
Plane-induced
Homography
(fh6p)
IC
(Algorithm 5)
—
GICH6
Plane-induced
Homography
(fh6p)
GIC
(Algorithm 11)
—
FCH6PP
Plane+Parallax
Homography
(fh6pp)
FC
(Algorithm 4)
—
7.3 Complexity of Algorithms
In this section we show the computational complexity of several testing algorithms.
We are specially interested in comparing the complexities for additive algorithms:
we show that the extensions to the HB algorithm to track 3D targets— either rigid
or nonrigid—are much more eﬃcient that their LK counterparts.
A Word About Implementation The complexity of a testing algorithm is re-
lated to an iteration of the optimization loop: the total complexity is the sum of
the complexities of each step of the algorithm. We ensure the scalability of our
estimations of complexity by using the following variables:
96

1. NΩ: is the number of visible points in the current view—i.e. NΩ = Ω ,
see Page 122. This variable measures how many times a given operation is
repeated (once per visible point in the target region).
2. K: is the number of deformation components of a morphable model.
Another implementation issue is how we deal with derivatives. We compute such
derivatives as image gradients and the Jacobian by using central differences [Press
et al., 1992],
Jˆµi
=
r(µi + δ) − r(µi − δ)
2δ
, (7.9)
where Jˆµi
= ∇ˆµi
r(µ) is the i-th column of the Jacobian matrix of the iterative
optimization. We also compute the derivatives of function ψ in GIC algorithm (see
Equation 6.32) by using numerical differentiation instead of explicit methods.
7.3.1 Additive Algorithms
Tables 7.4–7.9 show the complexities for the additive algorithms from Table 7.2. We
break down every algorithm in its basic steps; we compute the number of operations
for each step by using the conventions in Section 7.1. The detailed steps of the
derivation are shown in Appendix H. The final complexity is the summation of the
number of operations for each step of the algorithm.
Table 7.4: Complexity of Algorithm LK3DTM.
Step Action Multiplications Additions
1. Compute visibility set Ω. — —
2. Compute J 894NΩ 685NΩ
Compute f3dtm using Equation 5.10. 74 51
Compute J using Equation 7.9. 6 × 149 6 × 115
3. Compute J⊤
J. 36NΩ −36+ 36NΩ
4. Invert J⊤
J — —
5. Compute r(µ) (Equation 5.12) 74NΩ 52NΩ
6. Compute J⊤
r(µ) 6NΩ −6+ 6NΩ
7. Compute J⊤
J
−1
J⊤
r(µ) 36 30
TOTAL 36+ 1010NΩ −12+ 779NΩ
We summarize the total complexities for additive registration algorithms in Ta-
ble 7.10. Direct comparison of values from Table 7.10 is difficult as the complexities
depend upon the variables NΩ and K. We ease the comparison by plotting the
complexities for different values of NΩ and K in Figure 7.1.
97

Table 7.5: Complexity of Algorithm HB3DTM.
2. Compute J 81+ 75NΩ 54+ 66NΩ
Compute R⊤ ˙R{α,β,γ}. 81 54
Compute J using Equation 5.31. 75 66
3. Compute J⊤
J. 36NΩ −36+ 36NΩ
4. Invert J⊤
J — —
5. Compute r(µ) (Equation 5.12) 74NΩ 52NΩ
6. Compute J⊤
r(µ) 6NΩ −6+ 6NΩ
7. Compute J⊤
J
−1
J⊤
r(µ) 36 30
TOTAL 117+ 191NΩ 42+ 160NΩ
Table 7.6: Complexity of Algorithm LK3DMM.
2. Compute J (1002+203K+6K2)NΩ (762+175K+3K2)NΩ
Compute f3dmm using Equation 5.42. 83+3K 57+3K
3. Compute J⊤
J. (36+12K+K2)NΩ
(−36−12K−K2)
+(36+12K+K2)NΩ
4. Invert J⊤
J — —
5. Compute r(µ) from Equation 5.44. (83+3K)NΩ (58+3K)NΩ
6. Compute J⊤
r(µ) (6+K)NΩ (−6−K)+(6+K)NΩ
7. Compute J⊤
J
−1
J⊤
r(µ) 36+K2 30+K2
TOTAL
(36+K2)
+(1127+207K+7K2)NΩ
(−12−K)
+(863+179K+9K2)NΩ
98

Table 7.7: Complexity of Algorithm HB3DMMNF.
2. Compute J 81+(219+24K)NΩ 54+(171+16K)NΩ
Compute R⊤ ˙R{α,β,γ} 81 54
(−36−12K−K2)
+(36+12K+K2)NΩ
4. Invert J⊤
J — —
6. Compute J⊤
r(µ) (6+K)NΩ (−6−K)+(6+K)NΩ
7. Compute J⊤
J
−1
J⊤
r(µ) 36+K2 30+K2
TOTAL
(117+K2)
+(344+40K+K2)NΩ
(42−13K)
+(271+32K+K2)NΩ
Results for 3D rigid targets show that HB roughly performs an 80% less operations
than its LK counterpart (see Figure 7.1–(Top-left)). Results for nonrigid targets
are similar than those for rigid ones: HB algorithm that uses a semi-factorization
approach (HB3DMMSF) is six times faster—84% less operations—than its LK equivalent
(LK3DMM). The resulting complexities are similar for the three nonrigid cases—K =
6, 9, 15—in terms of speed gain, although the absolute numbers change accordingly
the size of the deformation basis: the bigger the basis, the higher the number of
operations.
99

Table 7.8: Complexity of Algorithm HB3DMM.
2. Compute J (210+280K+72K2)NΩ (204+280K+72K2)NΩ
Compute S⊤
1 Mi, for i = 1, . . . , 3. 63+81K+18K2 62+81K+18K2
Compute S⊤
2 M4. 21+6K 20+6K
Compute S⊤
3 M5. 31K+18K2 −1+31K+18K2
3. Compute J⊤
J. (36+12K+K2)NΩ
(−36−12K−K2)
+(36+12K+K2)NΩ
4. Invert J⊤
J — —
6. Compute J⊤
r(µ) (6+K)NΩ (−6−K)+(6+K)NΩ
7. Compute J⊤
J
−1
J⊤
r(µ) 36+K2 30+K2
TOTAL
(36+K2)
+(335+296K+73K2)NΩ
(−12−13K)
+(304+296K+73K2)NΩ
Figure 7.1 also points out the advantages of using a factorization procedure:
we reduce the number of operations by using a semi-factorization approach by a
30%—compare algorithms HB3DMMNF and HB3DMMSF in Figure 7.1. However, the full-
factorization HB scheme (HB3DMM) is much slower than LK3DMM due the diﬃculties
of the factorization; completely separate motion and structural variables needs such
an amount of resources that renders unusable the advantages of using HB algorithm.
100

Table 7.9: Complexity of Algorithm HB3DMMSF.
2. Compute J 81+(60+18K)NΩ 54+(36+14K)NΩ
Compute R⊤ ˙R{α,β,γ} 81 54
Compute J
(i)
1 , . . . , J
(i)
6+k using
Equations 8.5
. 60+18K 36+14K
3. Compute J⊤
J. (36+12K+K2)NΩ
(−36−12K−K2)
+(36+12K+K2)NΩ
4. Invert J⊤
J — —
6. Compute J⊤
r(µ) (6+K)NΩ (−6−K)+(6+K)NΩ
7. Compute J⊤
J
−1
J⊤
r(µ) 36+K2 30+K2
TOTAL
(117+K2)
+(185+34K+K2)NΩ
(42−13K)
+(136+30K+K2)NΩ
Table 7.10: Complexities of Additive Algorithms.
Algorithm Multiplications Additions
lk3dtm
(Table 7.4)
36+1010NΩ −12+779NΩ
hb3dtm
(Table 7.5)
117+191NΩ 42+160NΩ
lk3dmm
(Table 7.6)
36+K2+(1127+207K+7K2)NΩ −12−K+(863+179K+9K2)NΩ
hb3dmmnf
(Table 7.7)
117+K2+(344+40K+K2)NΩ 42−13K+(271+32K+K2)NΩ
hb3dmm
(Table 7.8)
36+K2+(335+296K+73K2)NΩ −12−13K+(304+296K+73K2)NΩ
hb3dmmsf
(Table 7.9)
117+K2+(185+35K+K2)NΩ 42−13K+(136+30K+K2)NΩ
101

Rigid case Nonrigid case (K = 6)
Nonrigid case (K = 9) Nonrigid case (K = 15)
Figure 7.1: Complexity of Additive Algorithms. (Top-Left) Number of operations
vs. target size for additive rigid registration algorithms: algorithm LK3DTM (blue line),
and algorithm HB3DTM (red line). We also display the number of operations vs. target
size for additive nonrigid registration algorithms: LK3DMM (red), HB3DMM (blue), HB3DMMNF
(magenta), and HB3DMMSF (green). We compare the complexities for diﬀerent number
of modes of deformation: K = 6 (Top-right), K = 9 (Bottom-left), and K = 15
(Bottom-right).
102

7.3.2 Compositional Algorithms
Tables 7.11–7.14 show the complexities for some compositional registration algo-
rithms. We show the detailed derivation in Appendix H. We register the number of
operations of each algorithm by using a similar procedure to Section 7.3.1. Instead
of directly comparing the algorithms from Table 7.13, we contrast complexities by
performing an 8-dof homography. We also include here the additive algorithm LKH8
just for the sake of comparison with its compositional counterparts.
Table 7.11: Complexity of Algorithm LKH8.
Compute fH8. 11 6
Compute J using central
diﬀerences (Equation 7.9).
8 ×20 8 ×13
2. Compute J⊤
J. 64NΩ −64+ 64NΩ
3. Invert J⊤
J — —
4. Compute r(µ) (Equation 3.3). 11NΩ 6NΩ
5. Compute J⊤
r(µ) 8NΩ −8+ 8NΩ
6. Compute J⊤
J
−1
J⊤
r(µ) 64 56
TOTAL 64+ 243NΩ −16+ 182NΩ
Table 7.12: Complexity of Algorithm ICH8.
1. Compute JIC — —
2. Compute J⊤
ICJIC. — —
3. Invert J⊤
ICJIC — —
5. Compute J⊤
ICr(µ) 8NΩ −8+ 8NΩ
6. Compute J⊤
ICJIC
−1
J⊤
ICr(µ) 64 56
TOTAL 64+ 19NΩ 48+ 14NΩ
103

We choose Lucas-Kanade (LKH8) to be the nonefficient algorithm, whereas the
efficient algorithms are respectively Hager-Belhumeur (HBH8), Inverse Compositional
ICH8, and Generalized Inverse Compositional (GICH8). We compute the complexity
for one iteration of the search loop for each algorithm. We show the results: in
Tables 7.11–7.14. As in the additive case, we consider negligible certain constant
operations (denoted as—) such as inverting the Hessian matrix, or computing an
offline Jacobian (such as JIC). We also consider the case where the Hessian of the
optimization in algorithm ICH8 is not constant—e.g., due to partial occlusion of the
target; in this case, although the Jacobian JIC is constant, but we have to compute
the matrix product J⊤
ICJIC (see Table 7.12–Step 2).
Table 7.13: Complexity of Algorithm HBH8.
Compute J = M0Σ(µ) using
[Buenaposada and Baumela, 2002].
24 16
2. Compute J⊤
J. 64NΩ −64+ 64NΩ
3. Invert J⊤
J — —
5. Compute J⊤
r(µ) 8NΩ −8+ 8NΩ
6. Compute J⊤
J
−1
J⊤
r(µ) 64 56
TOTAL 64+ 107NΩ −16+ 94NΩ
Table 7.14: Complexity of Algorithm GICH8.
1. Compute Jgic 448+ 64NΩ 288+ 56NΩ
Compute JIC. — —
Compute ∇µψ(µ). 8 × 56 8 × 36
Compute JGIC = JIC × ∇µψ(µ). 64NΩ 56NΩ
2. Compute J⊤
GICJGIC. 64NΩ −64+ 64NΩ
3. Invert J⊤
GICJGIC — —
5. Compute J⊤
GICr(µ) 8NΩ −8+ 8NΩ
6. Compute J⊤
GICJGIC
−1
J⊤
GICr(µ) 64 56
TOTAL 512+ 147NΩ 272+ 134NΩ
104

0 0.5 1 1.5 2
x 10
5
0
1
2
3
4
5
6
7
8
9
x 10
7
Number of data
Numberofoperations
LKH8
HBH8
ICH8
GICH8
ICH8 (var. Hessian)
Figure 7.2: Complexity of Compositional Algorithms. Number of operations vs.
target size for compositional registration algorithms: LKH8 (red), HBH8 (blue), ICH8 (green),
and GICH8 (magenta). We also include the ICH8 algorithm with variable Hessian (light
blue) for the sake of comparison.
We summarize the results in Table 7.15; direct inspection of these results show
that one iteration of the IC algorithm requires less operations than the remaining
algorithms—at least ten-times faster than the equivalent LK. We also plot the results
in Figure 7.2 for ease of comparison—we plot the complexities for different values
of NΩ.
The results show that the algorithms fall in three categories with IC being the
fastest, HB and GIC being in the medium range, and LK being the slowest by a
considerable difference.
7.4 Summary
This chapter computes the computational cost for those image registration algo-
rithms that we shall experimentally evaluate later. We summarize the comparison
of complexities among the different algorithms in Tables 7.16 and 7.17; we read the
Tables as follows: the computational cost of the algorithm in the i-th row serves
as the basis to compute the percentage of increase or decrease of cost with respect
to the algorithms in the corresponding columns—e.g., in Table 7.16, in comparison
with algorithm LK3DMM, the algorithm HB3DMMNF is 77% faster, algorithm HB3DMMSF
is 84% faster, and HB3DMM is a 93% slower. Thus, for the nonrigid case, a semi-
factorization approach (HB3DMMSF) is more efficient than a proper full factorization
(HB3DMM), and only slightly better than no factorization at all but swapping gra-
dients (HB3DMMNF). For the rigid case, the HB algorithm is a 80% faster than the
105

Algorithm Mult. Add.
lkh8 (Table 7.11) 64 + 243NΩ −16 + 182NΩ
hbh8 (Table 7.13) 64 + 107NΩ −16 + 94NΩ
ich8 (Table 7.12) 64 + 19NΩ 18 + 14NΩ
ich8 (Table 7.12)
(Variable Hessian)
64 + 83NΩ −48 + 78NΩ
gich8 (Table 7.14) 512 + 147NΩ 272 + 134NΩ
Table 7.15: Complexities of Compositional Algorithms.
corresponding LK.
We summarize the results for compositional algorithms in Table 7.17. Results
show that IC is much more efficient than the usual LK—about ten-times speed-up—
and much faster than efficient algorithms with nonconstant Jacobian—about five
times faster than HB and GIC. However, if the Jacobian of algorithm ICH8 is not
constant, IC algorithm is only a 62% faster than LK, and HB is only a 24% slower.
Table 7.16: Comparison of Relative Complexities for Additive Algorithms
HB3DTM HB3DMM HB3DMMNF HB3DMMSF
LK3DTM 80.3800%
LK3DMM 93.5067% 77.0790% 84.0844%
HB3DMM 88.1550% 91.7752%
HB3DMMNF 30.5630%
Table 7.17: Comparison of Relative Complexities for Compositional Algo-
rithms
ICH8 (Var. Hes.) HBH8 GICH8 ICH8
LKH8 62.1176% 52.7058% 33.8805% 92.2350%
ICH8 (Var. Hes.) 24.8446% 74.5387% 79.5024%
HBH8 39.8047% 83.5815%
GICH8 88.2562%
106

Chapter 8
Experiments
In this chapter we describe the experiments that validate the theoretical results
that we introduced in Chapters 5 and 6. We demonstrate in our experiments (1)
the influence of the Gradient Equivalent Equation (GEE) in the convergence of the
algorithm, and (2) the correctness of the hyphoteses about the complexity of the
optimization algorithms. We systematize the comparison of algorithms by using a set
of standarized measures. These measures describe certain features of the algorithms
such as efficiency, accuracy or robustness.
We organize the chapter as follows: Section 8.1 draws a distinction between regis-
tration and tracking for efficient algorithms, and introduces basic hypotheses about
the algorithms; Section 8.2 introduces the qualitative features that our experiments
should exploit together with the quantitative measures that we shall use to verify
the former; Section 8.3 describes the procedure that we use to generate the syn-
thetic data needed by our experiments; Section 8.4 discusses some aspects relative
to the implementation of our algorithms; Section 8.5 describes the experiments us-
ing additive algorithms, and Section 8.6 evaluates compositional algorithms; finally,
Section 8.7 summarizes the results and provides some discussion about them.
Notation
Our testing algorithms are iterative: from an initial estimate µ0 in the search space
RP
we iterate until we find an optimum µ∗
∈ RP
. We optimize the cost function
by descending along the gradient steepest direction given by the Jacobian matrix
J. The Jacobian is constant for efficient optimization methods such as IC or IC; in
this case, we denote µJ ∈ RP
as those fixed parameters at which we compute this
constant Jacobian.
8.1 Motivation
The purpose of the experiments is to test a set of hypotheses that describe func-
tional characteristics of the algorithms. We aim to demonstrate different properties
about the algorithms such as convergence or efficiency. We informally present these
107

hypotheses in the following questions:
Efficient Registration and Tracking In Section 2.1 we stated the generic differ-
ences between image registration and tracking. However, efficient registration/tracking
algorithms—such as HB, IC, GIC, or EFC—have subtle differences when used for reg-
istration or tracking. By construction, efficient algorithms compute the Jacobian
matrix used in the iterative optimization process at a fixed location µJ. This Ja-
cobian matrix may be either constant—as in IC or EFC algorithm—or partially
constant—as in HB or GIC. We show in the following the main differences between
image registration and tracking when using efficient methods:
Number of images
Efficient registration involves an image and a template: the algorithm warps
the image so that its texture and the template texture coincides (see Fig-
ure 8.1). On the other hand, efficient template-based tracking involves a se-
quence of images and a template: the tracking algorithm searches for the
template in each image of the sequence; besides, the template may not be
even an image of the sequence—see Figure 8.1.
Algorithm initialization
Direct registration methods imply by definition that the template and the
registered/tracked image overlap to some extent: the error between template
and image is linearized into a gradient descent scheme whose outputs are the
target parameters. Thus, it is critical for the algorithm performance to choose
a proper initialization of the optimization procedure.
In registration problems, the template and the registered image must suf-
ficiently overlap (see Figure 8.1). The registration algorithm is usually ini-
tialized at µ∗
template, the location of the target region on the template—i.e.,
µ0 = µ∗
template, see Figure 8.2. The initial guess must be close enough to µ∗
,
the actual target parameters at the registered image: the regions defined by
µ0 and µ∗
in the image must overlap so the image error can be linearized—cf.
Figure 8.2–(Top-right).
In tracking problems, the template and the tracked image may be arbi-
trarily different (see Figure 8.1). The tracking algorithm is not initialized at
the template—i.e, µ0 = µ∗
template—but at the previous target location in the
sequence: for the sequence frame captured at t + 1, we initialize the itera-
tive tracking procedure at the optimum computed at the frame t, µ∗
t —i.e.,
µ0 = µ∗
t . Again, the initial guess must be close enough to the actual target
location µ∗
t+1 so the error can be linearized: the regions defined by µ∗
t and µ∗
t+1
must overlap, which is equivalent to say that the inter-frame differences must
be small enough—see Figure 8.2–(Bottom-right). Notice that the image in
Figure 8.2–(Bottom-right) can be tracked in a sequence but not registered,
as the intersection of µ∗
template and µ∗
t+1 is empty.
108

Figure 8.1: Registration vs. Tracking. (Top) The registration aligns regions the on
template and the image (green squares). The algorithm warps the image (pink square)
such that the intensity values of image and template coincide. (Bottom) The tracking
algorithm searches for those regions in the sequence whose texture coincide with the tem-
plate. The output is a vector containing the state the target region (position, orientation,
scale, etc).
109

Figure 8.2: Algorithm initialization. (Top-left) Template image where the Jacobian
matrix is computed at the ﬁxed parameters µJ = µ∗
template (green square). (Top-right)
Image to be registered to the template with the actual parameters µ∗ of the target region
(green square). We initialize the registration procedure with the target location at the
template—i.e., µ0 is µ∗
template. (Bottom-left) Frame t in the image sequence. We show
the actual target parameters at time t, µ∗
t . (Bottom-right) Frame t + 1 in the image
sequence with the actual target parameters µ∗
t+1 (green square). The tracking algorithm
is not initialized at µ∗
template (yellow square) but at the previous target location µ∗
t (pink
square).
110

Table 8.1: Registration vs. tracking in efficient methods
Registration Tracking
The aim is to align image regions
The aim is to recover the target state
(position, orientation, velocity, etc.)
An image is registered against the tem-
plate
A sequence of images is tracked against
the template(1)
The template and the registered image
must overlap to some extent
The template and the tracked image
may not overlap
The Jacobian is computed at the initial
guess of the optimization
The Jacobian is computed far away
from the initial guess of the optimiza-
tion
(1)
The template may not even be a part of the sequence.
Efficient Jacobian
Efficient algorithms—partially or totally—compute the Jacobian matrix of the
brightness error at the fixed the parameters µJ. It is usually assumed that
efficient algorithms can be used for either registration or tracking. However,
we show that the assumptions for efficient algorithms behave very differently
in both problems.
In registration problems, the efficient Jacobian is computed at the location
of the template, which happens to be also the initial guess for the iterative
registration procedure—i.e., µJ = µ∗
template = µ0, see Figure 8.2. This is the
case of the experiments in [Baker and Matthews, 2004] or [Brooks and Arbel,
2010]. The optimization procedure starts with the actual Jacobian at µ0,
which remains constant—or partially constant—for the rest of the iterations
of the algorithm.
In tracking problems, the efficient Jacobian is computed at the location of
the template, which is usually very different from the initial guess for the
iterative tracking procedure—i.e., µJ = µ∗
template = µ0, see Figure 8.2. This is
what happens in the experiments in [Hager and Belhumeur, 1998] or [Cobzas
et al., 2009]. In this case, the optimization procedure does not start with the
actual Jacobian—i.e., the Jacobian computed at µ0—but with that computed
at µ∗
template. This Jacobian must be somehow transformed by parameters µ∗
t ,
so it can be used at frame t + 1 (see Figure 8.2).
We summarize the differences between registration and tracking for efficient algo-
rithms in Table 8.1. In this chapter we show that efficient algorithms that verify
their requirements can be used either for tracking or registration. If the efficient
algorithms do not hold the requirements, then they can be eligible for registration
but not for tracking.
111

Do Warp Requirements Influence Convergence? In previous sections we
stated the requirements that warps should meet to work with efficient registration
algorithms (see Table 6.1–page 89). We are interested in studying how the com-
pliance of these warp requirements affects algorithms convergence. Our hypothesis
is that the optimization successfully converges if and only if the warp meets its
requirements—e.g. any IC-based optimization algorithm shall converge if its warp
holds Requirements 2 and 3 (cf. Table 6.1).
However, proving this hypothesis is not easy: an algorithm may converge close
to the optimum even when none of its Requirements are met, so we are not facing a
true-or-false situation. Recall that efficient algorithms substitute the actual gradient
by an approximation provided by the GEE (see Section 5.1 and 6.2). Thus, if
the approximated gradient is sufficiently similar to the actual one—that is, the
one computed in each iteration—the optimization may converge close to the actual
optimum.
When the warp requirements are satisfied, we theoretically demonstrated that
the approximation due to gradient swapping was accurate—which should lead to
good convergence of the optimization. However, when the algorithm fails to meet the
requirements, it is actually approximating the Jacobian. We show in our experiments
that, in this case, the optimization converges when µ0 ≡ µJ—i.e. in registration
problems. As µ0 becomes increasingly different from µJ in the parameter space—as
is the case in tracking problems—the performance of the optimization degrades.
From how far do Algorithms Converge? The convergence in gradient-based
optimization heavily depends on the choice of the initial guess [Press et al., 1992].
Gradient-based optimization linearizes the cost function using a first order Tay-
lor series expansion at the starting point. The accuracy of an approximation that
uses Taylor series depends on the order of the approximating polynomial, so a lin-
ear approximation is rather coarse; hence, the initial guess for our gradient-based
optimization must be close to the optimum.
If S ⊆ Rp
is the search space, we define the basin of convergence Λ as the
neighbourhood around the optimum µ∗
where the algorithm converges— i.e. Λ =
{µ0|µ∗
= limk→∞ Υ(µ0)}, where Υ(µ0) = (µ0, µ1, . . . , µk) is the iterative sequence
that begins at µ0. The basin of convergence is typically an open ball with centre µ∗
and radius rΛ. The radius rΛ describes the convergence properties of our algorithm:
the larger the radius, the bigger the basin of convergence, so the algorithm converges
far from the optimum. We show in our experiments that the basin of convergence
of a given algorithm strongly depends on the compliance of the warp requirements.
Do Theoretical and Empirical Complexities Match? The experimental tests
shall also confirm the theoretical complexities that we obtained for our algorithms
in Chapter 7. We are specially interested in confirming the actual speed increment
in the factorization-based algorithms: we have only guessed their complexity in a
theoretical fashion, and we want to compare it to other approaches such as LK.
112

8.2 Features and Measures
We answer the aforementioned questions by analyzing the experimental results of the
algorithms under review. We describe the qualitative properties of the algorithms by
using a collection of features. We quantify each feature by measuring certain output
values when testing the algorithms. In the following we present these features along
with their corresponding measures.
Accuracy
We measure how accurate the algorithms are, that is, how close are their out-
comes to the actual optimum. We measure the accuracy of our algorithms
using the Reprojection Error: let ˆµ ∈ RP
be the parameters estimated
by an algorithm, and let µ∗
∈ RP
be actual optimum for a given configura-
tion. We define the reprojection error ε(ˆµ) as the average Euclidean distance
between the projections of ˆµ and µ∗
,
ε(ˆµ) =
1
N
p(f(x; ˆµ)) − p(f(x; µ∗
)) , ∀x ∈ X (8.1)
where N = |X|, f : R3
× RP
→ R3
is the warp function of our algorithm, and
p : R3
→ R2
is the corresponding projection. Notice that we define the error
function in R2
—the image space—instead of RP
—the search space. Thus, the
estimated parameters are accurate if they project the shape to coordinates that
are close enough to those projected by the actual optimum (see Figure 8.3).
We could quantify the accuracy by directly comparing the parameters ˆµ and
µ∗
. However, comparing motion parameters in RP
does not have a natural
geometric meaning—e.g. it is difficult to compare homography parameters or
rotation matrices.
Efficiency
The efficiency feature is directly related to the computational complexity of the
algorithm. We theoretically computed the complexity of some algorithms in
Chapter 7, but we corroborate these estimations in the experiments. We break-
down the computational burden for each algorithm in (1) the total number of
iterations of the optimization loop, and (2) the time per iteration measured in
seconds.
Robustness
The robustness feature measures the basin of convergence of the algorithm.
Our fundamental assumption is that the initial guess µ0 is located in an small
neighbourhood of the actual optimum. We measure the robustness of the al-
gorithm by successively increasing the radius of this neighbourhood and com-
puting the frequency of convergence of the optimization.
We define the Frequency of Convergence as the percentage of successfully
converged trials: we consider that an algorithm has successfully converged
113

Figure 8.3: Accuracy and convergence. (Left) The target projected by the ac-
tual parameters µ∗. We overlay on the alpha channel the target projected by the es-
timated parameters ˆµ. (Right-Top row) Two snippets of the image that show the
projections of u = p(f(x; µ∗)) and u′ = p(f(x′; µ∗)), respectively, onto û = p(f(x; ˆµ))
and û′
= p(f(x′; ˆµ)). (Right-Bottom row) The green circle represents the threshold
in the reprojection error: we consider that the estimated parameters are accurate if the
estimated projection ˆµ belongs to the circle.
when the reprojection error is below a given threshold (see Figure 8.3). Be-
sides, we define the Rate of Convergence as the variation of reprojection
error per iteration. We typically plot the rate of convergence as the reprojec-
tion error versus the iteration of the optimization loop.
Localness
Localness measures the convergence of an efficient algorithm with respect to
the point where the gradient is computed. The localness feature helps us to dis-
criminate between image registration and tracking: algorithms that are valid
for registration have high localness, whereas algorithms devised for tracking
should have low localness.
We measure localness by comparing the convergence frequency for different
datasets: algorithms with high localness should converge for registration-like
problems; on the other hand, algorithms with low localness should equally
converge for all the datasets. We shall show that those algorithms that do not
satisfy their requirements are more local than those who do satisfy.
Generality
The generality feature measures the ability of the optimization scheme to
deal with different warp functions. Generality is a generic property of the
algorithms. Thus, we indirectly measure generality by comparing convergence
rates and frequencies of the same algorithm with different warp functions.
We summarize the qualitative features that we expect in our algorithms versus their
corresponding quantitative measures in Table 8.2.
114

Table 8.2: Features and Measures.
Qualitative Features Quantitative Measures
Accuracy Reprojection error
Efficiency
Time per iteration
Number of iterations
Robustness
Rate of Convergence
Convergence Frequency
Localness
Rate of Convergence
Generality
Rate of Convergence
Table 8.3: Numerical Ranges for Features.
Quantitative Measures
Freq. Convergence Reproj. Error Iterations
QualitativeFeatures
Convergence
x ≥ 80% — —
40% < x < 80% — —
x ≤ 40% — —
Accuracy
— x ≤ 1.0px. —
— 1.0px. < x < 5.0px. —
— x ≥ 5.0px. —
Efficiency
— — x ≤ 10
— — 10 < x < 30
— — x ≥ 30
8.2.1 Numerical Ranges for Features
We compare the algorithms by analyzing the values of their numerical measures.
However, it is also interesting to compare the algorithms with respect to a fixed
scale. We build such a three-fold scale by classifying the numerical outcomes of the
experiments in good, medium, and bad. We show the ranges for the numerical values
of the measures in Table 8.3. Each feature is described by one or more measures, and
each measure is defined according to a given numerical range. Thus, we classify the
feature into good—green rows—medium—yellow rows—and bad—red rows. Notice
that we arbitrary define these ranges of values, so the final classification may be
subject to interpretation.
115

8.3 Generation of Synthetic Experiments
We examine the registration/tracking algorithms by using a collection of syntheti-
cally generated experiments. The importance of using synthetic data is to accurately
verify the outcomes of the algorithms. We design our synthetic experiments to an-
alyze the influence of the features introduced in Section 8.2.
Each experiment consists in registering a 3D target to an image. We synthetically
generate this image by rendering a 3D model of the target at parameters µ∗
, the
actual optimum. For efficient registration algorithms, we also render the template
image at µJ, the fixed parameters where we compute the Jacobian matrix. We
typically choose µJ such that the rendered image best displays the texture template
so the derivatives can be computed as accurately as possible—i.e., the texture is
totally visible, and it is neither distorted nor aliased, see Figure 8.6.
Once the image is generated, we iterate the registration algorithm starting from
the initial guess µ0. Recall from appendix A that the performance of GN-like op-
timization methods depends upon the shape of the cost function between µ0 and
µ∗
—which is usually convex when µ0 and µ∗
are close enough—and the accuracy
of the linear approximation provided by the Jacobian at µJ. Thus, by tuning pa-
rameters µ∗
, µ0, and µJ we may study the behaviour of the registration algorithm.
In the following we show how to generate synthetic datasets that highlight the
features in in Section 8.2:
Accuracy: Synthetic Ground-truth Synthetic data naturally provides ground
truth for evaluating accuracy: µ∗
is known by definition, so computing the reprojec-
tion error is straightforward from the algorithms outcomes. Moreover, results can
be easily compared by sharing the synthetic data among all the algorithms.
Robustness: Gaussian Noise The robustness of a gradient-based optimization
procedure measures its convergence with respect to the difference between the initial
guess µ0 and the actual optimum µ∗
. We study the robustness of the algorithms by
generating experiments whose initial guess increasingly diverges from the ground-
truth optimum. We generate the initial guess for our experiments by corrupting the
actual optimum µ∗
with Gaussian noise of increasing standard deviation σ.
µ0 ∼ N(µ∗
, σ2
). (8.2)
We analyze the convergence of the testing algorithms when varying the noise vari-
ance: the higher the variance the greater the distance between the initial guess and
the optimum. We show the ground truth data and several initial guesses (gener-
ated from different variances) in Figure 8.4. Although the variance is defined in
the parameter space, its effects are equivalent in R2
, the image coordinates. Ta-
ble 8.11—page 166—shows the reprojection error for the initial parameters ε(µ0)
(see Equation 8.1) for different noise values. Again, the bigger the variance the
greater the Euclidean distance in the image plane. We prefer to measure the error
116

σ = 0.5 σ = 1.0 σ = 1.5 σ = 2.0 σ = 2.5
σ = 3.0 σ = 3.5 σ = 4.0 σ = 4.5 σ = 5.0
Figure 8.4: Ground Truth and Noise Variance. We show the ground-truth values
—textured cube—with three parameters samples each—µ1
0 in red, µ2
0 in blue, and µ3
0 in
green—generated by using the corresponding noise variance σ. The noise ranges from
σ = 0.5 to σ = 5.0, and the successive samples increasingly depart from the ground-truth.
in the image to be comparable to previous works such as [Baker and Matthews,
2004; Brooks and Arbel, 2010].
Localness: Multiple Datasets Localness is specially relevant in the case of ef-
ficient algorithms such as HB or IC. Localness relates the convergence of an efficient
algorithm while the actual optimum µ∗
and the initial guess µ0 diverge from the
parameters µJ where we compute the Jacobian. We study the influence of localness
in the convergence of the testing algorithms by building multiple datasets: each
dataset contains samples of µ∗
that are increasingly different from µJ. The experi-
ments aim to demonstrate that efficient algorithms are local when they do not meet
the fundamental requirements: that is, the approximation of the gradients provided
by the efficient algorithms, when they do not satisfy the requirements, is only valid
in a local neighbourhood of µJ.
We design six datasets that increasingly minimize localness. We name these
datasets from DS1 to DS6 (see Figure 8.5). For each dataset we randomly generate
10, 000 samples of the target position µ∗
. We control the divergence between µ∗
and µJ by defining the ranges where we sample each target parameter. These
ranges define increasingly larger neighbourhoods centred in µJ that do not overlap
with each other. We show an example of datasets DS1–DS6 for parameters µ =
{α, β, γ, tx, ty, tz}⊤
in Table 8.5—page 129—and we graphically represent the values
contained in the table in Figure 8.5. We choose those samples in dataset DS1 to
be equal to the Jacobian parameters—that is, µ∗
= µJ for all the 10, 000 samples.
For the remaining datasets, we arbitrarily choose the parameter ranges depending
on each target—for example, we do not rotate planar targets more than 60o
so
the texture may be accurately recovered from the image. For each interval range
Ψi ≡ [ai, bi] corresponding to parameter µi we randomly sample the parameter µ∗
i
from an Uniform distribution defined over the support sets [ai, bi] and [−ai, −bi]—
117

DS1 DS2 DS3
DS4 DS5 DS6
Figure 8.5: Definition of Datasets. Ranges of rotation parameters α, β, and γ for
datasets DS1 (Top-Left) to DS6 (Bottom-Right). For each parameter we display the
range interval as a green square annulus). We represent the combination of the three
ranges as a cubic shell—the region between two concentric cubes, plotted in blue. Inside
this region we plot the targets obtained from 8 random samples of the rotation parameters
within their corresponding ranges. The plots show that the rotation of the target increases
for each dataset.
i.e. µ∗
i ∼ U([−bi, −ai] ∪ [ai, bi]). We repeat this process for each parameter of µ∗
as
each parameter may be defined over different ranges for the same dataset.
8.3.1 Synthetic Datasets and Images
We generate our synthetic experimental data for a given target by simultaneously us-
ing multiple datasets and Gaussian noise. We present the procedure in Algorithm 12.
Although the parameters and their ranges may change from one experiment to an-
other, the procedure is similar for all the cases.
The procedure generates six datasets DS1,. . . ,DS6 with 10, 000 samples each. We
generate the corresponding initial guess µ0 for the optimization by sampling 1, 000
data from a Normal distribution with variances ranging from σ = 0.5 to σ = 5.0
with increments of 0.5 units—i.e. 1, 000 ground truth samples for each one of the
10 noise values. Thus, we totalize 60, 000 samples. We render one synthetic image
per ground-truth sample: Each image represents the projection of the target under
the motion parameters µ∗
and the camera parameters K,
x = K R(µ∗
)|t(µ∗
)
X
1
, (8.3)
118

Algorithm 12 Creating the synthetic datasets.
1: for i = 1 to 6 do
2: for σ = 0.5 to 5.0 do
3: for j = 1 to 1, 000 do
4: Generate ground truth sample for range i, µ∗
= (µ1, . . . , µP)⊤
,
where µi ∼ U([−bi, −ai] ∪ [ai, bi]) for i = 1, . . . , P.
5: Generate the initial guess with Gaussian noise, µ0 ∼ N(µ∗
, σ).
6: end for
7: end for
8: end for
where X are the target shape coordinates, R and t are the rigid body motion corre-
sponding to the ground truth µ∗
, and x are the target image projections.
We represent the target using a textured triangle mesh (see Chapter 5). We
project the target using Equation 8.3 and we render the texture on the image using
POVRAY, a free tool for ray tracing. The result is a collection of 60, 000 images (see
Figure 8.6).
DS1DS2DS3DS4DS5DS6
Figure 8.6: Example of Synthetic Datasets. We select diﬀerent samples from each
dataset. Datasets range from DS1 (Top) to DS6 (Bottom), according to Table 8.5. Top-
left image represents the position where we compute the Jacobian for eﬃcient methods.
Notice that the successive samples increasingly depart from this location.
119

8.3.2 Generation of Result Plots
We use the synthetic data from Section 8.3.1 with a group of selected algorithms.
Notice that the ground-truth data µ∗
j , the corresponding synthetic image, and the
initial guess µj are common to all algorithms.
We register the target with the synthetic image by using each algorithm. From
each optimization, we collect (1) the reprojection error ε(ˆµ) in pixels, (2) the total
optimization time in seconds, and (3) the number of iterations of the optimization.
We average the 1, 000 values of reprojection error, optimization time, and number
of iterations for each noise σ value. Besides, we consider that one optimization has
successfully converged when its reprojection error is below 5.0 pixels.
For each dataset we collect 10, 000 outcomes per algorithm: 1000 samples for
each one of the 10 levels of noise. Using the collected data of all the algorithms,
we generate four plots for each dataset. We plot each algorithm by using diﬀerent
colours and markers for ease of comparison (e.g., see Figures 8.45–Page 167). Each
plot is computed as follows:
Reprojection Error We plot the average reprojection error (in pixels) of those
optimizations that have successfully converged for each algorithm. We inde-
pendently average these values for each level of noise—see, e.g., Figure 8.45–
(Top-Left). Notice that we are measuring the accuracy of the algorithms at
ideal conditions, as we are only using those optimizations that have success-
fully converged
Percentage of Convergence We plot the percentage of optimizations that have
successfully converged for each level of noise—see, e.g., Figure 8.45–(Top-
Right).
Convergence Time We plot the average convergence time (in seconds) of those
optimizations that have successfully converged for each algorithm—see, e.g.,
Figure 8.45–(Bottom-Left).
Number of Iterations We plot the total number of iterations of those optimiza-
tions that have successfully converged for each algorithm—see, e.g., Figure 8.45–
(Bottom-Right).
Finally, note that we are only averaging those results from optimizations that have
converged: those plots concerning reprojection error or number of iterations only
consider the best outcomes of those algorithms. Thus, the average results also
depend on the frequency of convergence—i.e., low reprojection error coupled with
low frequency of convergence is less meaningful than low reprojection error coupled
with high percentage of convergence. We summarize the process of generating and
evaluating the experiments in Figure 8.7.
120

Figure 8.7: Experimental Evaluation with Synthetic Data. We generate synthetic data for the 6 datasets. For each dataset
we generate 10, 000 parameters (green) using the given range (magenta). With these parameters we compute both the initial guess for
the optimization (yellow) and the synthetic images (blue). Finally, each pair synthetic image–initialization is evaluated through the
algorithms LK, HB, IC, FC and GIC, and the results collected (red).
121

8.4 Implementation Details
In this section we give detailed information about the implementation of the regis-
tration/tracking algorithms: the criteria to decide when to stop the minimization,
how to deal with self-occlusions of the model, or further improvements on the fac-
torization.
8.4.1 Convergence Criteria
We implement the gradient-based optimization of all the algorithms using the Gauss-
Newton scheme proposed by Madsen et al. [Madsen et al., 2004]. We use the fol-
lowing stopping criteria:
1. one based on the value of the gradient: g(x) ≤ ǫ1, with g(x) being the
gradient evaluated at parameters x.
2. one based on the parameters increment: x(n + 1) − x(n) ≤ ǫ2( x + ǫ2),
with x(n + 1) and x(n) being the parameters at time n + 1 and n.
Both criteria depend upon the values of constants ǫ1 and ǫ2. For all the range of
experiments we use the values ǫ1 = 10−8
, ǫ2 = 10−8
. Finally, we define a safeguard
against infinite loop by imposing an upper bound of 50 iterations for the optimisation
scheme. The parameters values at that point will be considered as solution.
Notice that the usual Gauss-Newton algorithm recomputes the Jacobian matrix
of the error whereas other methods such as IC do not. We deal with that issue by
having separate functions to compute both the Jacobian and the approximation to
the Hessian matrices while maintaining the overall scheme of the optimisation. Each
algorithm was implemented using MATLAB scripting language.
8.4.2 Visibility Management
Convex targets—i.e. nonplanar—typically suffer from self-occlusion: the target can-
not be completely imaged but only a portion of it (see Figure 8.8). A triangle of
the target mesh becomes self-occluded when (1) the triangle is covered by other
triangles of the target that are closer to the camera, or (2) the normal vector to the
triangle is orthogonal —or partially orthogonal—to the camera projection rays. The
set of occluded triangles dynamically depends on the relative orientation of target
and camera: some of the triangles appear and others disappear from the image due
to changes on rotation, translation, and even the deformation of the target.
We consequently augment the brightness dissimilarity (Equation 3.1) using the
D(Ω; µ) = T (Ω) − I(f(Ω; µ)), (8.4)
where Ω ⊂ X is the set of visible vertices in the current view I that depends on µ—
i.e. Ω = Ω(µ). Equation 8.4 actually holds only on the nonoccluded points: if we
122

β = 0o
β = 30o
β = 60o
β = 90o
W1JJ⊤
W⊤
1 W2JJ⊤
W⊤
2 W3JJ⊤
W⊤
3 W4JJ⊤
W⊤
4
H1 = J⊤
W⊤
1 W1J H2 = J⊤
W⊤
2 W2J H3 = J⊤
W⊤
3 W3J H4 = J⊤
W⊤
4 W4J
Figure 8.8: Visibility management. (Top-row) We rotate the shape model around Y -
axis with degrees β = {0, 30, 60, 90}. (Middle-row) Block-structure plots of the Jacobian
matrix. We compute the constant Jacobian J at β = 0o. We also compute the weighting
matrices Wi, i = 1, . . . , 4, such that Wi depends on the rotation with angle βi. For each
view, we plot the matrix WiJJ⊤Wi. Blue dots: Represent the non-zero entries of matrix
WiJJ⊤Wi. (Bottom-row) Block-structure plots of the Hessian matrix. Diﬀerent colour
values represent diﬀerent data entries.
include any vertex outside Ω, its dissimilarity will be corrupted due to the erroneous
brightness of the imaged vertex.
We take the self-occluded points into account in the optimization scheme (such as
LK, HB, etc) using Weighted Least-Squares (wls) [Press et al., 1992]. We transform
the ordinary least-squares problem,
J⊤
J δµ = −J⊤
r,
123

into the wls problem,
J⊤
W⊤
WJ δµ = −J⊤
W⊤
Wr,
where W is the weight matrix. Matrix W is typically diagonal—the residuals are
uncorrelated—and the i-th entry in the main diagonal indicates the importance of
residual ri(µ)—the i-th entry of vector r. We choose the following weighting matrix:
W =





ω1 0 · · · 0
0 ω2 · · · 0
...
...
...
...
0 0 · · · ωNΩ





, (8.5)
where
ωi,i =
1 if xi ∈ Ω ⊂ X,
0 if xi /∈ Ω ⊂ X,
(8.6)
and NΩ = Ω . Matrix W in Equation 8.5 accounts for the i-th residual into the
normal equations (Equation 8.4.2) if the corresponding i-th target point is visible
(cf. Equation 8.6). Those points not present in Ω won’t affect the local minimizer
of Equation 8.4.2, so we have effectively taken account of the self-occluded points.
Constant Hessian and WLS Note that the Hessian matrix in Equation 8.4.2
includes the inner product W⊤
W. If W depends on µ, then the Hessian is no longer
constant. Thus, the efficiency of algorithms that use a constant Hessian, such as IC,
greatly diminishes as the product J⊤
W⊤
WJ changes over time.
We can estimate the loss of efficiency of the IC algorithm using the examples
from Chapter 7. The original implementation for algorithm ICH8 is about six-times
faster than HBH8 (see Table 7.17). However, if the Jacobian depends on the matrix W,
the resulting IC algorithm with variable Hessian is slower—cf. Table 7.15–Page 106.
Efficient Solution of WLS We may alleviate the loss of efficiency of any algo-
rithm that uses wls by using a technique similar to [Gross et al., 2006]: the key
idea is to keep as much pre-computed values as possible. The method subdivides the
tracking region of the object in P non-overlapping partitions Pi. The partitions are
chosen such that the triangles inside have a consistent orientation. We precompute
the Hessian matrix for each partition, HPi
, and we compute the Hessian matrix for
the optimization as
H =
P
i=1
λiHPi
,
where λi is a weight for each partition Pi (see Figure 8.9).
We also adapt the method from [Gross et al., 2006] to the HB algorithm: we
compute the matrix SPi
for each partition Pi and we build the matrix S as
S =
Pi
S⊤
Pi
SPi
.
124

Figure 8.9: Efficiently solving of WLS. The texture in the reference space is sub-
divided in regions—green squares—whose Hessians are computed separately. The actual
Hessian only takes into account visible regions—blue overlay.
8.4.3 Scale of Homographies
In Section 5.3 we proved that the Equation 5.31 represents a proper factorization as
S (Equation 5.32) only contains target shape elements. However, this is not entirely
accurate as matrix S also depends on the homogeneous scale factor λ = 1/(1 − n⊤
t)
(cf. Appendix E). Note that the λ scale depends on the translation vector t, and
hence, the matrix S cannot be constant.
However, we can still use the factorization with no efficiency loss by employing
wls—as in the case of self-occlusions. We redefine the weighting matrix W (Equa-
tion 8.5) as
W =





ω′
1 0 · · · 0
0 ω′
2 · · · 0
...
...
...
...
0 0 · · · ω′
NΩ





, (8.7)
and we define each weight ω′
i,i = ωi,iλi, where ωi,i are the occlusion weights (Equa-
tion 8.6), and λi = 1/(1 − n(i)⊤
t) with n(i)⊤
being the plane normal to the i-th
vertex.
The conclusion is that we can extract the homogeneous scale λ from matrix S,
and account for the factor when solving the local minimizer in a wls fashion. Thus,
the matrix S is actually constant and we do not lose efficiency as the weighting
125

process is unavoidable.
8.4.4 Minimization of Jacobian Operations
We also benefit from the block distribution of the factorization form of the Jacobian
matrix. The idea was first proposed in [Sepp and Hirzinger, 2003], and we adapt
it to our algorithms. We show how to apply the idea for algorithm hb3dtm (see
Algorithm 7) in the following.
We solve for the local minimizer (Equation 8.4.2) by using a Jacobian matrix
from the factorization (Equation 5.31) as follows:
δµ = (SM)⊤
(SM)
−1
(SM)⊤
r. (8.8)
We turn our attention to the GN Hessian matrix of Equation 8.8, (SM)⊤
(SM) , and
we rewrite it using Equations 5.32 and 5.30 as follows:
M⊤
S⊤
SM =
M⊤
1 0
0 M⊤
2
S⊤
1
S⊤
2
S1 S2
M1 0
0 M2
,
=
M⊤
1 0
0 M⊤
2
S⊤
1 S1 S⊤
1 S2
S⊤
2 S1 S⊤
2 S2
M1 0
0 M2
,
=
M1S⊤
1 S1M1 M1S⊤
1 S2M2
M2S⊤
2 S1M1 M2S⊤
2 S2M2
.
(8.9)
Notice that the matrix from Equation 8.9 is symmetric so there is no need to compute
the elements of the lower triangular matrix—M2S⊤
2 S1M1 in this case. Thus, we roughly
spare a 25% of the computations to compute the Hessian matrix (Equation 8.9).
8.5 Additive Algorithms
In this section we present the experiments that we conducted to evaluate additive
registration algorithms. We organize this Section as follows: We introduce the
algorithms that we use to evaluate the hypotheses in Section 8.5.1; we generate
synthetic data for rigid and nonrigid targets that we evaluate using the algorithms
in Sections 8.5.2 and 8.5.3; we prove the robustness of our algorithms using real data
in Sections 8.5.5 and 8.5.6.
8.5.1 Experimental Hypotheses
The purpose of the tests is to confirm theoretical hypotheses concerning some al-
gorithms. We are specially interested in investigating the relationship between the
fulfilment of certain requirements by the algorithm and its convergence.
We use the same naming convention for algorithms that we introduced in Chap-
ter 7. We select four algorithms from Table 7.2 (see Page 95): two for rigid targets—
LK3DTM and HB3DTM—and two for nonrigid data—LK3DMM and HB3DMM. For the sake
126

of comparison we also include the algorithm HB3DRT: 6-dof tracking with the HB
algorithm [Sepp, 2006]. Table 8.4 summarizes some characteristics of the selected
algorithms: description of the warp, constancy of the Jacobian matrix, and fulfil-
ment of Requirement 1.
Table 8.4: Evaluated Additive Algorithms
Algorithm Warp Constant Req. 1
LK3DTM Shape-induced homography No —
HB3DTM Shape-induced homography Partially YES
LK3DMM Nonrigid shape-induced homography No —
HB3DMM Nonrigid shape-induced homography Partially YES
HB3DRT Rigid Body Partially NO
We use the algorithms from Table 8.4 to validate that (1) the convergence of HB
depends upon Requirement 1, (2) HB is accurate and robust, and (3) HB is efficient.
8.5.2 Experiments with Synthetic Rigid data
This set of experiments studies the convergence of the evaluated algorithms in a
controlled environment. The synthetic datasets provide us precise measurements on
the outcomes of the algorithms.
Target Model Our target is a 3D textured model (3dtm): a 3D triangle mesh
and a texture image— both of them defined in a reference frame—constitute the
target (cf. Section 5.3.1). We use three models in our experiments with rigid data,
(1) a textured cube, (2) a human face, and (3) a textured rectangular box.
The cube model comprises 15, 606 vertices, 30, 000 triangles, and the texture
image has 640 × 480 pixels (see Figure 8.10). We use the centroid of each triangle
as target target, adding up to 30, 000 vertices. We compute the centroids of the
triangle mesh and the reference frame using barycentric coordinates. The texture
for each triangle centroid is computed by averaging the colour values for each vertex
of the triangle. We do not consider those vertices close to the edges of the cube;
instead of removing those vertices from the model, we mark them as “forbidden”
and they are treated as occluded points (see Section 8.4.2).
The face1
model comprises 5, 866 vertices, 11, 344 triangles, and the texture
as target vertex, adding up to 11, 344 points (see Figure 8.11).
The tea box model comprises 61, 206 vertices, 120, 000 triangles, and the texture
as target vertex, adding up 120, 000 points. We mark some of those vertices as
“forbidden”, and hereby we do not use them in our algorithms.
1
Generously provided by Prof. T. Vetter and S. Rhomdani from Basel University
127

Figure 8.10: The cube model. (Left) The model is a textured triangle mesh. We have
downsampled the triangle mesh for a proper visualization (blue lines). (Right) Image
containing the texture map of the model.
Figure 8.11: The face model. (Left) The model is a textured triangle mesh. We have
downsampled the triangle mesh for a proper visualization (blue lines). (Right) Texture
map of the model. The texture image is actually a cylindrical projection of the mesh
colour data. Source: Data provided by T. Vetter and S. Rhondami, University of Basel.
Experiments with cube model
We generate 60, 000 synthetic experiments for a rotating cube model by using the
procedure described in Section 8.3. Figure 8.6 shows a subset of selected samples
from the aforementioned collection. We rotate the model around the Euler angles
α, β, and γ, coupled with translations along the three axis tX, tY , and tZ. We
show the ranges of the parameters that we have used to generate the experiments
in Table 8.5. Notice that we include extreme rotations of the target to test the
robustness of the algorithms. We register each target in the 60, 000 experiments
by using the algorithms LK3DTM, HB3DTM, and HB3DRT. Table 8.6 shows the average
initial reprojection error for each dataset and level of noise. As expected, the higher
the noise the larger the reprojection error. We apply the algorithms to the generated
128

Figure 8.12: The tea box model. (Left) The model is a textured parallelepiped,
representing a box of tea. We have downsampled the triangle mesh (blue lines) for a
proper visualization. (Bottom) Texture image of the model. We obtained the texture by
unfolding, then scanning the actual tea box.
Table 8.5: Ranges of parameters for cube experiments.
Dataset α β γ tx ty tz
Registration DS1 0 0 0 0 0 0
Tracking
DS2 [0,72] [0,72] [0,72] [0,20] [0,10] [0,10]
DS3 [72,144] [72,144] [72,144] [20,40] [10,20] [10,20]
DS4 [144,216] [144,216] [144,216] [40,60] [30,30] [30,30]
DS5 [216,288] [216,288] [216,288] [60,80] [30,40] [30,40]
DS6 [288,360] [288,360] [288,360] [80,100] [40,50] [40,50]
experiments and we show the results in Figures 8.13–8.18.
Table 8.6: Average reprojection error vs. noise for cube.
σ 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.0
DS1 1.56 3.17 4.60 6.38 7.98 9.41 10.98 12.49 14.30 15.69
DS2 1.58 3.21 4.81 6.25 7.93 9.48 11.07 12.76 14.11 15.79
DS3 1.79 3.47 5.07 6.80 8.74 10.39 11.92 14.42 15.35 16.70
DS4 1.61 3.16 4.71 6.35 7.91 9.53 11.21 12.91 14.41 15.56
DS5 1.79 3.86 5.10 7.02 8.76 10.74 12.57 13.89 15.17 16.97
DS6 1.76 3.29 5.11 6.47 7.92 9.73 11.34 12.95 14.36 16.12
129

Figure 8.13: Results from Additive Rigid Dataset DS1 for cube (Top-left) Ac-
curacy plot: average reprojection error against noise standard deviation. (Top-right)
Robustness plot: average frequency of convergence against noise. (Bottom-left) Effi-
ciency plot: average convergence time against noise standard deviation.(Bottom-right)
Efficiency plot: Average number of iterations against noise.
Dataset DS1 This dataset contains those experiments closest to the registration
problem (i.e. µ∗
template = µ0). We show the results in Figure 8.13. The three algo-
rithms converge for each experiment. The results for accuracy are similar for all the
experiments, although the error remains constant around 1.5 pixels. However, for
the number of iterations we have different results: the HB3DRT algorithm approxi-
mately iterates twice the number of iterations of the other algorithms—although the
resulting optimization time is half than LK3DTM due to a lower time per iteration.
Datasets DS2 We show the results for dataset DS2 in Figure 8.14. The accuracy
plot shows that the average reprojection error has increased with respect to dataset
DS1:about 0.5 pixels, a 25% more, for algorithms LK3DTM and HB3DTM; moreover,
the reprojection error for HB3DRT is a 50% higher than for the remaining two algo-
rithms. Results for percentage of convergence are even more different than for DS1:
optimizations for both LK3DTM and HB3DTM converge around a 90% for the worst
case, while convergence for algorithm HB3DRT reduces to a 70%. As in dataset DS1,
the number of iterations needed by HBSDRT to converge approximately doubles the
iterations of LK3DTM and HB3DTM—again, algorithm HB3DTR is faster than LK3DTM
130

due to a lower time per iteration.
Figure 8.14: Results from dataset DS2 for cube. (Top-left) Accuracy plot: average
reprojection error against noise standard deviation. (Top-right) Robustness plot: average
frequency of convergence against noise. (Bottom-left) Eﬃciency plot: average conver-
gence time against noise standard deviation. Eﬃciency plot: (Bottom-right) average
number of iterations against noise.
131

Datasets DS3 We show the results for dataset DS3 in Figure 8.15. The results for
reprojection error, frequency of convergence and number of iterations are similar for
algorithms LKSDTM and HB3DTM in dataset DS2. However, the results for algorithm
HB3DRT significantly worsen: the average reprojection error lies in the range of 4.0
pixels, roughly a 100% more than for the other two algorithms; also, the convergence
of algorithm HB3DRT monotonically decreases with the level of noise from 85% to
a rough 25%. The number of iterations do not grow with respect to dataset DS2:
the algorithm converges in less cases, but iterates fewer times for each convergence.
However, for those cases in which HB3DRT converges, the convergence time is better
than LK3DTM. Algorithm HB3DTM has the lowest convergence time.
Figure 8.15: Results from dataset DS3 for cube. (Top-left) Accuracy plot: av-
erage reprojection error against noise standard deviation. (Top-right) Robustness plot:
average frequency of convergence against noise. (Bottom-left) Efficiency plot: aver-
age convergence time against noise standard deviation.(Bottom-right) Efficiency plot:
Average number of iterations against noise.
132

Datasets DS4–DS6 We show the results for these datasets in Figures 8.16–8.18.
The accuracy results are similar to previous datasets: the reprojection error for
algorithm HB3DRT is higher than for algorithms LK3DRT and HB3DTM—although the
diﬀerences are small for dataset DS4. Algorithm HB3DRT also converges less times
than the other two algorithms: the frequency of convergence approximately ranges
from 80% to 10%—although the frequency of convergence for dataset DS6 is slightly
better than for datasets DS4 and DS5. The results for the number of iterations are
similar to previous datasets: the number of iterations linearly grow with noise, al-
though the results for algorithm HB3DRT double those from the other two algorithms.
Figure 8.16: Results from dataset DS4 for cube. (Top-left) Accuracy plot: av-
133

Discussion The results clearly shows that those efficient algorithms that do not
hold their requirements are not valid for tracking: the convergence of algorithm
HB3DRT, which does not hold Requirement 1, is very poor—80% at ideal conditions—
for datasets DS3-DS6—i.e., those that represent the tracking problem; however, the
algorithm have good convergence for datasets DS1 and DS2—i.e., those datasets
whose samples represent the registration problem.
On the other hand, the efficient algorithm HB3DTM holds Requirement 1 and its
results are valid for both registration and tracking: although the convergence of
algorithm HB3DTM degrades from dataset DS1 to DS6, the optimizations converge
more than 80% at the worst possible conditions. Moreover, notice that the results
of algorithm HB3DTM are equivalent to those of algorithm LK3DTM—which does not
assume any requirement. Thus, we conjecture that the degradation of the conver-
gence is due to the difficulty on the experiments, not problems with the efficient
approximation.
134

Algorithms HB3DTM and LK3DTM show similar accuracy for all datasets. Although
the average accuracy seems to be low—it ranges from 1.5 to 2.2 pixels —the results
are consistent for the two algorithms. The accuracy results for algorithm HB3DRT
are worse than those for the other two algorithms—even for those cases that it
successfully converged. Thus, neglecting the Requirement 1 for HB algorithm has a
direct impact in the accuracy of the optimization.
Timing results are also affected by Requirement 1. Algorithm HB3DRT consis-
tently iterates more times to converge than the other two algorithms: if Require-
ment 1 does not hold, the efficient Jacobian is incorrectly approximated, and the
successive iterations spend more times to reach the optimum.
In summary, the satisfaction of Requirement 1 affects the convergence of al-
gorithm HB and, to a lesser extent, the accuracy of the optimization. Thus, the
Requirement 1 is mandatory to use the algorithm HB for tracking.
135

Experiments with tea box model
We also test the tea box model. The purpose of these experiments is to verify the
validity of our algorithm to track objects that rotate 360o
. We show that our tracker
naturally handles strong target rotations by unfolding the target texture.
15 30 45 60 75 90
105 120 135 150 165 180
195 210 240 255 270 300
315 330 345 375 390 405
420 435 450 465 480 505
525 540 555 570 585 599
Figure 8.19: tea box sequence. Selected frames from the sequence generated for
the target tea box. We approximately show one every ﬁfteen frames. The target displays
strong rotations in the three axes, and translations that take the target to the very borders
of the image.
136

Experiments with the tea box model are diﬀerent from the previous ones: we
evaluate a continuous sequence of the target rotating and translating through the
scene. We generate a 600 frames sequence in which we completely rotate the target
around its vertical axis—i.e. β ∈ [0, 360]. Besides, we strongly rotate the target
around the remaining axes—α ∈ [0, 120] and γ ∈ [0, 300]—and we move near the
borders of the image (see Figure 8.19). We continuously track the object through the
scene using algorithm HB3DTM. We show the results in Figure 8.20. The algorithm
consistently recovers the target rotation and orientation.
15 30 45 60 75 90
105 120 135 150 165 180
195 210 240 255 270 300
315 330 345 375 390 405
420 435 450 465 480 505
525 540 555 570 585 599
Figure 8.20: Results for the tea box sequence. We overlay the results of algorithm
HB3DT onto the frames from the tea-box sequence. For each frame we project the target
vertices using the resulting parameters of the optimization. We represent these projections
using blue dots.
137

We conﬁrm the results by plotting the estimated values along the frame number
in Figure 8.21. Target rotation is accurately estimated: the three Euler angles from
the estimation match the ground-truth values despite the extreme orientations. Also,
translation parameters are correctly estimated.
Figure 8.21: Estimated parameters from teabox sequence. (Top-row) Euler
angles against frame number. ∆GT : Ground-truth Euler angles used to generate the
sequence; ∆EST : Estimated Euler angles using shape-induced HB, for ∆ = {α, β, γ}.
(Bottom-row) Translation parameters against frame number. tGT
∆ : Ground truth trans-
lation parameters used to generated the sequence; tEST
∆ : Estimated translation parameters
for ∆ = {α, β, γ}.
138

Good Texture to Track
The convergence of direct methods heavily depends on the texture of the target.
Although the existence of texture corners is not strictly necessary for the tracking
algorithms to work, the target texture influence the result [Benhimane et al., 2007].
The question now is: how do we classify a texture as good or bad? A clas-
sical reference on the subject, [Shi and Tomasi, 1994], claims that high-frequency
textures—i.e. targets with high contrast patterns, and clearly defined borders—are
the most suitable for tracking/registration. [Shi and Tomasi, 1994] demonstrates
their claims by tracking Harris corners features using LK; as high-frequency texture
patterns provide more stable estimations of Harris corners, they allegedly should
improve tracking.
However, more recently [Benhimane et al., 2007] demonstrated that low-frequency
textures—i.e. textures that change gradually, or do not have clearly defined borders—
improve the convergence of gradient-based direct methods. They support their
claims by showing that the solution to least-squares problem involving image gradi-
ents is more accurate when the Jacobian is computed from smooth textures patterns.
This assumption apparently conflicts with the idea that a high frequency texture
pattern may provide a better estimation of the registration parameters.
We study the relationship between convergence and texture in HB by performing
an experiment with the face model. We compare the estimated parameters from HB
by using the same structure but (1) the usual texture of the face model, and (2) a
texture of Gaussian gradients similar to the cube model (see Figure 8.10). We intend
to isolate the influence of the target texture on the accuracy of the estimation from
other sources such as the target structure or kinematics. Notice that this experiment
with the face model also proves that our proposed algorithm to track shape-induced
homographies with HB can deal with 3D models more complicated than “boxes”—
such as the cube and teabox models.
We build up the experiment by rotating the face model 90o
back and forth
around the Y -axis, and estimating the parameters using HB—see Figure 8.23–(Top
rows). We also modify the texture of the face model using a pattern of Gaussian
gradients—see Figure 8.23–Bottom-rows—and we again estimate the motion pa-
rameters using HB. The face model provides a high-frequency texture (specially in
eyebrows, lips, eyes, and ears) whereas the Gaussian gradients provides by definition
a low-frequency texture. Even more, Gaussian texture is uniformly distributed over
the face, whereas eyes, lips, and brows are mainly visible in frontal views—i.e., the
texture on the sides of the face is ill-conditioned.
We also plot the ground-truth values that we use to generate the sequence back
to back with the estimated values from the HB with the usual and the Gaussian
textures. Figure 8.22 shows the values of the target rotational parameters—i.e.
Euler angles α, β, and γ—for the ground-truth data (αGT
), the estimation from the
face model texture (αTXT
), and the estimation from the Gaussian texture (αGSS
).
139

Results from Figure 8.22 show that HB with the usual texture cannot deal with
extreme out-of-plane rotations: the estimation loses track at β ≈ 60o
and the ob-
tained solution cannot provide a reliable initial estimation for the remaining frames.
However, HB with a Gaussian gradient texture provide an accurate estimation for
every frame, even for the very extreme value of β = ±90o
; this is possibly due to
the fact that Gaussian texture is uniformly distributed all over the model.
We conclude that texture is fundamental for an accurate estimation of the target
motion: results may greatly diverge when using diﬀerent textures for the same target
structure. Furthermore, we show that the problem of tracking a human head under
large rotations is rather diﬃcult as the texture from the face at large rotations is
not entirely suitable for gradient-based registration.
Figure 8.22: Estimated parameters from face sequence. Euler angles for ground-
truth and estimated values plotted against frame number. ∆GS: Ground-truth Euler
angles used to generate the sequence; ∆TXT : Euler angles estimated from the usual tex-
ture; ∆GSS: Euler angles estimated from the Gaussian gradients texture; for ∆ = α, β, γ.
140

1 10 20 30 40
50 60 70 80 90
100 110 120 130 140
150 160 170 180 190
Figure 8.23: Good texture vs. bad texture. The face model with the usual texture
(Top-rows) and with a Gaussian gradient texture (Bottom-rows). We project the target
vertices onto each image using blue-dots.
141

8.5.3 Experiments with Synthetic Nonrigid data
In this group of experiments we allow the target model to deform in addition to
change its position and orientation.
Target Model We describe our target using a 3D morphable model (3DMM), [Blanz
and Vetter, 2003]: the target comprises a set of linear deformation basis (including
the mean sample) and a texture image, both of them deﬁned in the reference frame.
The face-deform model contains 9 modes of deformation.
¯x b1 b2 b3 b4
b5 b6 b7 b8 b9
Figure 8.24: The face-deform model. Target deformation is encoded using linear
deformation basis bk, k = 1, . . . , 9 with mean ¯x (cf. Equation 5.34).
We derive the face-deform model by adding 9 linear deformation basis to the
mean represented by the face model. Each basis—and the mean—embraces 5, 866
vertices, distributed in 11, 344 triangles. The texture image is 500 × 500 pixels
size. We compute the deformation basis by using PCA in a distribution of face
meshes. This distribution results from deforming the face model triangle mesh using
a muscle-based system [Parke and Waters, 1996] : we attach 18 parametric muscles
to certain vertices of the mesh such that modifying the muscles actually deforms the
mesh2
. We generate 475 keyframes encompassing diﬀerent values for our synthetic
muscles. Each frame provides a sample mesh for the PCA procedure that estimates
the linear deformation model. We show the resulting model in Figure 8.24.
2
J.M. Buenaposada kindly provided the muscle-based animation
142

DS1DS2DS3DS4DS5DS6
Figure 8.25: Distribution of Synthetic Datasets. We select different samples from
each dataset. Datasets range from DS1 (Top) to DS6 (Bottom), according to Table 8.5.
Top-left image represents the position where we compute the Jacobian for efficient meth-
ods. Notice that the successive samples increasingly depart from this location.
Figure 8.25 shows 36 selected samples from a total of 60, 000 for the face-deform
model—6 random samples for each dataset. We randomly sample rotation and
translation parameters from an Uniform distribution defined on the ranges displayed
in Table 8.7 (using the procedure that we described in Section 8.3).
Besides pose and orientation we deform the model polygonal mesh. We ran-
domly select one of the 475 meshes from the shape distribution, and we compute its
corresponding vector of deformation parameters c∗
. The corresponding initial guess
is more involved to compute than pose and orientation: we must carefully choose the
variance of the Gaussian noise so that the resulting c0
produces a physically plausi-
ble shape. Thus, we compute the covariance matrix of the deformation coefficients,
Λc, from the shape distribution. We generate the initial guess for deformation, co
i ,
by corrupting the ground truth value with Gaussian noise,
c0
∼ N(c∗
, Λc).
143

Table 8.7: Ranges of parameters for face-deform experiments.
Tracking
DS2 [0,10] [0,10] [0,10] [0,10] [0,10] [0,10]
DS3 [10,20] [10,20] [10,20] [10,20] [10,20] [10,20]
DS4 [20,30] [20,30] [20,30] [20,30] [30,30] [30,30]
DS5 [30,40] [30,40] [30,40] [30,30] [30,30] [30,30]
DS6 [40,50] [40,50] [40,50] [30,30] [30,30] [30,30]
We show the average initial reprojection error for the generated initializations in
Table 8.8. Using this initialization values, we execute the optimization procedures
for algorithms LK3DMM and HB3DMMSF. We generate join plots with the results in
Figure 8.26–8.31.
Table 8.8: Average reprojection error vs. noise for face-deform.
σ 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.0
DS1 1.50 3.09 4.47 6.07 7.58 8.99 10.71 12.29 13.50 15.51
DS2 1.54 2.95 4.62 6.08 7.63 9.14 10.41 12.18 13.60 15.05
DS3 1.53 3.05 4.46 6.03 7.64 9.32 10.70 12.09 13.73 15.07
DS4 1.53 2.98 4.59 6.11 7.50 9.20 10.55 12.01 13.64 15.47
DS5 1.60 3.12 4.72 6.26 7.90 9.34 11.06 12.41 14.10 15.66
DS6 1.58 3.08 4.71 6.25 7.76 9.18 11.09 12.74 13.85 15.63
144

Dataset DS1 In this dataset we study those experiments corresponding to the
registration problem—i.e. µJ = µ0. We show the results in Figure 8.26. Results
for accuracy show that the reprojection error linearly grows with the noise variance,
although the average reprojection error for LK algorithm is smaller than for HB. Re-
sults for frequency of convergence shows that HB is less robust than LK: algorithm
HB perfectly converges for low noise—i.e., σ ≤ 1.5—but the convergence monotoni-
cally decreases for higher noise variance. On the other hand, HB algorithm is more
efficient than LK: for those cases that converged, although both algorithms iterate
the same, HB is almost 10 times faster than LK.
Figure 8.26: Results from dataset DS1 for face-deform. (Top-left) Accuracy plot:
average reprojection error against noise standard deviation. (Top-right) Robustness plot:
average frequency of convergence against noise. (Bottom-left) Efficiency plot: average
convergence time against noise standard deviation. Efficiency plot: (Bottom-right)
average number of iterations against noise.
145

Datasets DS2–DS6 These datasets contain those experiments that represent the
tracking problems—i.e., those experiments that are not initialized in the position
where the eﬃcient Jacobian is computed. We show the results in Figures 8.27–8.31.
As these results are similar for all the datasets, we summarize their interpretation
in the following.
146

LK is consistently more accurate for all datasets than HB algorithm: the diﬀerences
in re-projection error are small for low noise; however, for high noise variance, HB
almost doubles the re-projection error of LK.
147

Results for frequency of convergence show that LK is more robust than HB: conver-
gence for LK algorithm is around 100% for high noise variance, whereas convergence
for HB ranges from 60% in DS2 to 10% in DS6. However, notice that the convergence
for both algorithms is similar for low noise variance—i.e., σ ≤ 1.5.
148

Eﬃciency comparisons are mostly favourable to HB algorithm: average time per
iteration for HB is 0.023 seconds, and 0.178 seconds for LK. The lower time per
iteration compensates the higher number of iterations of HB with respect to LK: even
iterating a 60% more to converge, the total time for HB algorithm is typically seven
times lower than for LK.
149

Summarizing, these experiments confirm that LK algorithm is more robust than
HB for the face-deform model. We conjecture the HB optimization with deformation
parameters is more sensitive to noise than the HB algorithm for 6-dof tracking; we
support this claim by confirming that the HB algorithm has perfect convergence for
low noise in all datasets. On the other hand, HB is more efficient than LK for all the
studied cases: on average, the HB algorithm is more than five-times faster than LK.
Figure 8.31: Results from dataset DS6 for face-deform. (Top-left) Accuracy
plot: average reprojection error against noise standard deviation. (Top-right) Robustness
plot: average frequency of convergence against noise. (Bottom-left) Efficiency plot:
average convergence time against noise standard deviation.(Bottom-right) Efficiency
plot: Average number of iterations against noise.
150

8.5.4 Experiments With Nonrigid Sequence
Results from previous section showed that HB algorithm was less accurate and robust
than LK for those cases where the optimum parameters were different from the initial
guess of the optimization—i.e. for high levels of noise. At a first glance, from these
results we may conclude that the HB algorithm is not suitable for nonrigid tracking.
We vindicate the HB algorithm by using a challenging experiment.
2 20 40 60 80
100 120 140 160 180
200 220 240 260 280
300 320 340 360 380
400 420 440 460 470
Figure 8.32: face-deform sequence. Selected frames from the sequence generated for
the target face-deform. We approximately show one every fifteen frames. The target
displays strong rotations in X- and Y -axis, and several deformations.
151

We generate a synthetic sequence using face-deform model (see Figure 8.32).
The sequence is 470 frames long, and it shows the model alternatively rotating in
the X and Y -axis while performing several facial expressions: frowning, grimacing,
grinning, raising eyebrows, and opening the mouth.
2 20 40 60 80
100 120 140 160 180
200 220 240 260 280
300 320 340 360 380
400 420 440 460 470
Figure 8.33: Results from face-deform sequence. Selected frames from the sequence
face-deform processed using HB. Blue dots: The vertices of the model are projected onto
the image by using the estimated parameters. pink dots: Estimated projections from
selected regions of the face: eyebrows, lips, jaw and nasolabial wrinkles.
We show the results from processing the sequence using HB in Figure 8.33. We
overlay the model projection onto the frames using the estimated values; the results
are visually accurate and no drift is noticeable. We conﬁrm this perception by plot-
ting the estimated values against the ground truth values that we used to generate
the sequence in Figure 8.34.
152

Figure 8.34: (Top-row) Estimated vs. ground-truth rotation parameters. We denote
ground-truth parameters as ∆GT , and estimated parameters as ∆EST , where ∆ = α, β, γ.
(Bottom-row) Estimated vs. ground-truth deformation parameters. We show the pa-
rameters corresponding to the ﬁrst four basis of deformation (i.e. b1 to b4). We denote
ground-truth parameters as cGT
i , and estimated parameters as cEST
i , i = 1, . . . , 4.
The estimated results for rotation accurately match the ground-truth values.
We also compare some of the deformation parameters: we select the parameters
c1, . . . , c4, that is, the ﬁrst four components of the basis of deformation b1 to b4.
We choose these parameters because they collect more than 80% of the energy in
the pca factorization. Quantitative results for deformation parameters seem less
accurate than those for rigid motion. However, the impact of this estimation error
in facial motion is small.
153

8.5.5 Experiments with real Rigid data
This experiment demonstrates the suitability of our algorithm to track a rigid object
in a real-world sequence. Real sequences are hard to process as the essential assump-
tions on which we base our algorithms are not strictly hold: the bcc (Equation 3.1)
is not an strict equality due to (1) lighting changes not included in the model, and
(2) numerical inaccuracies induced by camera discretization, quantization noise, and
aliasing.
Target Model Besides, modelling and initializing the target is also less accurate
than in the synthetic case. Even for a very simple target—e.g. the cube used in
this experiment—it is diﬃcult to establish a one-to-one correspondence between the
brightness values of our synthetic model and the images of the target in the sequence.
We accordingly require the target model to be (1) easy to be synthetically built, and
(2) easy to put in correspondence with a single image of the sequence—initialization
procedure.
Based on these reasons we choose the target to be a textured cube. We build
the cube by sticking six squares of rigid cardboard sheet onto a wood frame. Each
cardboard side has a piece of textured fabric stuck on top of it. The advantage of
using a fabric for the texture pattern is that the material does not produce specular
highlights due to changes in the target orientation. We also stick a calibration
pattern on one side of the cube: we shall use this pattern to put in correspondence
the target image and the synthetic model. We also attach an aluminium rod to
the wooden frame to serve as a handle to freely rotate and translate the target
object. We display the target object in Figure 8.35. The cube-real model comprises
Figure 8.35: The cube-real model. (Left) The actual target made of cardboard and
fabric. The calibration pattern is used to initialize the registration algorithm. (Right)
Texture map corresponding to the unfolded cube target.
118, 428 vertices arranged in 233, 496 triangles. We encode the unwrapped texture
in a 564 × 690 rgb image (see Figure 8.35). As in the synthetic cube model, we
154

mark the vertices at the border of each side of the cube as “forbidden”—these points
shall not be considered into the optimization.
Experiment Arrangement We capture a 470 frames sequence in which we rotate
and translate the cube device using a handheld video camera (see Figure 8.36).
We use a Panasonic NV-GS400 3CCD DV camera, and we disable features such as
autofocus and automatic white balancing to avoid focal length or sudden lighting
changes. We compute the camera intrinsic using a planar calibration pattern and
Yves Bouguet’s Camera Calibration Toolbox, [Bouguet].
We capture the sequence by moving the cube across the scene using the rod
handle. To demonstrate the advantages of using an unwrapped texture, we rotate
and translate the cube in the three coordinate axis such that each side of the cube
is visible in the sequence at least once. We deliberately rotate the cube by large
amounts in the three axis to demonstrate the robustness of our method.
The initial guess for the algorithm are the rotation and translation parameters
such that the associated projection of the model is perfectly aligned with the first
image of the sequence. We compute such initialization by using Bouguet’s calibra-
tion program with the calibration pattern of the cube. We compute the off-line
matrices of algorithm HB3DTM with the initialization values (see Algorithm 7).
Results of the experiment
We show the results in Figure 8.37. For each frame of the sequence we obtain a
rotation matrix and a translation vector as a result of the algorithm HB3DTM. We
normalize the brightness values of each side of the cube—in both the recovered pixels
from the sequence and the texture image—to minimize the effects of illumination
changes.
We project the edges of the cube onto the image by using a projection matrix
assembled from these parameters (see Figure 8.37). The algorithm accurately com-
putes the position and orientation of the target; although the estimation degenerates
in some frames—e.g., frames 40, 160, or 380—the algorithm is able to produce ac-
curate solutions for most of them. Besides, the algorithm is also able to recover
from erroneous estimations in those frames where the target motion was too fast
or abrupt—see e.g., frames 260 or 380 in Figure 8.37. Note that we may improve
the performance of the algorithm by using a Pyramid-base optimization [Bouguet,
2000]; however, we choose not to use such approach to better analyze the behaviour
of the algorithm.
155

2 20 40 60 80
100 120 140 160 180
200 220 240 260 280
300 320 340 360 380
400 420 440 460 470
Figure 8.36: The cube-real sequence. Selected frames from the sequence cube-real.
We translate the target whilst it rotates around the axis deﬁned by the handle rod. This
rotational motion involves the three axis of rotation of the object. Finally, when a sub-
stantial portion of the target is no longer visible in the image—i.e., the target leaves the
camera ﬁeld of view—the algorithm stops its execution.
156

2 20 40 60 80
100 120 140 160 180
200 220 240 260 280
300 320 340 360 380
400 420 440 460 470
Figure 8.37: Results from cube-real sequence. Selected frames from the sequence
cube-real processed using HB (Algorithm 7). For each frame we compute the rotation
matrix and the translation vector that best registers the model to the image intensity
values. We use these parameters to project the cube wireframe model onto the image
(blue lines).
157

Figure 8.38: Selected facial scans used to build the model. Each scan is a three-
dimensional textured mesh that represents a facial expression. These 3D meshes are
computed from three views of the subject by using reconstruction algorithms based on
structured light.
8.5.6 Experiment with real Nonrigid data
In this experiment we show the performance of algorithm HB3DMM for registering a
deforming human face as it changes its expression. The algorithm faces the same
challenges as for the real rigid case—that is, deviations between the sequence and
the model caused by illumination or camera quantization—plus those derived from
the nonrigid nature of the target: it is remarkably diﬃcult to accurately model the
deformations of the nonrigid target.
Target Model The face model should capture the maximum variability of the
target structure for each facial expression—joy, disgust, fear, etc. We use PCA to
provide a set of deformation basis that represent the information contained in facial
motion. Professor Thomas Vetter and his team provided us with a complete face
model with expressions3
. The model was built from 88 structured light 3D scans
of the author’s face performing diﬀerent expressions—joy, sadness, surprise, anger,
winking, etc—as shown in Figure 8.38. These scans are aligned into a common
reference system by using a semi-automatic procedure that uses manually selected
face landmarks. The basis of deformation are computed by applying a PCA proce-
3
The author gratefully thanks Pascal Paysan and Brian Amberg for the scanning session and
the construction of the models.
158

Figure 8.39: Unfolded texture model. (Left) Spherical coordinates (θ, φ, ρ) are
used to project the 3D shape onto a bidimensional reference space (θ, φ). (Right) The
template of the registration algorithm comprises the actual rgb values of the target texture
projected on the reference space (θ, φ).
dure to the aligned scans. The mean of the resulting PCA model comprises 97, 577
vertices arranged in 195, 040 triangles. The basis of deformation are the 88 principal
components computed from PCA.
The original model has a number of physical details such as tongue, eyeballs and
eye sockets, the back of the head and neck. As we do not need such details for our
algorithms, we strip them down by deleting their corresponding meshes from the
model. We project the model colour by using cylindrical projection to render the
texture onto the reference space (see Figure 8.39).
Experiment Arrangement We capture a sequence of the face performing both
rigid and nonrigid motion: the rigid motion consists of rotating the head out of
the image plane about 60o
, and then rotating the head inside the image plane; the
nonrigid motion consists on the face opening the mouth and raising eyebrows (see
Figure 8.40). We capture the 210 frames long sequence by using a Panasonic DV
handheld camera mounted onto a fixed tripod. We illuminate the scene by using
two halogen lamps located above and below the face. We do not use any facial make-
up nor special lighting to avoid specular spots on the target. We provide an initial
guess for the registration algorithm by fitting the morphable model to the first image
of the sequence: we compute the rotation and translation of the morphable model
such that the differences between its projected texture and the frame brightness are
minimal. We ease the procedure by defining anchor points, that is, correspondences
between some vertices of the model and some pixels on the initial frame. We define
a total of 18 anchor points including the corners of the eyes and mouth, nostrils, tip
159

1 20 38 34 45
79 89 105 118 145
159 167 169 172 175
185 195 200 203 206
Figure 8.40: The face-real sequence. Selected frames from the sequence of the face
performing rigid and nonrigid motion. The head rotates to its left around its vertical axis,
then nods from left to right, and ﬁnally the mouth opens and the forehead wrinkles with
the rise of the eyebrows.
160

Figure 8.41: Anchor points in the model. We plot the anchor points used to manually
fit the model to the image (blue circles) on both the reference template (Left) and the
model shape (Right).
of the nose, and tips of the ears (see Figure 8.41).
Moreover, illumination conditions drastically change the brightness of the face
image with respect to the texture of the morphable model. Light sources—i.e., the
ceiling lamp—create specular highlights on a glossy surface such as human skin, and
non-cast shadows as parts of the face are more illuminated than others. However, the
texture of the morphable model represents the light reflected by the skin surface, so
shadows are not considered. We modify the texture colour of the morphable model
by using the illumination basis provided by spherical harmonics [Basri and Jacobs,
2001; Ramamoorthi, 2002]. Using these illumination basis we adjust the texture of
the morphable model to be as similar as possible to the pixel values of the projected
face.
Our fitting process also considers visibility issues on the morphable model. We
remove from the model those areas which may be problematic for the optimization
such as the lower jaw, the back of the head, the neck, the ear canal, and the nostrils.
We also remove the eyes as winking produces a sudden change of texture in those
areas. We consider the aforementioned issues into our fitting process by using non-
linear reweighted least-squares. The result is a projection matrix that best fits the
morphable model to the image frame.
Results of the Experiment We iteratively apply Algorithm 7 to the frames of
the sequence. We show the results in Figure 8.42. The algorithm accurately recovers
the face rotation around Y -axis ; notice that we have restricted the rotation to
≈ 60 degrees, as the experiment in Section 8.5.2 suggested that face texture may
161

1 20 38 34 45
79 89 105 118 145
159 167 169 172 175
185 195 200 203 206
Figure 8.42: Results for the face-real sequence. We show selected frames from
the face-real sequence processed with the HB algorithm (see Algorithm 7, page 64). For
each frame of the sequence, the algorithm computes a rotation matrix, a translation vector
and a set of deformation coeﬃcients. We use these parameters to project the shape model
onto the image—blue dots.
162

degenerate for strong rotations. Nonetheless, the algorithm is able to track the
morphable model. Finally, the algorithm recovers the nonrigid motion of mouth
and forehead due to the change of expression. In summary, the algorithm is able to
recover the face motion and deformation with small drift.
8.6 Compositional Algorithms
In this section we present the experiments that we conducted to evaluate composi-
tional registration algorithms. We organize this Section as follows: We introduce the
experimental hypotheses, together with the algorithms in Section 8.6.1, and perform
synthetic experiments involving a rigid 3D target in Section 8.6.2.
8.6.1 Experimental Hyphoteses
As for the additive approach, we select some compositional algorithms to validate
the experimental hypotheses. Again, we verify that those algorithms that fulﬁl Re-
quirements 2 and 3 should have better convergence than those that do not. We
select the following compositional algorithms: Inverse Compositional Homography
(ICH8), Generalized Inverse Compositional homography (GICH8), Inverse Composi-
tional plane-induced homography (ICH6), Generalized Inverse Compositional plane-
induced homography (GICH6), Inverse Compositional rigid body transformation
(IC3DRT), and Forward Compositional plane+parallax plane-induced homography
(FCH6PP). For the sake of comparison we evaluate two additive LK algorithms: one
for 8-dof homographies and one for 6-dof homographies. We also include algorithm
HB3DTM (see Table 8.4) to compare timing and convergence results. We summarize
the evaluated algorithms in Table 8.9.
Experiments with the EFC algorithm Note that we do not include the EFC
algorithm in the comparison. The reason is that the numerical results of EFC and
IC are exactly identical for all the experiments—as we theoretically demonstrated
in Chapter 6; thus, we do not plot the results of EFC for ease of visualization.
We test the compositional algorithms from Table 8.9 to validate the following
hypothesis: the convergence of compositional algorithms depends on the compliance
of their requirements. If the requirements hold, then the convergence shall be good;
otherwise, problems in convergence shall arise.
8.6.2 Experiments with Synthetic Rigid data
This set of experiments studies the convergence of the evaluated algorithms in a con-
trolled environment. The synthetic datasets provide us with precise measurements
on the outcomes of the algorithms.
163

Table 8.9: Evaluated Compositional Algorithms
Algorithm Warp Update Constant Req. 2 Req. 3
LKH8P 8-dof homography Additive No — —
ICH8 8-dof homography Compositional Yes YES YES
GICH8 8-dof homography Additive Partially YES YES
LKH6 6-dof homography Additive No — —
ICH6 6-dof homography Compositional Yes NO YES
GICH6 6-dof homography Compositional Yes NO YES
FCH6PP 6-dof Plane+Parallax Compositional No YES —
IC3DRT 6-dof in R3
Compositional Yes YES NO
HB3DTM 6-dof homography Additive Partially NO YES(1)
(1)
Strictly speaking, HB3DTM does not hold Requirement 3, but Requirement 1, cf.
Section 6.2.2–Page 85.
Target Model For our experiments with compositional algorithms we use a tex-
tured model of a 3D plane. The key issue about using a simple plane is that it can
be registered using 2D or 3D warps.
The plane model comprises 10, 201 vertices organized in 20, 000 triangles (see
Figure 8.43). We represent the target texture as a collection of RGB triplets, one
per vertex. The texture depicts four Gaussian-based gradient patterns: smooth
gradients are more suitable for direct image registration than high frequency tex-
tures [Benhimane et al., 2007].
Figure 8.43: The plane model. (Left) The model is a textured triangle mesh. Blue
lines: Triangle mesh, downsampled here for a proper visualization. (Right) Planar tex-
ture map of the model. The texture represents four gradient patterns based on a Gaussian
distribution.
164

Experiments with plane model
We generate a collection of experiments using the procedure described in Section 8.3.
We show the ranges of motion parameters that we use to generate the datasets in
Table 8.10.
Table 8.10: Ranges of motion parameters for each dataset.
Tracking
DS2 [0,10] [0,10] [0,10] [0,10] [0,10] [0,10]
DS3 [10,20] [10,20] [10,20] [10,20] [10,20] [10,20]
DS4 [20,30] [20,30] [20,30] [20,30] [20,30] [20,30]
DS5 [30,40] [30,40] [30,40] [20,30] [20,30] [20,30]
DS6 [40,50] [40,50] [40,50] [20,30] [20,30] [20,30]
DS1DS2DS3DS4DS5DS6
Figure 8.44: Distribution of Synthetic Datasets. We select diﬀerent samples from
each dataset. Datasets range from DS1 (Top) to DS6 (Bottom), according to Table 8.10.
Top-left image represents the position where we compute the Jacobian for eﬃcient meth-
ods. Notice that the successive samples increasingly depart from this location.
165

Table 8.11: Average reprojection error vs. noise for plane.
σ 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.0
DS1 2.43 4.79 7.19 9.71 11.91 14.64 16.37 19.21 21.09 23.61
DS2 2.45 4.70 7.46 9.45 11.86 14.51 16.53 18.72 21.37 24.21
DS3 2.43 4.81 7.21 9.49 11.87 14.43 16.84 18.99 21.27 23.60
DS4 2.34 4.77 7.08 9.70 11.94 13.75 16.06 18.90 21.21 24.09
DS5 2.55 4.96 7.51 10.04 12.68 15.36 17.40 20.02 22.58 24.98
DS6 2.50 4.97 7.20 9.81 12.04 15.20 16.43 20.46 22.67 24.23
As for the experiments with the face model, we show the average initial re-
projection error in Table 8.11. As expected, the higher the noise the larger the
reprojection error. Moreover, the average reprojection error is approximately sim-
ilar among datasets—for the same level of noise. We present the results of the
experiments in the following.
Dataset DS1 We present the results for dataset DS1 in Figure 8.45. Reprojec-
tion error plot shows that all the involved algorithms have similar accuracy. The
frequency of convergence plot shows that algorithms with constant Jacobian—IC-
and GIC-based algorithms—converge less often than those algorithms that recom-
pute the Jacobian, such as FC, LK, or HB: the plot shows that for a noise with
standard deviation bigger than σ = 3.5 (≈ 16 pixels error) compositional-based
methods converge below 80%. Moreover, the convergence for those algorithms that
do not hold Requirement 2—i.e., ICH6 and GICH6—are up to 20 percent worse than
the other compositional algorithms. However, algorithm IC3DRT—which does not
hold Requirement 3—has better convergence than algorithms ICH8 and GICH8—that
hold both requirements. Timing results relationship with noise behave as expected:
as the initialization error increases, the algorithms need to iterate more times to
reach the optimum. Absolute timing results show that inverse compositional meth-
ods are faster than forward compositional or LK-based methods. However, inverse
compositional algorithms that do not satisfy Requirement 2—ICH6 and GICH6—or
Requirement 3—IC3DRT—spend more iterations than those that hold the require-
ments: GICH6 iterates as much as double than FCH6PP and ICH8, whereas the number
of iterations for IC3DRT are just slightly higher. The plots also show an interesting
result: LK-based methods iterate more times than compositional methods, specially
in low noise cases. We conjecture that for homography-based tracking composition
is more natural and better conditioned. Thus, the descent directions of the Jacobian
for additive algorithms show a ’zigzag’ pattern near the optimum, and the algorithm
needs more iterations to compute a local minimum with desired accuracy. This is
known to be a problem with gradient-descent methods and poorly conditioned con-
vex problems. Notice that LKH8 needs more iterations as the search space has more
dimensions than the case of LKH6.
166

Figure 8.45: Results from dataset DS1 for plane. (Top-left) Accuracy plot: av-
Dataset DS2 We present the results for dataset DS2 in Figure 8.46. All the
algorithms show similar accuracy for those cases that converged—the maximum
divergence is under 0.005. The frequency of convergence decreases with the noise
variance for those algorithms that eﬃciently compute the Jacobian. The convergence
worsens in those algorithms that do not hold Requirement 2, specially algorithm
GICH6: for values of noise above σ = 4.0, the algorithm converges between 40–20%
of trials against 60–40% in DS1. Convergence of ICH6, IC3DRT, ICH8, and GICH8
are similar to those in DS1. However, the timing results for LK algorithms are
more coherent than those in DS1: computation time linearly grows with the noise
variance.
167

Dataset DS3 Figure 8.47 shows the results for dataset DS3. Again, the accuracy
of the algorithms is very similar for those cases that converged. The frequency
of convergence decreases with the noise variance, and shows noticeable diﬀerences
between those algorithms that verify Requirements 2 and 3 and those that do not.
Algorithm GICH6 practically do not converge for noise above σ = 4.0, and the
convergence of algorithms ICH6 and IC3DRT decreases up to 40%. The convergence
of the remaining algorithms is very similar to previous datasets. Even worse, for
those cases where GICH6 converged, the optimization reached the maximum number
of iterations on average. Timing results are consistent with previous datasets.
168

Dataset DS4 We present the results for dataset DS4 in Figure 8.48. The plot
shows the gap between those algorithms that hold the requirements and those that
do not. We leave algorithm GICH6 out of the comparison as its convergence quickly
degrades: the algorithm converges only 10% of the times for σ = 0.5 and does not
converge for any trial with noise σ ≥ 3.5. Algorithm ICH6 converges less than half
of the times than in previous datasets—less than 20% in DS4 against around a
40% in DS1–DS3. Convergence results for algorithm IC3DRT range from 80 to 15%.
For these two algorithms, even for noise with variance σ = 0.5, the frequency of
convergence is approximately of 80% in contrast to a 100% of convergence for the
remaining algorithms. Accuracy results are worse than those of datasets DS1-DS3,
although the results are similar among all the algorithms, with ICH6 and IC3DRT
showing the lowest reprojection error. Notice that this estimation is not accurate as
the reprojection error is computed using 10% of the samples, instead of 100% in case
of LK algorithm. The convergence problems also reﬂects in the number of iterations:
algorithms ICH6 and IC3DRT iterate three times more than the other algorithms.
169

Dataset DS5 We present the results for dataset DS5 in Figure 8.49. We leave
algorithm GICH6 out of the comparison as it has bad convergence: the convergence
is below 5% for the lowest noise, and it does not converge at all for σ ≥ 2.5.
The convergence of algorithm IC3DRT (40 to 3% of the optimizations successfully
converge) is approximately the half of dataset DS4—convergence is 80 to 15%, cf.
Figure 8.48. Also, the convergence of algorithm ICH6 degrades even more: the range
of convergence reduces from 80–15 percent for dataset DS4 to 10 to 1 percent for
dataset DS5. Moreover, the convergence of all algorithms decreases in this dataset—
even LK and FC algorithms have worst performances than in previous datasets. The
cause may be the diﬃculties in registering a plane that is skewed with respect to
the camera (see Figure 8.44). Notice that the number of iterations the algorithms
need to converge accordingly increase. Accuracy is also aﬀected: the reprojection
error increases more than a 100 percent with respect to dataset DS4.
170

Dataset DS6 Figure 8.50 shows the results from the experiments for dataset
DS6. We discard algorithms ICH6 and GICH6 as they do not converge for any of
the 60, 000 experiments—i.e., their reprojection error is always above 5.0 pixels. We
also leave algorithm IC3DRT out of the comparison as it convergence ranges from
3% for the lowest noise values, and it does not converge at all for noise σ ≥ 2.0.
The frequency of convergence of the remaining algorithms is worse than in DS5 due
to larger rotations of the plane with respect to the camera—see Figure 8.44. The
accuracy of all the algorithms are similar to the results in dataset DS5. However, the
timing results are worse: the number of iterations—together with the computation
time—increase with respect to previous datasets. The larger the noise variance, the
higher the number of iterations that the algorithm requires to converge.
171

Figure 8.50: Results from dataset DS6 for plane.(Top-left) Accuracy plot: average
Summarizing, all algorithms show similar accuracy, although those with less
parameters—e.g., LKH6 and FCH6PP—have less reprojection error. Eﬃcient algo-
rithms have lower convergence rates than algorithms that recompute the Jacobian
at each iteration. Moreover, those algorithms that do not meet the compositional
requirements—i.e., ICH6 and GICH6—nor the gradient requirements—i.e., IC3DRT—
systematically have lower convergence rates than those that hold Requirements 2
and 3. Finally, algorithms that recompute the Jacobian such as LK or FC have higher
convergence time than the others; although algorithm FC iterates few times, it has
the highest time per iteration—cf. Figure 8.51.
172

8.7 Discussion
This section draws some conclusions from the results of the experiments with the
cube, face-deform and plane models.
Comparison to Previous Works The proposed experiments are related to some
relevant previous works such as [Baker and Matthews, 2004; Brooks and Arbel, 2010].
These works only consider the registration problem, whereas we have also studied
the tracking problem. Typically, the template is a fixed square region of the image
where the Jacobian is computed—that is, the quadrangle is the projection of the
target at µ∗
= µJ. Then, the initial guess for the optimization is computed by
perturbing the corners of the quadrilateral by using Gaussian noise. [Baker and
Matthews, 2004] modifies the corner locations of the ground truth quadrangle with
noise of variance up to σ = 10, which approximately represents an upper bound
error in corner location of 30 pixels (≈ 3σ). [Brooks and Arbel, 2010] uses a similar
procedure, although the initial reprojection error ranges from 2 to 80 pixels. In our
experiments the Gaussian noise directly modifies the parameters µ∗
, and not the
target projections; thus, the noise variance σ in both methods is not equivalent.
Nonetheless, we still may compare both methods in image coordinates space: our
experiments have an average error of 25 pixels for the highest noise values—cf.
Table 8.11.
By considering multiple datasets, we also study those cases where µ∗
may differ
from µJ—i.e., the tracking case. Results show that algorithms that do not hold their
corresponding warp requirements are not suitable for tracking but may be eligible
for image registration. Moreover, results also show that efficient algorithms with
constant Jacobian have worse convergence than those that recompute the Jacobian
at each iteration. Using an experimental methodology similar to that in [Baker and
Matthews, 2004; Brooks and Arbel, 2010] would not allow us to obtain such result.
Convergence of Algorithm HB Depends upon Gradient Equivalence Al-
gorithm HB approximates the actual Jacobian computed at each frame by another
one that is semi-constant. Requirement 1 states that the quality of the Jacobian
approximation is directly related to the GEE: we use the GEE to build the semi-
constant Jacobian that speeds up the algorithm. Thus, if the GEE is satisfied, the
approximated Jacobian shall be identical to the true one, and the convergence shall
not be affected. However, if the GEE does not hold, the approximation may induce
errors during the optimization —thus, leading to poor convergence.
Algorithm HB3DRT does not hold Requirement 1, and its convergence deteriorates
as result: the convergence is good for dataset DS1, but it gradually degrades as the
noise and the differences between µJ and µ0 increase—e.g., for datasets DS3-DS6
the algorithm HB3DRT approximately converges between 80–15 percent of the times.
On the other hand, algorithm HB3DTM—that holds Requirement 1—converges at
least an 80% of the time in the worst case—i.e., DS6 with noise σ = 5.0. Thus,
Requirement 1 is directly related to the convergence of HB-based algorithms.
173

Convergence of Algorithm IC Depends upon Gradient Equivalence and
Composition Algorithm IC approximates the actual Jacobian by another one
which is efficiently computed. As in the HB case, the quality of the approximation
directly depends on the compliance of Requirements 2 and 3: if both Require-
ments hold, then the approximation shall be accurate, and the optimization shall
successfully converge. We confirm these hypotheses by using the results from the
experiments.
Algorithms ICH8 and GICH8 hold both requirements and have similar behaviour.
Accuracy is good for all datasets, and the convergence is medium for all datasets
except DS6, for which it is bad. However, these results are consistently better than
those algorithms that do not hold the requirements, such as IC3DRT, ICH6 and GICH6.
Algorithms ICH6 and GICH6 only hold Requirement 3, as plane-induced homogra-
phies are not closed under composition. Notice that, although the parameter update
for GIC-based algorithms is additive and not compositional, we require a composi-
tional warp in GICH6 by construction (the warp composition is used in the change of
variables, see Section 6.1.1). Accuracy is good in both algorithms for all datasets,
but the convergence is different for either case: convergence for ICH6 is medium-
bad, but we can describe the convergence of GICH6 as bad for all datasets. Even for
the registration case—dataset DS1– the convergence is borderline bad when high
initialization noise is present (see Figure 8.45). Results show that the convergence
of IC and GIC algorithms is noticeably worse for plane-induced homography than
for 8-dof homography. This result demonstrates that compliance of warp require-
ments determine the convergence of the algorithm. Besides, there is a noticeable
difference between results for IC and GIC that may be explained by the numerical
approximation to the matrix ∇µψ(µt) of Equation 6.32—results in Figure 8.45 for
low levels of noise appear to be good.
Algorithm IC3DRT holds only Requirement 2, as rigid body transformations in
R2
do not verify the GEE. The algorithm shows good accuracy for those cases that
converge. The convergence of the algorithm is medium for datasets DS1 and DS2:
the results are similar to algorithms ICH8 and GICH8. However, the results for
datasets DS3 and DS4 are significantly worse, and the algorithm practically do not
converge for datasets DS5 and DS6; hence, IC3DRT is eligible for registration but
not for tracking.
Finally, we examine algorithm FCH6PP that only holds Requirement 3: plane+parallax
induced-homographies do not verify the GEE (see Table 4.1). However, as FC only
requires the warp to be closed under composition, results for FCH6PP show good
convergence and accuracy for all datasets. Moreover, FC converges in less iterations
than the equivalent LK for plane-induced homography—although the total optimiza-
tion time is higher. We speculate that, for this model, composition is more natural
and better conditioned than the usual GN approach.
Requirements Determine Behaviour for Efficient Algorithms We have
shown that the performance of efficient algorithms depend upon the compliance
of their requirements—Requirement 1 for HB algorithm, and Requirements 2 and 3
174

for IC and GIC. We have also made a distinction between registration and tracking
depending on the parameters µ0 and µJ (cf. Section 8.1).
For registration, the compliance of the requirements is not a determinant fac-
tor: algorithms ICH6 and GICH6 have good convergence for datasets DS1 and DS2
(at least for low noise) even when they do not hold Requirement 2—cf. Figures 8.45
and 8.46. Moreover, algorithm IC3DRT does not hold Requirement 3, and the con-
vergence results are even better than those for algorithms ICH6 and GICH6—cf.
Figures 8.45 and 8.46. Additive algorithms have similar performance: algorithm
HB3DRT does not hold Requirement 1, but has good convergence for datasets DS1
and DS2—cf. Figures 8.13 and 8.14.
For tracking, the compliance of the requirements is fundamental for a proper
convergence of the algorithms: ICH6 and GICH6 (that do not hold Requirement 2)
and algorithm IC3DRT (that does not hold Requirement 1) have bad convergence
for datasets DS3 to DS6—i.e., those that represent the tracking problem, cf. Fig-
ures 8.47–8.50. The convergence specially degrades for dataset DS6, where almost
none of the optimizations successfully converged. The additive algorithms similarly
behave: HB3DRT does not hold Requirement 1 and its convergence quickly degrades
through datasets DS3 to DS6—cf. Figures 8.15–8.18. However, algorithms that
hold their requirements—such as ICH8, GICH8, and HB3DTM—have better conver-
gence, even for the challenging dataset DS6.
Efficient algorithms greatly depend on an approximation to the GN Jacobian
matrix: we substitute the actual Jacobian J(µt) by one fixed at µJ, J(µJ), or we
approximate J(µt) by computing J(µt) = SM(µt). In registration problems, we
assume that µJ ≡ µt, so the approximated Jacobian is similar to the actual one, but
in tracking problems, the difference between µJ and µt may be arbitrarily large. The
requirements determine the accuracy of the approximation to the actual Jacobian.
If the requirements do not hold, the approximated Jacobian is still similar to the
actual one for registration case—i.e., J(µJ) ≡ J(µt); however, the approximated
Jacobian for the tracking case is quite different when the requirements do not hold,
which results in a degraded convergence for those cases.
Timing Results Comparison Figure 8.51 shows the time per iteration for the
algorithms used in the compositional experiments. For each algorithm we average
the time per iteration all over the 60, 000 experiments. Results show that IC algo-
rithm is undoubtedly the fastest, with GIC ranking the second by little difference.
The latter is a direct consequence of updating the constant IC Jacobian. The LK al-
gorithms are about four times slower than their compositional counterparts—either
8 or 6-dof—as computing the Jacobian in each iteration is a costly process. Finally,
FC is the slowest of the reviewed algorithms. The explanation to this fact is sim-
ple: the Jacobian computation involves more operations as the function is composed
within. We have also included the HB algorithm for the sake of comparison. Timing
results for HB are comparable to those of the GIC algorithm: about twice the time per
iteration of IC but still much faster than LK or FC. Notice that the final computation
time of an algorithm depends on (1) the number of iterations of the optimization
175

loop, and (2) the time per iteration.
Figure 8.51: Average Time per iteration. Colour bars show the average optimization
time over 60, 000 experiments for each algorithm.
Comparison Between IC and HB We now compare the two efficient registration
techniques reviewed in this chapter. For a fairer comparison, we compare algorithms
that hold their respective requirements: ICH8 verifies both the GEE and the warp
composition, whereas HB3DTM holds the GEE. We evaluate two different homographic
warps to register the plane target: an 8-dof homography and a 6-dof plane-induced
homography. This would make a difference for accuracy or efficiency measures, as
the number of parameters changes the shape of the optimization surface, but not
convergence.
Convergence for algorithm HB3DTM is good for datasets DS1-DS3 and medium
for the rest, but always converges at least a 40% of the time. HB3DTM consistently
performs better than ICH8 for all datasets and, with the difference peaking at 15 −
20% for noise σ = 5.0. These results demonstrate that (1) HB is more robust than
IC—convergence is better when increasing the noise—and (2) HB is less local than
IC—convergence is better for last datasets.
176

Efficiency is very disputed, although HB performs slightly better in both total
time and number of iterations. Nonetheless, IC beats HB in time per iteration—
0.009 vs. 0.016 seconds, cf. Figure 8.51.
Algorithm GIC has the same Requirements than IC Results also show that
the behaviour of algorithms IC and GIC is exactly the same: the convergence of
both algorithms is good if Requirements 2 and 3 hold. Note that, although the
algorithm GIC additively updates the parameters, results from algorithm GICH6 show
poor convergence as the warp is not closed under composition—i.e., Requirement 3
does not hold. This may be seen as contradictory—an additive update needs a
compositional warp—however, it is a direct consequence of the IC change of variables
that is the basis of the GIC algorithm (cf. Section 6.3.1, Equation 6.26). Thus, if
the warp does not hold Requirement 3, the change of variables of Equation 6.26 is
not valid anymore, and the algorithm would optimize an erroneous cost function.
Nonetheless, it may be interesting to study the impact of using optimization methods
like BGFS and Newton-Raphson when the requirements are not fulfilled.
On the Robustness of Efficient Algorithms Results show that efficient algorithms—
i.e., IC and HB—are less robust than LK or FC, even if their corresponding require-
ments are hold. In theory, the compliance of the requirements indicates that the
efficient approximations of the Jacobian—constant Jacobian for IC and factorized
Jacobian for HB—are accurate enough to provide a successful optimization.
However, the Jacobian matrix of both efficient algorithms is still an approxi-
mation to the actual gradient. Thus, when the initial guess is not within a small
neighbourhood from the actual optimum, the procedure with the approximated Ja-
cobian converges less often. We examine the robustness of the efficient algorithms
by analyzing Figures 8.45–8.50. Results for algorithm ICH8 show that the frequency
of convergence decreases as the initialization noise increases. Moreover, the conver-
gence worsens for the successive datasets, as the initialization is increasingly different
from the reference template. Algorithm HB3DTM shows a similar behaviour, although
the frequency of convergence is better than the ICH8 case. However, both algorithms
display worse frequency of convergence than LK due to the Jacobian approximation:
over 90% for LK against 65–25% for ICH8 and 80–38% for HB3DTM.
We have tested the algorithm HB in both rigid and nonrigid real-world sequences.
Results for both sequences are similar: they show good convergence, although the
estimation of the parameters sometimes lacks accuracy (cf. Figures 8.42 and 8.37).
We may explain the accuracy problems by analyzing the behaviour of the algorithm
with respect to noise: HB algorithm is sensitive to the initialization. Moreover, scene
illumination poses a challenge to the brightness constancy assumption which results
in inaccurate estimation of the target position or orientation.
A Proper Target Texture is Critical for Convergence Section Good Texture
to Track demonstrated that the convergence of the HB algorithm heavily depends on
177

the texture of the target (cf. Section 8.5.2–Page 139). The texture of the face model
is specially troublesome in case of rotations greater than 60o
(cf. Figure 8.22). This
may explain the lack of robustness in the face-deform sequence (see Figures 8.26–
8.29).
178

Chapter 9
Conclusions and Future work
This chapter summarizes the contributions of the thesis and hints feasible lines of
future work.
9.1 Summary of Contributions
We highlight the following contributions of the thesis:
Survey of Existing Methods We have analysed the existing additive and com-
positional image registration algorithms in depth.
Gradient Equivalence Equation We have introduced the GEE as a differential
extension to the bcc. The GEE is crucial for the proper convergence of efficient
registration algorithms.
Fundamental Requirements for Convergence We have proposed some require-
ments on the motion model that efficient image registration algorithms must
satisfy to guarantee accurate results. Motion warps have different requirements
depending on the approach to the image registration: Requirement 1 for ad-
ditive algorithms, and Requirements 2 and 3 for compositional approaches.
Distinction Between Registration and Tracking We have introduced a differ-
entiation between registration and tracking for efficient algorithms: those effi-
cient algorithms that do not hold their requirements are valid for registration
but not for tracking.
Systematic Factorization Framework We have introduced lemmas and theo-
rems that systematize the factorization stage of the HB algorithm.
Efficient 3D Tracking using HB We have proposed two homography-based warps
for tracking 3D targets by using the HB algorithm: the shape-induced ho-
mography represents the rigid motion of a triangle mesh, and the nonrigid
shape-induced homography models both the rigid and nonrigid motion of
179

a deforming 3D target. We have provided the HB factorization schemes for
both warps by using the proposed systematic factorization procedure.
Efficient Forward Compositional Algorithm We have introduced a new com-
positional algorithm for image registration, the EFC, which is equivalent to the
IC. The EFC algorithm provides a new interpretation of IC which clearly ex-
plains the change of roles between target and template derivatives. Moreover,
the EFC does not require the warping function to be invertible.
9.2 Conclusions
We summarize some thoughts from the results of the experiments in the following
paragraphs.
Handbook of Motion Warps We introduce a classification of motion warps, in
terms of their suitability, for efficient image registration/tracking. We gather infor-
mation from Tables 4.1 and 6.1 to build the following classification—see Table 9.1:
• Warps in R2
Motion warps in R2
— affine, rotation-translation-scale, and
homography—are suitable for every image registration algorithm.
• Warps in P2
Homographies in P2
are suitable warps for every image reg-
istration algorithm— efficient or not. However, 6-dof homographic warps—
such as plane-induced and shape-induced homographies— do not comply the
Requirement 3—i.e., these warps do not form a group. Thus, 6-dof homo-
graphies are not eligible for compositional algorithms. On the other hand,
Plane+Parallax homographies do form a group, but they do not hold the GEE;
thus, Plane+Parallax homography is not eligible for efficient algorithms—
either additive or compositional.
• Warps in R3
General rigid body motion does not hold the GEE—cf. Require-
ments 3 and 1; therefore, warps in R3
are generally not eligible for efficient
algorithms.
Handbook of Efficient Algorithms We provide a classification of efficient im-
age registration algorithms in terms of their suitability to solve problems. Figure 9.1
shows the qualitative features for algorithms LK, IC, HB, FC, and GIC; we estimate
this qualitative information from the quantitative outcomes of the experiments in
Chapter 8. Using the values depicted in Figure 9.1 we infer the following classifica-
tion:
The most complete: Lucas-Kanade algorithm is the one that
achieves the best marks on almost every feature: it is the most ro-
bust, accurate and can be applied to any differentiable warp func-
tion. Also, it can be robustly used for both registration and tracking. However,
its poor efficiency renders the algorithm unusable for real-time applications.
180

Table 9.1: Classification of Motion Warps.
Algorithms
LK FC HB IC GICMotionWarps
Affine R2
✔ ✔ ✔ ✔ ✔
Homography R2
✔ ✔ ✔ ✔ ✔
Homography P2
✔ ✔ ✔ ✔ ✔
Plane-induced Homography P2
✔ ✘ ✔ ✘ ✘
Shape-induced Homography P2
✔ ✘ ✔ ✘ ✘
Plane+Parallax Homography P2
✔ ✔ ✘ ✘ ✘
Rigid Body R3
✔ ✔ ✘ ✘ ✘
Camera Rotation R3
✔ ✔ ✔ ✔ ✔
The most efficient: Inverse Compositional and Generalized In-
verse Compositional are the fastest image registration algorithms
around. If the warp function complies with the requirements of
Chapter 6, these algorithms offer good convergence at the best speed. How-
ever, even in these cases, algorithms with constant Jacobian lack robustness.
Moreover, IC algorithm cannot track nonplanar 3D targets—but may handle
3D objects in registration cases.
The most balanced: Hager-Belhumeur algorithm has a perfect
trade-off between speed and accuracy: it is not as accurate and
robust as LK but is is obviously far more efficient; on the other
hand, although HB is not as efficient as IC or GIC, it is more robust and can be
applied to a wider range of motion warps—specially plane-induced and shape-
induced homography for 3D tracking. Moreover, it converges better than IC
for both registration and tracking problems.
Registration is not Tracking This thesis emphasizes the differences between
registration and tracking: in registration the initial guess in the optimization space
is close to the point where we compute the gradient, whereas in tracking the distance
in optimization space between the initial guess and the gradient parameters can be
arbitrarily large. We empirically show that efficient algorithms that do not hold
their requirements are suitable for registration but not for tracking.
This restriction is obvious for algorithms with constant Jacobian such inverse
compositional and its extensions. The critical distinction between registration and
tracking has been discussed here for the first time. Hence, inverse compositional
algorithms would have good convergence when the parameters of the target are
similar to those of the reference image; however, the algorithm may drift —and
accumulate reprojection error—as the target parameters diverge from the reference
ones.
181

Lucas-Kanade Inverse Compositional
A
R
E G
L
A
R
E G
L
Hager-Belhumeur Generalized Inverse Compositional
A
R
E G
L
A
R
E G
L
Forward Compositional
A
R
E G
L
Figure 9.1: Spiderweb Plots for Several Image Registration Algorithms. (Top-
right) Lucas-Kanade algorithm. (Top-left) Inverse Compositional algorithm. (Middle-
left) Hager-Belhumeur algorithm. (Middle-right) Generalized Inverse Compositional
algorithm. (Bottom) Forward Compositional algorithm. Legend: (A) Accuracy, (L)
Localness, (G) Generality, (E) Eﬃciency, and (R) Robustness.
182

This behaviour is shown clearly when tracking faces using AAMs [Matthews and
Baker, 2004]: the IC algorithm for AAMs does not hold Requirement 2 as AAMs do not
form a group. The algorithm has good convergence if the face does not move and it
is frontal to the camera—i.e., the parameters of the face are similar to those used to
compute the constant Jacobian; the convergence is also good when the face changes
expression but it is still frontal to the camera and centred; however, the convergence
decreases when the face translates from its central position.
3D Tracking Implies Non-Constant Jacobian By definition, a 3D target is
not totally visible from a single view—except, e.g., if the target is a plane. Efficient
algorithms like IC cannot precompute the Jacobian and the Hessian of 3D targets: IC
algorithm computes the Jacobian in those visible target points at J. However, some
points appear and others disappear due to self-occlusions and the relative motion
between the target and the camera; in these cases, the Jacobian must be partially
recomputed to handle the newly visible points. Also notice that the efficiency of the
IC algorithm greatly diminishes when the Jacobian is not constant (cf. Table 7.15–
Page 106).
In Defence of Additive Algorithms The Inverse Compositional algorithm has
been synonym of efficient registration/tracking since it was first published in [Baker
and Matthews, 2001]. The rise of IC brought the fall of the existing efficient methods,
specially additive ones such as [Hager and Belhumeur, 1998]. In this thesis we
vindicate HB as the most balanced registration/tracking algorithm, as:
1. HB can handle a wider range of motion warps: HB is able to track 3D rigid
and deformable objects (see Table 9.1), which is not possible when using the
IC algorithm—we have shown that IC is not correct for 3D tracking in Sec-
tion 4.2.2.
2. HB roughly performs as efficiently as IC when the Jacobian must be recomputed—
as in the case of 3D tracking, cf. Table 7.15–Page 106.
9.3 Future Work
We also suggest the following lines of investigation for future improvements of the
algorithms.
Illumination Model In this thesis we have assumed that the BCC is purely
Lambertian—i.e., the texture of the target does not depend on its position nor
its orientation. However, to be physically more accurate, the BCC should account for
attached shadows: changes in texture due to the relative orientation of the target
and the light source—i.e., side-lighting cause some facets of the object to be more
illuminated than others. We use spherical harmonics [Basri and Jacobs, 2003; Ra-
mamoorthi, 2002] to model attached shadows. We display the spherical harmonics
183

of the model face in Figure 9.2. We propose to handle the changes in illumination
Figure 9.2: Spherical Harmonics-based Illumination Model
due to orientation by augmenting the BCC as follows:
9
i=0
Bi[x] = It[f(x; µt)], ∀x ∈ X,
where B0 = T , and Bi : R2
→ R is the bi-dimensional brightness function corre-
sponding to the i-th spherical harmonic basis computed from [Basri and Jacobs,
2003]. This Equation may also be factorized as the usual bcc using the techniques
proposed in the thesis—a similar problem involving the 2D homographic case was
solved in [Buenaposada et al., 2009].
Combine Texture and Edges In this thesis we have only used texture infor-
mation to perform the image registration. However, we could improve the regis-
tration/tracking by using features other than texture, such as edges [Decarlo and
Metaxas, 2000; Marchand et al., 1999; Masson et al., 2003; Vacchetti et al., 2004],
or even illumination cues [Lagger et al., 2008; Romdhani and Vetter, 2003]. Besides,
we could devise a factorization to include these terms using the techniques proposed
in the thesis.
Multi-view Registration/Tracking This thesis have presented only monocular
tracking procedures. However, we can extend some of the procedures to work in
multi-view environments. Using multiple cameras we could (1) estimate the param-
eters more robustly than only using just one of them, and (2) extend the ﬁeld of
184

Figure 9.3: Tracking by simultaneously using texture and edges information
view as several cameras may capture more information from the object (see Fig-
ure 9.4). Moreover, multi-view tracking using HB is still eﬃcient. Let P1, . . . , Pv be v
distinct cameras that capture a single target (see Figure 9.4). We assume that the
target is expressed in the scene coordinate system; hence, the motion parameters
are independent of each camera. We set up the following equation:



S⊤
1
...
S⊤
v


 M =



e1
...
ev


 , (9.1)
where S1, . . . , Sv and e1, . . . , ev are the constant factorization matrices and the error
vectors that depend on the cameras P1, . . . , Pv. Notice that the matrix M that depends
on the target motion is common to all the views.
Regularization of Gradient Equivalence Although fast, IC is specially limited
to the choice of motion warp: no warp involving nonplanar 3D motion is allowed (cf.
Table 9.1). IC places strict constraints on the target motion to (1) allow composition,
and (2) hold the gradients equivalence. The non-compliance of the motion warp with
either of these requirements leads to a poor convergence of the algorithm.
185

Figure 9.4: Efficient tracking using multiple views.
Nonetheless, we can achieve a good convergence even when the requirements are
not met. [Amberg and Vetter, 2009] improves IC registration of AAMs—which do not
form a group—by using regularization. We could use a similar technique to enhance
the convergence of IC of (1) those warps that do not hold Requirement 2—such as
plane-induced homography, [Cobzas et al., 2009]—or (2) those warps that do not
hold Requirements 1 or 3—such as rigid body transformation, [Muñoz et al., 2005].
Quantization of Parameter Stability In Chapter 8 we showed that the con-
vergence of efficient algorithms depends upon holding certain requirements. How-
ever, how does the algorithm behave when the requirements are not satisfied?. We
have only verified this result in an experimental way. We propose to analytically
study the convergence of the algorithms when using an inaccurate approximation
to the Jacobian matrix. The idea is to compute some statistics on the results of the
optimization—e.g., confidence intervals on each parameter, convergence or accuracy
measures, etc—given numerical or analytic information about the approximation
Jacobian: for example, we would like to know for which ranges of 6-dof parameters
in R3
, the IC algorithm for rigid body transformation shall converge more than a
90% of the time.
Automatic Factorization In Chapter 5 and Appendix D we introduced lemmas
and rules to systematically solve the factorization problem: we have demonstrated
that—under certain assumptions—the factorization is feasible. However, we have
not demonstrated that the obtained factorization is the most efficient possible; re-
call that factorization is similar to the chain matrix multiplication problem (cf.
Section 7.1.3). We propose to build an automatic procedure to compute the fac-
torization: the input would be a chain of matrix operations, and the output would
186

another chain of matrix operations whose matrices are clearly separated; the result-
ing chain of matrices would be such that the operations are minimal. The optimum
order of operations can be computed using dynamic programming triggered by the
proposed rules of factorization.
Alternative Computation of the Brightness Error Function In this the-
sis we have posed the registration/tracking problem as the minimization of the
quadratic error function of the brightness differences between the template and the
image—which is usually known as Sum of Squared Differences or SSD. However,
there are alternative error norms in the Fourier domain [Navarathna et al., 2011],
or maximizing the correlation of image gradient orientations [Tzimiropoulos et al.,
2011]. We may improve the robustness of the HB algorithm by deriving factorization
methods for such norms.
We may even go a step further by using Discriminative Tracking [Avidan, 2001;
Liu, 2007; Lucey, 2008; Wang et al., 2010]: we maximize the classification score of
the image of the target instead of miminizing the SSD error norm. We may search
for those parameters that best categorize the target region in the “well-aligned”
class—as opposed to the “bad-aligned” class. We propose to speed-up existing
discriminative tracking techniques by using factorization methods.
187

Bibliography
Amberg, B. and Vetter, T. (2009). On compositional imge alignment, with an
application to active appearance models. In Proc. of CVPR.
An, K. H. and Chung, M. J. (2008). 3d head tracking and pose-robust 2d tex-
ture map-based face recognition using a simple ellipsoid model. In IEEE/RSJ
International Conference on Intelligent Robots and Systems,IROS.
Averbuch, A. and Keller, Y. (2002). Fast motion estimation using bidirectional
gradient methods. In Proc. International Conference on Acoustics, Speech, and
Signal Processing.
Avidan, S. (2001). Support vector tracking. In IEEE Trans. on Pattern Analysis
and Machine Intelligence, pages 184–191.
B. Tordoff, W. M., de Campos, T., and Murray, D. (2002). Head pose estimation for
wearable robot control. In Proc 13th British Machine Vision Conference, Cardiff,
September 2002, volume 2, pages 807–816.
Baker, S. and Matthews, I. (2001). Equivalence and efficiency of image alignment
algorithms. In Proc. of CVPR, volume 1, pages 1090–1097. IEEE.
Baker, S. and Matthews, I. (2004). Lucas-kanade 20 years on: A unifiying framework.
International Journal of Computer Vision, 56(3):221–255.
Baker, S., Matthews, I., Xiao, J., Gross, R., Kanade, T., and Ishikawa, T. (2004a).
Real-time non-rigid driver head tracking for driver mental state estimation. In
11th World Congress on Intelligent Transportation Systems.
Baker, S., Patil, R., Cheung, G., and Matthews, I. (2004b). Lucas-kanade 20 years
on: Part 5. Technical Report CMU-RI-TR-04-64, Robotics Institute, Carnegie
Mellon University, Pittsburgh, PA.
Bartoli, A. (2008). Groupwise geometric and photometric direct image registration.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(12):2098 –
2108.
189

Bartoli, A., Hartley, R., and Kahl, F. (2003). Motion from 3d line correspondences:
Linear and non-linear solutions. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, Madison, Wisconsin, USA, pages 477–484.
IEEE CSP.
Bartoli, A. and Zisserman, A. (2004). Direct estimation of non-rigid registration. In
In British Machine Vision Conference.
Basri, R. and Jacobs, D. W. (2001). Lambertian reflectance and linear subspaces.
In Proc. of ICCV, volume 2, pages 383–390.
Basri, R. and Jacobs, D. W. (2003). Lambertian reflectance and linear subspaces.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(2):218–233.
Basu, S., Essa, I., and Pentland, A. (1996). Motion regularization for model-based
head tracking. In ICPR ’96: Proceedings of the International Conference on
Pattern Recognition (ICPR ’96) Volume III-Volume 7276, page 611, Washington,
DC, USA. IEEE Computer Society.
Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008). Speeded-up robust
features (surf). Computer Vision and Image Understanding, 110(3):346 – 359.
Benhimane, S., Ladikos, A., Lepetit, V., and Navab, N. (2007). Linear and quadratic
subsets for template-based tracking. In Proc. of CVPR.
Benhimane, S. and Malis, E. (2007). Homography-based 2d visual tracking and
servoing. International Jounal of Robotics Research, 26(7):661–676.
Bickel, B., Botsch, M., Angst, R., Matusik, W., Otaduy, M., Pfister, H., and Gross,
M. (2007). Multi-scale capture of facial geometry and motion. ACM Trans.
Graph., 26(3):33.
Black, M. J. and Jepson, A. D. (1998). Eigentracking: Robust matching and tracking
of articulated objects using a view-based representation. International Journal of
Computer Vision, 26(1):63–84.
Black, M. J. and Yacoob, Y. (1997). Recognizing facial expressions in image se-
quences using local parameterized models of image motion. International Journal
of Computer Vision, 25(1):23–48.
Blanz, V. and Vetter, T. (1999). A morphable model for the synthesis of 3d faces.
In Proc. of SIGGRAPH, pages 187–194. ACM Press.
Blanz, V. and Vetter, T. (2003). Face recognition based on fitting a 3d morphable
model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1–
12.
Bouguet, J. Y. Camera calibration toolbox for matlab.
190

Bouguet, J.-Y. (2000). Pyramidal implementation of the lucas kanade feature
tracker. Intel Corporation, Microprocessor Research Labs.
Bowden, R., Mitchell, T. A., and Sharhadi, M. (2000). Non-linear statistical mod-
els for the 3d reconstruction of human pose and motion from monocular image
sequences. Image and Vision Computing, 9(18):729–737.
Brand, M. (2001). Morphable 3d models from video. Computer Vision and Pattern
Recognition, IEEE Computer Society Conference on, 2:456+.
Brand, M. and R.Bhotika (2001). Flexible flow for 3d nonrigid tracking and shape
recovery. In IEEE Computer Society Conference on Computer vision and Pattern
Recognition (CVPR), volume 1, pages 315–322.
Bregler, C., Hertzmann, A., and Biermann, H. (2000). Recovering non-rigid 3d
shape from image streams. In Proc. of CVPR, pages 690–696.
Brooks, R. and Arbel, T. (2010). Generalizing inverse compositional and esm image
alignment. International Journal of Computer Vision, 11(87):191–212.
Brown, L. G. (1992). A survey of image registration techniques. ACM Comput.
Surv., 24:325–376.
Brunet, F., Bartoli, A., Navab, N., and Malgouyres, R. (2009). Nurbs warps. In
British Machine Vision Conference (BMVC), London.
Buenaposada, J., Muñoz, E., and Baumela, L. (2009). Efficient illumination indepen-
dent appearance-based face tracking. Image and Vision Computing, 27(5):560–
578.
Buenaposada, J. M. and Baumela, L. (1999). Seguimiento robusto del rostro humano
mediante visi’on computacional. In Proc. Conferencia Asociacion Española para
la Inteligencia Artificial, volume I, pages 48–53. AEPIA.
Buenaposada, J. M. and Baumela, L. (2002). Real-time tracking and estimation of
plane pose. In Proc. of ICPR, volume II, pages 697–700, Quebec, Canada. IEEE.
Buenaposada, J. M., Muñoz, E., and Baumela, L. (2004). Efficient appearance-
based tracking. In Proc. CVPR-Workshop on Nonrigid and Articulated Motion.
IEEE.
Capel, D. (2004). Image Mosaicing and Super-Resolution (Cphc/Bcs Distinguished
Dissertations.). SpringerVerlag.
Caspi, Y. and Irani, M. (2002). Spatio-temporal alignment of sequences. IEEE
Trans. Pattern Anal. Mach. Intell., 24(11):1409–1424.
191

Chen, C.-W. and Wang, C.-C. (2008). 3d active appearance model for aligning faces
in 2d images. In IEEE/RSJ International Conference on Robots and Systems
(IROS), Nice, France.
Choi, S. and Kim, D. (2008). Robust head tracking using 3d ellipsoidal head model
in particle filter. Pattern Recogn., 41(9):2901–2915.
Cipolla, R. and Drummond, T. W. (1999). Real-time tracking of complex structures
with on-line camera calibration. In In Proceedings of the 10th British Machine
Vision Conference, BMVC, Nottingham, UK.
Claus, D. and Fitzgibbon, A. W. (2005). A rational function lens distortion model
for general cameras. In CVPR ’05: Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR’05) -
Volume 1, pages 213–219, Washington, DC, USA. IEEE Computer Society.
Cobzas, D., Jagersand, M., and Sturm, P. (2009). 3d ssd tracking with estimated
3d planes. Image and Vision Computing, 27(1-2):69–79.
Comaniciu, D., Ramesh, V., and Meer, P. (2000). Real-tiem tracking of non-rigid
objects using mean shift. In Proc. of CVPR, pages 142–149. IEEE.
Cootes, T., Edwards, G., and Taylor, C. (2001). Active appearance models. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 23(6):681–685.
Cormen, T. H., Stein, C., Rivest, R. L., and Leiserson, C. E. (2001). Introduction
to Algorithms. McGraw-Hill Higher Education, 2nd edition.
Decarlo, D. and Metaxas, D. (2000). Optical flow constraints on deformable models
with applications to face tracking. International Journal of Computer Vision,
38(2):99–127.
Del Bue, A. (2010). Adaptive metric registration of 3d models to non-rigid image
trajectories. In Daniilidis, K., Maragos, P., and Paragios, N., editors, 11th Euro-
pean Conference on Computer Vision (ECCV 2010), Crete, Greece, volume 6313
of Lecture Notes in Computer Science, pages 87–100. Springer.
Del Bue, A., Smeraldi, F., and Agapito, L. (2004). Non-rigid structure from mo-
tion using non-parametric tracking and non-linear optimization. In Proc. CVPR-
Workshop on Nonrigid and Articulated Motion, volume 1. IEEE.
Dementhon, D. F. and Davis, L. S. (1995). Model-based object pose in 25 lines of
code. Int. J. Comput. Vision, 15(1-2):123–141.
Devernay, F., Mateus, D., and Guilbert, M. (2006). Multi-camera scene flow by
tracking 3-d points and surfels. In Proc. of CVPR, volume II, pages 2203– 2212.
192

Donner, R., Reiter, M., Langs, G., Peloschek, P., and Bischof, H. (2006). Fast active
appearance models search using canonical correlation analysis. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 28(10):1690–1694.
Dornaika, F. and Ahlberg, J. (2006). Fitting 3d face models for tracking and active
appearance model training. Image and Vision Computing, 24:1010–1024.
Dowson, N. and Bowden, R. (2008). Mutual information for lucas-kanade tracking
(milk): An inverse compositional formulation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 30(1):180–185.
Drummond, T. and Cipolla, R. (2002). Real-time visual tracking of complex struc-
tures. IEEE Trans. Pattern Anal. Mach. Intell., 24(7):932–946.
Faggian, N., Paplinski, A. P., and Sherrah, J. (2006). Active appearance models
for automatic fitting of 3d morphable models. In AVSS ’06: Proceedings of the
IEEE International Conference on Video and Signal Based Surveillance, page 90,
Washington, DC, USA. IEEE Computer Society.
Gay-Bellile, V., Bartoli, A., and Sayd, P. (2010). Direct estimation of nonrigid
registrations with image-based self-occlusion reasoning. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 32(1):87–104.
Gleicher, M. (1997). Projective registration with difference decomposition. In Proc.
of CVPR, pages 331–337.
Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations (Johns Hopkins
Studies in Mathematical Sciences)(3rd Edition). The Johns Hopkins University
Press, 3rd edition.
Gonzalez-Mora, J., Guil, N., and De la Torre, F. (2009). Efficient image alignment
using linear appearance models. In Proc. of CVPR.
Gross, R., Matthews, I., and Baker, S. (2006). Active appearance models with
occlusion. Image and Vision Computing, 24(6):593–604.
Guskov, I. (2004). Multiscaled inverse compositional image alingment for subdivision
surface maps. In Proc. European Conference on Computer Vision.
Hager, G. and Belhumeur, P. (1998). Efficient region tracking with parametric
models of geometry and illumination. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 20(10):1025–1039.
Hager, G. and Belhumeur, P. (1999). Tracking in 3d: Image variability decomposi-
tion for recovering object pose and illumination. Pattern Analysis and Applica-
tions.
Harris, C. and Stephens, M. (1988). A combined corner and edge detection. In
Proceedings of The Fourth Alvey Vision Conference, pages 147–151.
193

Hartley, R. and Zisserman, A. (2004). Multiple View Geometry in Computer Vision.
Cambridge University Press, second edition.
Hiwada, K., Maki, A., and Nakashima, A. (2003). Mimicking video: real-time
morphable 3d model ﬁtting. In VRST ’03: Proceedings of the ACM symposium
on Virtual reality software and technology, pages 132–139, New York, NY, USA.
ACM.
Hong, H. S. and Chung, M. J. (2007). 3d pose and camera parameter tracking
algorithm based on lucas-kanade image alignment algorithm. In International
Conference on Control, Automation and Systems, Seoul, Korea.
Irani, M. and Anandan, P. (1999). All about direct methods. In Triggs, W., Zis-
serman, A., and Szeliski, R., editors, Vision Algorithms: Theory and practice.
Springer-Verlag.
Irani, M., Anandan, P., and Cohen, M. (2002). Direct recovery of planar-parallax
from multiple frames. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 24(11):1528–1534.
Irani, M. and Peleg, S. (1991). Improving resolution by image registration. CVGIP:
Graph. Models Image Process., 53(3):231–239.
Irani, M., Rousso, B., and Peleg, S. (1997). Recovery of ego-motion using region
alignment. IEEE Trans. Pattern Anal. Mach. Intell., 19(3):268–272.
Jang, J.-S. and Kanade, T. (2008). Robust 3d head tracking by online feature reg-
istration. In The IEEE International Conference on Automatic Face and Gesture
Recognition.
Jurie, F. and Dhome, M. (2002a). Hyperplane approximation for template matching.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):996–100.
Jurie, F. and Dhome, M. (2002b). Real time robust template matching. In Proc.
BMVC, pages 123–132.
K. B. Petersen, M. P. The matrix cookbook.
Keller, Y. and Averbuch, A. (2004). Fast motion estimation using bidirectional
gradient methods. Trans. on IP, 13(8):1042–1054.
Keller, Y. and Averbuch, A. (2008). Global parametric image alignment via high-
order approximation. Computer Vision and Image Understanding, 109(3):244–
259.
Kollnig, H. and Nagel, H. H. (1997). 3d pose estimation by directly matching
polyhedral models to gray value gradients. In International Journal of Computer
Vision, volume 23, pages 283–302.
194

La Cascia, M., Sclaroff, S., and Athitsos, V. (2000). Fast, reliable head tracking
under varying illumination: An approach based on robust registration of texture-
mapped 3d models. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 22(4):322–336.
Lagger, P., Salzmann, M., Lepetit, V., and Fua, P. (2008). 3d pose refinement from
reflections. In Computer Vision and Pattern Recognition.
Lepetit, V. and Fua, P. (2005). Monocular model-based 3d tracking of rigid objects.
Found. Trends. Comput. Graph. Vis., 1(1):1–89.
Lepetit, V. and Fua, P. (2006). Keypoint recognition using randomized trees. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 28(9):1465–1479.
Lepetit, V., Moreno-Noguer, F., and Fua, P. (2009). Epnp: An accurate o(n) solution
to the pnp problem. International Journal of Computer Vision, 81(2).
Lester, H. and Arridge, S. R. (1999). A survey of hierarchical non-linear medical
image registration. Pattern Recognition, 32(1):129 – 149.
Lewis, J. P. (1995). Fast normalized cross-correlation. In Vision Interface, pages
120–123. Canadian Image Processing and Pattern Recognition Society.
Liu, X. (2007). Generic face alignment using boosted appearance model. In in Proc.
IEEE Computer Vision and Pattern Recognition, pages 1079–1088.
Lourakis, M. I. A. and Argyros, A. A. (2006). Chaining planar homographies for
fast and reliable 3d plane tracking. In Proc. of ICPR, pages 582–586, Washington,
DC, USA. IEEE Computer Society.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. In-
ternational Journal of Computer Vision, 2(60):91–110.
Lucas, B. D. and Kanade, T. (1981). An iterative image registration technique with
an application to stereo vision. In Proc. of Int. Joint Conference on Artificial
Intelligence, pages 674–679.
Lucey, S. (2008). Enforcing non-positive weights for stable support vector tracking.
In IEEE International Conference on Computer Vision and Pattern Recognition
(CVPR).
Madsen, K., Nielsen, H., and Tingleff, O. (2004). Methods for non-linear least
squares problems. Informatiks and Mathematical Modelling, Technical University
of Denmark, second edition.
Malciu, M. and Prêteux, F. (2000). A robust model-based approach for 3d head
tracking in video sequences. In FG ’00: Proceedings of the Fourth IEEE Inter-
national Conference on Automatic Face and Gesture Recognition 2000, page 169,
Washington, DC, USA. IEEE Computer Society.
195

Marchand, E., Bouthemy, P., Chaumette, F., and Moreau, V. (1999). Robust real-
time visual tracking using 2d-3d model-based approach. In In International Con-
ference on Computer Vision, ICCV, Corfu, Greece.
Masson, L., Dhome, M., and Jurie, F. (2004). Robust real time tracking of 3d
objects. In ICPR ’04: Proceedings of the Pattern Recognition, 17th International
Conference on (ICPR’04) Volume 4, pages 252–255, Washington, DC, USA. IEEE
Computer Society.
Masson, L., Dhome, M., and Jurie, F. (2005). Tracking 3d objects using flexible
models. BMVC, 2005.
Masson, L., Jurie, F., and Dhome, M. (2003). Contour/texture approach for visual
tracking. In SCIA’03: Proceedings of the 13th Scandinavian conference on Image
analysis, pages 661–668, Berlin, Heidelberg. Springer-Verlag.
Matas, J., Chum, O., Martin, U., and Pajdla, T. (2002). Robust wide baseline
stereo from maximally stable extremal regions. In Proceedings of British Machine
Vision Conference, volume 1, pages 384–393, London.
Matthews, I. and Baker, S. (2004). Active appearance models revisited. Interna-
tional Journal of Computer Vision, 60(2):135–164.
Matthews, I., Xiao, J., and Baker, S. (2007). 2d vs. 3d deformable face models:
Representational power, construction, and real-time fitting. International Journal
of Computer Vision, 75(1):93–113.
Megret, R., Authesserre, J., and Berthoumieu, Y. (2008). The bi-directional frame-
work for unifying parametric image alignment approaches. In Proc. European
Conference on Computer Vision, pages 400–411.
Megret, R., Mikram, M., and Berthoumieu, Y. (2006). Inverse composition for
multi-kernel tracking.
Muñoz, E., Buenaposada, J. M., and Baumela, L. (2005). Efficient model-based
3d tracking of deformable objects. In Proc. of ICCV, volume I, pages 877–882,
Beijing, China.
Muñoz, E., Buenaposada, J. M., and Baumela, L. (2009). A direct approach for
efficiently tracking with 3d morphable models. In Proc. of ICCV, volume I, Kyoto,
Japan.
Murphy-Chutorian, E. and Trivedi, M. M. (2009). Head pose estimation in computer
vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 31(4):607–626.
Navarathna, R., Sridharan, S., and Lucey, S. (2011). Fourier active appearance
models. In Proceedings of IEEE International Conference on Computer Vision
(ICCV 2011).
196

Neapolitan, R. and Naimipour, K. (1996). Foundations of algorithms. D. C. Heath
and Company, Lexington, MA, USA.
Nick Molton, Andrew Davison, I. R. (2004). Parameterisation and probability in
image alignment. In Proc. of Asian Conference on Computer Vision.
Papandreu, G. and Maragos, P. (2008). Adaptative and constrained algorithms for
inverse compositional active appearance models fitting. In Proc. of CVPR.
Parke, F. I. and Waters, K. (1996). Computer Facial Animation. AK Peters Ltd.
Pighin, F., Salesin, D. H., and Szeliski, R. (1999). Resynthesizing facial animation
through 3d model-based tracking. In In International Conference on Computer
Vision, ICCV, Corfu, Greece.
Pilet, J., Lepetit, V., and Fua, P. (2005). Real-time non-rigid surface detection. In
Proc. of CVPR. IEEE.
Pilet, J., Lepetit, V., and Fua, P. (2008). Fast non-rigid surface detection, registra-
tion and realistic augmentation. Int. J. Comput. Vision, 76(2):109–122.
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1992).
Numerical Recipes: The Art of Scientific Computing. Cambridge University Press,
Cambridge (UK) and New York, 2nd edition.
Pressigout, M. and Marchand, E. (2007). Real-time hybrid tracking using edge and
texture information. Int. Journal of Robotics Research, IJRR, 26(7):689–713.
Ramamoorthi, R. (2002). Analytic pca construction for theoretical analysis of light-
ing variability in images of a lambertian object. IEEE Trans. Pattern Analysis
and Machine Intelligence, 24:1322–1333.
Romdhani, S. and Vetter, T. (2003). Efficient, robust and accurate fitting of a 3d
morphable model. In Proc. of ICCV, volume 1, pages 59–66.
Ross, D., Lim, J., and Yang, M.-H. (2004). Adaptive probabilistic visual tracking
with incremental subspace update. In Proc. European Conference on Computer
Vision, volume LNCS 3022, pages 470–482. Springer-Verlag.
Salzmann, M., J.Pilet, S.Ilic, and P.Fua (2007). Surface deformation models for non-
rigid 3–d shape recovery. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 29(8):1481–1487.
Schmid, C., Mohr, R., and Bauckhage, C. (2000). Evaluation of interest point
detectors. International Journal of Computer Vision, 37(2):151–172.
Sclaroff, S. and Isidoro, J. (2003). Active blobs: region-based, deformable appear-
ance models. Comput. Vis. Image Underst., 89(2-3):197–225.
197

Sepp, W. (2006). Efficient tracking in 6-dof based on the image-constancy assump-
tion in 3-d. In Proc. of ICPR.
Sepp, W. (2008). Visual Servoing of Textured Free-Form Objects in 6 Degrees
of Freedom. PhD thesis, Institut fr Datenverarbeitung, Technische Universitt
Mnchen.
Sepp, W. and Hirzinger, G. (2003). Real-time texture-based 3-d tracking. In Proc. of
Deutsche Arbeitsgemeinschaft für Mustererkennung e.V., volume 2781 of LNCS,
pages 330–337. Springer.
Shi, J. and Tomasi, C. (1994). Good features to track. In 1994 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR’94), pages 593 – 600.
Shum, H.-Y. and Szeliski, R. (2000). Construction of panoramic image mosaics with
global and local alignment. International Journal of Computer Vision, 36(2):101–
130.
Simon, G., Fitzgibbon, A. W., and Zisserman, A. (2000). Markerless tracking using
planar structures in the scene. In International Symposium on Augmented Reality,
pages 120–128.
Strom, J., Jebara, T., Basu, S., and Pentland, A. (1999). Real time tracking and
modeling of faces: An EKF-based analysis by synthesis approach. In Proceed-
ings of the Modelling People Workshop at the 1999 International Conference on
Computer Vision.
Tomasi, C. and Kanade, T. (1992). Shape and motion from image streams under or-
thography: A factorization approach. International Journal of Computer Vision,
9(2):137–154.
Torr, P. H. S. and Zisserman, A. (1999). Feature based methods for structure and
motion estimation. In Triggs, W., Zisserman, A., and Szeliski, R., editors, Vision
Algorithms: Theory and practice, pages 278–295. Springer-Verlag.
Torresani, L., Hertzmann, A., and Bregler, C. (2008). Nonrigid structure-from-
motion: Estimating shape and motion with hierarchical priors. IEEE Trans.
Pattern Anal. Mach. Intell., 30(5):878–892.
Torresani, L., Yang, D., Alexander, G., and Bregler, C. (2002). Tracking and mod-
elling non-rigid objects with rank constraints. In Proc. of CVPR. IEEE.
Tsai, R. (1987). A versatile camera calibration technique for high-accuracy 3d ma-
chine vision metrology using off-the-shelf tv cameras and lenses. Robotics and
Automation, IEEE Journal of, 3(4):323–344.
Tzimiropoulos, G., Zafeiriou, S., and Pantic, M. (2011). Robust and efficient para-
metric face alignment. In Proceedings of IEEE International Conference on Com-
puter Vision (ICCV 2011), pages 1847–1854. Oral.
198

Vacchetti, L., Lepetit, V., and Fua, P. (2004). Stable real-time 3d tracking using on-
line and oﬄine information. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26(10):1385–1391.
Viola, P. and Jones, M. J. (2004). Robust real-time face detection. International
Journal of Computer Vision, 57(2):137–154.
Viola, P. and Wells, III, W. M. (1997). Alignment by maximization of mutual
information. Int. J. Comput. Vision, 24(2):137–154.
Wang, X., Hua, G., and Han, T. X. (2010). Discriminative tracking by metric
learning. In ECCV (3), pages 200–214.
Xiao, J., Baker, S., Matthews, I., and Kanade, T. (2004a). Real-time combined
2d+3d active appearance models. In Proc. of CVPR, Washington, D.C. IEEE.
Xiao, J., Baker, S., Matthews, I., and Kanade, T. (2004b). Real-time combined
2d+3d active appearance models. In Proc. of CVPR, volume 2, pages 535 – 542.
Xu, Y. and Roy-Chowdhury, A. K. (2008). Inverse compositional estimation of 3d
pose and lighting in dynamic scenes. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 30(7):1300 – 1307.
Zhu, J., Hoi, S. C., and Lyu, M. R. (2006). Real-time non-rigid shape recovery
via active appearance models for augmented reality. In Proceedings 9th European
Conference on Computer Vision (ECCV2006), Graz, Austria.
Zimmermann, K., Matas, J., and Svoboda, T. (2009). Tracking by an optimal
sequence of linear predictors. IEEE Trans. Pattern Anal. Mach. Intell., 31(4):677–
692.
Zimmermann, K., Svoboda, T., and Matas, J. (2006). Multiview 3d tracking with an
incrementally constructed 3d model. In 3DPVT ’06: Proceedings of the Third In-
ternational Symposium on 3D Data Processing, Visualization, and Transmission
(3DPVT’06), pages 488–495, Washington, DC, USA. IEEE Computer Society.
Zitova, B. (2003). Image registration methods: a survey. Image and Vision Com-
puting, 21(11):977–1000.
199

Appendix A
Gauss-Newton Optimization
Let f be a vector function, f : Rn
→ Rm
with m ≥ n. We want to find the minimum
of f, i.e. minimise f(x) , so we cast the problem as
x∗
= arg min
x
{F(x)},
where
F(x) =
1
2
f(x) 2
=
1
2
f(x)⊤
f(x).
We assume that x∗
= x+h, where h ∈ Rn
is an arbitrary vector such that F(x+h)
is a local minimizer.We find h by linearizing the function f at x using a truncated
Taylor series:
f(x + h) ≃ ℓ(h) ≡ f(x) + f′
(x)h,
where f′
(x) is the first derivative of function f at point x. We redefine the problem
as
F(x + h) ≃ L(h) ≡
1
2
ℓ(h)⊤
ℓ(h)
=
1
2
f(x)⊤
f(x) + h⊤
J(x)⊤
f(x) +
1
2
h⊤
J(x)⊤
J(x)h
with J(x) = f′
(x). Matrix J is usually referred as the Jacobian. The Gauss-Newton
step h minimises the linear model L:
h = arg min
h
{L(h)}.
If L(h) is a local minimizer then we assume that L′
(h) = 0. The first derivative of
the linear model is
L′
(h) = J(x)⊤
f(x) + J(x)⊤
J(x)h.
We equal this derivative to zero, yielding the normal equations:
J(x)⊤
J(x)h = −J(x)⊤
f(x)
201

Finally, the GN descent direction is computed in closed-form as
h = − J(x)⊤
J(x)
−1
J(x)⊤
f(x).
We use a line search strategy to compute a step size α alont the descent direction
towards the optimun—i.e., α = arg minˆα F(x+ ˆαh). Typically, F(x+αh) is not the
true local minimizer so we repeat the process from the point x′
= x + h. Again, we
linearize F(x′
+ h) and we compute a new GN step. We iterate the process until
convergence. We outline the whole process in Algorithm 13.
Algorithm 13 Outline of the GN algorithm.
On-line: Let xi be the starting point, with i=0.
2: Compute the Jacobian, J(xi).
3: Compute the GN step, hi = − J(xi)⊤
J(xi)
−1
J(xi)⊤
f(xi).
4: Update xi+1 = xi + hi, i = i + 1
5: end while
202

Appendix B
Plane-induced Homography
A plane-induced homography relates the image of a plane in two views, or the image
of two planes in a single view(see Figure B.1). Suppose that the camera matrices of
Figure B.1: (Left) The plane π induces a homography H between the imaged plane
on views C and C′
. The plane-induced homography H depends on the relative motion
between C and C′
—which depends on R and t. (Right) The plane-induced homography
alternatively represents the motion of a plane in a single view. In this case, the plane-
induced homography depends on the relative motion between the planes π and π′.
the two views are those of a calibrated stereo rig,
P = K[I | 0] P′
= K[R | t],
such that the world origin is at the ﬁrst camera P. The world plane π has coordinates
π = (n⊤
π , d)⊤
such that n⊤
π xπ + d = 0, for every point xπ on the plane. The point
xπ = (xπ, yπ, zπ)⊤
is sensed in both views as
˜x = K[I | 0]xπ = xπ, (B.1)
203

and
˜x′
= K[R | t]˜xπ = K (Rxπ + t) . (B.2)
We rewrite Equation B.2 using −n⊤
π xπ/d = 1 as
˜x′
= K[R | t]˜xπ = K Rxπ + tn⊤
xπ , (B.3)
where n⊤
= −n⊤
π /d. Inserting Equation B.1 into Equation B.2 results in an equation
that relates plane projections ˜x and ˜x′
,
˜x′
= K R + tn⊤
K−1
˜x. (B.4)
We deﬁne the plane-induced homographic motion as a function fh6p : P2
→ P2
such
that
˜x′
= fh6p(˜x; µ) = H6˜x, (B.5)
where µ = (α, β, γ, t)⊤
, and H6 = K R + tn⊤
K−1
is a 6-dof homography. This
homography is parameterized by the rotation R— described by the Euler angles α,
β, and γ—and the translation t.
204

Appendix C
Plane+Parallax-constrained
Homography
The homography induced by a plane generates a virtual parallax due to the motion
of the plane (cf. [Hartley and Zisserman, 2004]–p.335). Let us suppose we have a
calibrated stereo rig with camera matrices
P = K[R | − Rt] P′
= K[R′
| − R′
t′
].
The world plane π has coordinates π = (n⊤
π , d)⊤
such that n⊤
π xπ + d = 0, for every
point on the plane xπ. The plane-induced homography H relates the images of the
point xπ on the two views, ˜x and ˜x′
(see Appendix B). However, this statement
is not longer true when the plane’s coordinates change. Let π′
= (n
′⊤
π , d′
)⊤
be the
resulting plane from applying a rigid body transformation with parameters δR and
δt to π (see Figure C.1).
We image the point xπ′ as ˜x′′
on the right view. We relate the two image points
˜x and ˜x′′
as
˜x′′
= H˜x + ρ˜e′
(C.1)
where he scalar ρ is the parallax displacement relative to the plane-induced homog-
raphy H, and e′
is the projection of the epipole under P′
.
We demonstrate that there exists a closed-form of ρ when we know both R′
and
t′
. First, we respectively express xπ′ in the reference system of cameras C and C′
as follows:
x =Rxπ′ − Rt,
x′
=R′
xπ′ − R′
t′
,
(C.2)
that is, x is xπ′ expressed in coordinates of C, and x′
is xπ′ expressed in coordinates
of C′
.
205

Figure C.1: Plane+Parallax-constrained homograpy.
We compute the transformation that relates x and x′
by combining Equations C.2
as follows:
x′
= R′
R⊤
x + R′
(t − t′
). (C.3)
Notice that ˜x′′
= Kx′
as x′
is already expressed in coordinates of C′
.
We express x by chaining three transformations: (1) from ˜x to xπ,
xπ = R⊤
tn⊤
π R⊤
/(d − n⊤
π t) K˜x, (C.4)
(2) from xπ to xπ′ ,
xπ′ = δR − δRδtn⊤
π /d xπ, (C.5)
and (3) from xπ′ to x,
x = R − Rtn⊤
π δR⊤
/(d − n⊤
π δt) xπ′ . (C.6)
Inserting Equations C.4–C.6 into Equation C.1, and projecting using K leads to an
expression relating ˜x and ˜x′′
,
˜x′′
= KR′
R⊤
R − Rt
n⊤
π δR⊤
(d − n⊤
π δt)
δR − δRδt
n⊤
π
d
R⊤
t
n⊤
π R⊤
(d − n⊤
π t)
K˜x + KR′
(t − t′
)
(C.7)
We rewrite Equation C.7 as
˜x′′
= H˜x + ˜e′
, (C.8)
206

where
H = KR′
R⊤
R − Rt
n⊤
π δR⊤
(d − n⊤
π δt)
δR − δRδt
n⊤
π
d
R⊤
t
n⊤
π R⊤
(d − n⊤
π t)
K−1
, (C.9)
and
˜e′
= KR′
(t − t′
). (C.10)
Matrix H is the plane induced homography between ˜x and ˜x′
, and ˜e′
is the epipole
in the left view. Note that ˜e′
∈ P2
is a point in general projective form— i.e.
˜e′
= (x, y, w)⊤
. However, if we express ˜e′
as an augmented point on the Euclidean
plane—i.e. ˜e′
= (x/w, y/w, 1)⊤
— then we rewrite Equation C.8 as
˜x′′
= H˜x + ρ˜e′
, (C.11)
where ρ = w is the projective depth of the epipole.
We define the Plane+Parallax Constrained Homography, fH6PP, using Equa-
tion C.9,
˜x′′
= fH6PP(˜x; µ) = H˜x + ρ˜e′
. (C.12)
C.1 Compositional Form
We may rewrite the warp fH6PP (Equation C.12) as a composition of two functions,
so it can be directly used in compositional algorithms (see Chapter 6). We recast
Equation C.12 as
˜x′′
= fH6PP(˜x; µ) = h(g(˜x; δR, δt); R, R′
, t, t′
), (C.13)
where we define the functions h and g as
h(˜x; R, R′
, t, t′
) = KR′
R⊤
˜x − KR′
(t − t′
),
g(˜x; δR, δt) = R − Rt
n⊤
π δR⊤
(d − n⊤
π δt)
δR − δRδt
n⊤
π
d
R⊤
t
n⊤
π R⊤
(d − n⊤
π t)
K˜x.
(C.14)
Notice that the actual optimization parameters are δR and δt—the parameters R
and t are fixed thourough the process, and R′
and t′
depend upon δR and δt.
207

Appendix D
Methodical Factorization
The main goal of the factorization algorithm is to re-organise the chain of matrix
products of the Jacobian matrix due to gradient replacement. Frequently, this ar-
rangement is done using ad hoc techniques, [Hager and Belhumeur, 1998]. In this
section we propose a method to sistematically carry out the factorization step. We
use the following theorems and their corresponding corollaries as a basis for this
technique.
D.1 Basic Definitions
Definition D.1.1. The vec Operator: If A is a m × n matrix with values
A =



a11 · · · a1n
...
...
...
am1 · · · amn


 ,
the vec operator stacks the matrix columns into a vector
vec(A) =













a11
...
am1
...
a1n
...
amn













.
Definition D.1.2. Kronecler Product: If A is a m × n matrix and B is a p × q
209

matrix, then the Kronecker product A ⊗ B is the (mp) × (nq) block matrix
A ⊗ B =



a11B · · · a1nB
...
...
...
am1B · · · amnB


 .
For more properties of the Kronecker product we recommend the reading of [K. B. Pe-
tersen].
Definition D.1.3. Kronecker Row-product: If A is a m × n matrix and B is a
p × q matrix, we define the Kronecker row-product A ⊙ B as the (mp) × q matrix
A ⊙ B =



A ⊗ b⊤
1
...
A ⊗ b⊤
p


 ,
where b⊤
i are the p rows of matrix B.
We define the concepts of permutation and permutation matrix bellow. We
shall use these definitions to re-organize Kronecker products:
Definition D.1.4. Permutation: Given a set {1, . . . , m}, a permutation, π, of
the set is a bijective map of that set onto itself: π : {1, . . . , m} → {1, . . . , m}. A less
formal defition would enunciate the permutation of a set as a reordering of the set
elements. We annotate the permutation of set m, π(m), as
π(m) =
1 2 · · · m
π(1) π(2) · · · π(m)
.
For example, given the set {1, 2, 3}, a valid permutation of that set is
1 2 3
π(1) π(2) π(m)
=
1 2 3
2 3 1
.
Definition D.1.5. Permutation Matrix: If π(n) is a permutation of the set
{1, . . . , n}, then we define the n × n permutation matrix, Pπ(n), as
Pπ(n) =






e⊤
π(1)
e⊤
π(2)
...
e⊤
π(n)






,
where ei ∈ Rn is the i-th unit vector: i.e. the vector that is zero in all entries except
the i-th where it is 1.
210

The permutation matrix enable us to re-order the rows and collumns of a matrix
or vector. We shall use this property in Theorem 6. Additionally, we define the
permutation with ratio as a special sub-class of permutations
Definition D.1.6. The permutation of the set {1, . . . , m} with ratio q, π(m : p),
is the permutation that verifies
1 2 3
π(1) π(2) π(m)
For example, the permutation π(9 : 3) is
π(9 : 3) =
1 2 3 4 5 6 7 8 9
π(1) π(2) π(3) π(4) π(5) π(6) π(7) π(8) π(9)
=
1 2 3 4 5 6 7 8 9
1 4 7 2 5 8 3 6 9
D.2 Lemmas that Re-organize Product of Matri-
ces
Using the above definitions we enunciate the theorem that let us to re-arrange the
multiplication of two matrices whereas it keeps result of the product the same.
Theorem 5. Let A and B be m × n and n × p matrices respectively. We can rewrite
their product AB as Im ⊗ vec(B)⊤ Ip ⊙ A .
Proof. The product AB can be alternatively written as a row-wise times collumn-wise
vector product,
Am×nBn×p =



a⊤
11×n
...
a⊤
m1×n


 b1n×1 · · · bpn×1 ,
=



a⊤
1 b1 · · · a⊤
1 bp
...
...
...
a⊤
mb1 · · · a⊤
mbp,



, =



b⊤
1 a1 · · · b⊤
p a1
...
...
...
b⊤
1 am · · · b⊤
p am,



211

after some basic matrix manipulations. The result can be re-arranged as product of
a m × mnp matrix times a mnp × p matrix,
Am×nBn×p =





b⊤
1 b⊤
2 · · · b⊤
p 0⊤
· · · 0⊤
0⊤
b⊤
1 b⊤
2 · · · b⊤
p · · · 0⊤
...
...
...
...
0⊤
· · · 0⊤
b⊤
1 b⊤
2 · · · b⊤
p
























a1 · · · 0
...
...
...
0 · · · a1
a2 · · · 0
...
...
...
0 · · · a2
...
...
...
am · · · 0
...
...
...
0 · · · am



















which can be compacly rewritten using Kronecker product and row-product and vec
operator as
Am×nBn×p = Im ⊗ vec(B)⊤
Ip ⊙ A .
Following this theorem we deﬁne four corollaries that deal with the most common
cases.
Corollary 5. If A be a m × n matrix and b a n × 1 column vector, then the product
Ab can be rewritten as Im ⊗ b⊤
vec(A⊤).
Proof. If we apply theorem 5 using the current matrix dimensions, we get
Am×nbn×1 = Im ⊗ b⊤
(I1 ⊙ A)
= Im ⊗ b⊤





a1
a2
...
am





= Im ⊗ b⊤
vec(A⊤
)
We can reach the same result by just rewriting the matrix A row-wise so the product
is
Ab =



a⊤
1
...
a⊤
m


 b =



a⊤
1 b
...
a⊤
mb


 =



b⊤
a1
...
b⊤
am



The resulting transposed product can be rewriten as a matrix multiplication in the
212

form
Ab =



b⊤
· · · 01×n
...
...
...
01×n · · · b⊤






a1
...
am



= Im ⊗ b⊤
vec(A⊤
),
compactly written using Kronecker product and vec operator.
Corollary 6. If b a m × 1 column vector and A be a m × n matrix then the product
b⊤
A can be rewritten as vec(A)⊤ (In ⊗ b).
Proof. We rewrite the product straightforward by using theorem 5, but changing
the dimensions accordingly,
b⊤
1×nAm×n = I1 ⊗ vec(A)⊤
In ⊙ b⊤
= vec(A)⊤
(In ⊗ b) .
Corollary 7. If a is a m×1 column vector and b⊤
is a 1×n row vector, the resulting
m × n matrix ab⊤
can be rewritten as Im ⊗ b⊤
(a ⊗ In)
Proof. We use the theorem 5 to rewrite the product ab⊤
accordingly to the vector
sizes,
am×1b⊤
1×n = Im ⊗ vec(b⊤
)⊤
(In ⊙ a)
= Im ⊗ b⊤
(a ⊗ In) .
Notice that we can re-organize the row-collumn product, a⊤
b, in a direct way by just
rewriting it as b⊤
a. However, we will the following corollary several times during
our factorisation.
Corollary 8. Let a and b be two n × 1 collumn vectors. The product (a⊤b)Im can
be rewritten as (Im ⊗ a⊤)(Im ⊗ b).
Proof. The initial product is compactly rewritten as:
(a⊤
b)Im =



a⊤
b · · · 0
...
...
...
0 · · · a⊤
b



=



a⊤
· · · 01×n
...
...
...
01×n · · · b⊤






b · · · 0n×1
...
...
...
0n×1 · · · b


 ,
= (Im ⊗ a⊤
)(Im ⊗ b).
213

Lemmas Input Output
Theorem 5
A
(m×n)
B
(n×p)
Im ⊗ vec(B)⊤
m×(np)
Ip ⊙ A
(np)×p
Corollary 5
A
(m×n)
b
(n×1)
(Im ⊗ b⊤
)
m×(mn)
vec(A⊤
)
(mn)×1
Corollary 6
b⊤
(1×m)
A
(m×n)
vec(A)⊤
1×(mn)
(In ⊗ b))
(mn)×n
Corollary 7
a
(m×1)
b⊤
(1×n)
(Im ⊗ b⊤
)
m×(mn)
(a ⊗ In)
(mn)×n
Corollary 8
( a⊤
(1×n)
b
(1×n)
) Im
(m×m)
(Im ⊗ a⊤
)
m×(mn)
(Im ⊗ b)
(mn)×m
Table D.1: Lemmas used to re-arrange matrices product.
214

D.3 Lemmas that Re-organize Kronecker Prod-
ucts
In this section we shall derive some theorems and corollaries that we shall use to
re-organize those products involving Kronecker matrices (see definition D.1.2).
The first theorem is used to re-order the operands of a Kronecker product. Notice
that, generally, the Kronecker product is not conmutative.
Theorem 6. If A is a m × n matrix and B is a p × q matrix, then their Kronecker
product is permutation equivalent, i.e., there exist (mp) × (mp) permutation matrices
P and Q such that
A ⊗ B = P(B ⊗ A)Q
.
The following theorem and its corollaries re-organize a Kronecker product where
one of the operands is an identity matrix.
Theorem 7. If Im is a m × m identity matrix and A is a n × p matrix, then we can
re-organize the (mn) × (mp) Kronecker product as
Im ⊗ A = Pπ(p) (A ⊗ Im) ,
where Pπ((mn):p) permutation matrix of the set {1, . . . , (mn)} with ratio p (see defini-
tion D.1.6).
From this theorem we derive three corollaries that we shall use directly in our
derivations.
Corollary 9. If Im is the m × m identity matrix, a and b are n × 1 vectors, we can
rewrite the (mn) × (mn) product (Ip ⊗ a) Ip ⊗ b⊤
as
(Im ⊗ a) Im ⊗ b⊤
= P⊤
π((mn):n) Im ⊗ b⊤
Pπ((mn2):n) (Im ⊗ a) .
Corollary 10. If Im is the m× m identity matrix, a and b are n× 1 vectors, we can
rewrite the m × m product Im ⊗ a⊤ (Im ⊗ b) as
Im ⊗ a⊤
(Im ⊗ b) = b⊤
⊗ Im (a ⊗ Im)
215

Lemmas Input Output
Theorem 7
Im ⊗ A
(mn)×(mp)
Pπ((mn):p)
(mn)×(mn)
(A ⊗ Im)
(mn)×(mp)
Corollary 9
(Im ⊗ a)
m×(mn)
(Im ⊗ b⊤
)
(mn)×n
P⊤
π((mn):n)
m×m
Im ⊗ b⊤
m×(mn)
Pπ((mn2):n)
(mn)×(mn)
(Im ⊗ a)
(mn)×m
Corollary 10
(Im ⊗ a⊤
)
m×(mn)
(Im ⊗ b)
(mn)×m
(b⊤
⊗ Im)
m×(mn)
(a ⊗ Im)
(mn)×m
Corollary 11
(a ⊗ Im)
(mn)×m
(Im ⊗ b⊤
)
m×(mn)
(I(mn) ⊗ b⊤
)
(mn)×(mn2)
(a ⊗ I(mn))
(mn2)×(mn)
Table D.2: Lemmas used to re-arrange Kronecker matrix products.
Corollary 11. If Im is the m × m identity matrix, and a and b are n × 1 vectors,
we can rewrite the (mn) × (mn) product (a ⊗ Im) Im ⊗ b⊤
as
(a ⊗ Im) Im ⊗ b⊤
= I(mn) ⊗ b⊤
a ⊗ I(mn) .
In addition to this, we shall use the following properties of the Kronecker product
in our derivations. If Im is the m×m identity matrix, and a and b are n×1 vectors,
then
Im ⊗ ab⊤
= (Im ⊗ a) Im ⊗ b⊤
,
(Im ⊗ a)⊤
= Im ⊗ a⊤
,
(a ⊗ Im)⊤
= a⊤
⊗ Im .
D.4 Lemmas that Re-organize Sums of Matrices
In this section we shall deﬁne some propositions that re-arrange the distributive
product of matrices.
Proposition 8. If A is a m×p matrix, B is a m×(np) matrix, Ip is the p×p identity
matrix, and a is a n × 1 vector, then there exist m × (m + np) matrices P and Q such
216

that we can rewrite the distributive product A + B (Ip ⊗ a) as
A + B (Ip ⊗ a) = (AP + BQ) Ip ⊗
1
a
.
The matrices P and Q are respectively in the form,
P = e1 0m×n e2 0m×n · · · ei 0m×n · · · 0m×n em ,
and
Q =










0n×1 In 0n×1 0n · · · 0n×1 0n · · · 0n×1 0n
0n×1 0n 0n×1 In · · · 0n×1 0n · · · 0n×1 0n
...
...
...
...
...
...
...
...
...
...
0n×1 0n 0n×1 0n · · · 0n×1 In · · · 0n×1 0n
...
...
...
...
...
...
...
...
...
...
0n×1 0n 0n×1 0n · · · 0n×1 0n · · · 0n×1 In










,
where ei is the i-the unit vector (see Deﬁnition D.1.5), 0n×1 is the n×1 zeroes vector,
and 0n is the n × n matrix whose all entries are zero.
217

Appendix E
Methodical Factorization of f3DTM
The goal of this section is to show how to factorize the Jacobian matrix J in a
systhematic way by just using the lemmas of Appendix D. We attempt to obtain
the most efficient factorization (i.e. the factorization that employs the least num-
ber of operations). We separately factorize each element of the Jacobian matrix J
(Equations 5.27).
Decomposition of J1
J1 =∇ûT [˜u]⊤
λH−1
A I3 − (n⊤
t)I3 + tn⊤
R⊤ ˙Rα v
Cor. 5
−−−→D⊤
I3 − ( n⊤ t )I3 + tn⊤
(I ⊗ v⊤
)vec(˙R
⊤
α R)
Cor. 8
−−−→D⊤
I3 − (I3 ⊗ n⊤
)(I3 ⊗ t) + t n⊤ (I3 ⊗ v⊤
)vec(˙R
⊤
α R)
Cor. 7
−−−→D⊤
I3 − (I3 ⊗ n⊤
(I3 ⊗ t) + (I3 ⊗ n⊤
)(t ⊗ I3) (I3 ⊗ v⊤
)vec(˙R
⊤
α R),
(E.1)
where D⊤
= ∇ûT [˜u]⊤
λH−1
A .We continue with the factorization process by eliminat-
ing the distributive product of Equation E.1. First, we insert the term I3 ⊗ v⊤
into
the sum of the distributive product,
J1 =D⊤
I3 − (I3 ⊗ n⊤
(I3 ⊗ t) + (I3 ⊗ n⊤
)(t ⊗ I3) (I3 ⊗ v⊤
)vec(˙R
⊤
α R)
=D⊤
(I3 ⊗ v⊤
) − (I3 ⊗ n⊤
(I3 ⊗ t)(I3 ⊗ v⊤
) + (I3 ⊗ n⊤
)(t ⊗ I3)(I3 ⊗ v⊤
) vec(˙R
⊤
α R).
(E.2)
Second, we reorganize the two terms of the expression inside the parentheses of
Equation E.2 containing translations, such that we scroll the operands containing t
to the right side of the product. We reorganize the term (I3 ⊗t)(I3 ⊗v⊤
) as follows:
(I3 ⊗ n(i)⊤
) (I3 ⊗ t) (I3 ⊗ v(i)⊤
)
Cor. 5
−−−−→(I3 ⊗ n(i)⊤
) Pπ(9:3)(I9 ⊗ v(i)⊤
)(I9 ⊗ v(i)⊤
)Pπ(27:3)(I9 ⊗ t) .
(E.3)
Notice two important facts about Equation E.3: (1) we simplify the term Pπ(9:3)(I9 ⊗
v⊤
) using a basic property of Kronecker product [K. B. Petersen], and (2) we use
219

the Corollary 11 to reorder the Kronecker product Pπ(27:3)(I9 ⊗ t). If we apply these
two properties, we obtain the following results:
(I3 ⊗ n⊤
(I3 ⊗ t)(I3 ⊗ v⊤
) =(I3 ⊗ n⊤
) Pπ(9:3)(I9 ⊗ v⊤
) Pπ(27:3)(I9 ⊗ t)
−→(I3 ⊗ n⊤
) (Pπ(9:3) ⊗ v⊤
) Pπ(27:3)(I9 ⊗ t)
Cor. 11
−−−−−→(I3 ⊗ n⊤
)(Pπ(9:3) ⊗ v⊤
) (t ⊗ I9) .
(E.4)
We rearrange the second term in a similar way: we apply the Corollary 11 to the
term (t ⊗ I3)(I3 ⊗ v⊤
), so we obtain:
(I3 ⊗ n⊤
) (t ⊗ I3) (I3 ⊗ v⊤
)
Cor. 11
−−−−→ (I3 ⊗ n⊤
) (I9 ⊗ v⊤
)(t ⊗ I9) . (E.5)
Notice the common factor (t ⊗ I9) in both Equations E.4 and E.5; we can now
reorganize the summation part of the distributivity by applying Proposition 8:
I3 − (I3 ⊗ n⊤
(I3 ⊗ t) + (I3 ⊗ n⊤
)(t ⊗ I3) (I3 ⊗ v⊤
)
Prop. 8
−−−−→ (I3 ⊗ n⊤
)A + (I3 ⊗ n⊤
)(I9 ⊗ v⊤
) − (I3 ⊗ n⊤
)(Pπ(9:3) ⊗ v⊤
) B
1
t
⊗ I9 ,
(E.6)
where the matrices A and B are in the following form (see Theorem 7 for further
details):
A =


1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0

 ,
and
B =


03×1 I3 03×1 03 03×1 03
03×1 03 03×1 I3 03×1 03
03×1 03 03×1 03 03×1 I3

 .
(E.7)
Finally, we rename the product of D⊤
and the highlighted portion of the Equa-
tion E.6 as S⊤
1 :
S⊤
1 = D⊤
(I3 ⊗ n⊤
)A + (I3 ⊗ n⊤
)(I9 ⊗ v⊤
) − (I3 ⊗ n⊤
)(Pπ(9:3) ⊗ v⊤
) B . (E.8)
Therefore, we write the factorized version of J1 as:
J1 = S⊤
1
1
t
⊗ I9 vec(˙R
⊤
αt
R). (E.9)
Notice that the Equation E.9 represents a proper factorization (according to the
rules of Section 5.3): the term S⊤
1 is constant—only made of shape terms—and the
term
1
t
⊗ I9 vec(˙R
⊤
αt
R) depends on only motion-variable terms.
220

Decomposition of J2 and J3 We reorganize the terms J2 and J3 in the same way
we did with J1. The only difference lies on the parameter of the rotational derivative:
we use α for J1, β for J2, and γ for J3. Hence, we show the final factorized forms
for J2 and J3 in the next equations:
J2 =S⊤
1
1
t
⊗ I9 vec(˙R
⊤
β R),
and
J3 =S⊤
1
1
t
⊗ I9 vec(˙R
⊤
γ R).
(E.10)
Decomposition of J4 The matrix decomposition for J4, J5, and J6 is slighty
different from the three previous elements as the former elements do not involve a
rotational derivative, that is:
J4 = D⊤
R⊤
− (n⊤
t)R⊤
+ tn⊤
R⊤
r1n⊤
v. (E.11)
Alghough we could deliver a completely different routine to rearrange J4, we opt
to reuse the most part of the factorization process of J1. We reorder the terms of
Equation E.11 as follows:
J4 = D⊤
R⊤
− (n⊤
t)R⊤
+ tn⊤
R⊤ r1 n⊤
v
Cor. 5
−−−−→ R⊤
− (n⊤
t)R⊤
+ tn⊤
R⊤
(I3 ⊗ v⊤
n)vec(r⊤
1 ) ,
(E.12)
where r1 is the first collumn of matrix R— i.e. R = (r1, r2, r3). We decompose
(I3 ⊗ v⊤
n) into (I3 ⊗ v⊤
)(I3 ⊗ n) by using Corollary 8,
J4 = D⊤
I3 − (n⊤
t)I3 + tn⊤
I3 (I3 ⊗ v⊤
n)vec(r⊤
1 )
Cor. 8
−−−−→D⊤
I3 − (n⊤
t)I3 + tn⊤
I3 (I3 ⊗ v⊤
)(I3 ⊗ n) vec(r⊤
1 ).
(E.13)
Now we can reorder Equation E.13 using the result from Equation E.8 and Corol-
lary 11; we show the process in the next equations:
D⊤
I3 − (n⊤
t)I3 + tn⊤
I3 (I3 ⊗ v⊤
) (I3 ⊗ n)vec(r⊤
1 )
Eq. E.8
−−−−−→S1
1
t
⊗ I9 (I3 ⊗ n) vec(r⊤
1 )
Cor. 11
−−−−−→S1 ((I3 ⊗ n) ⊗ I4) I3 ⊗
1
t
vec(r⊤
1 )
(E.14)
We write the expression S⊤
2 as
S⊤
2 = S⊤
1 ((I3 ⊗ n) ⊗ I4) , (E.15)
so we concisely write the term J4 as:
J4 = S⊤
2 I3 ⊗
1
t
r1. (E.16)
221

Decomposition of J5 and J6 We reorganize the terms J5 and J6 in the same
way we did with J4. The only diﬀerence lies on the columns of matrix R that are
involved: we use r1 for J4, r2 for J5, and r3 for J6. We show the decomposition
expressions fo J5 and J6 in the following equations:
J5 = S⊤
2 I3 ⊗
1
t
r2, (E.17)
and
J6 = S⊤
2 I3 ⊗
1
t
r3. (E.18)
Summary of results for J If we gather Equations E.9, E.10, E.16, E.17 and E.18,
we can rewrite Equation 5.27 as follows:
J⊤
= S⊤
1 S⊤
2 M, (E.19)
where we deﬁne the matrix M as follows:
M =




1
t
⊗ I9 vec(˙R
⊤
αt
R) vec(˙R
⊤
βt
R) vec(˙R
⊤
γt
R) 036×3
012×3 I3 ⊗
1
t
R



 . (E.20)
222

Appendix F
Methodical Factorization of
f3DMM (Partial case)
The process of decomposing Equations 5.57 is slightly diﬀerent from the full-factorization
case. First, we completely factorize the expression D (see Equation 5.58). We group
those motion terms that are common in D
( n⊤
Bs c )I3 − Bs c n⊤
Cor. 5
−−−→ (I3 ⊗ n⊤
Bs)(I3 ⊗ c) − Bs (I6 ⊗ n⊤
)(c ⊗ I3) ,
Cor. 5
−−−→(I3 ⊗ n⊤
Bs)(I3 ⊗ c) − Bs(I6 ⊗ n⊤
) Pπ(9:3)(I3 ⊗ c) ,
Cor. 5
−−−→ (I3 ⊗ n⊤
Bs) − Bs(I6 ⊗ n⊤
)Pπ(9:3) (I3 ⊗ c).
(F.1)
and
t n⊤ − ( n⊤ t )I3
Cor. 5
−−−→ (I3 ⊗ n⊤
)(t ⊗ I3) − (I3 ⊗ n⊤
)(I3 ⊗ t) ,
I3
Cor. 5
−−−→(I3 ⊗ n⊤
) Pπ(9:3)(I3 ⊗ t) − (I3 ⊗ n⊤
)(I3 ⊗ t),
I3
Cor. 5
−−−→ (I3 ⊗ n⊤
)Pπ(9:3) − (I3 ⊗ n⊤
) (I3 ⊗ t).
(F.2)
Using Equations F.1 and F.2 we rewrite D as
D = I3 + s1(I3 ⊗ t) + s2(I3 ⊗ c), (F.3)
where
s1 = (I3 ⊗ n⊤
)Pπ(9:3) − (I3 ⊗ n⊤
) , and
s2 = (I3 ⊗ n⊤
Bs) − Bs(I6 ⊗ n⊤
)Pπ(9:3) .
(F.4)
Notice that we can re-organize the summation terms in Equation F.3 more compactly
by using Proposition 8, as we show here:
D =I3 + [s1P + s2Q] I3 ⊗
t
c
H = I3P′
+ [s1P + s2Q] Q′

I3 ⊗


1
t
c



 ,
(F.5)
223

where
P =


I3 03 03 03 03 03 03 03 03
03 03 03 I3 03 03 03 03 03
03 03 03 03 03 03 I3 03 03

 ,
Q =


06×3 I6 06×3 06 06×3 06
06×3 06 06×3 I6 06×3 06
06×3 06 06×3 06 06×3 I6

 ,
P′
= e1 03×9 e2 03×9e3 03×9 ,
and
Q′
=


09×1 I9 09×1 09 09×1 09
09×1 09 09×1 I9 09×1 09
09×1 09 09×1 09 09×1 I9

 .
(F.6)
We rewrite Equation F.5 more compactly by using
D1 = I3P′
+ [s1P + s2Q] Q′
,
D2 =

I3 ⊗


1
t
c



 ,
(F.7)
so the resulting representation of D is
D = D1D2. (F.8)
Notice that Equation F.8 has two diﬀerently parts: D1 depends only on structure
parameters whereas D2 solely depends on motion. The key idea of the partial fac-
torization is to leave untouched those parts of Equation 5.57 whose factorization
process could slow down the computation; thus, we only decompose the term D1 so
we rewrite the elements of the Jacobian (Equation 5.57) as follows:
J1 =D1D2R⊤ ˙Rαt I3 + Bscn⊤
v,
J2 =D1D2R⊤ ˙Rβt I3 + Bscn⊤
v,
J3 =D1D2R⊤ ˙Rγt I3 + Bscn⊤
v,
J4 =D1D2r1n⊤
v,
J5 =D1D2r2n⊤
v,
J6 =D1D2r3n⊤
v,
and
Jk =D1D2R⊤
Bk.
(F.9)
224

Appendix G
Methodical Factorization of
f3DMM (Full case)
The goal of this section is to show how to factorize the Jacobian matrix J that span
Equations 5.57 by using the lemmas of Appendix D. We attempt to obtain the
most eﬃcient factorization (i.e. the factorization that employs the least number of
operations). We separately factorize each element of the Jacobian matrix J in the
following.
Decomposition of J1 We show the expanded version of J1 from Equation 5.57
as follows:
J1 = D⊤
I + (n⊤
Bsct)I − (n⊤
t)I − Bscn⊤
+ tn⊤
R⊤ ˙Rα (I + Bsc) v, (G.1)
where D⊤
= ∇ˆuT [˜u]⊤
λH−1
A K. We split Equation G.1 in four chunks by applying the
distributive property as follows:
J
(1)
1 =D⊤
(n⊤
Bsct)I − Bscn⊤
R⊤ ˙Rαv,
J
(2)
1 =D⊤
(n⊤
Bsct)I − Bscn⊤
R⊤
Bscv,
J
(3)
1 =D⊤
I − (n⊤
t)I + tn⊤
R⊤ ˙Rαv,
J
(4)
1 =D⊤
I − (n⊤
t)I + tn⊤
R⊤
Bscv.
(G.2)
We separately re-arrange the terms J
(1)
1 , J
(2)
1 , J
(3)
1 , and J
(4)
1 from Equations G.2.
We re-organize each term of J1 by using the Lemmas in Appendix D. We show the
225

re-arranging process for J
(1)
1 in the following:
J
(1)
1 =D⊤
( n⊤
Bs
ct )I − Bscn⊤
R⊤ ˙Rαv,
Cor. 8
−−−→D⊤
(I3 ⊗ n⊤
Bs)(IK ⊗ c) − Bs c n⊤ R⊤ ˙Rαv,
Cor. 8
−−−→D⊤
(I3 ⊗ n⊤
Bs)(IK ⊗ c) − Bs (IK ⊗ n⊤
)(I3 ⊗ c) R⊤ ˙Rαv,
=D⊤
(I3 ⊗ n⊤
Bs) (IK ⊗ c)R⊤ ˙Rα v
−D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ c) R⊤ ˙Rαv,
Cor. 5
−−−→D⊤
(I3 ⊗ n⊤
Bs) (I3 ⊗ v⊤
)vec(˙R
⊤
α R(IK ⊗ c⊤
)
−D⊤
Bs(IK ⊗ n⊤
) (I3 ⊗ c)R⊤ ˙Rα v ,
Cor. 5
−−−→D⊤
(I3 ⊗ n⊤
Bs)(I3 ⊗ v⊤
)vec(˙R
⊤
α R(IK ⊗ c⊤
))
−D⊤
Bs(IK ⊗ n⊤
) (I3 ⊗ v⊤
)vec(˙R
⊤
α R(I3 ⊗ c⊤
)) ,
(G.3)
Note that we can rewrite Equation G.3 as a product of two matrices,
J
(1)
1 = D⊤
(I3 ⊗ n⊤
Bs)(I3 ⊗ v⊤
), D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ v⊤
)
vec(˙R
⊤
α R(IK ⊗ c⊤
))
vec(˙R
⊤
α R(I3 ⊗ c⊤
))
(G.4)
We now proceed with the remaining terms J
(2)
1 , J
(3)
1 , and J
(4)
1 in the following:
J
(2)
1 =D⊤
( n⊤
Bs
ct )I − Bscn⊤
R⊤
Bscv,
Cor. 8
−−−→D⊤
(I3 ⊗ n⊤
Bs)(IK ⊗ c) − Bs c n⊤ R⊤
Bscv,
Cor. 8
−−−→D⊤
(I3 ⊗ n⊤
)(I3 ⊗ c) R⊤
Bscv,
=D⊤
(I3 ⊗ n⊤
Bs)(IK ⊗ c)R⊤
Bs c v
− D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ c)R⊤
Bs c v,
Cor. 6
−−−→D⊤
(I3 ⊗ n⊤
Bs)(IK ⊗ c)R⊤
(I3 ⊗ c⊤
)vec(B⊤
s ) v
− D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ c)R⊤
(I3 ⊗ c⊤
)vec(B⊤
s ) v,
=D⊤
(I3 ⊗ n⊤
Bs) (IK ⊗ c)R⊤
(I3 ⊗ c⊤
)vec (B⊤
s )v
− D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ c)R⊤
(I3 ⊗ c⊤
)vec(B⊤
s )v,
Cor. 6
−−−→D⊤
(I3 ⊗ n⊤
Bs) I3 ⊗ vec(B⊤
s )v vec((IK ⊗ c)R(I3 ⊗ c⊤
))
− D⊤
Bs(IK ⊗ n⊤
) (I3 ⊗ c)R⊤
(I3 ⊗ c⊤
)vec (B⊤
s )v ,
Cor. 6
−−−→D⊤
(I3 ⊗ n⊤
Bs) I3 ⊗ vec(B⊤
))
− D⊤
Bs(IK ⊗ n⊤
) I3 ⊗ vec(B⊤
)) ,
(G.5)
226

J
(3)
1 =D⊤
I − ( n⊤ t )I + tn⊤
R⊤ ˙Rαv,
Cor. 8
−−−→D⊤
I − (I3 ⊗ n⊤
)(I3 ⊗ t) + t n⊤ R⊤ ˙Rαv,
Cor. 8
−−−→D⊤
I − (I3 ⊗ n⊤
)(I3 ⊗ t) + (I3 ⊗ n⊤
)(t ⊗ I3) R⊤ ˙Rαv,
=D⊤
R⊤ ˙Rα v − (I3 ⊗ n⊤
)(I3 ⊗ t)R⊤ ˙Rαv
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤ ˙Rαv,
Cor. 5
−−−→D⊤
(I3 ⊗ v⊤
)vec(˙R
⊤
α R) − (I3 ⊗ n⊤
) (I3 ⊗ t)R⊤ ˙Rα v
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤ ˙Rαv,
Cor. 5
−−−→D⊤
(I3 ⊗ v⊤
)vec(˙R
⊤
α R) − (I3 ⊗ n⊤
) (I3 ⊗ v⊤
)vec(˙R
⊤
α R(I3 ⊗ t⊤
))
+ (I3 ⊗ n⊤
) (t ⊗ I3)R⊤ ˙Rα v ,
Cor. 5
−−−→D⊤
(I3 ⊗ v⊤
)vec(˙R
⊤
α R) − (I3 ⊗ n⊤
)(I3 ⊗ v⊤
)vec(˙R
⊤
α R(I3 ⊗ t⊤
))
+ (I3 ⊗ n⊤
) (I3 ⊗ v⊤
)vec(˙R
⊤
α R(t⊤
⊗ I3)) .
(G.6)
J
(4)
1 =D⊤
I − ( n⊤ t )I + tn⊤
R⊤
Bscv,
Cor. 8
−−−→D⊤
I − (I3 ⊗ n⊤
)(I3 ⊗ t) + t n⊤ R⊤
Bscv,
Cor. 8
−−−→D⊤
I − (I3 ⊗ n⊤
)(I3 ⊗ t) + (I3 ⊗ n⊤
)(t ⊗ I3) R⊤
Bscv,
=D⊤
R⊤
Bs c v − (I3 ⊗ n⊤
)(I3 ⊗ t)R⊤
Bscv
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
Bscv,
Cor. 5
−−−→D⊤
R⊤
(I3 ⊗ c⊤
)vec(B⊤
s ) v − (I3 ⊗ n⊤
)(I3 ⊗ t)R⊤
Bscv
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
Bscv,
=D⊤
R⊤
(I3 ⊗ c⊤
) vec(B⊤
s )v − (I3 ⊗ n⊤
)(I3 ⊗ t)R⊤
Bscv
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
Bscv,
Cor. 5
−−−→D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)vec ((I3 ⊗ c)R) − (I3 ⊗ n⊤
)(I3 ⊗ t)R⊤
Bs c v
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
Bscv,
Cor. 5
−−−→D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)vec ((I3 ⊗ c)R) − (I3 ⊗ n⊤
)(I3 ⊗ t)R⊤
(I3 ⊗ c⊤
)vec(B⊤
s ) v
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
Bscv,
(G.7)
227

=D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)vec ((I3 ⊗ c)R) − (I3 ⊗ n⊤
) (I3 ⊗ t)R⊤
(I3 ⊗ c⊤
) vec(B⊤
s )v
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
Bscv,
Cor. 5
−−−→D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)vec ((I3 ⊗ c)R)
− (I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
Bs c v,
Cor. 5
−−−→D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)vec ((I3 ⊗ c)R)
− (I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
(I3 ⊗ c⊤
)vec(B⊤
s ) ,
=D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)vec ((I3 ⊗ c)R)
− (I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
+ (I3 ⊗ n⊤
) (t ⊗ I3)R⊤
(I3 ⊗ c⊤
) vec(B⊤
s ) ,
Cor. 5
−−−→D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)vec ((I3 ⊗ c)R)
− (I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
+ (I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤
vec((I3 ⊗ c)R⊤
(t ⊗ I3)) .
We rewrite Equations G.5, G.6, and G.7 in the say way that Equation G.4 as follows:
J
(2)
1 =
D⊤
(I3 ⊗ n⊤
Bs) I3 ⊗ vec(B⊤
s )v
⊤
− D⊤
Bs(IK ⊗ n⊤
) I3 ⊗ vec(B⊤
s )v
⊤
⊤
))
))
,
J
(3)
1 =



D⊤
(I3 ⊗ v⊤
)
⊤
− D⊤
(I3 ⊗ n⊤
)(I3 ⊗ v⊤
)
⊤
D⊤
(I3 ⊗ n⊤
)(I3 ⊗ v⊤
)
⊤



⊤ 


vec(˙R
⊤
α R)
vec(˙R
⊤
α R(I3 ⊗ t⊤
))
vec(˙R
⊤
α R(t⊤
⊗ I3))


 ,
J
(4)
1 =



D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)
⊤
− D⊤
(I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤ ⊤
D⊤
(I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤ ⊤



⊤ 

vec((I3 ⊗ c)R)
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
vec((I3 ⊗ c)R⊤
(t ⊗ I3))


(G.8)
Combining Equations G.4 and G.8 we rewrite the Jacobian element J1 (Equa-
tion G.1) as
J1 = S⊤
1 M1, (G.9)
228

where
S1 =




















D⊤
(I3 ⊗ n⊤
Bs)(I3 ⊗ v⊤
)
⊤
D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ v⊤
)
⊤
D⊤
(I3 ⊗ n⊤
Bs) I3 ⊗ vec(B⊤
s )v
⊤
− D⊤
Bs(IK ⊗ n⊤
) I3 ⊗ vec(B⊤
s )v
⊤
D⊤
(I3 ⊗ v⊤
)
⊤
− D⊤
(I3 ⊗ n⊤
)(I3 ⊗ v⊤
)
⊤
D⊤
(I3 ⊗ n⊤
)(I3 ⊗ v⊤
)
⊤
D⊤
(I3 ⊗ v⊤
vec(B⊤
s )⊤
)
⊤
− D⊤
(I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤ ⊤
D⊤
(I3 ⊗ n⊤
) I3 ⊗ v⊤
vec(B⊤
s )⊤ ⊤




















⊤
1×(63+81K+18K2)
,
and
M1 =


















vec(˙R
⊤
α R(IK ⊗ c⊤
))
vec(˙R
⊤
α R(I3 ⊗ c⊤
))
))
))
vec(˙R
⊤
α R)
vec(˙R
⊤
α R(I3 ⊗ t⊤
))
vec(˙R
⊤
α R(t⊤
⊗ I3))
vec((I3 ⊗ c)R)
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
vec((I3 ⊗ c)R⊤
(t ⊗ I3))


















(63+81K+18K2)×1
(G.10)
Decomposition of J2 and J3 Decomposing J2 is exactly identical to the above
procedure but changing the Euler angle—β instead of α—of the rotational derivative.
Hence, we decompose J2 as the product J2 = S1M2 where
M2 =


















vec(˙R
⊤
β R(IK ⊗ c⊤
))
vec(˙R
⊤
β R(I3 ⊗ c⊤
))
))
))
vec(˙R
⊤
β R)
vec(˙R
⊤
β R(I3 ⊗ t⊤
))
vec(˙R
⊤
β R(t⊤
⊗ I3))
vec((I3 ⊗ c)R)
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
vec((I3 ⊗ c)R⊤
(t ⊗ I3))


















(63+81K+18K2)×1
. (G.11)
Note that there is no need to compute a matrix S2 as the shape elements (vertices,
normals and basis shapes) do not change with respect to J1. We equivalentely
229

decompose J3 as J3 = S1M3, where
M3 =


















vec(˙R
⊤
γ R(IK ⊗ c⊤
))
vec(˙R
⊤
γ R(I3 ⊗ c⊤
))
))
))
vec(˙R
⊤
γ R)
vec(˙R
⊤
γ R(I3 ⊗ t⊤
))
vec(˙R
⊤
γ R(t⊤
⊗ I3))
vec((I3 ⊗ c)R)
vec((I3 ⊗ c)R⊤
(I3 ⊗ t⊤
))
vec((I3 ⊗ c)R⊤
(t ⊗ I3))


















(63+81K+18K2)×1
. (G.12)
Decomposition of J4 We expand the term J4 from Equation 5.57 as follows:
J4 = D⊤
I + (n⊤
Bsct)I − (n⊤
t)I − Bscn⊤
+ tn⊤
R⊤
n⊤
v, (G.13)
where D⊤
= ∇ˆuT [˜u]⊤
λH−1
A K. We split Equation G.13 in two chunks by applying
the distributive property,
J
(1)
4 = D⊤
I + (n⊤
Bsct) − Bscn⊤
R⊤
n⊤
v,
and
J
(2)
4 = D⊤
I − (n⊤
t)I + tn⊤
R⊤
n⊤
v.
(G.14)
We separaterly re-arrange the terms J
(1)
4 and J
(2)
4 from Equation G.14. We show
the process for J
(1)
4 in the following:
J
(1)
4 =D⊤
( n⊤
Bs
ct )I − Bscn⊤
R⊤
n⊤
v,
Cor. 8
−−−→D⊤
(I3 ⊗ n⊤
Bs)(IK ⊗ c) − Bs c n⊤ R⊤
n⊤
v,
Cor. 8
−−−→D⊤
(I3 ⊗ n⊤
)(I3 ⊗ c) R⊤
n⊤
v,
=D⊤
(I3 ⊗ n⊤
Bs)(IK ⊗ c)R⊤
n⊤
v
−D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ c) r1n⊤
v,
=D⊤
n⊤
v(I3 ⊗ n⊤
Bs)(IK ⊗ c)R⊤
−D⊤
n⊤
v Bs(IK ⊗ n⊤
)(I3 ⊗ c) R⊤
.
(G.15)
230

Notice that we can place the scalar n⊤
1×3v⊤
3×1 anywhere in Equation G.15 without
using any of the Lemmas of Appendix D. We similarly re-organize the element J
(2)
4 ,
J
(2)
4 =D⊤
I − ( n⊤ t )I + tn⊤
R⊤
n⊤
v,
Cor. 8
−−−→D⊤
I − (I3 ⊗ n⊤
)(I3 ⊗ t) + t n⊤ R⊤
n⊤
v,
Cor. 8
−−−→D⊤
I − (I3 ⊗ n⊤
)(I3 ⊗ t) + (I3 ⊗ n⊤
)(t ⊗ I3) R⊤
n⊤
v,
=D⊤
R⊤
n⊤
v − (I3 ⊗ n⊤
)(I3 ⊗ t)R⊤
n⊤
v
+ (I3 ⊗ n⊤
)(t ⊗ I3)R⊤
n⊤
v,
=D⊤
n⊤
vR⊤
− D⊤
n⊤
v(I3 ⊗ n⊤
)(I3 ⊗ t)R⊤
+ D⊤
n⊤
v(I3 ⊗ n⊤
)(t ⊗ I3)R⊤
.
(G.16)
Using Equations G.15 and G.16 we rewrite the Jacobian element J4 as J4 = S⊤
2 M4,
where
S2 =








D⊤
n⊤
v(I3 ⊗ n⊤
Bs)
⊤
− D⊤
n⊤
v Bs(IK ⊗ n⊤
)
⊤
D⊤
n⊤
v
⊤
− D⊤
n⊤
v(I3 ⊗ n⊤
)
⊤
D⊤
n⊤
v(I3 ⊗ n⊤
)
⊤








⊤
1×(21+6K)
,
and
M4 =






(IK ⊗ c)R⊤
(I3 ⊗ c)R⊤
R⊤
(I3 ⊗ t)R⊤
(t ⊗ I3)R⊤






(21+6K)×3
.
(G.17)
Decomposition of JK We expand the term JK from Equation 5.57 as follows:
JK = D⊤
I + (n⊤
Bsct)I − (n⊤
t)I − Bscn⊤
+ tn⊤
Bn⊤
v, (G.18)
where D⊤
= ∇ˆuT [˜u]⊤
λH−1
A K. We split Equation G.17 in two chunks by applying
the distributive property,
J
(1)
K = D⊤
+(n⊤
Bsct)I − Bscn⊤
Bn⊤
v,
J
(2)
K = D⊤
I − (n⊤
t)I + tn⊤
Bn⊤
v.
(G.19)
231

We separately re-arrange the terms J
(1)
K and J
(2)
K from Equation G.19. We show the
process for J
(1)
K in the following:
J
(1)
K =D⊤
(n⊤
Bsct)I − Bs c n⊤ Bn⊤
v,
Cor. 8
−−−→D⊤
(n⊤
Bsct)I − Bs (IK ⊗ n⊤
)(I3 ⊗ c) Bn⊤
v,
=D⊤
(n⊤
Bsct)I Bn⊤
v
−D⊤
Bs(IK ⊗ n⊤
)(I3 ⊗ c)Bn⊤
v,
Cor. ??
−−−−→D⊤
(n⊤
v)B(n⊤
Bsct)IK
−D⊤
Bs(IK ⊗ n⊤
) (I3 ⊗ c) Bn⊤
v ,
Cor. 5
−−−→D⊤
(n⊤
v)B( n⊤
Bs
ct )IK
−D⊤
(n⊤
v)Bs(IK ⊗ n⊤
) (I3K ⊗ vec(B)⊤
)(I3K ⊙ (I3 ⊗ c)) ,
Cor. 5
−−−→D⊤
(n⊤
v)B (IK ⊗ (n⊤
Bs))(I3K ⊗ c)
−D⊤
(n⊤
v)Bs(IK ⊗ n⊤
)(I3K ⊗ vec(B)⊤
)(I3K ⊙ (I3 ⊗ c))
(G.20)
We similarly re-arrange the term J
(2)
K ,
J
(2)
K =D⊤
I − ( n⊤ t )I + tn⊤
Bn⊤
v,
Cor. 5
−−−→D⊤
I − (I3 ⊗ n⊤
)(I3 ⊗ t) + t n⊤ Bn⊤
v,
Cor. 5
−−−→D⊤
I − (I3 ⊗ n⊤
)(I3 ⊗ t) + (I3 ⊗ n⊤
)(t ⊗ I3) Bn⊤
v,
=D⊤
Bn⊤
v − D⊤
(I3 ⊗ n⊤
) (I3 ⊗ t) Bn⊤
v
+D⊤
(I3 ⊗ n⊤
)(t ⊗ I3)Bn⊤
v
Cor. 5
−−−→D⊤
Bn⊤
v − D⊤
(I3 ⊗ n⊤
) (n⊤
v)(I9 ⊗ vec(B)⊤
)((I3 ⊗ t) ⊙ IK)
+D⊤
(I3 ⊗ n⊤
) (t ⊗ I3) Bn⊤
v ,
Cor. 5
−−−→D⊤
Bn⊤
v − D⊤
(I3 ⊗ n⊤
)(n⊤
v)(I9 ⊗ vec(B)⊤
)((I3 ⊗ t) ⊙ IK)
+D⊤
(I3 ⊗ n⊤
) (n⊤
v)(I9 ⊗ vec(B)⊤
)((t ⊗ I3) ⊙ IK) .
(G.21)
232

Using Equations G.20 and G.21 we rewrite the element JK (Equation G.18) as
JK = S⊤
3 M5, where
S3 =








D⊤
(n⊤
v)B(IK ⊗ (n⊤
Bs))
⊤
− D⊤
(n⊤
v)Bs(IK ⊗ n⊤
)(I3K ⊗ vec(B)⊤
)
⊤
D⊤
Bn⊤
v
⊤
− D⊤
(I3 ⊗ n⊤
)(n⊤
v)(I9 ⊗ vec(B)⊤
)
⊤
D⊤
(I3 ⊗ n⊤
)(n⊤
v)(I9 ⊗ vec(B)⊤
)
⊤








⊤
1×(31K+18K2)
,
and
M5 =






(I3K ⊗ c)
(I3K ⊙ (I3 ⊗ c))
IK
((I3 ⊗ t) ⊙ IK)
((t ⊗ I3) ⊙ IK)






(31K+18K2)×K
.
(G.22)
Summarizing the results for J⊤
We rewrite Equation 5.57 by gathering Equa-
tions G.10, Equations G.11, Equations G.12, Equations G.13, Equations G.17, and
Equations G.22,
J⊤
= S⊤
M, (G.23)
where
S = S⊤
1 , S⊤
1 , S⊤
1 , S⊤
2 , S⊤
3 1×(210+280K+72K2)
,
and
M =






M1 0 0 0 0
0 M2 0 0 0
0 0 M3 0 0
0 0 0 M4 0
0 0 0 0 M5






(210+280K+72K2)×(6+K)
.
(G.24)
233

Appendix H
Detailed Complexity of
Algorithms
In this section we provide detailed descriptions of the computation of the num-
ber of operations for certain stages of the algorithms under review in Chapter 7.
First, we separately compute the complexities of warps f3DTM and f3DMM. Using
these results, we subsequentely comput the complexities of algorithms HB3DTM,
HB3DTMNF, HB3DMM, HB3DMMNF, and HB3DMMSF (see Chapter 7
for a detailed description of these algorithms).
H.1 Warp f3DTM
We compute the number of operations that we need to perform each time we use the
warp f3DTM. We only compute the operations directly related to the warp; thus,
we obviate operations that are commmon to every warp and algorithm such that
the image operation—I or T —or the operation that extracts R or t from µ.
We recall the warp deﬁnition from Equation 5.10:
f3DTM(˜u, n; µ) = K′
R + tn⊤
K˜u′
, (H.1)
where K′
= H−1
A K, and ˜u′
= HA˜u. We display the dimensions for each term of
Equation H.1 in the following:
f3dmmr(˜u, n; µ) = K′
3×3
R
3×3
+ t
3×1
n⊤
1×3
K
3×3
˜u′
3×1
. (H.2)
We compute the complexity of Equation H.2 step by step:
f3dmmr(˜u, n; µ) = K′
3×3
R
3×3
+ t
3×1
n⊤
1×3
< 9 >M
K
3×3
˜u′
3×1
,
235

= K′
3×3
R
3×3
+ tn⊤
3×3
< 9 >A
K
3×3
˜u′
3×1
,
= K′
3×3
R + tn⊤
3×3
< 27 >M+< 18 >A
K
3×3
˜u′
3×1
,
= K′
R + tn⊤
3×3
K
3×3
< 27 >M+< 18 >A
˜u′
3×1
,
= K′
R + tn⊤
K
3×3
˜u′
3×1
< 9 >M+< 6 >A
.
We sum up all the partial complexities to compute the total complexity for the warp:
Θ3dmmr =< 74 > M+ < 51 > A. (H.3)
Notice that we have added an 2 extra multiplications in Equation H.3 due to the
homogeneous to Cartesian coordinates mapping.
H.2 Warp f3DMM
We show now how to compute the number of operations of the warp f3dmm. We
recall the warp structure from Equation 5.42 and we show its dimensions as follows:
f3dmm = K
3×3
R
3×3
+ R
3×3
Bs
3×K
c
K×1
− t
3×1
n⊤
1×3
K−1
3×3
˜u′
3×1
, (H.4)
where we deﬁne ˜u′
as in Equation H.1. We show the step-by-step complexity in the
following:
f3dmm = K
3×3
R
3×3
+ R
3×3
Bs
3×K
c
K×1
< 3K >M+< 3K − 3 >A
− t
3×1
n⊤
1×3
K−1
3×3
˜u′
3×1
,
= K
3×3
R
3×3
+ R
3×3
Bsc
3×1
− t
3×1
< 3 >A
n⊤
1×3
K−1
3×3
˜u′
3×1
,
236

= K
3×3
R
3×3
+ R
3×3
Bsc − t
3×1
< 9 >M+< 6 >A
n⊤
1×3
K−1
3×3
˜u′
3×1
,
= K
3×3
R
3×3
+ R Bsc − t
3×1
n⊤
1×3
< 9 >M
K−1
3×3
˜u′
3×1
,
= K
3×3
R
3×3
+ R Bsc − t n⊤
3×3
< 9 >A
K−1
3×3
˜u′
3×1
,
= K
3×3
R + R Bsc − t n⊤
3×3
< 27 >M+< 18 >A
K−1
3×3
˜u′
3×1
,
= K R + R Bsc − t n⊤
3×3
K−1
3×3
< 27 >M+< 18 >A
˜u′
3×1
,
= K R + R Bsc − t n⊤
K−1
3×3
˜u′
3×1
< 9 >M+< 6 >A
.
Summing up these partial complexities—and including 2 multiplications for due to
the mapping from P2
to R2
—we compute the complexity of the warp f3dmm as:
Θ3dmm =< 83 + 3K > M+ < 57 + 3K > A. (H.5)
H.3 Jacobian of Algorithm HB3DTM
We calculate the Jacobian matrix by separaterly computing each column element
from Equation 5.31. Notice that the expression S (1, t)⊤
⊗ I9 is common to all the
elements of the row of the Jacobian, so we just compute it once to avoid including
repeated operations.
J1 = S1
1×36
1
t
⊗ I9
36×9
< 27 >M+< 27 >A
vec( ˙R
⊤
αt
3×3
R
3×3
)
< 27 >M+< 18 >A
237

= S1
1
t
⊗ I9
1×9
vec(˙R
⊤
αt
R)
9×1
< 9 >M+< 8 >A
J2 = S1
1
t
⊗ I9
1×9
vec( ˙R
⊤
βt
3×3
R
3×3
)
< 27 >M+< 18 >A
= S1
1
t
⊗ I9
1×9
vec(˙R
⊤
βt
R)
9×1
< 9 >M+< 8 >A
J3 = S1
1
t
⊗ I9
1×9
vec( ˙R
⊤
γt
3×3
R
3×3
)
< 27 >M+< 18 >A
= S1
1
t
⊗ I9
1×9
vec(˙R
⊤
γt
R)
9×1
< 9 >M+< 8 >A
J4 = S1
1
t
⊗ I9
1×9
I3 ⊗ n
9×3
< 9 >M+< 6 >A
R⊤


1
0
0

 ,
3×1
= S1
1
t
⊗ I9 I3 ⊗ n
1×3
R⊤


1
0
0

 .
3×1
< 3 >M+< 2 >A
J5 = S1
1
t
⊗ I9
1×9
I3 ⊗ n
9×3
< 9 >M+< 6 >A
R⊤


0
1
0

 ,
3×1
238

= S1
1
t
⊗ I9 I3 ⊗ n
1×3
R⊤


0
1
0

 .
3×1
< 3 >M+< 2 >A
J6 = S1
1
t
⊗ I9
1×9
I3 ⊗ n
9×3
< 9 >M+< 6 >A
R⊤


0
0
1

 ,
3×1
= S1
1
t
⊗ I9 I3 ⊗ n
1×3
R⊤


0
0
1

 .
3×1
< 3 >M+< 2 >A
Summing up all the partial complexities of Equations H.3 we have the resulting
complexity for computing the Jacobian matrix JHB3DTM:
ΘJHB3DTM
=< 81 + 75Nv > M+ < 54 + 66Nv > A. (H.6)
Notice that some operations—e.g the product R⊤ ˙R∆—are only computed once whereas
the rest of the complexities are computed Nv times.
H.4 Jacobian of Algorithm HB3DTMNF
In the following we compute the number of operations associated to Equations 5.26,
that is, we compute the complexity of the gradient replacement stage, not the factor-
ization. Again, we compute the complexity for Equations 5.26 in the most eﬃcient
way: ﬁrst we compute the summation I3 − (n(i)⊤
t)I3 + tn(i)⊤
, and then we com-
pute the remaining products:
J1 = D⊤
3×1
I3
3×3
− ( n(i)⊤
1×3
t
3×1
) I3
3×3
< 6 >M+< 2 >A
+ t
3×1
n(i)⊤
1×3
< 9 >M
R⊤
3×3
˙Rαt
3×3
< 27 >M+< 18 >A
v,
3×1
= D⊤
3×1
I3 − (n⊤
t)I3 + tn(i)⊤
3×3
< 9 >M+< 6 >A
R⊤ ˙Rαt
3×3
v
3×1
< 9 >M+< 6 >A
= D(i)⊤
I3 − (n(i)⊤
t)I3 + tn⊤
1×3
R⊤ ˙Rαt v,
3×1
< 3 >M+< 2 >A
239

J2 = D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤
3×3
˙Rβt
3×3
< 27 >M+< 18 >A
v,
3×1
= D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤ ˙Rβt
3×3
v,
3×1
< 9 >M+< 6 >A
= D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤ ˙Rβt v,
3×1
< 3 >M+< 2 >A
J3 = D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤
3×3
˙Rγt
3×3
< 27 >M+< 18 >A
v,
3×1
= D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤ ˙Rγt
3×3
v,
3×1
< 9 >M+< 6 >A
= D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤ ˙Rγt v,
3×1
< 3 >M+< 2 >A
J4 = D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤
3×3


1
0
0


3×1
n⊤
1×3
v,
3×1
< 9 >M
= D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤
3×3


1
0
0


3×1
n⊤
1×3
v,
3×1
< 3 >M+< 2 >A
J5 = D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤
3×3


0
1
0


3×1
n⊤
1×3
v,
3×1
< 9 >M
240

= D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤
3×3


0
1
0


3×1
n⊤
1×3
v,
3×1
< 3 >M+< 2 >A
J6 = D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤
3×3


0
0
1


3×1
n⊤
1×3
v,
3×1
< 9 >M
= D⊤
I3 − (n⊤
t)I3 + tn⊤
1×3
R⊤
3×3


0
0
1


3×1
n⊤
1×3
v.
3×1
< 3 >M+< 2 >A
Summing up all the partial complexities of Equations H.4 we have the resulting
complexity for computing the Jacobian matrix JHB3DTMNF:
ΘJHB3DTMNF
=< 81 + 96Nv > M+ < 54 + 56Nv > A. (H.7)
H.5 Jacobian of Algorithm HB3DMMNF
In the following we compute the number of operations associated to computing
the Jacobian of algorithm hb3dmmnf (Equation 5.57); notice that this algorithm
only uses the gradient replacement, not the factorization stage. We compute the
complexity of Equations 5.57 in the most efficient way: we first compute the common
term D⊤
(Equation 5.58), that is,
D⊤
= ∇ûT [˜u]⊤
1×3
I3
3×3
+( n(i)⊤
1×3
B(i)
3×K
c
K×1
) I3
3×3
−( n(i)⊤
1×3
t
3×1
) I3
3×3
− B(i)
3×K
c
K×1
n(i)⊤
1×3
+ t
3×1
n(i)⊤
1×3
(H.8)
241

We break down Equation H.8 into to easily display the number of operations as
follows:
D⊤
= ∇ûT [˜u]⊤
1×3
I3 + ( n⊤
1×3
Bs
3×K
c
K×1
) I3
3×3
< 3K + 6 >M+< 3(K − 1) + 2 >A
− ( n⊤
1×3
t
3×1
) I3
3×3
< 6 >M+< 2 >A
− Bs
3×K
c
K×1
n⊤
1×3
< 3K + 9 >M+< 3(K − 1) >A
+ t
3×1
n⊤
1×3
< 9 >M
= ∇ûT [˜u]⊤
1×3
I3
3×3
+ (n⊤
Bsc)I3
3×3
− (n⊤
t)I3
3×3
− Bscn⊤
3×3
+ tn⊤
3×3
< 9 >A+< 9 >A+< 9 >A+< 9 >A
= ∇ûT [˜u]⊤
1×3
I3 + (n⊤
Bsc)I3 − (n⊤
t)I3 − Bscn⊤
+ tn⊤
3×3
< 9 >M+< 6 >A
The complexity of computing D⊤
is the sum of all partial complexities of Equa-
tion H.5, that is,
ΘD⊤ =< 39 + 6K > M+ < 40 + 6K > A. (H.9)
Once we have computed D⊤
there is no need to compute it again. We proceed to
compute each term of the Jacobian as follows:
J1 = D⊤
1×3
R⊤
3×3
˙Rα
3×3
< 27 >M+< 18 >A
I3
3×3
+ Bscn⊤
3×3
v
242

= D⊤
1×3
R⊤ ˙Rα
3×3
I3 + Bscn⊤
< 9 >A
v
= D⊤
1×3
R⊤ ˙Rα
3×3
I3 + Bscn⊤
3×3
< 27 >M+< 18 >A
v
3×1
= D⊤
1×3
R⊤ ˙Rα
3×3
I3 + Bscn⊤
3×3
< 27 >M+< 18 >A
v
3×1
= D⊤
1×3
R⊤ ˙Rα I3 + Bscn⊤
3×3
v.
3×1
< 12 >M+< 8 >A
J2 = D⊤
1×3
R⊤
3×3
˙Rβ
3×3
< 27 >M+< 18 >A
I3 + Bscn⊤
3×3
v
3×1
= D⊤
1×3
R⊤ ˙Rβ
3×3
I3 + Bscn⊤
3×3
< 27 >M+< 18 >A
v
3×1
= D⊤
1×3
R⊤ ˙Rβ I3 + B(i)
cn⊤
3×3
v.
3×1
< 12 >M+< 8 >A
J3 = D⊤
1×3
R⊤
3×3
˙Rγ
3×3
< 27 >M+< 18 >A
I3 + Bscn⊤
3×3
v
3×1
= D⊤
1×3
R⊤ ˙Rγ
3×3
I3 + Bscn⊤
3×3
< 27 >M+< 18 >A
v
3×1
= D⊤
1×3
R⊤ ˙Rγ I3 + Bscn⊤
3×3
v.
3×1
< 12 >M+< 8 >A
243

J4 = D⊤
1×3
R⊤


1
0
0


3×1
n⊤
1×3
< 9 >M
v
3×1
= D⊤
1×3
R⊤


1
0
0

 n⊤
3×3
v
3×1
< 12 >M+< 8 >A
= D⊤
1×3
R⊤


1
0
0

 n⊤
v
3×1
< 3 >M+< 2 >A
J5 = D⊤
1×3
R⊤


0
1
0


3×1
n⊤
1×3
< 9 >M
v
3×1
= D⊤
1×3
R⊤


0
1
0

 n⊤
3×3
v
3×1
< 12 >M+< 8 >A
= D⊤
1×3
R⊤


0
1
0

 n⊤
v
3×1
< 3 >M+< 2 >A
J6 = D⊤
1×3
R⊤


0
0
1


3×1
n⊤
1×3
< 9 >M
v
3×1
244

= D⊤
1×3
R⊤


0
0
1

 n⊤
3×3
v
3×1
< 12 >M+< 8 >A
= D⊤
1×3
R⊤


0
0
1

 n⊤
v
3×1
< 3 >M+< 2 >A
Jk = D⊤
1×3
R⊤
3×3
Bk
3×1
< 9 >M+< 6 >A
n⊤
1×3
v
3×1
< 3 >M+< 2 >A
= D⊤
1×3
R⊤
Bk
3×1
n⊤
v
1×1
< 3 >M
= D⊤
1×3
R⊤
Bkn⊤
v
3×1
< 3 >M+< 2 >A
, i = 1, . . . , K.
Summing up all the partial complexities of Equations H.5 and H.9 we have the
resulting complexity for computing the Jacobian matrix Jhb3dmmnf:
ΘJhb3dmmnf =< 81 + (24K + 219)Nv > M+ < 54 + (16K + 171)Nv > A. (H.10)
Notice that some operations—e.g the product R⊤ ˙R∆—are only computed once
whereas the rest of the complexities are computed Nv times.
245

H.6 Jacobian of Algorithm HB3DMMSF
In the following we compute the number of operations associated to computing the
Jacobian of algorithm hb3dmmsf (Equation 8.5) by means of a partial factorization.
We compute the complexity of Equations 8.5 in the most eﬃcient way, that is, we
avoid to compute repeated operations more than once. We show the complexities
associated to each element of the row of the Jacobian in the following:
J1 = S1
1×(3(K+4))

I3 ⊗


1
t
c




(3(K+4))×3
< 3K + 5 >M+< 3K + 3 >A
R⊤
3×3
˙Rαt
3×3
v(i)
3×1
+ B(i)
3×K
c
K×1
n(i)⊤
1×3
v(i)
3×1
= S1

I3 ⊗


1
t
c




1×3
R⊤
3×3
˙Rαt
3×3
< 27 >M+< 18 >A
v(i)
3×1
+ B(i)
3×K
c
K×1
n(i)⊤
1×3
v(i)
3×1
< 3 >M+< 2 >A
= S1

I3 ⊗


1
t
c




1×3
R⊤ ˙Rαt
3×3
v(i)
3×1
+ B(i)
3×K
c
K×1
n(i)⊤
v(i)
1×1
< 3K + 3 >M+< 3K − 3 >A
= S1

I3 ⊗


1
t
c




1×3
R⊤ ˙Rαt
3×3
v(i)
3×1
+ B(i)
cn(i)⊤
v(i)
3×1
< 3 >A
= S1

I3 ⊗


1
t
c




1×3
R⊤ ˙Rαt
3×3
v(i)
+ B(i)
cn(i)⊤
v(i)
.
3×1
< 12 >M+< 8 >A
J2 = S1

I3 ⊗


1
t
c




1×3
R⊤
3×3
˙Rβt
3×3
< 27 >M+< 18 >A
v(i)
+ B(i)
cn(i)⊤
v(i)
3×1
246

= S1

I3 ⊗


1
t
c




1×3
R⊤ ˙Rβt
3×3
v(i)
B(i)
cn(i)⊤
v(i)
.
3×1
< 12 >M+< 8 >A
J3 = S1

I3 ⊗


1
t
c




1×3
R⊤
3×3
˙Rγt
3×3
< 27 >M+< 18 >A
v(i)
+ B(i)
cn(i)⊤
v(i)
3×1
= S1

I3 ⊗


1
t
c




1×3
R⊤ ˙Rγt
3×3
v(i)
+ B(i)
cn(i)⊤
v(i)
.
3×1
< 12 >M+< 8 >A
J4 = S1

I3 ⊗


1
t
c




1×3
n(i)⊤
v(i)
1×1
< 3 >M
R⊤


1
0
0


3×1
= S1

I3 ⊗


1
t
c



 n(i)⊤
v(i)
1×3
R⊤


1
0
0

 .
3×1
< 3 >M+< 2 >A
J5 = S1

I3 ⊗


1
t
c




1×3
n(i)⊤
v(i)
1×1
< 3 >M
R⊤


0
1
0


3×1
= S1

I3 ⊗


1
t
c



 n(i)⊤
v(i)
1×3
R⊤


0
1
0

 .
3×1
< 3 >M+< 2 >A
247

J6 = S1

I3 ⊗


1
t
c




1×3
n(i)⊤
v(i)
1×1
< 3 >M
R⊤


0
0
1


3×1
= S1

I3 ⊗


1
t
c



 n(i)⊤
v(i)
1×3
R⊤


0
0
1

 .
3×1
< 3 >M+< 2 >A
Jk = S1

I3 ⊗


1
t
c




1×3
n⊤
v
1×1
< 3 >M
R⊤
3×3
Bk
3×1
< 9 >M+< 6 >A
= S1

I3 ⊗


1
t
c



 n⊤
v
1×3
R⊤
Bk.
3×1
< 3 >M+< 2 >A
Summing up all the partial complexities of Equations H.6 we obtain the resulting
complexity for computing the Jacobian matrix Jhb3dmmsf:
ΘJhb3dmmsf =< 81 + (18K + 60)Nv > M+ < 54 + (14K + 36)Nv > A. (H.11)
248

Efficient Model-based 3D Tracking by Using Direct Image Registration

More Related Content

What's hot (20)

Similar to Efficient Model-based 3D Tracking by Using Direct Image Registration (20)

Recently uploaded (20)

Efficient Model-based 3D Tracking by Using Direct Image Registration