05 history of cv a machine learning (theory) perspective on computer vision

A Machine Learning (Theory)
Perspective on Computer Vision

Peter Auer
Montanuniversität Leoben

Outline

What I am doing and how computer
vision approached me (in 2002).
Some modern machine learning
algorithms used in computer vision,
and their development:
Boosting
Support Vector Machines
Concluding remarks

My background
COLT 1993
Conference on Learning Theory
„On-Line Learning of Rectangles in Noisy
Environments“

FOCS 1995
Symp. Foundations of Computer Science
„Gambling in a Rigged Casino: The Adversarial
Multi-Arm Bandit Problem“
with N. Cesa-Bianchi, Y. Freund, R. Schapire

ICML, NIPS, STOC, …

A computer vision project

EU-Project LAVA, 2002
“Learning for adaptable visual
assistants”
XRCE: Ch. Dance, R. Mohr
IRIA Grenoble: C. Schmid, B. Triggs
RHUL: J. Shawe-Taylor
IDIAP: S. Bengio

LAVA Proposal
Vision (goals)
Recognition of generic objects and events
Attention Mechanisms
Base line and high-level descriptors
Learning (means)
Statistical Analysis
Kernels and models and features
Online Learning

Online learning
Online Information Setting
An input is received, a prediction is made, and
then feedback is acquired.
Goal: To make good predictions, in respect to
a (large) set of fixed predictors.
Online Computation Setting
The amount of computation per new example –
to update the learned information – is constant
(or small).
Goal: To be fast computationally.
(Near) real-time learning?

Learning for vision around 2002
Viola, Jones, CVPR 2001:
Rapid object detection using a boosted cascade
of simple features. (Boosting)
Agarwal, Roth, ECCV 2002:
Learning a Sparse Representation for Object
Detection. (Winnow)
Fergus, Perona, Zisserman, CVPR 2003:
Object class recognition by unsupervised scale-
invariant learning. (EM-type algorithm)
Wallraven, Caputo, Graf, ICCV 2003:
Recognition with local features: the kernel
recipe. (SVM)

Our contribution in LAVA

Opelt, Fussenegger, Pinz, Auer,
ECCV 2004:
Weak hypotheses and boosting for
generic object detection and
recognition.

Image classification
as a learning problem

Image classification as a learning problem

Images are represented as vectors x = (x1 , . . . , xn ) ∈ X ⊂ Rn .

Given
training images x (1) , . . . , x (m) ∈ X
with their classifications y (1) , . . . , y (m) ∈ Y = {−1, +1},
a classifier H : X → Y is learned.

We consider linear classifiers Hw , w ∈ Rn ,

+1 if w · x ≥ 0
Hw (x) =
−1 if w · x < 0
n
(w · x = i=1 wi xi ).

P. Auer ML Perspective on CV

The Perceptron algorithm (Rosenblatt, 1958)
The Perceptron algorithm maintains a weight vector w (t) as its
current classiﬁer.
Initialization w (1) = 0.
+1 if w (t) · x (t) ≥ 0
Predict y (t) =
ˆ
−1 if w (t) · x (t) < 0
If y (t) = y (t) then w (t+1) = w (t) ,
ˆ
else w (t+1) = w (t) + ηy (t) x (t) .
(η is the learning rate.)

The Perceptron was abandoned in 1969, when Minsky and
Papert showed that Perceptrons are not able to learn some
simple functions.
Revived only in the 1980’s when neural networks became
popular.


Perceptron cannot learn XOR

No single line can separate the green
from the red boxes.

Non-linear classiﬁers

Extending the feature space (or using kernels) prevents the
problem:
2 2
Since XOR is a quadratic function, use (1, x1 , x2 , x1 , x2 , x1 x2 )
instead of (x1 , x2 ).
For x1 , x2 ∈ {+1, −1},

x1 XOR x2 = x1 x2 .


Winnow (Littlestone 1987)

Works like the Perceptron algorithm except for the update of
the weights:
(t+1) (t) (t)
wi = wi ∗ exp ηy (t) xi

for some η > 0. (w (1) = 1.)

Observe the multiplicative update of the weights and
(t+1) (t) (t)
log wi = log wi + ηy (t) xi .

Very related work:
The Weighted Majority Algorithm (Littlestone, Warmuth)


Comparison of the Perceptron algorithm and Winnow

Perceptron and Winnow scale diﬀerently in respect to
relevant, used, and irrelevant attributes:

all attributes n
relevant attributes k
used attributes d

# training ex.
√
Perceptron dk
Winnow k log n


Adaboost (Freund, Schapire, 1995)

(s)
AdaBoost maintains weights vt on the training examples
(x (s) , y (s) ) over time t:

(s)
Initialize weights v0 = 1.
For t = 1, 2, . . .
Select coordinate it with maximal correlation with the labels,
(s) (s) (s)
s vt y xi , as weak hypothesis.
(s) (s)
Choose αt which minimizes s vt exp −αt y (s) xit .
(s) (s) (s)
Update vt+1 = vt exp −αt y (s) xit .
For x = (x1 , . . . , xn ) predict sign ( t αt xit ).


History of Boosting (1)
Rob Schapire:
The strength of weak learnability, 1990.
Showed that classifiers which are only 51%
correct, can be combined into a 99% correct
classifier.
Rather a theoretical result, since the algorithm
was complicated and not practical.
I know people who thought that this was not
an interesting result.


Yoav Freund:
Boosting a weak learning algorithm
by majority, 1995.
Improved boosting algorithm, but still
complicated and theoretical.
Only logarithmically many examples
are forwarded to the weak learner!

Y. Freund and R. Schapire:
A decision-theoretic generalization of on-line
learning and an application to boosting, 1995.
Very simple boosting algorithm, easy to implement.
Theoretically less interesting.
Performs very well in practice.

Won the Gödel price in 2003 and the Kanellakis
price in 2004. (Both are prestigious prices in
Theoretical Computer Science.)

Since then many variants of Boosting (mainly to
improve error robustness):
BrownBoost, Soft margin boosting, LPBoost.

Support Vector Machines (SVMs)
In its vanilla version also learns a linear classifier.

It maximizes distance between the decision
boundary and the nearest training points.
Formulates learning as a well-behaved optimization
problem.

Invented by Vladimir Vapnik
(1979, Russian paper).
Translated in 1982.
No practical applications,
since it required linear separability.

Practical SVMs
Vapnik:
The Nature of Statistical Learning Theory, 1995.
Statistical Learning Theory, 1998.

Shawe-Taylor, Cristianini:
Support Vector Machines, 2000.

Soft margin SVMs:
Tolerate incorrectly labeled training examples (by
using slack variables).

Non-linear classification using the “kernel trick”.

Support Vector Machines (SVMs)

+
+ +
+ +
+ +
−
+ + −
−
− −
− −
− −

– p.21
Maschinelles Lernen — 25.8.03 — Peter Auer

The kernel trick (1)

Recall the perceptron update,
t
w (t+1) = w (t) + ηy (t) x (t) = η y (τ ) x (τ ) ,
τ =1

and classiﬁcation,
t
(t+1)
y = sign w
ˆ · x = sign y (τ ) x (τ ) · x .
τ =1

A kernel function generalizes the inner product,
t
y = sign
ˆ y (τ ) K x (τ ) , x .
τ =1


The kernel trick (2)

The inner product x (τ ) · x is a measure of similarity:
x (τ ) · x is maximal if x (τ ) = x.

The kernel function is a similarity measure in feature space,
K x (τ ) , x = Φ(x (τ ) ) · Φ(x).

Kernel functions can be designed to capture the relevant
similarities of the domain.

Aizerman, Braverman, Rozonoer:
Theoretical foundations of the potential function method in
pattern recognition learning, 1964.


Where are we going?

New learning algorithms?
Better image descriptors!
Probably they need to be learned.
Probably they need to be
hierarchical.
We need (to use) more data.

Final remark on algorithm evaluation
and benchmarks

Computer vision is in the state of
machine learning 10 years ago (at
least for object classification).

Benchmark datasets start to
become available, e.g. PASCAL
VOC.

05 history of cv a machine learning (theory) perspective on computer vision

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 05 history of cv a machine learning (theory) perspective on computer vision (20)

More from zukun (20)

Recently uploaded (20)

05 history of cv a machine learning (theory) perspective on computer vision