Jason svm tutorial

Support Vector Machine
(and Statistical Learning Theory)

Tutorial
Jason Weston
NEC Labs America
4 Independence Way, Princeton, USA.
jasonw@nec-labs.com

1 Support Vector Machines: history
• SVMs introduced in COLT-92 by Boser, Guyon & Vapnik. Became
rather popular since.

• Theoretically well motivated algorithm: developed from Statistical
Learning Theory (Vapnik & Chervonenkis) since the 60s.

• Empirically good performance: successful applications in many
ﬁelds (bioinformatics, text, image recognition, . . . )

2 Support Vector Machines: history II
• Centralized website: www.kernel-machines.org.

• Several textbooks, e.g. ”An introduction to Support Vector
Machines” by Cristianini and Shawe-Taylor is one.
• A large and diverse community work on them: from machine
learning, optimization, statistics, neural networks, functional
analysis, etc.

3 Support Vector Machines: basics
[Boser, Guyon, Vapnik ’92],[Cortes & Vapnik ’95]

- - - - margin
-

margin
+ +
+
+ +
Nice properties: convex, theoretically motivated, nonlinear with kernels..

4 Preliminaries:
• Machine learning is about learning structure from data.

• Although the class of algorithms called ”SVM”s can do more, in this
talk we focus on pattern recognition.

• So we want to learn the mapping: X → Y, where x ∈ X is some
object and y ∈ Y is a class label.
• Let’s take the simplest case: 2-class classiﬁcation. So: x ∈ Rn ,
y ∈ {±1}.

5 Example:

Suppose we have 50 photographs of elephants and 50 photos of tigers.

vs.

We digitize them into 100 x 100 pixel images, so we have x ∈ Rn where
n = 10, 000.
Now, given a new (different) photograph we want to answer the question:
is it an elephant or a tiger? [we assume it is one or the other.]

6 Training sets and prediction models
• input/output sets X , Y

• training set (x1 , y1 ), . . . , (xm , ym )
• ”generalization”: given a previously seen x ∈ X , ﬁnd a suitable
y ∈ Y.

• i.e., want to learn a classiﬁer: y = f (x, α), where α are the
parameters of the function.
• For example, if we are choosing our model from the set of
hyperplanes in Rn , then we have:

f (x, {w, b}) = sign(w · x + b).

7 Empirical Risk and the true Risk
• We can try to learn f (x, α) by choosing a function that performs well
on training data:
m
1
Remp (α) = (f (xi , α), yi ) = Training Error
m i=1

where is the zero-one loss function, (y, y ) = 1, if y = y , and 0
ˆ ˆ
otherwise. Remp is called the empirical risk.
• By doing this we are trying to minimize the overall risk:

R(α) = (f (x, α), y)dP (x, y) = Test Error

where P(x,y) is the (unknown) joint distribution function of x and y.

8 Choosing the set of functions
What about f (x, α) allowing all functions from X to {±1}?
Training set (x1 , y1 ), . . . , (xm , ym ) ∈ X × {±1}
Test set x1 , . . . , xm ∈ X ,
¯ ¯¯
such that the two sets do not intersect.
For any f there exists f ∗ :
1. f ∗ (xi ) = f (xi ) for all i
2. f ∗ (xj ) = f (xj ) for all j
Based on the training data alone, there is no means of choosing which
function is better. On the test set however they give different results. So
generalization is not guaranteed.
=⇒ a restriction must be placed on the functions that we allow.

9 Empirical Risk and the true Risk
Vapnik & Chervonenkis showed that an upper bound on the true risk can
be given by the empirical risk + an additional term:

h(log( 2m + 1) − log( η )
4
R(α) ≤ Remp (α) + h
m
where h is the VC dimension of the set of functions parameterized by α.
• The VC dimension of a set of functions is a measure of their capacity
or complexity.
• If you can describe a lot of different phenomena with a set of
functions then the value of h is large.
[VC dim = the maximum number of points that can be separated in all
possible ways by that set of functions.]

10 VC dimension:

The VC dimension of a set of functions is the maximum number of points
that can be separated in all possible ways by that set of functions. For
hyperplanes in Rn , the VC dimension can be shown to be n + 1.
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x xxxxxxxxxxxxxxxxxxxxxxxxxx
x
x x
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
x x
x
x
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
x
x
x
x xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x x xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx

11 VC dimension and capacity of functions

Simpliﬁcation of bound:
Test Error ≤ Training Error + Complexity of set of Models

• Actually, a lot of bounds of this form have been proved (different
measures of capacity). The complexity function is often called a
regularizer.

• If you take a high capacity set of functions (explain a lot) you get low
training error. But you might ”overﬁt”.

• If you take a very simple set of models, you have low complexity, but
won’t get low training error.

12 Capacity of a set of functions (classiﬁcation)

[Images taken from a talk by B. Schoelkopf.]

13 Capacity of a set of functions (regression)
y

sine curve fit

hyperplane fit

true function
x

14 Controlling the risk: model complexity

Bound on the risk

Confidence interval

Empirical risk
(training error)

h1 h* hn

S1 S* Sn

15 Capacity of hyperplanes

Vapnik & Chervonenkis also showed the following:
Consider hyperplanes (w · x) = 0 where w is normalized w.r.t a set of
points X ∗ such that: mini |w · xi | = 1.
The set of decision functions fw (x) = sign(w · x) deﬁned on X ∗ such
that ||w|| ≤ A has a VC dimension satisfying

h ≤ R2 A2 .

where R is the radius of the smallest sphere around the origin containing
X ∗.
=⇒ minimize ||w||2 and have low capacity
=⇒ minimizing ||w||2 equivalent to obtaining a large margin classiﬁer

<w, x> + b > 0

x

q x
q

x
<w, x> + b < 0 w x

q
q
q
{x | <w, x> + b = 0}

{x | <w, x> + b = +1}
{x | <w, x> + b = −1} Note:
x
<w, x1> + b = +1
r x x1 yi = +1 <w, x2> + b = −1
x2r
=> <w , (x1−x2)> = 2
x
, w
yi = −1 w x => < , (x1−x2) = 2
>
||w|| ||w||

r
r
r
{x | <w, x> + b = 0}

16 Linear Support Vector Machines (at last!)

So, we would like to ﬁnd the function which minimizes an objective like:
Training Error + Complexity term
We write that as:
m
1
(f (xi , α), yi ) + Complexity term
m i=1

For now we will choose the set of hyperplanes (we will extend this later),
so f (x) = (w · x) + b:
m
1
(w · xi + b, yi ) + ||w||2
m i=1

subject to mini |w · xi | = 1.

17 Linear Support Vector Machines II
That function before was a little difﬁcult to minimize because of the step
function in (y, y ) (either 1 or 0).
ˆ
Let’s assume we can separate the data perfectly. Then we can optimize
the following:
Minimize ||w||2 , subject to:

(w · xi + b) ≥ 1, if yi = 1
(w · xi + b) ≤ −1, if yi = −1
The last two constraints can be compacted to:

yi (w · xi + b) ≥ 1

This is a quadratic program.

18 SVMs : non-separable case
To deal with the non-separable case, one can rewrite the problem as:
Minimize:
m
||w||2 + C ξi
i=1
subject to:

yi (w · xi + b) ≥ 1 − ξi , ξi ≥ 0

This is just the same as the original objective:
m
1
(w · xi + b, yi ) + ||w||2
m i=1

except is no longer the zero-one loss, but is called the ”hinge-loss”:
(y, y ) = max(0, 1 − y y ). This is still a quadratic program!
ˆ ˆ

- +
- - -
-

margin
+ +
+ ξi
+ - +
-

19 Support Vector Machines - Primal
• Decision function:
f (x) = w · x + b

• Primal formulation:
1 2
min P (w, b) = w + C H1 [ yi f (xi ) ]
2 i
maximize margin
minimize training error
Ideally H1 would count the number of errors, approximate with:
H1(z)

Hinge Loss H1 (z) = max(0, 1 − z)

0 z

20 SVMs : non-linear case
Linear classiﬁers aren’t complex enough sometimes. SVM solution:
Map data into a richer feature space including nonlinear features, then
construct a hyperplane in that space so all other equations are the same!
Formally, preprocess the data with:

x → Φ(x)

and then learn the map from Φ(x) to y:

f (x) = w · Φ(x) + b.

21 SVMs : polynomial mapping

Φ : R2 → R3
(x1 , x2 ) → (z1 , z2 , z3 ) := (x2 ,
1 (2)x1 x2 , x2 )
2

x2
z3
!
! ! !
! !
! !
! !
! !
!
!
! ! r ! !
!
r
r
r
x1 r
r !
r
r !
!
r
r
rr !
!
z1
! r ! r r
r
! !
!
!

! ! !
! z2

22 SVMs : non-linear case II
For example MNIST hand-writing recognition.
60,000 training examples, 10000 test examples, 28x28.
Linear SVM has around 8.5% test error.
Polynomial SVM has around 1% test error.

5 0 4 1 9 2 1 3 1 4

3 5 3 6 1 7 2 8 6 9

4 0 9 1 1 2 4 3 2 7

3 8 6 9 0 5 6 0 7 6

1 8 7 9 3 9 8 5 9 3

3 0 7 4 9 8 0 9 4 1

4 4 6 0 4 5 6 1 0 0

1 7 1 6 3 0 2 1 1 7

9 0 2 6 7 8 3 9 0 4

6 7 4 6 8 0 7 8 3 1

23 SVMs : full MNIST results

Classiﬁer Test Error
linear 8.4%
3-nearest-neighbor 2.4%
RBF-SVM 1.4 %
Tangent distance 1.1 %
LeNet 1.1 %
Boosted LeNet 0.7 %
Translation invariant SVM 0.56 %

Choosing a good mapping Φ(·) (encoding prior knowledge + getting right
complexity of function class) for your problem improves results.

24 SVMs : the kernel trick
Problem: the dimensionality of Φ(x) can be very large, making w hard to
represent explicitly in memory, and hard for the QP to solve.
The Representer theorem (Kimeldorf & Wahba, 1971) shows that (for
SVMs as a special case):
m
w= αi Φ(xi )
i=1

for some variables α. Instead of optimizing w directly we can thus
optimize α.
The decision rule is now:
m
f (x) = αi Φ(xi ) · Φ(x) + b
i=1

We call K(xi , x) = Φ(xi ) · Φ(x) the kernel function.

25 Support Vector Machines - kernel trick II

We can rewrite all the SVM equations we saw before, but with the
w = m αi Φ(xi ) equation:
i=1

• Decision function:

f (x) = αi Φ(xi ) · Φ(x) + b
i

= αi K(xi , x) + b
i

• Dual formulation:
m
1 2
min P (w, b) = αi Φ(xi ) + C H1 [ yi f (xi ) ]
2 i=1 i

maximize margin minimize training error

26 Support Vector Machines - Dual
But people normally write it like this:
• Dual formulation:

1  È α =0
i i
min D(α) = αi αj Φ(xi )·Φ(xj )− yi αi s.t.
α 2  0≤yi αi ≤C
i,j i

• Dual Decision function:

f (x) = αi K(xi , x) + b
i

• Kernel function K(·, ·) is used to make (implicit) nonlinear feature
map, e.g.
– Polynomial kernel: K(x, x ) = (x · x + 1)d .
– RBF kernel: K(x, x ) = exp(−γ||x − x ||2 ).

27 Polynomial-SVMs

The kernel K(x, x ) = (x · x )d gives the same result as the explicit
mapping + dot product that we described before:
Φ : R2 → R3 (x1 , x2 ) → (z1 , z2 , z3 ) := (x2 ,
1 (2)x1 x2 , x2 )
2
2 2
Φ((x1 , x2 ) · Φ((x1 , x2 ) = (x2 ,
1 (2)x1 x2 , x2 ) · (x 1 ,
2 (2)x 1 x 2 , x 2 )
2 2
= x2 x 1 + 2x1 x 1 x2 x 2 + x2 x 2
1 2

is the same as:

K(x, x ) = (x · x )2 = ((x1 , x2 ) · (x 1 , x 2 ))2
2 2
= (x1 x 1 + x2 x 2 )2 = x2 x 1 + x2 x 2 + 2x1 x 1 x2 x 2
1 2

Interestingly, if d is large the kernel is still only requires n multiplications
to compute, whereas the explicit representation may not ﬁt in memory!

28 RBF-SVMs
The RBF kernel K(x, x ) = exp(−γ||x − x ||2 ) is one of the most
popular kernel functions. It adds a ”bump” around each data point:
m
f (x) = αi exp(−γ||xi − x||2 ) + b
i=1

Φ

. .
x x' Φ(x) Φ(x')

Using this one can get state-of-the-art results.

29 SVMs : more results

There is much more in the ﬁeld of SVMs/ kernel machines than we could
cover here, including:

• Regression, clustering, semi-supervised learning and other domains.
• Lots of other kernels, e.g. string kernels to handle text.

• Lots of research in modiﬁcations, e.g. to improve generalization
ability, or tailoring to a particular task.
• Lots of research in speeding up training.

Please see text books such as the ones by Cristianini & Shawe-Taylor or
by Schoelkopf and Smola.

30 SVMs : software

Lots of SVM software:
• LibSVM (C++)

• SVMLight (C)
As well as complete machine learning toolboxes that include SVMs:

• Torch (C++)
• Spider (Matlab)

• Weka (Java)
All available through www.kernel-machines.org.

Jason svm tutorial

More Related Content

Similar to Jason svm tutorial (20)

Recently uploaded (20)

Jason svm tutorial