SlideShare a Scribd company logo
Machine Learning for Data Mining
Introduction to Support Vector Machines
Andres Mendez-Vazquez
June 22, 2016
1 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 2 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 3 / 124
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
4 / 124
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
4 / 124
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
4 / 124
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
4 / 124
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
4 / 124
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
4 / 124
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
4 / 124
In addition
Alexey Yakovlevich Chervonenkis
He was a Soviet and Russian mathematician, and, with Vladimir Vapnik,
was one of the main developers of the Vapnik–Chervonenkis theory, also
known as the "fundamental theory of learning" an important part of
computational learning theory.
He died in September 22nd, 2014
At Losiny Ostrov National Park on 22 September 2014.
5 / 124
In addition
Alexey Yakovlevich Chervonenkis
He was a Soviet and Russian mathematician, and, with Vladimir Vapnik,
was one of the main developers of the Vapnik–Chervonenkis theory, also
known as the "fundamental theory of learning" an important part of
computational learning theory.
He died in September 22nd, 2014
At Losiny Ostrov National Park on 22 September 2014.
5 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
6 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 7 / 124
Separable Classes
Given
xi, i = 1, · · · , N
A set of samples belonging to two classes ω1, ω2.
Objective
We want to obtain a decision function as simple as
g (x) = wT
x + w0
8 / 124
Separable Classes
Given
xi, i = 1, · · · , N
A set of samples belonging to two classes ω1, ω2.
Objective
We want to obtain a decision function as simple as
g (x) = wT
x + w0
8 / 124
Such that we can do the following
A linear separation function g (x) = wt
x + w0
9 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 10 / 124
In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2
11 / 124
In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2
11 / 124
In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2
11 / 124
In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2
11 / 124
What do we want?
Our goal is to search for a direction w that gives the maximum possible margin
direction 2
MARGINSdirection 1
12 / 124
Remember
We have the following
d
Projection r distance
0
13 / 124
A Little of Geometry
Thus
r
d
A
B
C
Then
d =
|w0|
w2
1 + w2
2
, r =
|g (x)|
w2
1 + w2
2
(1)
14 / 124
A Little of Geometry
Thus
r
d
A
B
C
Then
d =
|w0|
w2
1 + w2
2
, r =
|g (x)|
w2
1 + w2
2
(1)
14 / 124
First d = |w0|√
w2
1+w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
15 / 124
First d = |w0|√
w2
1+w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
15 / 124
First d = |w0|√
w2
1+w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
15 / 124
What about r = |g(x)|√
w2
1+w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g (x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT
w
w
=wT
xp + w0 + r
w 2
w
=g (xp) + r w
Then
r = g(x)
||w||
16 / 124
What about r = |g(x)|√
w2
1+w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g (x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT
w
w
=wT
xp + w0 + r
w 2
w
=g (xp) + r w
Then
r = g(x)
||w||
16 / 124
What about r = |g(x)|√
w2
1+w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g (x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT
w
w
=wT
xp + w0 + r
w 2
w
=g (xp) + r w
Then
r = g(x)
||w||
16 / 124
What about r = |g(x)|√
w2
1+w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g (x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT
w
w
=wT
xp + w0 + r
w 2
w
=g (xp) + r w
Then
r = g(x)
||w||
16 / 124
What about r = |g(x)|√
w2
1+w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g (x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT
w
w
=wT
xp + w0 + r
w 2
w
=g (xp) + r w
Then
r = g(x)
||w||
16 / 124
What about r = |g(x)|√
w2
1+w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g (x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT
w
w
=wT
xp + w0 + r
w 2
w
=g (xp) + r w
Then
r = g(x)
||w||
16 / 124
This has the following interpretation
The distance from the projection
0
17 / 124
Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
18 / 124
Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
18 / 124
Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
18 / 124
Come back to the hyperplanes
We have then for each border support line an specific bias!!!
Support Vectors
19 / 124
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
20 / 124
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
20 / 124
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
20 / 124
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
20 / 124
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
20 / 124
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
20 / 124
Come back to the hyperplanes
The meaning of what I am saying!!!
21 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 22 / 124
A little about Support Vectors
They are the vectors (Here, we assume that w)
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
23 / 124
A little about Support Vectors
They are the vectors (Here, we assume that w)
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
23 / 124
A little about Support Vectors
They are the vectors (Here, we assume that w)
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
23 / 124
Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wT
xi + w0 = −(+)1 for di = −(+)1 (7)
Implies
The distance to the support vectors is:
r =
g (xi)
||w||
=



1
||w|| if di = +1
− 1
||w|| if di = −1
24 / 124
Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wT
xi + w0 = −(+)1 for di = −(+)1 (7)
Implies
The distance to the support vectors is:
r =
g (xi)
||w||
=



1
||w|| if di = +1
− 1
||w|| if di = −1
24 / 124
Therefore ...
We want the optimum value of the margin of separation as
ρ =
1
||w||
+
1
||w||
=
2
||w||
(8)
And the support vectors define the value of ρ
25 / 124
Therefore ...
We want the optimum value of the margin of separation as
ρ =
1
||w||
+
1
||w||
=
2
||w||
(8)
And the support vectors define the value of ρ
Support Vectors
25 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 26 / 124
Quadratic Optimization
Then, we have the samples with labels
T = {(xi, di)}N
i=1
Then we can put the decision rule as
di wT xi + w0 ≥ 1 i = 1, · · · , N
27 / 124
Quadratic Optimization
Then, we have the samples with labels
T = {(xi, di)}N
i=1
Then we can put the decision rule as
di wT xi + w0 ≥ 1 i = 1, · · · , N
27 / 124
Then, we have the optimization problem
The optimization problem
minwΦ (w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
28 / 124
Then, we have the optimization problem
The optimization problem
minwΦ (w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
28 / 124
Then, we have the optimization problem
The optimization problem
minwΦ (w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
28 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 29 / 124
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem to an equivalent
unconstrained problem with the help of certain unspecified parameters
known as Lagrange multipliers.
30 / 124
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem to an equivalent
unconstrained problem with the help of certain unspecified parameters
known as Lagrange multipliers.
30 / 124
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
31 / 124
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
31 / 124
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
31 / 124
Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T ∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T ∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
32 / 124
Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T ∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T ∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
32 / 124
Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T ∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T ∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
32 / 124
Remember
Think about this
Remember First Law of Newton!!!
Yes!!!
33 / 124
Remember
Think about this
Remember First Law of Newton!!!
Yes!!!
A system in equilibrium does not move
Static Body
33 / 124
Lagrange Multipliers
Definition
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problem
34 / 124
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
35 / 124
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
35 / 124
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
35 / 124
Gradient to a Surface
After all a gradient is a measure of the maximal change
For example the gradient of a function of three variables:
f (x) = i
∂f (x)
∂x
+ j
∂f (x)
∂y
+ k
∂f (x)
∂z
(11)
where i, j and k are unitary vectors in the directions x, y and z.
36 / 124
Example
We have f (x, y) = x exp {−x2
− y2
}
37 / 124
Example
With Gradient at the the contours when projecting in the 2D plane
38 / 124
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λKFK = 0 (12)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
39 / 124
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λKFK = 0 (12)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
39 / 124
Thus
If we have the following optimization:
min f (x)
s.tg1 (x) = 0
g2 (x) = 0
40 / 124
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f −→x − λ1 g1
−→x − λ2 g2
−→x = 0
40 / 124
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f −→x − λ1 g1
−→x − λ2 g2
−→x = 0
40 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 41 / 124
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
42 / 124
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
42 / 124
Example
We can apply that to the following problem
min f (x, y) = x2
− 8x + y2
− 12y + 48
s.t x + y = 8
43 / 124
Then, Rewriting The Optimization Problem
The optimization with equality constraints
minwΦ (w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
44 / 124
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function that we want to minimize
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
45 / 124
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function that we want to minimize
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
45 / 124
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function that we want to minimize
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
45 / 124
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function that we want to minimize
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
45 / 124
Saddle Point?
At the left the original problem, at the right the Lagrangian!!!
f (−→x )
g1 (−→x )
g2 (−→x )
46 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 47 / 124
Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)
s.t g1 (x) = 0
...
gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!
A point x is a local minimum of an equality constrained problem P only if
a set of non-negative αj’s may be found such that:
L (x, α) = f (x) −
N
i=1
αi gi (x) = 0
48 / 124
Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)
s.t g1 (x) = 0
...
gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!
A point x is a local minimum of an equality constrained problem P only if
a set of non-negative αj’s may be found such that:
L (x, α) = f (x) −
N
i=1
αi gi (x) = 0
48 / 124
Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
49 / 124
Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
49 / 124
Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
49 / 124
A small deviation from the SVM’s for the sake of Vox
Populi
Theorem (Karush-Kuhn-Tucker Necessary Conditions)
Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → R
for i = 1, ..., m. Consider the problem P to minimize f (x) subject to
x ∈ X and gi (x) ≤ 0 i = 1, ..., m. Let x be a feasible solution, and
denote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I are
differentiable at x and that gi i /∈ I are continuous at x. Furthermore,
suppose that gi (x) for i ∈ I are linearly independent. If x solves
problem P locally, there exist scalars ui for i ∈ I such that
f (x) +
i∈I
ui gi (x) = 0
ui ≥ 0 for i ∈ I
50 / 124
It is more...
In addition to the above assumptions
If gi for each i /∈ I is also differentiable at x, the previous conditions can
be written in the following equivalent form:
f (x) +
m
i=1
ui gi (x) = 0
ugi (x) = 0 for i = 1, ..., m
ui ≥ 0 for i = 1, ..., m
51 / 124
The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J (w, w0, α)
∂w
= 0
Condition 2
∂J (w, w0, α)
∂w0
= 0
52 / 124
The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J (w, w0, α)
∂w
= 0
Condition 2
∂J (w, w0, α)
∂w0
= 0
52 / 124
The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J (w, w0, α)
∂w
= 0
Condition 2
∂J (w, w0, α)
∂w0
= 0
52 / 124
Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
53 / 124
Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
53 / 124
Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
53 / 124
In a similar way ...
We have by the second optimality condition
N
i=1
αidi = 0
Note
αi di wT
xi + w0 − 1 = 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 or
di wT xi + w0 − 1 = 0.
54 / 124
In a similar way ...
We have by the second optimality condition
N
i=1
αidi = 0
Note
αi di wT
xi + w0 − 1 = 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 or
di wT xi + w0 − 1 = 0.
54 / 124
Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
55 / 124
Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
55 / 124
Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
55 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 56 / 124
Lagrangian Dual Problem
Consider the following nonlinear programming problem
Primal Problem P
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
hi (x) = 0 for i = 1, ..., l
x ∈ X
Lagrange Dual Problem D
max Θ (u, v)
s.t. u > 0
where Θ (u, v) = infx f (x) + m
i=1 uigi (x) + l
i=1 vihi (x) |x ∈ X
57 / 124
Lagrangian Dual Problem
Consider the following nonlinear programming problem
Primal Problem P
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
hi (x) = 0 for i = 1, ..., l
x ∈ X
Lagrange Dual Problem D
max Θ (u, v)
s.t. u > 0
where Θ (u, v) = infx f (x) + m
i=1 uigi (x) + l
i=1 vihi (x) |x ∈ X
57 / 124
What does this mean?
Assume that the equality constraint does not exist
We have then
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
x ∈ X
Now assume that we finish with only one constraint
We have then
min f (x)
s.t g (x) ≤ 0
x ∈ X
58 / 124
What does this mean?
Assume that the equality constraint does not exist
We have then
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
x ∈ X
Now assume that we finish with only one constraint
We have then
min f (x)
s.t g (x) ≤ 0
x ∈ X
58 / 124
What does this mean?
First, we have the following figure
A
B
X
G
Slope:
Slope:
59 / 124
What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
Thus
Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -
Equivalent to f (x) + u g(x) = 0
60 / 124
What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
Thus
Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -
Equivalent to f (x) + u g(x) = 0
60 / 124
What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down as
possible, along its negative gradient, while in contact with G.
61 / 124
What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down as
possible, along its negative gradient, while in contact with G.
61 / 124
In other words
Move the line parallel to itself until it supports G
A
B
X
G
Slope:
Slope:
Note The Set G lies above the line and touches it.
62 / 124
Thus
Thus
Then, the problem is to find the slope of the supporting hyperplane for
G.
Then intersection with the z-axis
Gives θ(u)
63 / 124
Thus
Thus
Then, the problem is to find the slope of the supporting hyperplane for
G.
Then intersection with the z-axis
Gives θ(u)
63 / 124
Again
We can see the θ
A
B
X
G
Slope:
Slope:
64 / 124
Thus
The dual problem is equivalent
Finding the slope of the supporting hyperplane such that its intercept on
the z-axis is maximal
65 / 124
Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X
G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
66 / 124
Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X
G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
66 / 124
For more on this Please!!!
Look at this book
From “Nonlinear Programming: Theory and Algorithms” by Mokhtar
S. Bazaraa, and C. M. Shetty. Wiley, New York, (2006)
At Page 260.
67 / 124
Example (Lagrange Dual)
Primal
min x2
1 + x2
2
s.t. −x1 − x2 + 4 ≤ 0
x1, x2 ≥ 0
Lagrange Dual
Θ(u) = inf {x2
1 + x2
2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}
68 / 124
Example (Lagrange Dual)
Primal
min x2
1 + x2
2
s.t. −x1 − x2 + 4 ≤ 0
x1, x2 ≥ 0
Lagrange Dual
Θ(u) = inf {x2
1 + x2
2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}
68 / 124
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
69 / 124
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
69 / 124
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
69 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 70 / 124
Duality Theorem
First Property
If the Primal has an optimal solution, the dual too.
Thus
In order to w ∗ and α∗ to be optimal solutions for the primal and dual
problem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J (w∗, w0∗, α∗)
= min
w
J (w∗, w0∗, α∗)
71 / 124
Duality Theorem
First Property
If the Primal has an optimal solution, the dual too.
Thus
In order to w ∗ and α∗ to be optimal solutions for the primal and dual
problem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J (w∗, w0∗, α∗)
= min
w
J (w∗, w0∗, α∗)
71 / 124
Reformulate our Equations
We have then
J (w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi − w0
N
i=1
αidi +
N
i=1
αi
Now for our 2nd optimality condition
J (w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi +
N
i=1
αi
72 / 124
Reformulate our Equations
We have then
J (w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi − w0
N
i=1
αidi +
N
i=1
αi
Now for our 2nd optimality condition
J (w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi +
N
i=1
αi
72 / 124
We have finally for the 1st Optimality Condition:
First
wT
w =
N
i=1
αidiwT
xi =
N
i=1
N
j=1
αiαjdidjxT
j xi
Second, setting J (w, w0, α) = Q (α)
Q (α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
73 / 124
We have finally for the 1st Optimality Condition:
First
wT
w =
N
i=1
αidiwT
xi =
N
i=1
N
j=1
αiαjdidjxT
j xi
Second, setting J (w, w0, α) = Q (α)
Q (α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
73 / 124
From here, we have the problem
This is the problem that we really solve
Given the training sample {(xi, di)}N
i=1, find the Lagrange multipliers
{αi}N
i=1 that maximize the objective function
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
subject to the constraints
N
i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · , N (20)
Note
In the Primal, we were trying to minimize the cost function, for this it is
necessary to maximize α. That is the reason why we are maximizing
Q (α). 74 / 124
From here, we have the problem
This is the problem that we really solve
Given the training sample {(xi, di)}N
i=1, find the Lagrange multipliers
{αi}N
i=1 that maximize the objective function
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
subject to the constraints
N
i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · , N (20)
Note
In the Primal, we were trying to minimize the cost function, for this it is
necessary to maximize α. That is the reason why we are maximizing
Q (α). 74 / 124
Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g x(s)
= wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
75 / 124
Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g x(s)
= wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
75 / 124
Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g x(s)
= wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
75 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 76 / 124
What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
77 / 124
What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
77 / 124
What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
77 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 78 / 124
Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl
→ y ∈ Rk
Then, it is possible to define the following mapping
79 / 124
Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl
→ y ∈ Rk
Then, it is possible to define the following mapping
79 / 124
Define a map to a higher Dimension
Nonlinear transformations
Given a series of nonlinear transformations
{φi (x)}m
i=1
from input space to the feature space.
We can define the decision surface as
m
i=1
wiφi (x) + w0 = 0
80 / 124
Define a map to a higher Dimension
Nonlinear transformations
Given a series of nonlinear transformations
{φi (x)}m
i=1
from input space to the feature space.
We can define the decision surface as
m
i=1
wiφi (x) + w0 = 0
80 / 124
This allows us to define
The following vector
φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T
that represents the mapping.
From this mapping
We can define the following kernel function
K : X × X → R
K (xi, xj) = φ (xi)T
φ (xj)
81 / 124
This allows us to define
The following vector
φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T
that represents the mapping.
From this mapping
We can define the following kernel function
K : X × X → R
K (xi, xj) = φ (xi)T
φ (xj)
81 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 82 / 124
Example
Assume
x ∈ R → y =



x2
1√
2x1x2
x2
2



We can show that
yT
i yj = xT
i xj
2
83 / 124
Example
Assume
x ∈ R → y =



x2
1√
2x1x2
x2
2



We can show that
yT
i yj = xT
i xj
2
83 / 124
Example of Kernels
Polynomials
k (x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k (x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k (x, z) = tanh βxT
z + γ
84 / 124
Example of Kernels
Polynomials
k (x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k (x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k (x, z) = tanh βxT
z + γ
84 / 124
Example of Kernels
Polynomials
k (x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k (x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k (x, z) = tanh βxT
z + γ
84 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 85 / 124
Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
86 / 124
Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
86 / 124
Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
86 / 124
Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
87 / 124
Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
87 / 124
Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
87 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 88 / 124
Optimal Hyperplane for non-separable patterns
Important
We have been considering only problems where the classes are linearly
separable.
Now
What happen when the patterns are not separable?
Thus, we can still build a separating hyperplane
But errors will happen in the classification... We need to minimize them...
89 / 124
Optimal Hyperplane for non-separable patterns
Important
We have been considering only problems where the classes are linearly
separable.
Now
What happen when the patterns are not separable?
Thus, we can still build a separating hyperplane
But errors will happen in the classification... We need to minimize them...
89 / 124
Optimal Hyperplane for non-separable patterns
Important
We have been considering only problems where the classes are linearly
separable.
Now
What happen when the patterns are not separable?
Thus, we can still build a separating hyperplane
But errors will happen in the classification... We need to minimize them...
89 / 124
What if the following happens
Some data points invade the “margin” space
Optimal Hyperplane
Data Point
Violating Property
90 / 124
Fixing the Problem - Corinna’s Style
The margin of separation between classes is said to be soft if a data
point (xi, di) violates the following condition
di wT
xi + b ≥ +1 i = 1, 2, ..., N (22)
This violation can arise in one of two ways
The data point (xi, di) falls inside the region of separation but on the
right side of the decision surface - still correct classification.
91 / 124
Fixing the Problem - Corinna’s Style
The margin of separation between classes is said to be soft if a data
point (xi, di) violates the following condition
di wT
xi + b ≥ +1 i = 1, 2, ..., N (22)
This violation can arise in one of two ways
The data point (xi, di) falls inside the region of separation but on the
right side of the decision surface - still correct classification.
91 / 124
We have then
Example
Optimal Hyperplane
Data Point
Violating Property
92 / 124
Or...
This violation can arise in one of two ways
The data point (xi, di) falls on the wrong side of the decision surface -
incorrect classification.
Example
93 / 124
Or...
This violation can arise in one of two ways
The data point (xi, di) falls on the wrong side of the decision surface -
incorrect classification.
Example
Optimal Hyperplane
Data Point
Violating Property
93 / 124
Solving the problem
What to do?
We introduce a set of nonnegative scalar values {ξi}N
i=1.
Introduce this into the decision rule
di wT
xi + b ≥ 1 − ξi i = 1, 2, ..., N (23)
94 / 124
Solving the problem
What to do?
We introduce a set of nonnegative scalar values {ξi}N
i=1.
Introduce this into the decision rule
di wT
xi + b ≥ 1 − ξi i = 1, 2, ..., N (23)
94 / 124
Solving the problem
What to do?
We introduce a set of nonnegative scalar values {ξi}N
i=1.
Introduce this into the decision rule
di wT
xi + b ≥ 1 − ξi i = 1, 2, ..., N (23)
94 / 124
The ξi are called slack variables
What?
In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modified
maximum margin idea that allows for mislabeled examples.
Ok!!!
Instead of expecting to have constant margin for all the samples, the
margin can change depending of the sample.
What do we have?
ξi measures the deviation of a data point from the ideal condition of
pattern separability.
95 / 124
The ξi are called slack variables
What?
In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modified
maximum margin idea that allows for mislabeled examples.
Ok!!!
Instead of expecting to have constant margin for all the samples, the
margin can change depending of the sample.
What do we have?
ξi measures the deviation of a data point from the ideal condition of
pattern separability.
95 / 124
The ξi are called slack variables
What?
In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modified
maximum margin idea that allows for mislabeled examples.
Ok!!!
Instead of expecting to have constant margin for all the samples, the
margin can change depending of the sample.
What do we have?
ξi measures the deviation of a data point from the ideal condition of
pattern separability.
95 / 124
Properties of ξi
What if?
You have 0 ≤ ξi ≤ 1
We have
96 / 124
Properties of ξi
What if?
You have 0 ≤ ξi ≤ 1
We have
Optimal Hyperplane
Data Point
Violating Property
96 / 124
Properties of ξi
What if?
You have ξi > 1
We have
97 / 124
Properties of ξi
What if?
You have ξi > 1
We have
Optimal Hyperplane
Data Point
Violating Property
97 / 124
Support Vectors
We want
Support vectors that satisfy equation (Eq. 23) even when ξi > 0
di wT
xi + b ≥ 1 − ξi i = 1, 2, ..., N
98 / 124
We want the following
We want to find an hyperplane
Such that average error is misclassified over all the samples
1
N
N
i=1
e2
(24)
99 / 124
First Attempt Into Minimization
We can try the following
Given
I (x) =
0 if x ≤ 0
1 if x > 0
(25)
Minimize the following
Φ (ξ) =
N
i=1
I (ξi − 1) (26)
with respect to the weight vector w subject to
1 di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N
2 w 2 ≤ C for a given C.
100 / 124
First Attempt Into Minimization
We can try the following
Given
I (x) =
0 if x ≤ 0
1 if x > 0
(25)
Minimize the following
Φ (ξ) =
N
i=1
I (ξi − 1) (26)
with respect to the weight vector w subject to
1 di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N
2 w 2 ≤ C for a given C.
100 / 124
Problem
Using this first attempt
Minimization of Φ (ξ) with respect to w is a non-convex optimization
problem that is NP-complete.
Thus, we need to use an approximation, maybe
Φ (ξ) =
N
i=1
ξi (27)
Now, we simplify the computations by integrating the vector w
Φ (w, ξ) =
1
2
wT
w + C
N
i=1
ξi (28)
101 / 124
Problem
Using this first attempt
Minimization of Φ (ξ) with respect to w is a non-convex optimization
problem that is NP-complete.
Thus, we need to use an approximation, maybe
Φ (ξ) =
N
i=1
ξi (27)
Now, we simplify the computations by integrating the vector w
Φ (w, ξ) =
1
2
wT
w + C
N
i=1
ξi (28)
101 / 124
Problem
Using this first attempt
Minimization of Φ (ξ) with respect to w is a non-convex optimization
problem that is NP-complete.
Thus, we need to use an approximation, maybe
Φ (ξ) =
N
i=1
ξi (27)
Now, we simplify the computations by integrating the vector w
Φ (w, ξ) =
1
2
wT
w + C
N
i=1
ξi (28)
101 / 124
Important
First
Minimizing the first term in (Eq. 28) is related to minimize the
Vapnik–Chervonenkis dimension.
Which is a measure of the capacity (complexity, expressive power,
richness, or flexibility) of a statistical classification algorithm.
Second
The second term N
i=1 ξi is an upper bound on the number of test errors.
102 / 124
Important
First
Minimizing the first term in (Eq. 28) is related to minimize the
Vapnik–Chervonenkis dimension.
Which is a measure of the capacity (complexity, expressive power,
richness, or flexibility) of a statistical classification algorithm.
Second
The second term N
i=1 ξi is an upper bound on the number of test errors.
102 / 124
Important
First
Minimizing the first term in (Eq. 28) is related to minimize the
Vapnik–Chervonenkis dimension.
Which is a measure of the capacity (complexity, expressive power,
richness, or flexibility) of a statistical classification algorithm.
Second
The second term N
i=1 ξi is an upper bound on the number of test errors.
102 / 124
Some problems for the Parameter C
Little Problem
The parameter C has to be selected by the user.
This can be done in two ways
1 The parameter C is determined experimentally via the standard use of
a training! (validation) test set.
2 It is determined analytically by estimating the Vapnik–Chervonenkis
dimension.
103 / 124
Some problems for the Parameter C
Little Problem
The parameter C has to be selected by the user.
This can be done in two ways
1 The parameter C is determined experimentally via the standard use of
a training! (validation) test set.
2 It is determined analytically by estimating the Vapnik–Chervonenkis
dimension.
103 / 124
Some problems for the Parameter C
Little Problem
The parameter C has to be selected by the user.
This can be done in two ways
1 The parameter C is determined experimentally via the standard use of
a training! (validation) test set.
2 It is determined analytically by estimating the Vapnik–Chervonenkis
dimension.
103 / 124
Primal Problem
Problem, given samples {(xi, di)}N
i=1
min
w,ξ
Φ (w, ξ) = min
w,ξ
1
2
wT
w + C
N
i=1
ξi
s.t. di(wT
xi + w0) ≥ 1 − ξi for i = 1, · · · , N
ξi ≥ 0 for all i
With C a user-specified positive parameter.
104 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 105 / 124
Final Setup
Using Lagrange Multipliers and dual-primal method is possible to
obtain the following setup
Given the training sample {(xi, di)}N
i=1, find the Lagrange multipliers
{αi}N
i=1 that maximize the objective function
min
α
Q(α) = min
α



N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi



subject to the constraints
N
i=1
αidi = 0 (29)
0 ≤ αi ≤ C for i = 1, · · · , N (30)
where C is a user-specified positive parameter.
106 / 124
Remarks
Something Notable
Note that neither the slack variables nor their Lagrange multipliers
appear in the dual problem.
The dual problem for the case of non-separable patterns is thus
similar to that for the simple case of linearly separable patterns
The only big difference
Instead of using the constraint αi ≥ 0, the new problem use the more
stringent constraint 0 ≤ αi ≤ C.
Note the following
ξi = 0 if αi < C (31)
107 / 124
Remarks
Something Notable
Note that neither the slack variables nor their Lagrange multipliers
appear in the dual problem.
The dual problem for the case of non-separable patterns is thus
similar to that for the simple case of linearly separable patterns
The only big difference
Instead of using the constraint αi ≥ 0, the new problem use the more
stringent constraint 0 ≤ αi ≤ C.
Note the following
ξi = 0 if αi < C (31)
107 / 124
Remarks
Something Notable
Note that neither the slack variables nor their Lagrange multipliers
appear in the dual problem.
The dual problem for the case of non-separable patterns is thus
similar to that for the simple case of linearly separable patterns
The only big difference
Instead of using the constraint αi ≥ 0, the new problem use the more
stringent constraint 0 ≤ αi ≤ C.
Note the following
ξi = 0 if αi < C (31)
107 / 124
Remarks
Something Notable
Note that neither the slack variables nor their Lagrange multipliers
appear in the dual problem.
The dual problem for the case of non-separable patterns is thus
similar to that for the simple case of linearly separable patterns
The only big difference
Instead of using the constraint αi ≥ 0, the new problem use the more
stringent constraint 0 ≤ αi ≤ C.
Note the following
ξi = 0 if αi < C (31)
107 / 124
Finally
The optimal solution for the weight vector w∗
w∗
=
Ns
i=1
α∗
i dixi
Where Ns is the number of support vectors.
In addition
The determination of the optimum values to that described before.
The KKT conditions are as follow
αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N.
µiξi = 0 for i = 1, 2, ..., N.
108 / 124
Finally
The optimal solution for the weight vector w∗
w∗
=
Ns
i=1
α∗
i dixi
Where Ns is the number of support vectors.
In addition
The determination of the optimum values to that described before.
The KKT conditions are as follow
αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N.
µiξi = 0 for i = 1, 2, ..., N.
108 / 124
Finally
The optimal solution for the weight vector w∗
w∗
=
Ns
i=1
α∗
i dixi
Where Ns is the number of support vectors.
In addition
The determination of the optimum values to that described before.
The KKT conditions are as follow
αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N.
µiξi = 0 for i = 1, 2, ..., N.
108 / 124
Finally
The optimal solution for the weight vector w∗
w∗
=
Ns
i=1
α∗
i dixi
Where Ns is the number of support vectors.
In addition
The determination of the optimum values to that described before.
The KKT conditions are as follow
αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N.
µiξi = 0 for i = 1, 2, ..., N.
108 / 124
Where...
The µi are Lagrange multipliers
They are used to enforce the non-negativity of the slack variables ξi for all
i.
Something Notable
At saddle point, the derivative of the Lagrangian function for the primal
problem:
1
2
wT
w + C
N
i=1
ξi −
N
i=1
αi di wT
xi + wo − 1 + ξi −
N
i=1
µiξi (32)
109 / 124
Where...
The µi are Lagrange multipliers
They are used to enforce the non-negativity of the slack variables ξi for all
i.
Something Notable
At saddle point, the derivative of the Lagrangian function for the primal
problem:
1
2
wT
w + C
N
i=1
ξi −
N
i=1
αi di wT
xi + wo − 1 + ξi −
N
i=1
µiξi (32)
109 / 124
Thus
We get
αi + µi = C (33)
Thus, we get if αi < C
Then µi > 0 ⇒ ξi = 0
We may determine w0
Using any data point (xi, di) in the training set such that 0 ≤ α∗
i ≤ C.
Then, given ξi = 0,
w∗
0 =
1
di
− (w∗
)T
xi (34)
110 / 124
Thus
We get
αi + µi = C (33)
Thus, we get if αi < C
Then µi > 0 ⇒ ξi = 0
We may determine w0
Using any data point (xi, di) in the training set such that 0 ≤ α∗
i ≤ C.
Then, given ξi = 0,
w∗
0 =
1
di
− (w∗
)T
xi (34)
110 / 124
Thus
We get
αi + µi = C (33)
Thus, we get if αi < C
Then µi > 0 ⇒ ξi = 0
We may determine w0
Using any data point (xi, di) in the training set such that 0 ≤ α∗
i ≤ C.
Then, given ξi = 0,
w∗
0 =
1
di
− (w∗
)T
xi (34)
110 / 124
Nevertheless
It is better
To take the mean value of w∗
0 from all such data points in the training
sample (Burges, 1998).
BTW He has a great book in SVM’s “An Introduction to Support
Vector Machines and Other Kernel-based Learning Methods”
111 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 112 / 124
Basic Idea
Something Notable
The SVM uses the scalar product xi, xj as a measure of similarity
between xi and xj, and of distance to the hyperplane.
Since the scalar product is linear, the SVM is a linear method.
But
Using a nonlinear function instead, we can make the classifier nonlinear.
113 / 124
Basic Idea
Something Notable
The SVM uses the scalar product xi, xj as a measure of similarity
between xi and xj, and of distance to the hyperplane.
Since the scalar product is linear, the SVM is a linear method.
But
Using a nonlinear function instead, we can make the classifier nonlinear.
113 / 124
Basic Idea
Something Notable
The SVM uses the scalar product xi, xj as a measure of similarity
between xi and xj, and of distance to the hyperplane.
Since the scalar product is linear, the SVM is a linear method.
But
Using a nonlinear function instead, we can make the classifier nonlinear.
113 / 124
We do this by defining the following map
Nonlinear transformations
Given a series of nonlinear transformations
{φi (x)} m
i=1
from input space to the feature space.
We can define the decision surface as
m
i=1
wiφi (x) + w0 = 0
.
114 / 124
We do this by defining the following map
Nonlinear transformations
Given a series of nonlinear transformations
{φi (x)} m
i=1
from input space to the feature space.
We can define the decision surface as
m
i=1
wiφi (x) + w0 = 0
.
114 / 124
This allows us to define
The following vector
φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T
That represents the mapping.
115 / 124
Outline
1 History
The Beginning
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
5 Soft Margins
Introduction
The Soft Margin Solution
6 More About Kernels
Basic Idea
From Inner products to Kernels 116 / 124
Finally
We define the decision surface as
wT
φ (x) = 0 (35)
We now seek "linear" separability of features, we may write
w =
N
i=1
αidiφ (xi) (36)
Thus, we finish with the following decision surface
N
i=1
αidiφT
(xi) φ (x) = 0 (37)
117 / 124
Finally
We define the decision surface as
wT
φ (x) = 0 (35)
We now seek "linear" separability of features, we may write
w =
N
i=1
αidiφ (xi) (36)
Thus, we finish with the following decision surface
N
i=1
αidiφT
(xi) φ (x) = 0 (37)
117 / 124
Finally
We define the decision surface as
wT
φ (x) = 0 (35)
We now seek "linear" separability of features, we may write
w =
N
i=1
αidiφ (xi) (36)
Thus, we finish with the following decision surface
N
i=1
αidiφT
(xi) φ (x) = 0 (37)
117 / 124
Thus
The term φT
(xi) φ (x)
It represents the inner product of two vectors induced in the feature space
induced by the input patterns.
We can introduce the inner-product kernel
K (xi, x) = φT
(xi) φ (x) =
m
j=0
φj (xi) φj (x) (38)
Property: Symmetry
K (xi, x) = K (x, xi) (39)
118 / 124
Thus
The term φT
(xi) φ (x)
It represents the inner product of two vectors induced in the feature space
induced by the input patterns.
We can introduce the inner-product kernel
K (xi, x) = φT
(xi) φ (x) =
m
j=0
φj (xi) φj (x) (38)
Property: Symmetry
K (xi, x) = K (x, xi) (39)
118 / 124
Thus
The term φT
(xi) φ (x)
It represents the inner product of two vectors induced in the feature space
induced by the input patterns.
We can introduce the inner-product kernel
K (xi, x) = φT
(xi) φ (x) =
m
j=0
φj (xi) φj (x) (38)
Property: Symmetry
K (xi, x) = K (x, xi) (39)
118 / 124
This allows to redefine the optimal hyperplane
We get
N
i=1
αidiK (xi, x) = 0 (40)
Something Notable
Using kernels, we can avoid to go from:
Input Space =⇒ Mapping Space =⇒ Inner Product (41)
By directly going from
Input Space =⇒ Inner Product (42)
119 / 124
This allows to redefine the optimal hyperplane
We get
N
i=1
αidiK (xi, x) = 0 (40)
Something Notable
Using kernels, we can avoid to go from:
Input Space =⇒ Mapping Space =⇒ Inner Product (41)
By directly going from
Input Space =⇒ Inner Product (42)
119 / 124
This allows to redefine the optimal hyperplane
We get
N
i=1
αidiK (xi, x) = 0 (40)
Something Notable
Using kernels, we can avoid to go from:
Input Space =⇒ Mapping Space =⇒ Inner Product (41)
By directly going from
Input Space =⇒ Inner Product (42)
119 / 124
Important
Something Notable
The expansion of (Eq. 38) for the inner-product kernel K (xi, x) is an
important special case of that arises in functional analysis.
120 / 124
Mercer’s Theorem
Mercer’s Theorem
Let K (x, x ) be a continuous symmetric kernel that is defined in the
closed interval a ≤ x ≤ b and likewise for x . The kernel K (x, x ) can be
expanded in the series
K x, x =
∞
i=1
λiφi (x) φi x (43)
With
Positive coefficients, λi > 0 for all i.
121 / 124
Mercer’s Theorem
Mercer’s Theorem
Let K (x, x ) be a continuous symmetric kernel that is defined in the
closed interval a ≤ x ≤ b and likewise for x . The kernel K (x, x ) can be
expanded in the series
K x, x =
∞
i=1
λiφi (x) φi x (43)
With
Positive coefficients, λi > 0 for all i.
121 / 124
Mercer’s Theorem
For this expression to be valid and or it to converge absolutely and
uniformly
It is necessary and sufficient that the condition
ˆ b
a
ˆ b
a
K x, x ψ (x) ψ x dxdx ≥ 0 (44)
holds for all ψ such that
´ b
a ψ2 (x) dx < ∞(Example of a quadratic norm
for functions).
122 / 124
Remarks
First
The functions φi (x) are called eigenfunctions of the expansion and the
numbers λi are called eigenvalues.
Second
The fact that all of the eigenvalues are positive means that the kernel
K (x, x ) is positive definite.
123 / 124
Remarks
First
The functions φi (x) are called eigenfunctions of the expansion and the
numbers λi are called eigenvalues.
Second
The fact that all of the eigenvalues are positive means that the kernel
K (x, x ) is positive definite.
123 / 124
Not only that
We have that
For λi = 1, the ith image of
√
λiφi (x) induced in the feature space by the
input vector x is an eigenfunction of the expansion.
In theory
The dimensionality of the feature space (i.e., the number of eigenvalues/
eigenfunctions) can be infinitely large.
124 / 124
Not only that
We have that
For λi = 1, the ith image of
√
λiφi (x) induced in the feature space by the
input vector x is an eigenfunction of the expansion.
In theory
The dimensionality of the feature space (i.e., the number of eigenvalues/
eigenfunctions) can be infinitely large.
124 / 124

More Related Content

PDF
An introduction to support vector machines
PDF
Open science 2014
PPTX
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012
PDF
Support Vector Machines for Classification
PPT
Artificial Intelligence: Data Mining
PDF
Agile for Embedded & System Software Development : Presented by Priyank KS
PPT
3.7 heap sort
PDF
Interfaces to ubiquitous computing
An introduction to support vector machines
Open science 2014
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012
Support Vector Machines for Classification
Artificial Intelligence: Data Mining
Agile for Embedded & System Software Development : Presented by Priyank KS
3.7 heap sort
Interfaces to ubiquitous computing

Viewers also liked (20)

PDF
iBeacons: Security and Privacy?
PPTX
Demystifying dependency Injection: Dagger and Toothpick
PDF
Dependency Injection with Apex
PDF
Agile London: Industrial Agility, How to respond to the 4th Industrial Revolu...
PPTX
Agile Methodology PPT
PDF
HUG Ireland Event Presentation - In-Memory Databases
PPT
Fp growth algorithm
PDF
Privacy Concerns and Social Robots
PDF
Skiena algorithm 2007 lecture07 heapsort priority queues
PDF
ScrumGuides training: Agile Software Development With Scrum
PDF
Design & Analysis of Algorithms Lecture Notes
PPTX
Going native with less coupling: Dependency Injection in C++
PDF
Final Year Project-Gesture Based Interaction and Image Processing
PPTX
In-Memory Database Performance on AWS M4 Instances
PDF
Machine learning support vector machines
PPT
ARTIFICIAL INTELLIGENCE & NEURAL NETWORKS
PDF
Sap technical deep dive in a column oriented in memory database
PDF
Dependency injection in scala
PDF
RFID Privacy & Security Issues
PPSX
Data Structure (Heap Sort)
iBeacons: Security and Privacy?
Demystifying dependency Injection: Dagger and Toothpick
Dependency Injection with Apex
Agile London: Industrial Agility, How to respond to the 4th Industrial Revolu...
Agile Methodology PPT
HUG Ireland Event Presentation - In-Memory Databases
Fp growth algorithm
Privacy Concerns and Social Robots
Skiena algorithm 2007 lecture07 heapsort priority queues
ScrumGuides training: Agile Software Development With Scrum
Design & Analysis of Algorithms Lecture Notes
Going native with less coupling: Dependency Injection in C++
Final Year Project-Gesture Based Interaction and Image Processing
In-Memory Database Performance on AWS M4 Instances
Machine learning support vector machines
ARTIFICIAL INTELLIGENCE & NEURAL NETWORKS
Sap technical deep dive in a column oriented in memory database
Dependency injection in scala
RFID Privacy & Security Issues
Data Structure (Heap Sort)
Ad

Similar to 09 Machine Learning - Introduction Support Vector Machines (20)

PDF
Deterministic Chaos In Onedimensional Continuous Systems Jan Awrejcewicz
PDF
Logic Colloquium 2007 1st Edition Franoise Delon Ulrich Kohlenbach
PDF
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
PDF
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
PPT
Supporting the exploding dimensions of the chemical sciences via global netwo...
PDF
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
PDF
Causal Competences of Many Kinds
PDF
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
PDF
Recent Developments In Lie Algebras Groups And Representation Theory Kailash ...
PDF
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
PDF
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
PPTX
Knowledge base system appl. p 3,4
PDF
A Study on Evolving Self-organizing Cellular Automata based on Neural Network...
PDF
Nonlinear Cosmic Ray Diffusion Theories Andreas Shalchi
PPTX
Assessing, Creating and Using Knowledge Graph Restrictions
PPT
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
PDF
Contemporary Mathematics Dynamical Systems And Random Processes 1st Edition J...
PDF
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
PDF
Spaceefficient Data Structures Streams And Algorithms Papers In Honor Of J Ia...
PDF
Symmetry In Particle Physics Michal Hnati Jaroslav Antos Juha Honkonen
Deterministic Chaos In Onedimensional Continuous Systems Jan Awrejcewicz
Logic Colloquium 2007 1st Edition Franoise Delon Ulrich Kohlenbach
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
Supporting the exploding dimensions of the chemical sciences via global netwo...
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
Causal Competences of Many Kinds
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
Recent Developments In Lie Algebras Groups And Representation Theory Kailash ...
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
Handbook of Differential Equations Evolutionary Equations Volume 4 1st Editio...
Knowledge base system appl. p 3,4
A Study on Evolving Self-organizing Cellular Automata based on Neural Network...
Nonlinear Cosmic Ray Diffusion Theories Andreas Shalchi
Assessing, Creating and Using Knowledge Graph Restrictions
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Contemporary Mathematics Dynamical Systems And Random Processes 1st Edition J...
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
Spaceefficient Data Structures Streams And Algorithms Papers In Honor Of J Ia...
Symmetry In Particle Physics Michal Hnati Jaroslav Antos Juha Honkonen
Ad

More from Andres Mendez-Vazquez (20)

PDF
2.03 bayesian estimation
PDF
05 linear transformations
PDF
01.04 orthonormal basis_eigen_vectors
PDF
01.03 squared matrices_and_other_issues
PDF
01.02 linear equations
PDF
01.01 vector spaces
PDF
06 recurrent neural_networks
PDF
05 backpropagation automatic_differentiation
PDF
Zetta global
PDF
01 Introduction to Neural Networks and Deep Learning
PDF
25 introduction reinforcement_learning
PDF
Neural Networks and Deep Learning Syllabus
PDF
Introduction to artificial_intelligence_syllabus
PDF
Ideas 09 22_2018
PDF
Ideas about a Bachelor in Machine Learning/Data Sciences
PDF
Analysis of Algorithms Syllabus
PDF
20 k-means, k-center, k-meoids and variations
PDF
18.1 combining models
PDF
17 vapnik chervonenkis dimension
PDF
A basic introduction to learning
2.03 bayesian estimation
05 linear transformations
01.04 orthonormal basis_eigen_vectors
01.03 squared matrices_and_other_issues
01.02 linear equations
01.01 vector spaces
06 recurrent neural_networks
05 backpropagation automatic_differentiation
Zetta global
01 Introduction to Neural Networks and Deep Learning
25 introduction reinforcement_learning
Neural Networks and Deep Learning Syllabus
Introduction to artificial_intelligence_syllabus
Ideas 09 22_2018
Ideas about a Bachelor in Machine Learning/Data Sciences
Analysis of Algorithms Syllabus
20 k-means, k-center, k-meoids and variations
18.1 combining models
17 vapnik chervonenkis dimension
A basic introduction to learning

Recently uploaded (20)

PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Artificial Intelligence
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
CYBER-CRIMES AND SECURITY A guide to understanding
bas. eng. economics group 4 presentation 1.pptx
Artificial Intelligence
Foundation to blockchain - A guide to Blockchain Tech
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Sustainable Sites - Green Building Construction
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Operating System & Kernel Study Guide-1 - converted.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Lecture Notes Electrical Wiring System Components
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Model Code of Practice - Construction Work - 21102022 .pdf

09 Machine Learning - Introduction Support Vector Machines

  • 1. Machine Learning for Data Mining Introduction to Support Vector Machines Andres Mendez-Vazquez June 22, 2016 1 / 124
  • 2. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 2 / 124
  • 3. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 3 / 124
  • 4. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 4 / 124
  • 5. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 4 / 124
  • 6. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 4 / 124
  • 7. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 4 / 124
  • 8. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 4 / 124
  • 9. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 4 / 124
  • 10. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 4 / 124
  • 11. In addition Alexey Yakovlevich Chervonenkis He was a Soviet and Russian mathematician, and, with Vladimir Vapnik, was one of the main developers of the Vapnik–Chervonenkis theory, also known as the "fundamental theory of learning" an important part of computational learning theory. He died in September 22nd, 2014 At Losiny Ostrov National Park on 22 September 2014. 5 / 124
  • 12. In addition Alexey Yakovlevich Chervonenkis He was a Soviet and Russian mathematician, and, with Vladimir Vapnik, was one of the main developers of the Vapnik–Chervonenkis theory, also known as the "fundamental theory of learning" an important part of computational learning theory. He died in September 22nd, 2014 At Losiny Ostrov National Park on 22 September 2014. 5 / 124
  • 13. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 14. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 15. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 16. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 17. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 18. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 19. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 20. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 21. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124
  • 22. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 7 / 124
  • 23. Separable Classes Given xi, i = 1, · · · , N A set of samples belonging to two classes ω1, ω2. Objective We want to obtain a decision function as simple as g (x) = wT x + w0 8 / 124
  • 24. Separable Classes Given xi, i = 1, · · · , N A set of samples belonging to two classes ω1, ω2. Objective We want to obtain a decision function as simple as g (x) = wT x + w0 8 / 124
  • 25. Such that we can do the following A linear separation function g (x) = wt x + w0 9 / 124
  • 26. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 10 / 124
  • 27. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2 11 / 124
  • 28. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2 11 / 124
  • 29. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2 11 / 124
  • 30. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2 11 / 124
  • 31. What do we want? Our goal is to search for a direction w that gives the maximum possible margin direction 2 MARGINSdirection 1 12 / 124
  • 32. Remember We have the following d Projection r distance 0 13 / 124
  • 33. A Little of Geometry Thus r d A B C Then d = |w0| w2 1 + w2 2 , r = |g (x)| w2 1 + w2 2 (1) 14 / 124
  • 34. A Little of Geometry Thus r d A B C Then d = |w0| w2 1 + w2 2 , r = |g (x)| w2 1 + w2 2 (1) 14 / 124
  • 35. First d = |w0|√ w2 1+w2 2 We can use the following rule in a triangle with a 90o angle Area = 1 2 Cd (2) In addition, the area can be calculated also as Area = 1 2 AB (3) Thus d = AB C Remark: Can you get the rest of values? 15 / 124
  • 36. First d = |w0|√ w2 1+w2 2 We can use the following rule in a triangle with a 90o angle Area = 1 2 Cd (2) In addition, the area can be calculated also as Area = 1 2 AB (3) Thus d = AB C Remark: Can you get the rest of values? 15 / 124
  • 37. First d = |w0|√ w2 1+w2 2 We can use the following rule in a triangle with a 90o angle Area = 1 2 Cd (2) In addition, the area can be calculated also as Area = 1 2 AB (3) Thus d = AB C Remark: Can you get the rest of values? 15 / 124
  • 38. What about r = |g(x)|√ w2 1+w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g (x) =wT xp + r w w + w0 =wT xp + w0 + r wT w w =wT xp + w0 + r w 2 w =g (xp) + r w Then r = g(x) ||w|| 16 / 124
  • 39. What about r = |g(x)|√ w2 1+w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g (x) =wT xp + r w w + w0 =wT xp + w0 + r wT w w =wT xp + w0 + r w 2 w =g (xp) + r w Then r = g(x) ||w|| 16 / 124
  • 40. What about r = |g(x)|√ w2 1+w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g (x) =wT xp + r w w + w0 =wT xp + w0 + r wT w w =wT xp + w0 + r w 2 w =g (xp) + r w Then r = g(x) ||w|| 16 / 124
  • 41. What about r = |g(x)|√ w2 1+w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g (x) =wT xp + r w w + w0 =wT xp + w0 + r wT w w =wT xp + w0 + r w 2 w =g (xp) + r w Then r = g(x) ||w|| 16 / 124
  • 42. What about r = |g(x)|√ w2 1+w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g (x) =wT xp + r w w + w0 =wT xp + w0 + r wT w w =wT xp + w0 + r w 2 w =g (xp) + r w Then r = g(x) ||w|| 16 / 124
  • 43. What about r = |g(x)|√ w2 1+w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g (x) =wT xp + r w w + w0 =wT xp + w0 + r wT w w =wT xp + w0 + r w 2 w =g (xp) + r w Then r = g(x) ||w|| 16 / 124
  • 44. This has the following interpretation The distance from the projection 0 17 / 124
  • 45. Now We know that the straight line that we are looking for looks like wT x + w0 = 0 (5) What about something like this wT x + w0 = δ (6) Clearly This will be above or below the initial line wT x + w0 = 0. 18 / 124
  • 46. Now We know that the straight line that we are looking for looks like wT x + w0 = 0 (5) What about something like this wT x + w0 = δ (6) Clearly This will be above or below the initial line wT x + w0 = 0. 18 / 124
  • 47. Now We know that the straight line that we are looking for looks like wT x + w0 = 0 (5) What about something like this wT x + w0 = δ (6) Clearly This will be above or below the initial line wT x + w0 = 0. 18 / 124
  • 48. Come back to the hyperplanes We have then for each border support line an specific bias!!! Support Vectors 19 / 124
  • 49. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 20 / 124
  • 50. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 20 / 124
  • 51. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 20 / 124
  • 52. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 20 / 124
  • 53. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 20 / 124
  • 54. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 20 / 124
  • 55. Come back to the hyperplanes The meaning of what I am saying!!! 21 / 124
  • 56. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 22 / 124
  • 57. A little about Support Vectors They are the vectors (Here, we assume that w) xi such that wT xi + w0 = 1 or wT xi + w0 = −1 Properties The vectors nearest to the decision surface and the most difficult to classify. Because of that, we have the name “Support Vector Machines”. 23 / 124
  • 58. A little about Support Vectors They are the vectors (Here, we assume that w) xi such that wT xi + w0 = 1 or wT xi + w0 = −1 Properties The vectors nearest to the decision surface and the most difficult to classify. Because of that, we have the name “Support Vector Machines”. 23 / 124
  • 59. A little about Support Vectors They are the vectors (Here, we assume that w) xi such that wT xi + w0 = 1 or wT xi + w0 = −1 Properties The vectors nearest to the decision surface and the most difficult to classify. Because of that, we have the name “Support Vector Machines”. 23 / 124
  • 60. Now, we can resume the decision rule for the hyperplane For the support vectors g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7) Implies The distance to the support vectors is: r = g (xi) ||w|| =    1 ||w|| if di = +1 − 1 ||w|| if di = −1 24 / 124
  • 61. Now, we can resume the decision rule for the hyperplane For the support vectors g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7) Implies The distance to the support vectors is: r = g (xi) ||w|| =    1 ||w|| if di = +1 − 1 ||w|| if di = −1 24 / 124
  • 62. Therefore ... We want the optimum value of the margin of separation as ρ = 1 ||w|| + 1 ||w|| = 2 ||w|| (8) And the support vectors define the value of ρ 25 / 124
  • 63. Therefore ... We want the optimum value of the margin of separation as ρ = 1 ||w|| + 1 ||w|| = 2 ||w|| (8) And the support vectors define the value of ρ Support Vectors 25 / 124
  • 64. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 26 / 124
  • 65. Quadratic Optimization Then, we have the samples with labels T = {(xi, di)}N i=1 Then we can put the decision rule as di wT xi + w0 ≥ 1 i = 1, · · · , N 27 / 124
  • 66. Quadratic Optimization Then, we have the samples with labels T = {(xi, di)}N i=1 Then we can put the decision rule as di wT xi + w0 ≥ 1 i = 1, · · · , N 27 / 124
  • 67. Then, we have the optimization problem The optimization problem minwΦ (w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N Observations The cost functions Φ (w) is convex. The constrains are linear with respect to w. 28 / 124
  • 68. Then, we have the optimization problem The optimization problem minwΦ (w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N Observations The cost functions Φ (w) is convex. The constrains are linear with respect to w. 28 / 124
  • 69. Then, we have the optimization problem The optimization problem minwΦ (w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N Observations The cost functions Φ (w) is convex. The constrains are linear with respect to w. 28 / 124
  • 70. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 29 / 124
  • 71. Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. 30 / 124
  • 72. Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. 30 / 124
  • 73. Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 31 / 124
  • 74. Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 31 / 124
  • 75. Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 31 / 124
  • 76. Finding an Optimum using Lagrange Multipliers New problem min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} We want a λ = λ∗ optimal If the minimum of L (x1, x2, ..., xn, λ∗ ) occurs at (x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗ and (x1, x2, ..., xn) T ∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn) T ∗ minimizes: min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 Trick It is to find appropriate value for Lagrangian multiplier λ. 32 / 124
  • 77. Finding an Optimum using Lagrange Multipliers New problem min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} We want a λ = λ∗ optimal If the minimum of L (x1, x2, ..., xn, λ∗ ) occurs at (x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗ and (x1, x2, ..., xn) T ∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn) T ∗ minimizes: min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 Trick It is to find appropriate value for Lagrangian multiplier λ. 32 / 124
  • 78. Finding an Optimum using Lagrange Multipliers New problem min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} We want a λ = λ∗ optimal If the minimum of L (x1, x2, ..., xn, λ∗ ) occurs at (x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗ and (x1, x2, ..., xn) T ∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn) T ∗ minimizes: min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 Trick It is to find appropriate value for Lagrangian multiplier λ. 32 / 124
  • 79. Remember Think about this Remember First Law of Newton!!! Yes!!! 33 / 124
  • 80. Remember Think about this Remember First Law of Newton!!! Yes!!! A system in equilibrium does not move Static Body 33 / 124
  • 81. Lagrange Multipliers Definition Gives a set of necessary conditions to identify optimal points of equality constrained optimization problem 34 / 124
  • 82. Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (10) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 35 / 124
  • 83. Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (10) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 35 / 124
  • 84. Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (10) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 35 / 124
  • 85. Gradient to a Surface After all a gradient is a measure of the maximal change For example the gradient of a function of three variables: f (x) = i ∂f (x) ∂x + j ∂f (x) ∂y + k ∂f (x) ∂z (11) where i, j and k are unitary vectors in the directions x, y and z. 36 / 124
  • 86. Example We have f (x, y) = x exp {−x2 − y2 } 37 / 124
  • 87. Example With Gradient at the the contours when projecting in the 2D plane 38 / 124
  • 88. Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λKFK = 0 (12) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 39 / 124
  • 89. Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λKFK = 0 (12) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 39 / 124
  • 90. Thus If we have the following optimization: min f (x) s.tg1 (x) = 0 g2 (x) = 0 40 / 124
  • 91. Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can fix by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f −→x − λ1 g1 −→x − λ2 g2 −→x = 0 40 / 124
  • 92. Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can fix by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f −→x − λ1 g1 −→x − λ2 g2 −→x = 0 40 / 124
  • 93. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 41 / 124
  • 94. Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 42 / 124
  • 95. Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 42 / 124
  • 96. Example We can apply that to the following problem min f (x, y) = x2 − 8x + y2 − 12y + 48 s.t x + y = 8 43 / 124
  • 97. Then, Rewriting The Optimization Problem The optimization with equality constraints minwΦ (w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N 44 / 124
  • 98. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function that we want to minimize J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 45 / 124
  • 99. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function that we want to minimize J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 45 / 124
  • 100. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function that we want to minimize J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 45 / 124
  • 101. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function that we want to minimize J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 45 / 124
  • 102. Saddle Point? At the left the original problem, at the right the Lagrangian!!! f (−→x ) g1 (−→x ) g2 (−→x ) 46 / 124
  • 103. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 47 / 124
  • 104. Karush-Kuhn-Tucker Conditions First An Inequality Constrained Problem P min f (x) s.t g1 (x) = 0 ... gN (x) = 0 A really minimal version!!! Hey, it is a patch work!!! A point x is a local minimum of an equality constrained problem P only if a set of non-negative αj’s may be found such that: L (x, α) = f (x) − N i=1 αi gi (x) = 0 48 / 124
  • 105. Karush-Kuhn-Tucker Conditions First An Inequality Constrained Problem P min f (x) s.t g1 (x) = 0 ... gN (x) = 0 A really minimal version!!! Hey, it is a patch work!!! A point x is a local minimum of an equality constrained problem P only if a set of non-negative αj’s may be found such that: L (x, α) = f (x) − N i=1 αi gi (x) = 0 48 / 124
  • 106. Karush-Kuhn-Tucker Conditions Important Think about this each constraint correspond to a sample in both classes, thus The corresponding αi’s are going to be zero after optimization, if a constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember Maximization). Again the Support Vectors This actually defines the idea of support vectors!!! Thus Only the αi’s with active constraints (Support Vectors) will be different from zero when di wT xi + w0 − 1 = 0. 49 / 124
  • 107. Karush-Kuhn-Tucker Conditions Important Think about this each constraint correspond to a sample in both classes, thus The corresponding αi’s are going to be zero after optimization, if a constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember Maximization). Again the Support Vectors This actually defines the idea of support vectors!!! Thus Only the αi’s with active constraints (Support Vectors) will be different from zero when di wT xi + w0 − 1 = 0. 49 / 124
  • 108. Karush-Kuhn-Tucker Conditions Important Think about this each constraint correspond to a sample in both classes, thus The corresponding αi’s are going to be zero after optimization, if a constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember Maximization). Again the Support Vectors This actually defines the idea of support vectors!!! Thus Only the αi’s with active constraints (Support Vectors) will be different from zero when di wT xi + w0 − 1 = 0. 49 / 124
  • 109. A small deviation from the SVM’s for the sake of Vox Populi Theorem (Karush-Kuhn-Tucker Necessary Conditions) Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → R for i = 1, ..., m. Consider the problem P to minimize f (x) subject to x ∈ X and gi (x) ≤ 0 i = 1, ..., m. Let x be a feasible solution, and denote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I are differentiable at x and that gi i /∈ I are continuous at x. Furthermore, suppose that gi (x) for i ∈ I are linearly independent. If x solves problem P locally, there exist scalars ui for i ∈ I such that f (x) + i∈I ui gi (x) = 0 ui ≥ 0 for i ∈ I 50 / 124
  • 110. It is more... In addition to the above assumptions If gi for each i /∈ I is also differentiable at x, the previous conditions can be written in the following equivalent form: f (x) + m i=1 ui gi (x) = 0 ugi (x) = 0 for i = 1, ..., m ui ≥ 0 for i = 1, ..., m 51 / 124
  • 111. The necessary conditions for optimality We use the previous theorem 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] (14) Condition 1 ∂J (w, w0, α) ∂w = 0 Condition 2 ∂J (w, w0, α) ∂w0 = 0 52 / 124
  • 112. The necessary conditions for optimality We use the previous theorem 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] (14) Condition 1 ∂J (w, w0, α) ∂w = 0 Condition 2 ∂J (w, w0, α) ∂w0 = 0 52 / 124
  • 113. The necessary conditions for optimality We use the previous theorem 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] (14) Condition 1 ∂J (w, w0, α) ∂w = 0 Condition 2 ∂J (w, w0, α) ∂w0 = 0 52 / 124
  • 114. Using the conditions We have the first condition ∂J(w, w0, α) ∂w = ∂ 1 2wT w ∂w − ∂ N i=1 αi[di(wT xi + w0) − 1] ∂w = 0 ∂J(w, w0, α) ∂w = 1 2 (w + w) − N i=1 αidixi Thus w = N i=1 αidixi (15) 53 / 124
  • 115. Using the conditions We have the first condition ∂J(w, w0, α) ∂w = ∂ 1 2wT w ∂w − ∂ N i=1 αi[di(wT xi + w0) − 1] ∂w = 0 ∂J(w, w0, α) ∂w = 1 2 (w + w) − N i=1 αidixi Thus w = N i=1 αidixi (15) 53 / 124
  • 116. Using the conditions We have the first condition ∂J(w, w0, α) ∂w = ∂ 1 2wT w ∂w − ∂ N i=1 αi[di(wT xi + w0) − 1] ∂w = 0 ∂J(w, w0, α) ∂w = 1 2 (w + w) − N i=1 αidixi Thus w = N i=1 αidixi (15) 53 / 124
  • 117. In a similar way ... We have by the second optimality condition N i=1 αidi = 0 Note αi di wT xi + w0 − 1 = 0 Because the constraint vanishes in the optimal solution i.e. αi = 0 or di wT xi + w0 − 1 = 0. 54 / 124
  • 118. In a similar way ... We have by the second optimality condition N i=1 αidi = 0 Note αi di wT xi + w0 − 1 = 0 Because the constraint vanishes in the optimal solution i.e. αi = 0 or di wT xi + w0 − 1 = 0. 54 / 124
  • 119. Thus We need something extra Our classic trick of transforming a problem into another problem In this case We use the Primal-Dual Problem for Lagrangian Where We move from a minimization to a maximization!!! 55 / 124
  • 120. Thus We need something extra Our classic trick of transforming a problem into another problem In this case We use the Primal-Dual Problem for Lagrangian Where We move from a minimization to a maximization!!! 55 / 124
  • 121. Thus We need something extra Our classic trick of transforming a problem into another problem In this case We use the Primal-Dual Problem for Lagrangian Where We move from a minimization to a maximization!!! 55 / 124
  • 122. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 56 / 124
  • 123. Lagrangian Dual Problem Consider the following nonlinear programming problem Primal Problem P min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m hi (x) = 0 for i = 1, ..., l x ∈ X Lagrange Dual Problem D max Θ (u, v) s.t. u > 0 where Θ (u, v) = infx f (x) + m i=1 uigi (x) + l i=1 vihi (x) |x ∈ X 57 / 124
  • 124. Lagrangian Dual Problem Consider the following nonlinear programming problem Primal Problem P min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m hi (x) = 0 for i = 1, ..., l x ∈ X Lagrange Dual Problem D max Θ (u, v) s.t. u > 0 where Θ (u, v) = infx f (x) + m i=1 uigi (x) + l i=1 vihi (x) |x ∈ X 57 / 124
  • 125. What does this mean? Assume that the equality constraint does not exist We have then min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m x ∈ X Now assume that we finish with only one constraint We have then min f (x) s.t g (x) ≤ 0 x ∈ X 58 / 124
  • 126. What does this mean? Assume that the equality constraint does not exist We have then min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m x ∈ X Now assume that we finish with only one constraint We have then min f (x) s.t g (x) ≤ 0 x ∈ X 58 / 124
  • 127. What does this mean? First, we have the following figure A B X G Slope: Slope: 59 / 124
  • 128. What does this means? Thus at the y − z plane you have G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16) Thus Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) - Equivalent to f (x) + u g(x) = 0 60 / 124
  • 129. What does this means? Thus at the y − z plane you have G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16) Thus Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) - Equivalent to f (x) + u g(x) = 0 60 / 124
  • 130. What does this means? Thus at the y − z plane, we have z + uy = α (17) a line with slope −u. Then, to minimize z + uy = α We need to move the line z + uy = α in a parallel to itself as far down as possible, along its negative gradient, while in contact with G. 61 / 124
  • 131. What does this means? Thus at the y − z plane, we have z + uy = α (17) a line with slope −u. Then, to minimize z + uy = α We need to move the line z + uy = α in a parallel to itself as far down as possible, along its negative gradient, while in contact with G. 61 / 124
  • 132. In other words Move the line parallel to itself until it supports G A B X G Slope: Slope: Note The Set G lies above the line and touches it. 62 / 124
  • 133. Thus Thus Then, the problem is to find the slope of the supporting hyperplane for G. Then intersection with the z-axis Gives θ(u) 63 / 124
  • 134. Thus Thus Then, the problem is to find the slope of the supporting hyperplane for G. Then intersection with the z-axis Gives θ(u) 63 / 124
  • 135. Again We can see the θ A B X G Slope: Slope: 64 / 124
  • 136. Thus The dual problem is equivalent Finding the slope of the supporting hyperplane such that its intercept on the z-axis is maximal 65 / 124
  • 137. Or Such an hyperplane has slope −u and support G at (y, z) A B X G Slope: Slope: Remark: The optimal solution is u and the optimal dual objective is z. 66 / 124
  • 138. Or Such an hyperplane has slope −u and support G at (y, z) A B X G Slope: Slope: Remark: The optimal solution is u and the optimal dual objective is z. 66 / 124
  • 139. For more on this Please!!! Look at this book From “Nonlinear Programming: Theory and Algorithms” by Mokhtar S. Bazaraa, and C. M. Shetty. Wiley, New York, (2006) At Page 260. 67 / 124
  • 140. Example (Lagrange Dual) Primal min x2 1 + x2 2 s.t. −x1 − x2 + 4 ≤ 0 x1, x2 ≥ 0 Lagrange Dual Θ(u) = inf {x2 1 + x2 2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0} 68 / 124
  • 141. Example (Lagrange Dual) Primal min x2 1 + x2 2 s.t. −x1 − x2 + 4 ≤ 0 x1, x2 ≥ 0 Lagrange Dual Θ(u) = inf {x2 1 + x2 2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0} 68 / 124
  • 142. Solution Derive with respect to x1 and x2 We have two case to take in account: u ≥ 0 and u < 0 The first case is clear What about when u < 0 We have that θ (u) = −1 2u2 + 4u if u ≥ 0 4u if u < 0 (18) 69 / 124
  • 143. Solution Derive with respect to x1 and x2 We have two case to take in account: u ≥ 0 and u < 0 The first case is clear What about when u < 0 We have that θ (u) = −1 2u2 + 4u if u ≥ 0 4u if u < 0 (18) 69 / 124
  • 144. Solution Derive with respect to x1 and x2 We have two case to take in account: u ≥ 0 and u < 0 The first case is clear What about when u < 0 We have that θ (u) = −1 2u2 + 4u if u ≥ 0 4u if u < 0 (18) 69 / 124
  • 145. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 70 / 124
  • 146. Duality Theorem First Property If the Primal has an optimal solution, the dual too. Thus In order to w ∗ and α∗ to be optimal solutions for the primal and dual problem respectively, It is necessary and sufficient that w∗: It is feasible for the primal problem and Φ(w∗) = J (w∗, w0∗, α∗) = min w J (w∗, w0∗, α∗) 71 / 124
  • 147. Duality Theorem First Property If the Primal has an optimal solution, the dual too. Thus In order to w ∗ and α∗ to be optimal solutions for the primal and dual problem respectively, It is necessary and sufficient that w∗: It is feasible for the primal problem and Φ(w∗) = J (w∗, w0∗, α∗) = min w J (w∗, w0∗, α∗) 71 / 124
  • 148. Reformulate our Equations We have then J (w, w0, α) = 1 2 wT w − N i=1 αidiwT xi − w0 N i=1 αidi + N i=1 αi Now for our 2nd optimality condition J (w, w0, α) = 1 2 wT w − N i=1 αidiwT xi + N i=1 αi 72 / 124
  • 149. Reformulate our Equations We have then J (w, w0, α) = 1 2 wT w − N i=1 αidiwT xi − w0 N i=1 αidi + N i=1 αi Now for our 2nd optimality condition J (w, w0, α) = 1 2 wT w − N i=1 αidiwT xi + N i=1 αi 72 / 124
  • 150. We have finally for the 1st Optimality Condition: First wT w = N i=1 αidiwT xi = N i=1 N j=1 αiαjdidjxT j xi Second, setting J (w, w0, α) = Q (α) Q (α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi 73 / 124
  • 151. We have finally for the 1st Optimality Condition: First wT w = N i=1 αidiwT xi = N i=1 N j=1 αiαjdidjxT j xi Second, setting J (w, w0, α) = Q (α) Q (α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi 73 / 124
  • 152. From here, we have the problem This is the problem that we really solve Given the training sample {(xi, di)}N i=1, find the Lagrange multipliers {αi}N i=1 that maximize the objective function Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi subject to the constraints N i=1 αidi = 0 (19) αi ≥ 0 for i = 1, · · · , N (20) Note In the Primal, we were trying to minimize the cost function, for this it is necessary to maximize α. That is the reason why we are maximizing Q (α). 74 / 124
  • 153. From here, we have the problem This is the problem that we really solve Given the training sample {(xi, di)}N i=1, find the Lagrange multipliers {αi}N i=1 that maximize the objective function Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi subject to the constraints N i=1 αidi = 0 (19) αi ≥ 0 for i = 1, · · · , N (20) Note In the Primal, we were trying to minimize the cost function, for this it is necessary to maximize α. That is the reason why we are maximizing Q (α). 74 / 124
  • 154. Solving for α We can compute w∗ once we get the optimal α∗ i by using (Eq. 15) w∗ = N i=1 α∗ i dixi In addition, we can compute the optimal bias w∗ 0 using the optimal weight, w∗ For this, we use the positive margin equation: g x(s) = wT x(s) + w0 = 1 corresponding to a positive support vector. Then w0 = 1 − (w∗ )T x(s) for d(s) = 1 (21) 75 / 124
  • 155. Solving for α We can compute w∗ once we get the optimal α∗ i by using (Eq. 15) w∗ = N i=1 α∗ i dixi In addition, we can compute the optimal bias w∗ 0 using the optimal weight, w∗ For this, we use the positive margin equation: g x(s) = wT x(s) + w0 = 1 corresponding to a positive support vector. Then w0 = 1 − (w∗ )T x(s) for d(s) = 1 (21) 75 / 124
  • 156. Solving for α We can compute w∗ once we get the optimal α∗ i by using (Eq. 15) w∗ = N i=1 α∗ i dixi In addition, we can compute the optimal bias w∗ 0 using the optimal weight, w∗ For this, we use the positive margin equation: g x(s) = wT x(s) + w0 = 1 corresponding to a positive support vector. Then w0 = 1 − (w∗ )T x(s) for d(s) = 1 (21) 75 / 124
  • 157. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 76 / 124
  • 158. What do we need? Until now, we have only a maximal margin algorithm All this work fine when the classes are separable Problem, What when they are not separable? What we can do? 77 / 124
  • 159. What do we need? Until now, we have only a maximal margin algorithm All this work fine when the classes are separable Problem, What when they are not separable? What we can do? 77 / 124
  • 160. What do we need? Until now, we have only a maximal margin algorithm All this work fine when the classes are separable Problem, What when they are not separable? What we can do? 77 / 124
  • 161. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 78 / 124
  • 162. Map to a higher Dimensional Space Assume that exist a mapping x ∈ Rl → y ∈ Rk Then, it is possible to define the following mapping 79 / 124
  • 163. Map to a higher Dimensional Space Assume that exist a mapping x ∈ Rl → y ∈ Rk Then, it is possible to define the following mapping 79 / 124
  • 164. Define a map to a higher Dimension Nonlinear transformations Given a series of nonlinear transformations {φi (x)}m i=1 from input space to the feature space. We can define the decision surface as m i=1 wiφi (x) + w0 = 0 80 / 124
  • 165. Define a map to a higher Dimension Nonlinear transformations Given a series of nonlinear transformations {φi (x)}m i=1 from input space to the feature space. We can define the decision surface as m i=1 wiφi (x) + w0 = 0 80 / 124
  • 166. This allows us to define The following vector φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T that represents the mapping. From this mapping We can define the following kernel function K : X × X → R K (xi, xj) = φ (xi)T φ (xj) 81 / 124
  • 167. This allows us to define The following vector φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T that represents the mapping. From this mapping We can define the following kernel function K : X × X → R K (xi, xj) = φ (xi)T φ (xj) 81 / 124
  • 168. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 82 / 124
  • 169. Example Assume x ∈ R → y =    x2 1√ 2x1x2 x2 2    We can show that yT i yj = xT i xj 2 83 / 124
  • 170. Example Assume x ∈ R → y =    x2 1√ 2x1x2 x2 2    We can show that yT i yj = xT i xj 2 83 / 124
  • 171. Example of Kernels Polynomials k (x, z) = (xT z + 1)q q > 0 Radial Basis Functions k (x, z) = exp − ||x − z||2 σ2 Hyperbolic Tangents k (x, z) = tanh βxT z + γ 84 / 124
  • 172. Example of Kernels Polynomials k (x, z) = (xT z + 1)q q > 0 Radial Basis Functions k (x, z) = exp − ||x − z||2 σ2 Hyperbolic Tangents k (x, z) = tanh βxT z + γ 84 / 124
  • 173. Example of Kernels Polynomials k (x, z) = (xT z + 1)q q > 0 Radial Basis Functions k (x, z) = exp − ||x − z||2 σ2 Hyperbolic Tangents k (x, z) = tanh βxT z + γ 84 / 124
  • 174. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 85 / 124
  • 175. Now, How to select a Kernel? We have a problem Selecting a specific kernel and parameters is usually done in a try-and-see manner. Thus In general, the Radial Basis Functions kernel is a reasonable first choice. Then if this fails, we can try the other possible kernels. 86 / 124
  • 176. Now, How to select a Kernel? We have a problem Selecting a specific kernel and parameters is usually done in a try-and-see manner. Thus In general, the Radial Basis Functions kernel is a reasonable first choice. Then if this fails, we can try the other possible kernels. 86 / 124
  • 177. Now, How to select a Kernel? We have a problem Selecting a specific kernel and parameters is usually done in a try-and-see manner. Thus In general, the Radial Basis Functions kernel is a reasonable first choice. Then if this fails, we can try the other possible kernels. 86 / 124
  • 178. Thus, we have something like this Step 1 Normalize the data. Step 2 Use cross-validation to adjust the parameters of the selected kernel. Step 3 Train against the entire dataset. 87 / 124
  • 179. Thus, we have something like this Step 1 Normalize the data. Step 2 Use cross-validation to adjust the parameters of the selected kernel. Step 3 Train against the entire dataset. 87 / 124
  • 180. Thus, we have something like this Step 1 Normalize the data. Step 2 Use cross-validation to adjust the parameters of the selected kernel. Step 3 Train against the entire dataset. 87 / 124
  • 181. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 88 / 124
  • 182. Optimal Hyperplane for non-separable patterns Important We have been considering only problems where the classes are linearly separable. Now What happen when the patterns are not separable? Thus, we can still build a separating hyperplane But errors will happen in the classification... We need to minimize them... 89 / 124
  • 183. Optimal Hyperplane for non-separable patterns Important We have been considering only problems where the classes are linearly separable. Now What happen when the patterns are not separable? Thus, we can still build a separating hyperplane But errors will happen in the classification... We need to minimize them... 89 / 124
  • 184. Optimal Hyperplane for non-separable patterns Important We have been considering only problems where the classes are linearly separable. Now What happen when the patterns are not separable? Thus, we can still build a separating hyperplane But errors will happen in the classification... We need to minimize them... 89 / 124
  • 185. What if the following happens Some data points invade the “margin” space Optimal Hyperplane Data Point Violating Property 90 / 124
  • 186. Fixing the Problem - Corinna’s Style The margin of separation between classes is said to be soft if a data point (xi, di) violates the following condition di wT xi + b ≥ +1 i = 1, 2, ..., N (22) This violation can arise in one of two ways The data point (xi, di) falls inside the region of separation but on the right side of the decision surface - still correct classification. 91 / 124
  • 187. Fixing the Problem - Corinna’s Style The margin of separation between classes is said to be soft if a data point (xi, di) violates the following condition di wT xi + b ≥ +1 i = 1, 2, ..., N (22) This violation can arise in one of two ways The data point (xi, di) falls inside the region of separation but on the right side of the decision surface - still correct classification. 91 / 124
  • 188. We have then Example Optimal Hyperplane Data Point Violating Property 92 / 124
  • 189. Or... This violation can arise in one of two ways The data point (xi, di) falls on the wrong side of the decision surface - incorrect classification. Example 93 / 124
  • 190. Or... This violation can arise in one of two ways The data point (xi, di) falls on the wrong side of the decision surface - incorrect classification. Example Optimal Hyperplane Data Point Violating Property 93 / 124
  • 191. Solving the problem What to do? We introduce a set of nonnegative scalar values {ξi}N i=1. Introduce this into the decision rule di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N (23) 94 / 124
  • 192. Solving the problem What to do? We introduce a set of nonnegative scalar values {ξi}N i=1. Introduce this into the decision rule di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N (23) 94 / 124
  • 193. Solving the problem What to do? We introduce a set of nonnegative scalar values {ξi}N i=1. Introduce this into the decision rule di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N (23) 94 / 124
  • 194. The ξi are called slack variables What? In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modified maximum margin idea that allows for mislabeled examples. Ok!!! Instead of expecting to have constant margin for all the samples, the margin can change depending of the sample. What do we have? ξi measures the deviation of a data point from the ideal condition of pattern separability. 95 / 124
  • 195. The ξi are called slack variables What? In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modified maximum margin idea that allows for mislabeled examples. Ok!!! Instead of expecting to have constant margin for all the samples, the margin can change depending of the sample. What do we have? ξi measures the deviation of a data point from the ideal condition of pattern separability. 95 / 124
  • 196. The ξi are called slack variables What? In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modified maximum margin idea that allows for mislabeled examples. Ok!!! Instead of expecting to have constant margin for all the samples, the margin can change depending of the sample. What do we have? ξi measures the deviation of a data point from the ideal condition of pattern separability. 95 / 124
  • 197. Properties of ξi What if? You have 0 ≤ ξi ≤ 1 We have 96 / 124
  • 198. Properties of ξi What if? You have 0 ≤ ξi ≤ 1 We have Optimal Hyperplane Data Point Violating Property 96 / 124
  • 199. Properties of ξi What if? You have ξi > 1 We have 97 / 124
  • 200. Properties of ξi What if? You have ξi > 1 We have Optimal Hyperplane Data Point Violating Property 97 / 124
  • 201. Support Vectors We want Support vectors that satisfy equation (Eq. 23) even when ξi > 0 di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N 98 / 124
  • 202. We want the following We want to find an hyperplane Such that average error is misclassified over all the samples 1 N N i=1 e2 (24) 99 / 124
  • 203. First Attempt Into Minimization We can try the following Given I (x) = 0 if x ≤ 0 1 if x > 0 (25) Minimize the following Φ (ξ) = N i=1 I (ξi − 1) (26) with respect to the weight vector w subject to 1 di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N 2 w 2 ≤ C for a given C. 100 / 124
  • 204. First Attempt Into Minimization We can try the following Given I (x) = 0 if x ≤ 0 1 if x > 0 (25) Minimize the following Φ (ξ) = N i=1 I (ξi − 1) (26) with respect to the weight vector w subject to 1 di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N 2 w 2 ≤ C for a given C. 100 / 124
  • 205. Problem Using this first attempt Minimization of Φ (ξ) with respect to w is a non-convex optimization problem that is NP-complete. Thus, we need to use an approximation, maybe Φ (ξ) = N i=1 ξi (27) Now, we simplify the computations by integrating the vector w Φ (w, ξ) = 1 2 wT w + C N i=1 ξi (28) 101 / 124
  • 206. Problem Using this first attempt Minimization of Φ (ξ) with respect to w is a non-convex optimization problem that is NP-complete. Thus, we need to use an approximation, maybe Φ (ξ) = N i=1 ξi (27) Now, we simplify the computations by integrating the vector w Φ (w, ξ) = 1 2 wT w + C N i=1 ξi (28) 101 / 124
  • 207. Problem Using this first attempt Minimization of Φ (ξ) with respect to w is a non-convex optimization problem that is NP-complete. Thus, we need to use an approximation, maybe Φ (ξ) = N i=1 ξi (27) Now, we simplify the computations by integrating the vector w Φ (w, ξ) = 1 2 wT w + C N i=1 ξi (28) 101 / 124
  • 208. Important First Minimizing the first term in (Eq. 28) is related to minimize the Vapnik–Chervonenkis dimension. Which is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a statistical classification algorithm. Second The second term N i=1 ξi is an upper bound on the number of test errors. 102 / 124
  • 209. Important First Minimizing the first term in (Eq. 28) is related to minimize the Vapnik–Chervonenkis dimension. Which is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a statistical classification algorithm. Second The second term N i=1 ξi is an upper bound on the number of test errors. 102 / 124
  • 210. Important First Minimizing the first term in (Eq. 28) is related to minimize the Vapnik–Chervonenkis dimension. Which is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a statistical classification algorithm. Second The second term N i=1 ξi is an upper bound on the number of test errors. 102 / 124
  • 211. Some problems for the Parameter C Little Problem The parameter C has to be selected by the user. This can be done in two ways 1 The parameter C is determined experimentally via the standard use of a training! (validation) test set. 2 It is determined analytically by estimating the Vapnik–Chervonenkis dimension. 103 / 124
  • 212. Some problems for the Parameter C Little Problem The parameter C has to be selected by the user. This can be done in two ways 1 The parameter C is determined experimentally via the standard use of a training! (validation) test set. 2 It is determined analytically by estimating the Vapnik–Chervonenkis dimension. 103 / 124
  • 213. Some problems for the Parameter C Little Problem The parameter C has to be selected by the user. This can be done in two ways 1 The parameter C is determined experimentally via the standard use of a training! (validation) test set. 2 It is determined analytically by estimating the Vapnik–Chervonenkis dimension. 103 / 124
  • 214. Primal Problem Problem, given samples {(xi, di)}N i=1 min w,ξ Φ (w, ξ) = min w,ξ 1 2 wT w + C N i=1 ξi s.t. di(wT xi + w0) ≥ 1 − ξi for i = 1, · · · , N ξi ≥ 0 for all i With C a user-specified positive parameter. 104 / 124
  • 215. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 105 / 124
  • 216. Final Setup Using Lagrange Multipliers and dual-primal method is possible to obtain the following setup Given the training sample {(xi, di)}N i=1, find the Lagrange multipliers {αi}N i=1 that maximize the objective function min α Q(α) = min α    N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi    subject to the constraints N i=1 αidi = 0 (29) 0 ≤ αi ≤ C for i = 1, · · · , N (30) where C is a user-specified positive parameter. 106 / 124
  • 217. Remarks Something Notable Note that neither the slack variables nor their Lagrange multipliers appear in the dual problem. The dual problem for the case of non-separable patterns is thus similar to that for the simple case of linearly separable patterns The only big difference Instead of using the constraint αi ≥ 0, the new problem use the more stringent constraint 0 ≤ αi ≤ C. Note the following ξi = 0 if αi < C (31) 107 / 124
  • 218. Remarks Something Notable Note that neither the slack variables nor their Lagrange multipliers appear in the dual problem. The dual problem for the case of non-separable patterns is thus similar to that for the simple case of linearly separable patterns The only big difference Instead of using the constraint αi ≥ 0, the new problem use the more stringent constraint 0 ≤ αi ≤ C. Note the following ξi = 0 if αi < C (31) 107 / 124
  • 219. Remarks Something Notable Note that neither the slack variables nor their Lagrange multipliers appear in the dual problem. The dual problem for the case of non-separable patterns is thus similar to that for the simple case of linearly separable patterns The only big difference Instead of using the constraint αi ≥ 0, the new problem use the more stringent constraint 0 ≤ αi ≤ C. Note the following ξi = 0 if αi < C (31) 107 / 124
  • 220. Remarks Something Notable Note that neither the slack variables nor their Lagrange multipliers appear in the dual problem. The dual problem for the case of non-separable patterns is thus similar to that for the simple case of linearly separable patterns The only big difference Instead of using the constraint αi ≥ 0, the new problem use the more stringent constraint 0 ≤ αi ≤ C. Note the following ξi = 0 if αi < C (31) 107 / 124
  • 221. Finally The optimal solution for the weight vector w∗ w∗ = Ns i=1 α∗ i dixi Where Ns is the number of support vectors. In addition The determination of the optimum values to that described before. The KKT conditions are as follow αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N. µiξi = 0 for i = 1, 2, ..., N. 108 / 124
  • 222. Finally The optimal solution for the weight vector w∗ w∗ = Ns i=1 α∗ i dixi Where Ns is the number of support vectors. In addition The determination of the optimum values to that described before. The KKT conditions are as follow αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N. µiξi = 0 for i = 1, 2, ..., N. 108 / 124
  • 223. Finally The optimal solution for the weight vector w∗ w∗ = Ns i=1 α∗ i dixi Where Ns is the number of support vectors. In addition The determination of the optimum values to that described before. The KKT conditions are as follow αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N. µiξi = 0 for i = 1, 2, ..., N. 108 / 124
  • 224. Finally The optimal solution for the weight vector w∗ w∗ = Ns i=1 α∗ i dixi Where Ns is the number of support vectors. In addition The determination of the optimum values to that described before. The KKT conditions are as follow αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N. µiξi = 0 for i = 1, 2, ..., N. 108 / 124
  • 225. Where... The µi are Lagrange multipliers They are used to enforce the non-negativity of the slack variables ξi for all i. Something Notable At saddle point, the derivative of the Lagrangian function for the primal problem: 1 2 wT w + C N i=1 ξi − N i=1 αi di wT xi + wo − 1 + ξi − N i=1 µiξi (32) 109 / 124
  • 226. Where... The µi are Lagrange multipliers They are used to enforce the non-negativity of the slack variables ξi for all i. Something Notable At saddle point, the derivative of the Lagrangian function for the primal problem: 1 2 wT w + C N i=1 ξi − N i=1 αi di wT xi + wo − 1 + ξi − N i=1 µiξi (32) 109 / 124
  • 227. Thus We get αi + µi = C (33) Thus, we get if αi < C Then µi > 0 ⇒ ξi = 0 We may determine w0 Using any data point (xi, di) in the training set such that 0 ≤ α∗ i ≤ C. Then, given ξi = 0, w∗ 0 = 1 di − (w∗ )T xi (34) 110 / 124
  • 228. Thus We get αi + µi = C (33) Thus, we get if αi < C Then µi > 0 ⇒ ξi = 0 We may determine w0 Using any data point (xi, di) in the training set such that 0 ≤ α∗ i ≤ C. Then, given ξi = 0, w∗ 0 = 1 di − (w∗ )T xi (34) 110 / 124
  • 229. Thus We get αi + µi = C (33) Thus, we get if αi < C Then µi > 0 ⇒ ξi = 0 We may determine w0 Using any data point (xi, di) in the training set such that 0 ≤ α∗ i ≤ C. Then, given ξi = 0, w∗ 0 = 1 di − (w∗ )T xi (34) 110 / 124
  • 230. Nevertheless It is better To take the mean value of w∗ 0 from all such data points in the training sample (Burges, 1998). BTW He has a great book in SVM’s “An Introduction to Support Vector Machines and Other Kernel-based Learning Methods” 111 / 124
  • 231. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 112 / 124
  • 232. Basic Idea Something Notable The SVM uses the scalar product xi, xj as a measure of similarity between xi and xj, and of distance to the hyperplane. Since the scalar product is linear, the SVM is a linear method. But Using a nonlinear function instead, we can make the classifier nonlinear. 113 / 124
  • 233. Basic Idea Something Notable The SVM uses the scalar product xi, xj as a measure of similarity between xi and xj, and of distance to the hyperplane. Since the scalar product is linear, the SVM is a linear method. But Using a nonlinear function instead, we can make the classifier nonlinear. 113 / 124
  • 234. Basic Idea Something Notable The SVM uses the scalar product xi, xj as a measure of similarity between xi and xj, and of distance to the hyperplane. Since the scalar product is linear, the SVM is a linear method. But Using a nonlinear function instead, we can make the classifier nonlinear. 113 / 124
  • 235. We do this by defining the following map Nonlinear transformations Given a series of nonlinear transformations {φi (x)} m i=1 from input space to the feature space. We can define the decision surface as m i=1 wiφi (x) + w0 = 0 . 114 / 124
  • 236. We do this by defining the following map Nonlinear transformations Given a series of nonlinear transformations {φi (x)} m i=1 from input space to the feature space. We can define the decision surface as m i=1 wiφi (x) + w0 = 0 . 114 / 124
  • 237. This allows us to define The following vector φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T That represents the mapping. 115 / 124
  • 238. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 116 / 124
  • 239. Finally We define the decision surface as wT φ (x) = 0 (35) We now seek "linear" separability of features, we may write w = N i=1 αidiφ (xi) (36) Thus, we finish with the following decision surface N i=1 αidiφT (xi) φ (x) = 0 (37) 117 / 124
  • 240. Finally We define the decision surface as wT φ (x) = 0 (35) We now seek "linear" separability of features, we may write w = N i=1 αidiφ (xi) (36) Thus, we finish with the following decision surface N i=1 αidiφT (xi) φ (x) = 0 (37) 117 / 124
  • 241. Finally We define the decision surface as wT φ (x) = 0 (35) We now seek "linear" separability of features, we may write w = N i=1 αidiφ (xi) (36) Thus, we finish with the following decision surface N i=1 αidiφT (xi) φ (x) = 0 (37) 117 / 124
  • 242. Thus The term φT (xi) φ (x) It represents the inner product of two vectors induced in the feature space induced by the input patterns. We can introduce the inner-product kernel K (xi, x) = φT (xi) φ (x) = m j=0 φj (xi) φj (x) (38) Property: Symmetry K (xi, x) = K (x, xi) (39) 118 / 124
  • 243. Thus The term φT (xi) φ (x) It represents the inner product of two vectors induced in the feature space induced by the input patterns. We can introduce the inner-product kernel K (xi, x) = φT (xi) φ (x) = m j=0 φj (xi) φj (x) (38) Property: Symmetry K (xi, x) = K (x, xi) (39) 118 / 124
  • 244. Thus The term φT (xi) φ (x) It represents the inner product of two vectors induced in the feature space induced by the input patterns. We can introduce the inner-product kernel K (xi, x) = φT (xi) φ (x) = m j=0 φj (xi) φj (x) (38) Property: Symmetry K (xi, x) = K (x, xi) (39) 118 / 124
  • 245. This allows to redefine the optimal hyperplane We get N i=1 αidiK (xi, x) = 0 (40) Something Notable Using kernels, we can avoid to go from: Input Space =⇒ Mapping Space =⇒ Inner Product (41) By directly going from Input Space =⇒ Inner Product (42) 119 / 124
  • 246. This allows to redefine the optimal hyperplane We get N i=1 αidiK (xi, x) = 0 (40) Something Notable Using kernels, we can avoid to go from: Input Space =⇒ Mapping Space =⇒ Inner Product (41) By directly going from Input Space =⇒ Inner Product (42) 119 / 124
  • 247. This allows to redefine the optimal hyperplane We get N i=1 αidiK (xi, x) = 0 (40) Something Notable Using kernels, we can avoid to go from: Input Space =⇒ Mapping Space =⇒ Inner Product (41) By directly going from Input Space =⇒ Inner Product (42) 119 / 124
  • 248. Important Something Notable The expansion of (Eq. 38) for the inner-product kernel K (xi, x) is an important special case of that arises in functional analysis. 120 / 124
  • 249. Mercer’s Theorem Mercer’s Theorem Let K (x, x ) be a continuous symmetric kernel that is defined in the closed interval a ≤ x ≤ b and likewise for x . The kernel K (x, x ) can be expanded in the series K x, x = ∞ i=1 λiφi (x) φi x (43) With Positive coefficients, λi > 0 for all i. 121 / 124
  • 250. Mercer’s Theorem Mercer’s Theorem Let K (x, x ) be a continuous symmetric kernel that is defined in the closed interval a ≤ x ≤ b and likewise for x . The kernel K (x, x ) can be expanded in the series K x, x = ∞ i=1 λiφi (x) φi x (43) With Positive coefficients, λi > 0 for all i. 121 / 124
  • 251. Mercer’s Theorem For this expression to be valid and or it to converge absolutely and uniformly It is necessary and sufficient that the condition ˆ b a ˆ b a K x, x ψ (x) ψ x dxdx ≥ 0 (44) holds for all ψ such that ´ b a ψ2 (x) dx < ∞(Example of a quadratic norm for functions). 122 / 124
  • 252. Remarks First The functions φi (x) are called eigenfunctions of the expansion and the numbers λi are called eigenvalues. Second The fact that all of the eigenvalues are positive means that the kernel K (x, x ) is positive definite. 123 / 124
  • 253. Remarks First The functions φi (x) are called eigenfunctions of the expansion and the numbers λi are called eigenvalues. Second The fact that all of the eigenvalues are positive means that the kernel K (x, x ) is positive definite. 123 / 124
  • 254. Not only that We have that For λi = 1, the ith image of √ λiφi (x) induced in the feature space by the input vector x is an eigenfunction of the expansion. In theory The dimensionality of the feature space (i.e., the number of eigenvalues/ eigenfunctions) can be infinitely large. 124 / 124
  • 255. Not only that We have that For λi = 1, the ith image of √ λiφi (x) induced in the feature space by the input vector x is an eigenfunction of the expansion. In theory The dimensionality of the feature space (i.e., the number of eigenvalues/ eigenfunctions) can be infinitely large. 124 / 124