Support Vector Machines (SVM)

Support Vector Machines
October 23, 2019
Amit Praseed Classiﬁcation October 23, 2019 1 / 31

Introduction
Support Vector Machines (SVMs) are arguably the most famous ma-
chine learning tool.
They are extensively used in text and image classiﬁcation.

Linear Separability
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y

Linear Inseparability (The XOR Problem)
0 1
1
x
y

SVM for Linearly Separable Problems
For now, let us consider only
linearly separable cases.
The only thing SVM does
is ﬁnd the hyperplane (in n-
dimensions; in two dimensions
it simply ﬁnds the line) that
separates the two classes.
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y

Which is the Optimal Hyperplane?
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y

The Optimal Hyperplane
The optimal Hyperplane should be maximally separated from the clos-
est data points (”maximize the margin”).
The optimal hyperplane should completely separate the data points
into two perfectly pure classes.

SVM in 2 Dimensions
In two dimensions, the problem reduces to ﬁnding the optimal line sep-
arating the two data classes. The conditions for the optimal hyperplane
still hold in two dimensions.
Let the line we have to ﬁnd be y = ax + b.
y = ax + b
ax − y + b = 0
Let X = [x y] and W = [a − 1], then the equation for the line
becomes
W .X + b = 0
which incidentally is the equation for a hyperplane as well, when gen-
eralized to n dimensions. Here W is a vector perpendicular to the
plane.

SVM in 2 Dimensions
Let us denote the two classes by
integer numbers. The SVM will
output -1 for all the blue data
points and +1 for the red data
points.
It can also be noted that for all
blue points which are on the left
of the separating line, W .X +
b ≤ 0. For all the red points,
W .X + b ≥ 0.
Denote
h(xi ) =
+1, W .X + b ≥ 0
−1, W .X + b ≤ 0
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
xy

SVM in 2 Dimensions
The sign of W .X + b gives the class label.
The magnitude of W .X + b gives the distance of the data point from
the separating line.
Let βi denote the distance from a data point to the separating hyper-
plane P. Then the closest point to the plane P will have the lowest
value of β. Let us denote this value as the functional margin F.
However, the distance measures on the blue side will be negative while
on the red side they will be positive!!! To avoid the diﬃculty in com-
paring, we can simply multiply with the class labels, because all blue
points have class labels as -1 and all red points have class labels +1.
F = min βi
F = min yi (W .Xi + b)

SVM in 2 Dimensions
From a potentially inﬁnite number of hyperplanes, an SVM selects the
one which maximizes the margin, or in other words which maximally
separates the two closest data points to it.
Put in a mathematical sense, SVM selects the hyperplane which max-
imizes the functional margin.
However, the functional margin is not scale invariant.
For instance consider two planes given by W1 = (2, 1), b1 = 5 and
W2 = (20, 10), b1 = 50. Essentially they represent the same plane
because W is a vector orthogonal to the plane P, and the only thing
that matters is the direction of W . However, the functional margin
still multiplies the distance of a data point from P by a factor of 10,
which renders it ineﬀective.
We compute a scale invariant distance measure, given by
γi = yi (
W
|W |
.Xi +
b
|W |
)

SVM in 2 Dimensions
We deﬁne a scale invariant geometric margin as
M = min γi
M = min yi (
W
|W |
.Xi +
b
|W |
)
The SVM problem reduces to ﬁnding a hyperplane with the maximum
geometric margin, or in other words,
max M
subject to γi ≥ M, i = 1, 2.. m

SVM in 2 Dimensions
Now, we can rewrite the above problem by recalling that M = F
|W |
.
max
F
|W |
subject to
fi
|W |
≥
F
|W |
, i = 1, 2.. m
Since we are maximizing the geometric margin, which is scale invariant,
we can set W and b to make F = 1.
max
1
|W |
subject to fi ≥ 1, i = 1, 2.. m

SVM in 2 Dimensions
We can express the same as a minimization problem
min |W |
subject to yi (W .Xi + b) ≥ 1, i = 1, 2.. m
The ﬁnal optimization problem seen in literature is a simple modiﬁca-
tion of this
min
1
2
|W |2
subject to yi (W .Xi + b) − 1 ≥ 0, i = 1, 2.. m
This problem resembles the Convex Quadratic Optimization Problem
which can be solved using Lagrange multipliers.

Issues with Hard Margin SVMs
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Very Small Margin due to Outliers
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Linearly Inseparable due to Outliers

Soft Margin SVMs
Hard Margin SVMs have a strict rule that all of the training dataset
must be classified properly.
Soft Margin SVMs allow some data points in the training data set to
be misclassified if it allows for a better margin and a better decision
boundary.
They introduce a slack variable to allow for this, so that the constraint
becomes
yi (W .Xi + b) ≥ 1 − ζi , i = 1, 2.. m
The complete minimization problem for a soft margin SVM becomes
min
1
2
|W |2
+
m
i=1
ζi
subject to yi (W .Xi + b) ≥ 1 − ζi , i = 1, 2.. m
The additional term of m
i=1 ζi in the minimization function is meant to
penalize the choice of a large ζ value which could lead to all constraints
being satisfied, but would lead to more misclassification error.

A Look Back at Linear Separability
−4 −3 −2 −1 1 2 3 4
−3
−2
−1
1
2
3
4
5
6
7
8
9
10
x
y

The Kernel Trick
A data set that appears to be linearly inseparable in a lower dimen-
sional space can be made linearly separable by projecting onto a higher
dimensional space.
Instead of transforming all data points into the new higher dimensional
space, SVM uses a kernel that computes the distances between the
data points as if they belong to a higher dimensional space. This is
often called the Kernel Trick.
For the earlier dataset, let us apply a mapping φ : R → R2
φ(x, y) = (x, x2
)

Linearly Separable in 2D
−4 −3 −2 −1 1 2 3 4
−3
−2
−1
1
2
3
4
5
6
7
8
9
10
x
y

Another Example
2 4 6 8 10 12 14 16 18 20
2
4
6
8
10
12
14
16
18
20
x
y
φ(x, y) = (x2
,
√
2xy, y2
)

The Kernel Trick
The data when projected in 3 dimensions is linearly separable

The Kernel Trick

Types of Kernels
Linear Kernel
Polynomial Kernel
RBF or Gaussian Kernel : An RBF Kernel theoretically transforms data
into an infinite dimensional space for classification. It is usually rec-
ommended to try classification using an RBF kernel because it usually
gives good results.

Multi-class Classiﬁcation using SVM
SVMs are primarily built for binary classiﬁcation.
However, two approaches can be used to extend an SVM to classify
multiple classes.
One against All
One against One

One against All SVM
In One against All SVM, an n class classification problem is decomposed
into n binary classification problems.
For example, a problem involving three classes A,B and C is decom-
posed into three binary classifiers:
A binary classifier to recognize A
A binary classifier to recognize B
A binary classifier to recognize C
However, this approach can lead to inconsistencies.

One against All SVM

One against All leads to Ambiguity
Data in the blue regions will be classiﬁed into two classes simultaneously

One against One SVM
In One against One SVM, instead of pitting one class against all the
others, classes are pitted against each other.
This leads to a total of k(k−1)
2 classiﬁers for a k-class classiﬁcation
problem.
The class of the data point is decided by a simple majority voting.

One against One SVM
One against one still leads to ambiguity if two classes get the same
amount of votes.

Anomaly Detection
The fundamental notion of classification is that there are two or more
classes that are well represented in the training data, and the classifier
must assign one of the data labels to the incoming testing data.
However in a number of scenarios, only one class of data is available.
The other class is either not available at all, or has very few data
samples.
Eg: Security related domains, for example detecting cyber-attacks, or
detecting abnormal program execution.
In such scenarios, obtaining the training data corresponding to attacks
or abnormal runs is expensive or infeasible.
The commonly used approach is to train the classifier on the available
data which models the normal data, and use the classifier to identify
any anomalies in the testing data.
This is often called anomaly detection.

One Class SVMs
SVMs can be used for anomaly detection by modifying the constraints
and the objective function.
Instead of maximizing the margin between classes as in a normal SVM,
the One Class SVM tries to form a hypersphere (with radius R and
centre a) around the training data, and tries to minimize the radius of
the sphere.
min R2
+ C
N
i=1
ζi
subject to
||xi − a||2
≤ R2
+ ζi
ζi ≥ 0

Support Vector Machines (SVM)

More Related Content

What's hot (20)

Similar to Support Vector Machines (SVM) (20)

More from amitpraseed (7)

Recently uploaded (20)

Support Vector Machines (SVM)