SlideShare a Scribd company logo
Support Vector Machines
October 23, 2019
Amit Praseed Classification October 23, 2019 1 / 31
Introduction
Support Vector Machines (SVMs) are arguably the most famous ma-
chine learning tool.
They are extensively used in text and image classification.
Amit Praseed Classification October 23, 2019 2 / 31
Linear Separability
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Amit Praseed Classification October 23, 2019 3 / 31
Linear Inseparability (The XOR Problem)
0 1
1
x
y
Amit Praseed Classification October 23, 2019 4 / 31
SVM for Linearly Separable Problems
For now, let us consider only
linearly separable cases.
The only thing SVM does
is find the hyperplane (in n-
dimensions; in two dimensions
it simply finds the line) that
separates the two classes.
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Amit Praseed Classification October 23, 2019 5 / 31
Which is the Optimal Hyperplane?
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Amit Praseed Classification October 23, 2019 6 / 31
The Optimal Hyperplane
The optimal Hyperplane should be maximally separated from the clos-
est data points (”maximize the margin”).
The optimal hyperplane should completely separate the data points
into two perfectly pure classes.
Amit Praseed Classification October 23, 2019 7 / 31
SVM in 2 Dimensions
In two dimensions, the problem reduces to finding the optimal line sep-
arating the two data classes. The conditions for the optimal hyperplane
still hold in two dimensions.
Let the line we have to find be y = ax + b.
y = ax + b
ax − y + b = 0
Let X = [x y] and W = [a − 1], then the equation for the line
becomes
W .X + b = 0
which incidentally is the equation for a hyperplane as well, when gen-
eralized to n dimensions. Here W is a vector perpendicular to the
plane.
Amit Praseed Classification October 23, 2019 8 / 31
SVM in 2 Dimensions
Let us denote the two classes by
integer numbers. The SVM will
output -1 for all the blue data
points and +1 for the red data
points.
It can also be noted that for all
blue points which are on the left
of the separating line, W .X +
b ≤ 0. For all the red points,
W .X + b ≥ 0.
Denote
h(xi ) =
+1, W .X + b ≥ 0
−1, W .X + b ≤ 0
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
xy
Amit Praseed Classification October 23, 2019 9 / 31
SVM in 2 Dimensions
The sign of W .X + b gives the class label.
The magnitude of W .X + b gives the distance of the data point from
the separating line.
Let βi denote the distance from a data point to the separating hyper-
plane P. Then the closest point to the plane P will have the lowest
value of β. Let us denote this value as the functional margin F.
However, the distance measures on the blue side will be negative while
on the red side they will be positive!!! To avoid the difficulty in com-
paring, we can simply multiply with the class labels, because all blue
points have class labels as -1 and all red points have class labels +1.
F = min βi
F = min yi (W .Xi + b)
Amit Praseed Classification October 23, 2019 10 / 31
SVM in 2 Dimensions
From a potentially infinite number of hyperplanes, an SVM selects the
one which maximizes the margin, or in other words which maximally
separates the two closest data points to it.
Put in a mathematical sense, SVM selects the hyperplane which max-
imizes the functional margin.
However, the functional margin is not scale invariant.
For instance consider two planes given by W1 = (2, 1), b1 = 5 and
W2 = (20, 10), b1 = 50. Essentially they represent the same plane
because W is a vector orthogonal to the plane P, and the only thing
that matters is the direction of W . However, the functional margin
still multiplies the distance of a data point from P by a factor of 10,
which renders it ineffective.
We compute a scale invariant distance measure, given by
γi = yi (
W
|W |
.Xi +
b
|W |
)
Amit Praseed Classification October 23, 2019 11 / 31
SVM in 2 Dimensions
We define a scale invariant geometric margin as
M = min γi
M = min yi (
W
|W |
.Xi +
b
|W |
)
The SVM problem reduces to finding a hyperplane with the maximum
geometric margin, or in other words,
max M
subject to γi ≥ M, i = 1, 2.. m
Amit Praseed Classification October 23, 2019 12 / 31
SVM in 2 Dimensions
Now, we can rewrite the above problem by recalling that M = F
|W |
.
max
F
|W |
subject to
fi
|W |
≥
F
|W |
, i = 1, 2.. m
Since we are maximizing the geometric margin, which is scale invariant,
we can set W and b to make F = 1.
max
1
|W |
subject to fi ≥ 1, i = 1, 2.. m
Amit Praseed Classification October 23, 2019 13 / 31
SVM in 2 Dimensions
We can express the same as a minimization problem
min |W |
subject to yi (W .Xi + b) ≥ 1, i = 1, 2.. m
The final optimization problem seen in literature is a simple modifica-
tion of this
min
1
2
|W |2
subject to yi (W .Xi + b) − 1 ≥ 0, i = 1, 2.. m
This problem resembles the Convex Quadratic Optimization Problem
which can be solved using Lagrange multipliers.
Amit Praseed Classification October 23, 2019 14 / 31
Issues with Hard Margin SVMs
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Very Small Margin due to Outliers
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Linearly Inseparable due to Outliers
Amit Praseed Classification October 23, 2019 15 / 31
Soft Margin SVMs
Hard Margin SVMs have a strict rule that all of the training dataset
must be classified properly.
Soft Margin SVMs allow some data points in the training data set to
be misclassified if it allows for a better margin and a better decision
boundary.
They introduce a slack variable to allow for this, so that the constraint
becomes
yi (W .Xi + b) ≥ 1 − ζi , i = 1, 2.. m
The complete minimization problem for a soft margin SVM becomes
min
1
2
|W |2
+
m
i=1
ζi
subject to yi (W .Xi + b) ≥ 1 − ζi , i = 1, 2.. m
The additional term of m
i=1 ζi in the minimization function is meant to
penalize the choice of a large ζ value which could lead to all constraints
being satisfied, but would lead to more misclassification error.
Amit Praseed Classification October 23, 2019 16 / 31
A Look Back at Linear Separability
−4 −3 −2 −1 1 2 3 4
−3
−2
−1
1
2
3
4
5
6
7
8
9
10
x
y
Amit Praseed Classification October 23, 2019 17 / 31
The Kernel Trick
A data set that appears to be linearly inseparable in a lower dimen-
sional space can be made linearly separable by projecting onto a higher
dimensional space.
Instead of transforming all data points into the new higher dimensional
space, SVM uses a kernel that computes the distances between the
data points as if they belong to a higher dimensional space. This is
often called the Kernel Trick.
For the earlier dataset, let us apply a mapping φ : R → R2
φ(x, y) = (x, x2
)
Amit Praseed Classification October 23, 2019 18 / 31
Linearly Separable in 2D
−4 −3 −2 −1 1 2 3 4
−3
−2
−1
1
2
3
4
5
6
7
8
9
10
x
y
Amit Praseed Classification October 23, 2019 19 / 31
Another Example
2 4 6 8 10 12 14 16 18 20
2
4
6
8
10
12
14
16
18
20
x
y
φ(x, y) = (x2
,
√
2xy, y2
)
Amit Praseed Classification October 23, 2019 20 / 31
The Kernel Trick
The data when projected in 3 dimensions is linearly separable
Amit Praseed Classification October 23, 2019 21 / 31
The Kernel Trick
Amit Praseed Classification October 23, 2019 22 / 31
Types of Kernels
Linear Kernel
Polynomial Kernel
RBF or Gaussian Kernel : An RBF Kernel theoretically transforms data
into an infinite dimensional space for classification. It is usually rec-
ommended to try classification using an RBF kernel because it usually
gives good results.
Amit Praseed Classification October 23, 2019 23 / 31
Multi-class Classification using SVM
SVMs are primarily built for binary classification.
However, two approaches can be used to extend an SVM to classify
multiple classes.
One against All
One against One
Amit Praseed Classification October 23, 2019 24 / 31
One against All SVM
In One against All SVM, an n class classification problem is decomposed
into n binary classification problems.
For example, a problem involving three classes A,B and C is decom-
posed into three binary classifiers:
A binary classifier to recognize A
A binary classifier to recognize B
A binary classifier to recognize C
However, this approach can lead to inconsistencies.
Amit Praseed Classification October 23, 2019 25 / 31
One against All SVM
Amit Praseed Classification October 23, 2019 26 / 31
One against All leads to Ambiguity
Data in the blue regions will be classified into two classes simultaneously
Amit Praseed Classification October 23, 2019 27 / 31
One against One SVM
In One against One SVM, instead of pitting one class against all the
others, classes are pitted against each other.
This leads to a total of k(k−1)
2 classifiers for a k-class classification
problem.
The class of the data point is decided by a simple majority voting.
Amit Praseed Classification October 23, 2019 28 / 31
One against One SVM
One against one still leads to ambiguity if two classes get the same
amount of votes.
Amit Praseed Classification October 23, 2019 29 / 31
Anomaly Detection
The fundamental notion of classification is that there are two or more
classes that are well represented in the training data, and the classifier
must assign one of the data labels to the incoming testing data.
However in a number of scenarios, only one class of data is available.
The other class is either not available at all, or has very few data
samples.
Eg: Security related domains, for example detecting cyber-attacks, or
detecting abnormal program execution.
In such scenarios, obtaining the training data corresponding to attacks
or abnormal runs is expensive or infeasible.
The commonly used approach is to train the classifier on the available
data which models the normal data, and use the classifier to identify
any anomalies in the testing data.
This is often called anomaly detection.
Amit Praseed Classification October 23, 2019 30 / 31
One Class SVMs
SVMs can be used for anomaly detection by modifying the constraints
and the objective function.
Instead of maximizing the margin between classes as in a normal SVM,
the One Class SVM tries to form a hypersphere (with radius R and
centre a) around the training data, and tries to minimize the radius of
the sphere.
min R2
+ C
N
i=1
ζi
subject to
||xi − a||2
≤ R2
+ ζi
ζi ≥ 0
Amit Praseed Classification October 23, 2019 31 / 31

More Related Content

PDF
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
PPTX
Support vector machines
PDF
TENSOR VOTING BASED BINARY CLASSIFIER
PDF
Support Vector Machines for Classification
PPT
Visual Techniques
PDF
129215 specimen-paper-and-mark-schemes
DOCX
G6 m5-c-lesson 12-t
PPTX
Conference on theoretical and applied computer science
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
Support vector machines
TENSOR VOTING BASED BINARY CLASSIFIER
Support Vector Machines for Classification
Visual Techniques
129215 specimen-paper-and-mark-schemes
G6 m5-c-lesson 12-t
Conference on theoretical and applied computer science

What's hot (20)

DOCX
Number and number theory
PPTX
Curse of dimensionality
PDF
A Simple Review on SVM
PDF
Thesis_Draft_Nurkhaidarov
PDF
Igcse core papers 2002 2014
PDF
Principal Sensitivity Analysis
PDF
L1 intro2 supervised_learning
PPTX
Image representation
PDF
161783709 chapter-04-answers
PPT
FREQUENCY DISTRIBUTION ( distribusi frekuensi) - STATISTICS
PPT
Chapter 02
PDF
0580_w13_qp_21
PDF
0580_w13_qp_41
PDF
07 dimensionality reduction
PDF
0580_w13_qp_43
PPTX
Chapter5 data handling grade 8 cbse
PDF
0580_w14_qp_42
PDF
0580 s13 qp_42
PDF
0580_s13_qp_42
PDF
Gr3 journal
Number and number theory
Curse of dimensionality
A Simple Review on SVM
Thesis_Draft_Nurkhaidarov
Igcse core papers 2002 2014
Principal Sensitivity Analysis
L1 intro2 supervised_learning
Image representation
161783709 chapter-04-answers
FREQUENCY DISTRIBUTION ( distribusi frekuensi) - STATISTICS
Chapter 02
0580_w13_qp_21
0580_w13_qp_41
07 dimensionality reduction
0580_w13_qp_43
Chapter5 data handling grade 8 cbse
0580_w14_qp_42
0580 s13 qp_42
0580_s13_qp_42
Gr3 journal
Ad

Similar to Support Vector Machines (SVM) (20)

PPT
PPT
SVM (2).ppt
PPT
Introduction to Support Vector Machine 221 CMU.ppt
PPTX
Tutorial on Support Vector Machine
PDF
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
PDF
A Fuzzy Interactive BI-objective Model for SVM to Identify the Best Compromis...
PDF
Application of combined support vector machines in process fault diagnosis
PPTX
support vector machine 1.pptx
DOC
Introduction to Support Vector Machines
PPT
PERFORMANCE EVALUATION PARAMETERS FOR MACHINE LEARNING
PPTX
Image Classification And Support Vector Machine
PDF
Machine Learning_SVM_KNN_K-MEANSModule 2.pdf
PDF
Support Vector Machines
PDF
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
PDF
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
PDF
Chapter 4 Simplex Method ppt
PDF
PDF
Km2417821785
PPT
2.6 support vector machines and associative classifiers revised
PPT
Support Vector Machines
SVM (2).ppt
Introduction to Support Vector Machine 221 CMU.ppt
Tutorial on Support Vector Machine
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A Fuzzy Interactive BI-objective Model for SVM to Identify the Best Compromis...
Application of combined support vector machines in process fault diagnosis
support vector machine 1.pptx
Introduction to Support Vector Machines
PERFORMANCE EVALUATION PARAMETERS FOR MACHINE LEARNING
Image Classification And Support Vector Machine
Machine Learning_SVM_KNN_K-MEANSModule 2.pdf
Support Vector Machines
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
Chapter 4 Simplex Method ppt
Km2417821785
2.6 support vector machines and associative classifiers revised
Support Vector Machines
Ad

More from amitpraseed (7)

PDF
Decision Trees
PDF
Principal Component Analysis
PDF
Perceptron Learning
PDF
Introduction to Classification
PDF
Dimensionality Reduction
PDF
Convolutional Neural Networks
PDF
Bayesianclassifiers
Decision Trees
Principal Component Analysis
Perceptron Learning
Introduction to Classification
Dimensionality Reduction
Convolutional Neural Networks
Bayesianclassifiers

Recently uploaded (20)

PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Classroom Observation Tools for Teachers
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Cell Types and Its function , kingdom of life
Pharma ospi slides which help in ospi learning
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Microbial diseases, their pathogenesis and prophylaxis
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Final Presentation General Medicine 03-08-2024.pptx
Classroom Observation Tools for Teachers
Final Presentation General Medicine 03-08-2024.pptx
RMMM.pdf make it easy to upload and study
Orientation - ARALprogram of Deped to the Parents.pptx
GDM (1) (1).pptx small presentation for students
human mycosis Human fungal infections are called human mycosis..pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Anesthesia in Laparoscopic Surgery in India
Yogi Goddess Pres Conference Studio Updates
Supply Chain Operations Speaking Notes -ICLT Program
O7-L3 Supply Chain Operations - ICLT Program
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Cell Types and Its function , kingdom of life

Support Vector Machines (SVM)

  • 1. Support Vector Machines October 23, 2019 Amit Praseed Classification October 23, 2019 1 / 31
  • 2. Introduction Support Vector Machines (SVMs) are arguably the most famous ma- chine learning tool. They are extensively used in text and image classification. Amit Praseed Classification October 23, 2019 2 / 31
  • 3. Linear Separability 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 x y Amit Praseed Classification October 23, 2019 3 / 31
  • 4. Linear Inseparability (The XOR Problem) 0 1 1 x y Amit Praseed Classification October 23, 2019 4 / 31
  • 5. SVM for Linearly Separable Problems For now, let us consider only linearly separable cases. The only thing SVM does is find the hyperplane (in n- dimensions; in two dimensions it simply finds the line) that separates the two classes. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 x y Amit Praseed Classification October 23, 2019 5 / 31
  • 6. Which is the Optimal Hyperplane? 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 x y Amit Praseed Classification October 23, 2019 6 / 31
  • 7. The Optimal Hyperplane The optimal Hyperplane should be maximally separated from the clos- est data points (”maximize the margin”). The optimal hyperplane should completely separate the data points into two perfectly pure classes. Amit Praseed Classification October 23, 2019 7 / 31
  • 8. SVM in 2 Dimensions In two dimensions, the problem reduces to finding the optimal line sep- arating the two data classes. The conditions for the optimal hyperplane still hold in two dimensions. Let the line we have to find be y = ax + b. y = ax + b ax − y + b = 0 Let X = [x y] and W = [a − 1], then the equation for the line becomes W .X + b = 0 which incidentally is the equation for a hyperplane as well, when gen- eralized to n dimensions. Here W is a vector perpendicular to the plane. Amit Praseed Classification October 23, 2019 8 / 31
  • 9. SVM in 2 Dimensions Let us denote the two classes by integer numbers. The SVM will output -1 for all the blue data points and +1 for the red data points. It can also be noted that for all blue points which are on the left of the separating line, W .X + b ≤ 0. For all the red points, W .X + b ≥ 0. Denote h(xi ) = +1, W .X + b ≥ 0 −1, W .X + b ≤ 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 xy Amit Praseed Classification October 23, 2019 9 / 31
  • 10. SVM in 2 Dimensions The sign of W .X + b gives the class label. The magnitude of W .X + b gives the distance of the data point from the separating line. Let βi denote the distance from a data point to the separating hyper- plane P. Then the closest point to the plane P will have the lowest value of β. Let us denote this value as the functional margin F. However, the distance measures on the blue side will be negative while on the red side they will be positive!!! To avoid the difficulty in com- paring, we can simply multiply with the class labels, because all blue points have class labels as -1 and all red points have class labels +1. F = min βi F = min yi (W .Xi + b) Amit Praseed Classification October 23, 2019 10 / 31
  • 11. SVM in 2 Dimensions From a potentially infinite number of hyperplanes, an SVM selects the one which maximizes the margin, or in other words which maximally separates the two closest data points to it. Put in a mathematical sense, SVM selects the hyperplane which max- imizes the functional margin. However, the functional margin is not scale invariant. For instance consider two planes given by W1 = (2, 1), b1 = 5 and W2 = (20, 10), b1 = 50. Essentially they represent the same plane because W is a vector orthogonal to the plane P, and the only thing that matters is the direction of W . However, the functional margin still multiplies the distance of a data point from P by a factor of 10, which renders it ineffective. We compute a scale invariant distance measure, given by γi = yi ( W |W | .Xi + b |W | ) Amit Praseed Classification October 23, 2019 11 / 31
  • 12. SVM in 2 Dimensions We define a scale invariant geometric margin as M = min γi M = min yi ( W |W | .Xi + b |W | ) The SVM problem reduces to finding a hyperplane with the maximum geometric margin, or in other words, max M subject to γi ≥ M, i = 1, 2.. m Amit Praseed Classification October 23, 2019 12 / 31
  • 13. SVM in 2 Dimensions Now, we can rewrite the above problem by recalling that M = F |W | . max F |W | subject to fi |W | ≥ F |W | , i = 1, 2.. m Since we are maximizing the geometric margin, which is scale invariant, we can set W and b to make F = 1. max 1 |W | subject to fi ≥ 1, i = 1, 2.. m Amit Praseed Classification October 23, 2019 13 / 31
  • 14. SVM in 2 Dimensions We can express the same as a minimization problem min |W | subject to yi (W .Xi + b) ≥ 1, i = 1, 2.. m The final optimization problem seen in literature is a simple modifica- tion of this min 1 2 |W |2 subject to yi (W .Xi + b) − 1 ≥ 0, i = 1, 2.. m This problem resembles the Convex Quadratic Optimization Problem which can be solved using Lagrange multipliers. Amit Praseed Classification October 23, 2019 14 / 31
  • 15. Issues with Hard Margin SVMs 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 x y Very Small Margin due to Outliers 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 x y Linearly Inseparable due to Outliers Amit Praseed Classification October 23, 2019 15 / 31
  • 16. Soft Margin SVMs Hard Margin SVMs have a strict rule that all of the training dataset must be classified properly. Soft Margin SVMs allow some data points in the training data set to be misclassified if it allows for a better margin and a better decision boundary. They introduce a slack variable to allow for this, so that the constraint becomes yi (W .Xi + b) ≥ 1 − ζi , i = 1, 2.. m The complete minimization problem for a soft margin SVM becomes min 1 2 |W |2 + m i=1 ζi subject to yi (W .Xi + b) ≥ 1 − ζi , i = 1, 2.. m The additional term of m i=1 ζi in the minimization function is meant to penalize the choice of a large ζ value which could lead to all constraints being satisfied, but would lead to more misclassification error. Amit Praseed Classification October 23, 2019 16 / 31
  • 17. A Look Back at Linear Separability −4 −3 −2 −1 1 2 3 4 −3 −2 −1 1 2 3 4 5 6 7 8 9 10 x y Amit Praseed Classification October 23, 2019 17 / 31
  • 18. The Kernel Trick A data set that appears to be linearly inseparable in a lower dimen- sional space can be made linearly separable by projecting onto a higher dimensional space. Instead of transforming all data points into the new higher dimensional space, SVM uses a kernel that computes the distances between the data points as if they belong to a higher dimensional space. This is often called the Kernel Trick. For the earlier dataset, let us apply a mapping φ : R → R2 φ(x, y) = (x, x2 ) Amit Praseed Classification October 23, 2019 18 / 31
  • 19. Linearly Separable in 2D −4 −3 −2 −1 1 2 3 4 −3 −2 −1 1 2 3 4 5 6 7 8 9 10 x y Amit Praseed Classification October 23, 2019 19 / 31
  • 20. Another Example 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 x y φ(x, y) = (x2 , √ 2xy, y2 ) Amit Praseed Classification October 23, 2019 20 / 31
  • 21. The Kernel Trick The data when projected in 3 dimensions is linearly separable Amit Praseed Classification October 23, 2019 21 / 31
  • 22. The Kernel Trick Amit Praseed Classification October 23, 2019 22 / 31
  • 23. Types of Kernels Linear Kernel Polynomial Kernel RBF or Gaussian Kernel : An RBF Kernel theoretically transforms data into an infinite dimensional space for classification. It is usually rec- ommended to try classification using an RBF kernel because it usually gives good results. Amit Praseed Classification October 23, 2019 23 / 31
  • 24. Multi-class Classification using SVM SVMs are primarily built for binary classification. However, two approaches can be used to extend an SVM to classify multiple classes. One against All One against One Amit Praseed Classification October 23, 2019 24 / 31
  • 25. One against All SVM In One against All SVM, an n class classification problem is decomposed into n binary classification problems. For example, a problem involving three classes A,B and C is decom- posed into three binary classifiers: A binary classifier to recognize A A binary classifier to recognize B A binary classifier to recognize C However, this approach can lead to inconsistencies. Amit Praseed Classification October 23, 2019 25 / 31
  • 26. One against All SVM Amit Praseed Classification October 23, 2019 26 / 31
  • 27. One against All leads to Ambiguity Data in the blue regions will be classified into two classes simultaneously Amit Praseed Classification October 23, 2019 27 / 31
  • 28. One against One SVM In One against One SVM, instead of pitting one class against all the others, classes are pitted against each other. This leads to a total of k(k−1) 2 classifiers for a k-class classification problem. The class of the data point is decided by a simple majority voting. Amit Praseed Classification October 23, 2019 28 / 31
  • 29. One against One SVM One against one still leads to ambiguity if two classes get the same amount of votes. Amit Praseed Classification October 23, 2019 29 / 31
  • 30. Anomaly Detection The fundamental notion of classification is that there are two or more classes that are well represented in the training data, and the classifier must assign one of the data labels to the incoming testing data. However in a number of scenarios, only one class of data is available. The other class is either not available at all, or has very few data samples. Eg: Security related domains, for example detecting cyber-attacks, or detecting abnormal program execution. In such scenarios, obtaining the training data corresponding to attacks or abnormal runs is expensive or infeasible. The commonly used approach is to train the classifier on the available data which models the normal data, and use the classifier to identify any anomalies in the testing data. This is often called anomaly detection. Amit Praseed Classification October 23, 2019 30 / 31
  • 31. One Class SVMs SVMs can be used for anomaly detection by modifying the constraints and the objective function. Instead of maximizing the margin between classes as in a normal SVM, the One Class SVM tries to form a hypersphere (with radius R and centre a) around the training data, and tries to minimize the radius of the sphere. min R2 + C N i=1 ζi subject to ||xi − a||2 ≤ R2 + ζi ζi ≥ 0 Amit Praseed Classification October 23, 2019 31 / 31