ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx

CS-13410
Introduction to Machine Learning
Lecture # 15
Support Vector Machine
Non-Linear
A portion of the slides are taken from
Prof. Andrew Moore’s
SVM tutorial at
http://www.cs.cmu.edu/~awm/tutorials

Dataset with noise
 Hard Margin: So far we require all
data points be classified correctly
- No training error
 What if the training set is noisy?
- Solution 1: use very powerful
kernels
denotes +1
denotes -1
OVERFITTING!

What if the training set is not linearly separable?
Slack variables ξi can be added to allow misclassification of
difficult or noisy examples, resulting margin called soft.
wx+b=1
wx+b=0
wx+b=-1
ξ7
ξ11
ξ2
Soft Margin Classification

Hard Margin v.s. Soft Margin
 The old formulation:
 The new formulation incorporating slack variables:
 Parameter C can be viewed as a way to control
overfitting.
Find w and b such that
Φ(w) =½ wT
w is minimized and for all {(xi ,yi)}
yi (wT
xi + b) ≥ 1
Find w and b such that
Φ(w) =½ wT
w + CΣξi is minimized and for all {(xi ,yi)}
yi (wT
xi + b) ≥ 1- ξi and ξi ≥ 0 for all i
Not Discussed

Linear SVMs: Overview
 The classifier is a separating hyperplane.
 Most “important” training points are support vectors; they define the
hyperplane.
 Quadratic optimization algorithms can identify which training points
xi are support vectors with non-zero Lagrangian multipliers αi.
 Both in the dual formulation of the problem and in the solution
training points appear only inside dot products:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxi
T
xj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
f(x) = Σαiyixi
T
x + b

Non-linear SVMs
 Datasets that are linearly separable with some noise
work out great:
 But what are we going to do if the dataset is just too
hard?
 How about… mapping data to a higher-dimensional
space:
0 x
0 x
0 x
x2

Non-linear SVMs: Feature spaces
 General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x → φ(x)

8
Example
Suppose we are given the following positively labeled
data points in R2
:
and the following negatively labeled data points in R2
: . See the
figure bellow

9
Example (Non-linear SVM)
 From the figure we see that no linear class separating
hyperplane exists in the input space. Therefore, we must use a
nonlinear SVM (that is, one whose mapping function Φ is a
nonlinear mapping from input space into some feature space).
 Define
(1)
After transforming we can rewrite the data in feature space as
for the positive examples and
for the negative examples.
Please see the figure on the next slide.

10
 Now we can easily identify the support vectors
(see the figure)

11
We will use vectors augmented with 1 as a bias input. The augmented
vectors are
We know that and
Now we are required to find out 2 parameters and based on the following
2 linear equations.
Given equation (1), this reduces to
Now computing the dot products results in
And and .

12
, we get
Giving us the separating hyperplane equation with and . See the
figure below;
Now we can classify any new
point say as positive or
negative class. If the data point
is in input space; first, we
convert it to feature space and
then use SVM to identify its
class.

import numpy as np
X = np.array([[3,4],[1,4],[2,3],[6,-1],[7,-1],[5,-3]] )
y = np.array([-1,-1, -1, 1, 1 , 1 ])
#from sklearn.svm import SVC
#model = SVC(C = 1e5, kernel = 'linear’)
#model.fit(X, y)
from sklearn import svm
from sklearn import metrics
SVM_Sol = svm.SVC(decision_function_shape="ovr").fit(X_train,
y_train)
y_pred = SVM_Sol.predict(X_test)
accuracy = round(metrics.accuracy_score(y_test, y_pred),2)
Applying SVM using Python
(Jupyter Notebook)

The “Kernel Trick”
 The linear classifier relies on dot product between vectors K(xi,xj)=xi
T
xj
 If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi)T
φ(xj)
 A kernel function is some function that corresponds to an inner product in some
expanded feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
T
xj)2
,
Need to show that K(xi,xj)= φ(xi)T
φ(xj):
K(xi,xj)=(1 + xi
T
xj)2
,
= 1+ xi1
2
xj1
2
+ 2 xi1xj1 xi2xj2+ xi2
2
xj2
2
+ 2xi1xj1 + 2xi2xj2
= [1 xi1
2
√2 xi1xi2 xi2
2
√2xi1 √2xi2]T
[1 xj1
2
√2 xj1xj2 xj2
2
√2xj1 √2xj2]
= φ(xi)T
φ(xj), where φ(x) = [1 x1
2
√2 x1x2 x2
2
√2x1 √2x2]
Not Discussed

Examples of Kernel Functions
 Linear: K(xi,xj)= xi
T
xj
 Polynomial of power p: K(xi,xj)= (1+ xi
T
xj)p
 Gaussian (radial-basis function network):
 Sigmoid: K(xi,xj)= tanh(β0xi
T
xj + β1)
𝐾 (𝐱𝐢 ,𝐱𝐣)=exp(¿−
‖𝐱𝐢 −𝐱𝐣‖
2
2𝜎2
)¿
Not Discussed

Non-linear SVMs Mathematically
 Dual problem formulation:
 The solution is:
 Optimization techniques for finding αi’s remain the same!
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi,xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
f(x) = ΣαiyiK(xi,xj)+ b
Not Discussed

 SVM locates a separating hyperplane in the
feature space and classify points in that space.
 If transformation is required, the SVM does not
need to represent the new space explicitly,
simply by defining a kernel function.
 The kernel function plays the role of the dot
product in the feature space.
Nonlinear SVM - Overview

Properties of SVM
 Flexibility in choosing a similarity function
 Sparseness of solution when dealing with large data sets
- only support vectors are used to specify the separating
hyperplane
 Ability to handle large feature spaces
- complexity does not depend on the dimensionality of the
feature space
 Overfitting can be controlled by soft margin approach
 Nice math property: a simple convex optimization problem
which is guaranteed to converge to a single global solution
 Feature Selection

SVM Applications
 SVM has been used successfully in many
real-world problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification, Cancer classification)
- hand-written character recognition

Weakness of SVM
 It is sensitive to noise
- A relatively small number of mislabeled examples can dramatically
decrease the performance
 It only considers two classes
- how to do multi-class classification with SVM?
- Answer:
1) with output arity m, learn m SVM’s
 SVM 1 learns “Output==1” vs “Output != 1”
 SVM 2 learns “Output==2” vs “Output != 2”
 :
 SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with each SVM and
find out which one puts the prediction the furthest into the positive region.

Pros and Cons of SVM
 Pros
1. It is really effective in the higher dimension.
2. Effective when the number of features are more than training
examples.
3. Best algorithm when classes are separable
4. The hyperplane is affected by only the support vectors thus outliers
have less impact.
5. SVM is suited for extreme case binary classification.
 Cons
1. For larger dataset, it requires a large amount of time to process.
2. Does not perform well in case of overlapped classes.
3. Selecting, appropriately hyperparameters of the SVM that will allow
for sufficient generalization performance.
4. Selecting the appropriate kernel function can be tricky.

Application 1: Cancer
Classification
 High Dimensional
- p>1000; n<100
 Imbalanced
- less positive samples
 Many irrelevant features
 Noisy
Genes
Patients g-1 g-2 …… g-p
P-1
p-2
…….
p-n
𝐾[𝑥,𝑥]=𝑘(𝑥,𝑥)+𝜆
𝑛
+¿
𝑁
¿
FEATURE SELECTION
In the linear case,
wi
2
gives the ranking of dim i
SVM is sensitive to noisy (mis-labeled) data 
Not Discussed

Application 2: Text
Categorization
 Task: The classification of natural text (or
hypertext) documents into a fixed number
of predefined categories based on their
content.
- email filtering, web searching, sorting documents by
topic, etc..
 A document can be assigned to more than
one category, so this can be viewed as a
series of binary classification problems, one
for each category
Not Discussed

Representation of Text
IR’s vector space model (aka bag-of-words representation)
 A doc is represented by a vector indexed by a pre-fixed set or
dictionary of terms
 Values of an entry can be binary or weights
 Normalization, stop words, word stems
 Doc x => φ(x)
Not Discussed

Text Categorization using
SVM
 The distance between two documents is φ(x)·φ(z)
 K(x,z) = 〈 φ(x)·φ(z) is a valid kernel, SVM can be used
with K(x,z) for discrimination.
 Why SVM?
-High dimensional input space
-Few irrelevant features (dense concept)
-Sparse document vectors (sparse instances)
-Text categorization problems are linearly separable
Not Discussed

Some Issues
 Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate similarity
measures
 Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
 Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are tested
Not Discussed

Additional Resources
 An excellent tutorial on VC-dimension and Support Vector
Machines:
C.J.C. Burges. A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2(2):955-
974, 1998.
 The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience;
1998
http://www.kernel-machines.org/

ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx

More Related Content

Similar to ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx (20)

Recently uploaded (20)

ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx

Editor's Notes