SlideShare a Scribd company logo
CS-13410
Introduction to Machine Learning
Lecture # 15
Support Vector Machine
Non-Linear
A portion of the slides are taken from
Prof. Andrew Moore’s
SVM tutorial at
http://www.cs.cmu.edu/~awm/tutorials
Dataset with noise
 Hard Margin: So far we require all
data points be classified correctly
- No training error
 What if the training set is noisy?
- Solution 1: use very powerful
kernels
denotes +1
denotes -1
OVERFITTING!
What if the training set is not linearly separable?
Slack variables ξi can be added to allow misclassification of
difficult or noisy examples, resulting margin called soft.
wx+b=1
wx+b=0
wx+b=-1
ξ7
ξ11
ξ2
Soft Margin Classification
Hard Margin v.s. Soft Margin
 The old formulation:
 The new formulation incorporating slack variables:
 Parameter C can be viewed as a way to control
overfitting.
Find w and b such that
Φ(w) =½ wT
w is minimized and for all {(xi ,yi)}
yi (wT
xi + b) ≥ 1
Find w and b such that
Φ(w) =½ wT
w + CΣξi is minimized and for all {(xi ,yi)}
yi (wT
xi + b) ≥ 1- ξi and ξi ≥ 0 for all i
Not Discussed
Linear SVMs: Overview
 The classifier is a separating hyperplane.
 Most “important” training points are support vectors; they define the
hyperplane.
 Quadratic optimization algorithms can identify which training points
xi are support vectors with non-zero Lagrangian multipliers αi.
 Both in the dual formulation of the problem and in the solution
training points appear only inside dot products:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxi
T
xj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
f(x) = Σαiyixi
T
x + b
Non-linear SVMs
 Datasets that are linearly separable with some noise
work out great:
 But what are we going to do if the dataset is just too
hard?
 How about… mapping data to a higher-dimensional
space:
0 x
0 x
0 x
x2
Non-linear SVMs: Feature spaces
 General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x → φ(x)
8
Example
Suppose we are given the following positively labeled
data points in R2
:
and the following negatively labeled data points in R2
: . See the
figure bellow
9
Example (Non-linear SVM)
 From the figure we see that no linear class separating
hyperplane exists in the input space. Therefore, we must use a
nonlinear SVM (that is, one whose mapping function Φ is a
nonlinear mapping from input space into some feature space).
 Define
(1)
After transforming we can rewrite the data in feature space as
for the positive examples and
for the negative examples.
Please see the figure on the next slide.
10
 Now we can easily identify the support vectors
(see the figure)
11
We will use vectors augmented with 1 as a bias input. The augmented
vectors are
We know that and
Now we are required to find out 2 parameters and based on the following
2 linear equations.
Given equation (1), this reduces to
Now computing the dot products results in
And and .
12
, we get
Giving us the separating hyperplane equation with and . See the
figure below;
Now we can classify any new
point say as positive or
negative class. If the data point
is in input space; first, we
convert it to feature space and
then use SVM to identify its
class.
import numpy as np
X = np.array([[3,4],[1,4],[2,3],[6,-1],[7,-1],[5,-3]] )
y = np.array([-1,-1, -1, 1, 1 , 1 ])
#from sklearn.svm import SVC
#model = SVC(C = 1e5, kernel = 'linear’)
#model.fit(X, y)
from sklearn import svm
from sklearn import metrics
SVM_Sol = svm.SVC(decision_function_shape="ovr").fit(X_train,
y_train)
y_pred = SVM_Sol.predict(X_test)
accuracy = round(metrics.accuracy_score(y_test, y_pred),2)
Applying SVM using Python
(Jupyter Notebook)
The “Kernel Trick”
 The linear classifier relies on dot product between vectors K(xi,xj)=xi
T
xj
 If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi)T
φ(xj)
 A kernel function is some function that corresponds to an inner product in some
expanded feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
T
xj)2
,
Need to show that K(xi,xj)= φ(xi)T
φ(xj):
K(xi,xj)=(1 + xi
T
xj)2
,
= 1+ xi1
2
xj1
2
+ 2 xi1xj1 xi2xj2+ xi2
2
xj2
2
+ 2xi1xj1 + 2xi2xj2
= [1 xi1
2
√2 xi1xi2 xi2
2
√2xi1 √2xi2]T
[1 xj1
2
√2 xj1xj2 xj2
2
√2xj1 √2xj2]
= φ(xi)T
φ(xj), where φ(x) = [1 x1
2
√2 x1x2 x2
2
√2x1 √2x2]
Not Discussed
Examples of Kernel Functions
 Linear: K(xi,xj)= xi
T
xj
 Polynomial of power p: K(xi,xj)= (1+ xi
T
xj)p
 Gaussian (radial-basis function network):
 Sigmoid: K(xi,xj)= tanh(β0xi
T
xj + β1)
𝐾 (𝐱𝐢 ,𝐱𝐣)=exp(¿−
‖𝐱𝐢 −𝐱𝐣‖
2
2𝜎2
)¿
Not Discussed
Non-linear SVMs Mathematically
 Dual problem formulation:
 The solution is:
 Optimization techniques for finding αi’s remain the same!
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi,xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
f(x) = ΣαiyiK(xi,xj)+ b
Not Discussed
 SVM locates a separating hyperplane in the
feature space and classify points in that space.
 If transformation is required, the SVM does not
need to represent the new space explicitly,
simply by defining a kernel function.
 The kernel function plays the role of the dot
product in the feature space.
Nonlinear SVM - Overview
Properties of SVM
 Flexibility in choosing a similarity function
 Sparseness of solution when dealing with large data sets
- only support vectors are used to specify the separating
hyperplane
 Ability to handle large feature spaces
- complexity does not depend on the dimensionality of the
feature space
 Overfitting can be controlled by soft margin approach
 Nice math property: a simple convex optimization problem
which is guaranteed to converge to a single global solution
 Feature Selection
SVM Applications
 SVM has been used successfully in many
real-world problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification, Cancer classification)
- hand-written character recognition
Weakness of SVM
 It is sensitive to noise
- A relatively small number of mislabeled examples can dramatically
decrease the performance
 It only considers two classes
- how to do multi-class classification with SVM?
- Answer:
1) with output arity m, learn m SVM’s
 SVM 1 learns “Output==1” vs “Output != 1”
 SVM 2 learns “Output==2” vs “Output != 2”
 :
 SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with each SVM and
find out which one puts the prediction the furthest into the positive region.
Pros and Cons of SVM
 Pros
1. It is really effective in the higher dimension.
2. Effective when the number of features are more than training
examples.
3. Best algorithm when classes are separable
4. The hyperplane is affected by only the support vectors thus outliers
have less impact.
5. SVM is suited for extreme case binary classification.
 Cons
1. For larger dataset, it requires a large amount of time to process.
2. Does not perform well in case of overlapped classes.
3. Selecting, appropriately hyperparameters of the SVM that will allow
for sufficient generalization performance.
4. Selecting the appropriate kernel function can be tricky.
Application 1: Cancer
Classification
 High Dimensional
- p>1000; n<100
 Imbalanced
- less positive samples
 Many irrelevant features
 Noisy
Genes
Patients g-1 g-2 …… g-p
P-1
p-2
…….
p-n
𝐾[𝑥,𝑥]=𝑘(𝑥,𝑥)+𝜆
𝑛
+¿
𝑁
¿
FEATURE SELECTION
In the linear case,
wi
2
gives the ranking of dim i
SVM is sensitive to noisy (mis-labeled) data 
Not Discussed
Application 2: Text
Categorization
 Task: The classification of natural text (or
hypertext) documents into a fixed number
of predefined categories based on their
content.
- email filtering, web searching, sorting documents by
topic, etc..
 A document can be assigned to more than
one category, so this can be viewed as a
series of binary classification problems, one
for each category
Not Discussed
Representation of Text
IR’s vector space model (aka bag-of-words representation)
 A doc is represented by a vector indexed by a pre-fixed set or
dictionary of terms
 Values of an entry can be binary or weights
 Normalization, stop words, word stems
 Doc x => φ(x)
Not Discussed
Text Categorization using
SVM
 The distance between two documents is φ(x)·φ(z)
 K(x,z) = 〈 φ(x)·φ(z) is a valid kernel, SVM can be used
with K(x,z) for discrimination.
 Why SVM?
-High dimensional input space
-Few irrelevant features (dense concept)
-Sparse document vectors (sparse instances)
-Text categorization problems are linearly separable
Not Discussed
Some Issues
 Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate similarity
measures
 Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
 Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are tested
Not Discussed
Additional Resources
 An excellent tutorial on VC-dimension and Support Vector
Machines:
C.J.C. Burges. A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2(2):955-
974, 1998.
 The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience;
1998
http://www.kernel-machines.org/

More Related Content

PPT
4.Support Vector Machines.ppt machine learning and development
PPT
PERFORMANCE EVALUATION PARAMETERS FOR MACHINE LEARNING
PPTX
Lecture09 SVM Intro, Kernel Trick (updated).pptx
PPT
Support Vector Machines
PPTX
Support Vector Machine topic of machine learning.pptx
PPT
Introduction to Support Vector Machine 221 CMU.ppt
PPT
SVM (2).ppt
PDF
course slides of Support-Vector-Machine.pdf
4.Support Vector Machines.ppt machine learning and development
PERFORMANCE EVALUATION PARAMETERS FOR MACHINE LEARNING
Lecture09 SVM Intro, Kernel Trick (updated).pptx
Support Vector Machines
Support Vector Machine topic of machine learning.pptx
Introduction to Support Vector Machine 221 CMU.ppt
SVM (2).ppt
course slides of Support-Vector-Machine.pdf

Similar to ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx (20)

PPT
PPT
unit4-SVMs.ppt dfgdfgfdgfgfgfggfgggfdgggf
PPTX
Support Vector Machines Simply
PPTX
Anomaly detection using deep one class classifier
PPT
2.6 support vector machines and associative classifiers revised
PDF
lec6_annotated.pdf ml csci 567 vatsal sharan
PPTX
Machine learning session 9
PPTX
Support vector machines
PDF
Lecture4 xing
PDF
Epsrcws08 campbell isvm_01
PPTX
Lec2-review-III-svm-logreg_for the beginner.pptx
PPTX
Lec2-review-III-svm-logregressionmodel.pptx
PPTX
Machine learning interviews day2
PPT
linear SVM.ppt
PPTX
Support vector machine
PPTX
The world of loss function
PPT
lecture14-SVMs (1).ppt
PPT
lecture9-support vector machines algorithms_ML-1.ppt
PDF
[ML]-SVM2.ppt.pdf
PPTX
unit4-SVMs.ppt dfgdfgfdgfgfgfggfgggfdgggf
Support Vector Machines Simply
Anomaly detection using deep one class classifier
2.6 support vector machines and associative classifiers revised
lec6_annotated.pdf ml csci 567 vatsal sharan
Machine learning session 9
Support vector machines
Lecture4 xing
Epsrcws08 campbell isvm_01
Lec2-review-III-svm-logreg_for the beginner.pptx
Lec2-review-III-svm-logregressionmodel.pptx
Machine learning interviews day2
linear SVM.ppt
Support vector machine
The world of loss function
lecture14-SVMs (1).ppt
lecture9-support vector machines algorithms_ML-1.ppt
[ML]-SVM2.ppt.pdf
Ad

Recently uploaded (20)

PDF
The Evolution of Traditional to New Media .pdf
PDF
simpleintnettestmetiaerl for the simple testint
PPTX
Layers_of_the_Earth_Grade7.pptx class by
PDF
si manuel quezon at mga nagawa sa bansang pilipinas
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPT
Ethics in Information System - Management Information System
DOC
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
PPTX
Funds Management Learning Material for Beg
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
Uptota Investor Deck - Where Africa Meets Blockchain
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
Session 1 (Week 1)fghjmgfdsfgthyjkhfdsadfghjkhgfdsa
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PPTX
1402_iCSC_-_RESTful_Web_APIs_--_Josef_Hammer.pptx
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
PPT
250152213-Excitation-SystemWERRT (1).ppt
The Evolution of Traditional to New Media .pdf
simpleintnettestmetiaerl for the simple testint
Layers_of_the_Earth_Grade7.pptx class by
si manuel quezon at mga nagawa sa bansang pilipinas
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Ethics in Information System - Management Information System
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
Funds Management Learning Material for Beg
SASE Traffic Flow - ZTNA Connector-1.pdf
Uptota Investor Deck - Where Africa Meets Blockchain
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Session 1 (Week 1)fghjmgfdsfgthyjkhfdsadfghjkhgfdsa
Exploring VPS Hosting Trends for SMBs in 2025
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Design_with_Watersergyerge45hrbgre4top (1).ppt
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
1402_iCSC_-_RESTful_Web_APIs_--_Josef_Hammer.pptx
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Power Point - Lesson 3_2.pptx grad school presentation
250152213-Excitation-SystemWERRT (1).ppt
Ad

ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx

  • 1. CS-13410 Introduction to Machine Learning Lecture # 15 Support Vector Machine Non-Linear A portion of the slides are taken from Prof. Andrew Moore’s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials
  • 2. Dataset with noise  Hard Margin: So far we require all data points be classified correctly - No training error  What if the training set is noisy? - Solution 1: use very powerful kernels denotes +1 denotes -1 OVERFITTING!
  • 3. What if the training set is not linearly separable? Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. wx+b=1 wx+b=0 wx+b=-1 ξ7 ξ11 ξ2 Soft Margin Classification
  • 4. Hard Margin v.s. Soft Margin  The old formulation:  The new formulation incorporating slack variables:  Parameter C can be viewed as a way to control overfitting. Find w and b such that Φ(w) =½ wT w is minimized and for all {(xi ,yi)} yi (wT xi + b) ≥ 1 Find w and b such that Φ(w) =½ wT w + CΣξi is minimized and for all {(xi ,yi)} yi (wT xi + b) ≥ 1- ξi and ξi ≥ 0 for all i Not Discussed
  • 5. Linear SVMs: Overview  The classifier is a separating hyperplane.  Most “important” training points are support vectors; they define the hyperplane.  Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi.  Both in the dual formulation of the problem and in the solution training points appear only inside dot products: Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxi T xj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi f(x) = Σαiyixi T x + b
  • 6. Non-linear SVMs  Datasets that are linearly separable with some noise work out great:  But what are we going to do if the dataset is just too hard?  How about… mapping data to a higher-dimensional space: 0 x 0 x 0 x x2
  • 7. Non-linear SVMs: Feature spaces  General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)
  • 8. 8 Example Suppose we are given the following positively labeled data points in R2 : and the following negatively labeled data points in R2 : . See the figure bellow
  • 9. 9 Example (Non-linear SVM)  From the figure we see that no linear class separating hyperplane exists in the input space. Therefore, we must use a nonlinear SVM (that is, one whose mapping function Φ is a nonlinear mapping from input space into some feature space).  Define (1) After transforming we can rewrite the data in feature space as for the positive examples and for the negative examples. Please see the figure on the next slide.
  • 10. 10  Now we can easily identify the support vectors (see the figure)
  • 11. 11 We will use vectors augmented with 1 as a bias input. The augmented vectors are We know that and Now we are required to find out 2 parameters and based on the following 2 linear equations. Given equation (1), this reduces to Now computing the dot products results in And and .
  • 12. 12 , we get Giving us the separating hyperplane equation with and . See the figure below; Now we can classify any new point say as positive or negative class. If the data point is in input space; first, we convert it to feature space and then use SVM to identify its class.
  • 13. import numpy as np X = np.array([[3,4],[1,4],[2,3],[6,-1],[7,-1],[5,-3]] ) y = np.array([-1,-1, -1, 1, 1 , 1 ]) #from sklearn.svm import SVC #model = SVC(C = 1e5, kernel = 'linear’) #model.fit(X, y) from sklearn import svm from sklearn import metrics SVM_Sol = svm.SVC(decision_function_shape="ovr").fit(X_train, y_train) y_pred = SVM_Sol.predict(X_test) accuracy = round(metrics.accuracy_score(y_test, y_pred),2) Applying SVM using Python (Jupyter Notebook)
  • 14. The “Kernel Trick”  The linear classifier relies on dot product between vectors K(xi,xj)=xi T xj  If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: K(xi,xj)= φ(xi)T φ(xj)  A kernel function is some function that corresponds to an inner product in some expanded feature space.  Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi T xj)2 , Need to show that K(xi,xj)= φ(xi)T φ(xj): K(xi,xj)=(1 + xi T xj)2 , = 1+ xi1 2 xj1 2 + 2 xi1xj1 xi2xj2+ xi2 2 xj2 2 + 2xi1xj1 + 2xi2xj2 = [1 xi1 2 √2 xi1xi2 xi2 2 √2xi1 √2xi2]T [1 xj1 2 √2 xj1xj2 xj2 2 √2xj1 √2xj2] = φ(xi)T φ(xj), where φ(x) = [1 x1 2 √2 x1x2 x2 2 √2x1 √2x2] Not Discussed
  • 15. Examples of Kernel Functions  Linear: K(xi,xj)= xi T xj  Polynomial of power p: K(xi,xj)= (1+ xi T xj)p  Gaussian (radial-basis function network):  Sigmoid: K(xi,xj)= tanh(β0xi T xj + β1) 𝐾 (𝐱𝐢 ,𝐱𝐣)=exp(¿− ‖𝐱𝐢 −𝐱𝐣‖ 2 2𝜎2 )¿ Not Discussed
  • 16. Non-linear SVMs Mathematically  Dual problem formulation:  The solution is:  Optimization techniques for finding αi’s remain the same! Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi,xj) is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi f(x) = ΣαiyiK(xi,xj)+ b Not Discussed
  • 17.  SVM locates a separating hyperplane in the feature space and classify points in that space.  If transformation is required, the SVM does not need to represent the new space explicitly, simply by defining a kernel function.  The kernel function plays the role of the dot product in the feature space. Nonlinear SVM - Overview
  • 18. Properties of SVM  Flexibility in choosing a similarity function  Sparseness of solution when dealing with large data sets - only support vectors are used to specify the separating hyperplane  Ability to handle large feature spaces - complexity does not depend on the dimensionality of the feature space  Overfitting can be controlled by soft margin approach  Nice math property: a simple convex optimization problem which is guaranteed to converge to a single global solution  Feature Selection
  • 19. SVM Applications  SVM has been used successfully in many real-world problems - text (and hypertext) categorization - image classification - bioinformatics (Protein classification, Cancer classification) - hand-written character recognition
  • 20. Weakness of SVM  It is sensitive to noise - A relatively small number of mislabeled examples can dramatically decrease the performance  It only considers two classes - how to do multi-class classification with SVM? - Answer: 1) with output arity m, learn m SVM’s  SVM 1 learns “Output==1” vs “Output != 1”  SVM 2 learns “Output==2” vs “Output != 2”  :  SVM m learns “Output==m” vs “Output != m” 2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
  • 21. Pros and Cons of SVM  Pros 1. It is really effective in the higher dimension. 2. Effective when the number of features are more than training examples. 3. Best algorithm when classes are separable 4. The hyperplane is affected by only the support vectors thus outliers have less impact. 5. SVM is suited for extreme case binary classification.  Cons 1. For larger dataset, it requires a large amount of time to process. 2. Does not perform well in case of overlapped classes. 3. Selecting, appropriately hyperparameters of the SVM that will allow for sufficient generalization performance. 4. Selecting the appropriate kernel function can be tricky.
  • 22. Application 1: Cancer Classification  High Dimensional - p>1000; n<100  Imbalanced - less positive samples  Many irrelevant features  Noisy Genes Patients g-1 g-2 …… g-p P-1 p-2 ……. p-n 𝐾[𝑥,𝑥]=𝑘(𝑥,𝑥)+𝜆 𝑛 +¿ 𝑁 ¿ FEATURE SELECTION In the linear case, wi 2 gives the ranking of dim i SVM is sensitive to noisy (mis-labeled) data  Not Discussed
  • 23. Application 2: Text Categorization  Task: The classification of natural text (or hypertext) documents into a fixed number of predefined categories based on their content. - email filtering, web searching, sorting documents by topic, etc..  A document can be assigned to more than one category, so this can be viewed as a series of binary classification problems, one for each category Not Discussed
  • 24. Representation of Text IR’s vector space model (aka bag-of-words representation)  A doc is represented by a vector indexed by a pre-fixed set or dictionary of terms  Values of an entry can be binary or weights  Normalization, stop words, word stems  Doc x => φ(x) Not Discussed
  • 25. Text Categorization using SVM  The distance between two documents is φ(x)·φ(z)  K(x,z) = 〈 φ(x)·φ(z) is a valid kernel, SVM can be used with K(x,z) for discrimination.  Why SVM? -High dimensional input space -Few irrelevant features (dense concept) -Sparse document vectors (sparse instances) -Text categorization problems are linearly separable Not Discussed
  • 26. Some Issues  Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate similarity measures  Choice of kernel parameters - e.g. σ in Gaussian kernel - σ is the distance between closest points with different classifications - In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.  Optimization criterion – Hard margin v.s. Soft margin - a lengthy series of experiments in which various parameters are tested Not Discussed
  • 27. Additional Resources  An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955- 974, 1998.  The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 http://www.kernel-machines.org/

Editor's Notes

  • #13: ovo = One-vs-one
  • #17: Instead of explicitly calculating the required transformation (which might be computationally expensive or even impossible for very high dimensions), SVM uses a kernel function. This function calculates the relationship between two points in the transformed feature space without explicitly performing the transformation. This is known as the kernel trick, and it allows SVM to handle complex data relationships efficiently.
  • #20:  The arity of a function or operation is the number of arguments or operands the function or operation accepts. The arity of a relation is the dimension of the domain in the corresponding Cartesian product.