Feature Classification Using Support Vector Machines
A new classification system based on statistical learning theory (Vapnik, 1995), called the
support vector machine. Support vector machines are binary classifiers, popular for their
ability to handle high dimensional data and are widely used in feature classification. This
technique is said to be independent of the dimensionality of feature space as the main idea
behind this classification technique is to separate the classes with a surface that maximise the
margin between them, using boundary pixels to create the decision surface. The data points
that are closest to the hyperplane are termed "support vectors". Applications of SVMs to any
classification problem require the determination of several user-defined parameters. Some of
these parameters are the choice of a suitable multiclass approach, Choice of an appropriate
kernel and related parameters, determination of a suitable value of regularisation parameter
(i.e. C) and a suitable optimisation technique.
In the case of a two-class pattern recognition problem in which the classes are linearly
separable the SVM selects from among the infinite number of linear decision boundaries the
one that minimises the generalisation error. Thus, the selected decision boundary will be one
that leaves the greatest margin between the two classes, where margin is defined as the sum
of the distances to the hyperplane from the closest points of the two classes (Vapnik, 1995).
This problem of maximising the margin can be solved using standard Quadratic
Programming (QP) optimisation techniques. The data points that are closest to the hyperplane
are used to measure the margin; hence these data points are termed ‘support vectors’.
Consider a training data set {(x1, y1), (x2,y2),...,(xn, yn)}, where xi are the vectorized training
images and yi∈ {−1,+1} are the labels to which each image can be assigned to.
SVM tries to build a hyper plane, wT
z − b = 0 that best separates the data points (by widest
margin) where w is normal to the hyper plane and b is the bias and is the perpendicular
distance from the hyper plane to the origin.
Figure: Hyper plane that separates the data best
For the linearly separable case, the support vector algorithm simply looks for the separating
hyper plane with largest margin.
It does so by minimizing the following objective function:
F(x) =
yi(wT
xi+ b) ≥ 1 ∀i
Here ξi are slack variables that allow misclassification for data that are not linearly separable
and C is the penalizing constant. The problem of optimization is simplified by using its dual
representation:
Subject to
Here corresponds to Lagrange multiplier.
The Karush Kuhn–Tucker (KKT) conditions for the optimumconstrained function are
necessary and sufficient to find the maximum of this equation. The corresponding KKT
complementarity conditions are
∀i
The optimal solution is thus given by-
w =
For the non-separable data, the above objective function and inequality constraint can be
modified as:
Subject to ξi> 0
yi (wT
z − b) ≥ 1 − ξi, ∀i
Subject to ξi> 0 &0 ≤ αi≤ C,
Here ξi are slack variables that allow misclassification for data that are not linearly separable
and C is the penalizing constant.
i. Nonlinear Support Vector Machines
If the two classes are not linearly separable, the SVM tries to find the hyper plane that
maximises the margin while, at the same time, minimising a quantity proportional to the
number of misclassification errors. The trade-off between margin and misclassification error
is controlled by a user-defined constant (Cortes and Vapnik, 1995). Training an SVM finds
the large margin hyperplane, i.e. sets the parameters αi and b. The SVM has another set of
parameters called hyperparameters: The soft margin constant, C, and any parameters the
kernel function may depend on (width of a Gaussian kernel).SVM can also be extended to
handle non-linear decision surfaces. If the input data is not linearly separable in the input
space x but might be linear separable in some higher dimensional space, then the
classification problem can be solved by simply mapped the input data to higher dimensional
space such that x → (x).ϕ
Figure: Mapping of input data to higher dimensional data
SVM performs an implicit mapping of data into a higher (maybe infinite)dimensional feature
space, and then finds a linear separatinghyperplane with the maximal margin to separate data
in this higherdimensional space.
The dual representation is thus given by-
Subject to
The problem with this approach is the very high computational complexity in higher
dimensional space. The use Kernel functions eliminates this problem.
A Kernel function can be represented as:
K(xi, xj) = (xϕ i)T
(xϕ j)
A number of kernels have been developed so far but the most popular and promising kernels
are:
K (xi,xj) = xi
T
xj(Linear Kernel)
K (xi, xj) = exp ( ) (Radial Basis Kernel)
K(xi , xj ) = (1 + xi
T
xj )p
(Polynomial kernel)
K(xi, xj ) = tanh(axi
T
xj + r) (Sigmoidal Kernel)
A new test example x is classified by the following function:
F (x) =sgn( )
a. The Behaviour of the Sigmoid Kernel
We consider the sigmoid kernel K(xi, xj ) = tanh(axi
T
xj + r), which takes two parameters: a
and r. For a > 0, we can view a as a scaling parameter of the input data, and r as a shifting
parameter that controls the threshold of mapping. For a < 0, the dot-product of the input data
is not only scaled but reversed.
It concludes that the first case, a > 0 and r < 0, is moresuitable for the sigmoid kernel.
A R Results
+ - K is CPD after r is small; similar to RBF for small a
+ + in general not as good as the (+, −) case
- + objective value of (6) −∞ after r large enough
- - easily the objective value of (6) −∞
Table 1: behaviour in different parameter combinations in sigmoid kernel
b. Behaviour of polynomial kernel
Polynomial Kernel (K(xi , xj ) = (1 + xi
T
xj )p
) is non-stochastic kernel estimate with two
parameters i.e. C and polynomial degree p. Each data from the set xi has an influence on
the kernel point of the test value xj, irrespective of its the actual distance from xj [14], It
gives good classification accuracy with minimum number of support vectors and low
classification error.
.
Figure: The effect of the degree of a polynomial kernel.
Higher degree polynomial kernels allow a more flexible decision boundary
c. Gaussian radial basis function
K (xi, xj) = exp ( ) deals with data that has conditional probability distribution
approaching gaussian function. RBF kernels perform better than the linear and polynomial
kernel. However, it is difficult to find an optimum parameters σ and equivalent C that gives
better result for a given problem.
A radial basis function (RBF) is a function of two vectors, which depends on only the
distance between them, i.e., K ( , ) = f ( − ).
may be recognized as the squared Euclidean distance between the two feature
vectors. The parameter σ is called bandwidth.
Figure: Circled points are support vectors. The two contour lines running through support
vectors are the nonlinear counterparts of the convex hulls. The thick black line is the
classifier. The lines in the image are contour lines of this surface. The classifier runs along
the bottom of the "valley" between the two classes. Smoothness of the contours is controlled
by σ
Kernel parameters also have a significant effect on the decision boundary.The width
parameter of the Gaussian kernel control the flexibility of theresulting classifier
Gaussian, gamma=1 Gaussian, gamma=100
Figure: The effect of the inverse-width parameter of the Gaussian kernel (γ) for a fixed value
of the soft-margin constant. The flexibility of the decision boundary increases with an
increase in value of gamma. Large values of γ lead to over fitting (right).
Intuitively, the gamma parameter defines how far the influence of a single training example
reaches, with low values meaning ‘far’ and high values meaning ‘close’. The C parameter
trades off misclassification of training examples against simplicity of the decision surface.
ii. Multi Class Classification
SVM are suitableonly for binary classification. However, they can be easilyextended to a
multi-class problem by utilizing Error Correcting Output Codes. When dealing with multiple
classes, an appropriate multi-class method is needed. Vapnik (1995) suggested comparing
one class with the others taken together. This strategy generates n classifiers, where n is the
number of classes. The final output is the class that corresponds to the SVM with the largest
margin, as defined above. For multi-class problems one has to determine n hyperplanes.
Thus, this method requires the solution of n QP optimisation problems, each of which
separates one class from the remaining classes. A dichotomy is a two-class classifier that
learns fromdata labelled with positive (+), negative (-), or (don’t care).Given any number of
classes, we can re-label them withthese three symbols and thus form a dichotomy, Different
relabeling result in different two-class problems eachof which is learned independently. A
multi-class classifierprogresses through every selected dichotomy and choosesa class that is
correctly classified by the maximum numberof selected dichotomies.Exhaustive dichotomies
represent a set of all possibleways of dividing and relabeling the dataset with the threedefined
symbols. A one-against-all classification schemeon an n-class classification considers n
dichotomies eachre-label one class as (+) and all other classes as (-).
a. DAG – SVM
The problem of multiclass classification, especially for systems like SVMs, doesn’t present
an easy solution.The standard method for –class SVMs is to constructSVMs. The ith SVM
will be trained with all of the examples in the ith class with positive labels, and all other
exampleswith negative labels. We refer to SVMs trained in this way as 1-v-r SVMs (short for
oneversus-rest).The final output of the1-v-r SVMs is the class that corresponds to the
SVMwith the highest output value. Unfortunately, there is no bound on the generalization
errorfor the 1-v-r SVM, and the training time of the standard method scales linearly with N.
Another method for constructing N-class classifiers from SVMs is derived from
previousresearch into combiningtwo-class classifiers. Knerr suggested constructing all
possible two class classifiers from a training set of N classes, each classifier being trained on
onlytwo out of N classes. There would thus be K = N(N-1)/2 classifiers. When applied
toSVMs, we refer to this as 1-v-1 SVMs (short for one-versus-one).
A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles.
A Rooted DAG has a unique node such that it is the only node which has no arcs pointinginto
it. A Rooted Binary DAG has nodes which have either 0 or 2 arcs leaving them.We will use
Rooted Binary DAGs in order to define a class of functions to be used inclassification tasks.
The class of functions computed by Rooted Binary DAGs is formallydefined as follows.
Definition 1: Decision DAGs (DDAGs).
Given a space X and a set of Boolean functions F = {f: X  {0,1}}, the class DDAG(F) of
Decision DAGs on N classes over F arefunctions which can be implemented using a rooted
binary DAG with N leaves labelled bythe classes where each of the K = N(N-1)/2 internal
nodes is labelled with an elementof F. The nodes are arranged in a triangle with the single
root node at the top, two nodesin the second layer andso on until the finallayer of N leaves.
The i-th node in layer j<N is connected to the i-th and (i+1)-st node in the (j+1)-st layer.
To evaluate a particular DDAG on input x ∈X, starting at the root node, the binaryfunction at
a node is evaluated. The node is then exited via the left edge, if the binaryfunction is zero; or
the right edge, if the binary function is one. The next node’s binaryfunction is then evaluated.
The value of the decision function D(x) is the value associatedwith the final leaf node. The
path taken through the DDAG is knownas the evaluation path. The input x reaches a node of
the graph, if that node is on theevaluation path for x. We refer to the decision node
distinguishing classes i and j as the ij-node. Assuming that the number of a leaf is its class,
this node is the i-th node in the (N-j+1)-th layer provided i<j. Similarly the j-nodes are those
nodes involving class j, that is, the internal nodes on the two diagonals containing the leaf
labelled by j.
The DDAG is equivalent to operating on a list, where each node eliminates one class fromthe
list. The list is initialized with a list of all classes. A test point is evaluated against thedecision
node that corresponds to the first and last elements of the list.
If the node prefersone of the two classes, the other class is eliminated from the list, and the
DDAG proceedsto test the first and last elements of the new list. The DDAG terminates when
only oneclass remains in the list. Thus, for a problem with N classes, N-1 decision nodes will
beevaluated in order to derive an answer.
The current state of the list is the total state of the system. Therefore, since a list stateis
reachable in more than one possible path through the system, the decision graph thealgorithm
traverses is a DAG, not simply a tree.
The DAGSVM [8] separates the individual classes with large margin. It is safe to discard
thelosing class at each 1-v-1 decision because, for the hard margin case, all of the examplesof
the losing class are far away from the decision surface. The DAGSVM algorithm is superior
to other multiclass SVM algorithms in both trainingand evaluationtime. Empirically,SVM
training is observedto scale super-linearlywith the training set size, according to a power law:
T = cmγ
, whereγ≈2 for algorithmsbasedon the decompositionmethod,with some
proportionalityconstant c. For the standard1-v-r multiclass SVM training algorithm, the entire
training set is used to create all N classifiers.
Figure: The Decision DAG for finding the best class out of four classes
Hence the training time for 1-v-r is
T1-v-1 = cNmγ
Assuming that the classes have the same number of examples, training each 1-v-1 SVMonly
requires 2m/N training examples.Thus, training K 1-v-1 SVMs would require
T1-v-1 = c ≈ 2γ-1
cN2-γ
mγ
.
For a typical case, whereγ =2, the amount of time required to train all of the 1-v-1 SVMsis
independent of N, and is only twice that of training a single 1-v-r SVM. Using 1-v-1SVMs
with a combination algorithm is thus preferred for training time.
For more info you can visit us at: http://www.siliconmentor.com/
Below link also may be useful for you
VLSI M.Tech Projects
PhD Projects & Thesis
VLSI Design Projects List
IEEE Projects

More Related Content

PPTX
Image Classification And Support Vector Machine
PPT
2.6 support vector machines and associative classifiers revised
PDF
Support Vector Machines ( SVM )
PPTX
Support Vector Machine
PDF
Machine learning in science and industry — day 4
DOC
Tutorial - Support vector machines
PDF
Reweighting and Boosting to uniforimty in HEP
PPT
Svm and kernel machines
Image Classification And Support Vector Machine
2.6 support vector machines and associative classifiers revised
Support Vector Machines ( SVM )
Support Vector Machine
Machine learning in science and industry — day 4
Tutorial - Support vector machines
Reweighting and Boosting to uniforimty in HEP
Svm and kernel machines

What's hot (18)

PDF
Machine learning in science and industry — day 1
PPTX
Svm Presentation
PDF
MLHEP Lectures - day 1, basic track
PDF
Machine learning in science and industry — day 3
PDF
Application of combined support vector machines in process fault diagnosis
PDF
Vc dimension in Machine Learning
PDF
On image intensities, eigenfaces and LDA
PDF
widely-linear-minimum (1)
PDF
Support Vector Machines
PDF
Machine learning in science and industry — day 2
PDF
overviewPCA
PPTX
Support Vector Machine ppt presentation
PDF
Machine learning (11)
DOC
(MS word document)
PPT
Text categorization
DOC
SVM Tutorial
PPT
Support Vector machine
PPTX
Principal component analysis
Machine learning in science and industry — day 1
Svm Presentation
MLHEP Lectures - day 1, basic track
Machine learning in science and industry — day 3
Application of combined support vector machines in process fault diagnosis
Vc dimension in Machine Learning
On image intensities, eigenfaces and LDA
widely-linear-minimum (1)
Support Vector Machines
Machine learning in science and industry — day 2
overviewPCA
Support Vector Machine ppt presentation
Machine learning (11)
(MS word document)
Text categorization
SVM Tutorial
Support Vector machine
Principal component analysis
Ad

Viewers also liked (20)

PDF
PPT
Ivailo dimitrov-2014
PDF
Rethink
PPTX
My experience with angina
PPTX
Безопасный детский сад
PDF
คณิตศาสตร์
PPT
Форматирование документа в текстовом редакторе
PPT
Ivan yzunov-milen-djanovski-2015-1
DOC
Ranjan updated Resume
DOCX
El debido proceso trabajo
PPTX
PPTX
Krishna
PDF
스토리보드
PPT
Informatica solidale giugno 2014 1.1
PPTX
Art and craft
PPTX
Presentation social network
PPT
Collaborative divorce 7 28-13
PDF
Dip Your Toes in the Sea of Security (DPC 2015)
PPT
Welcome to the most pristine, off the map, off beat experience of a Himalayan...
PPTX
Shivam ppt
Ivailo dimitrov-2014
Rethink
My experience with angina
Безопасный детский сад
คณิตศาสตร์
Форматирование документа в текстовом редакторе
Ivan yzunov-milen-djanovski-2015-1
Ranjan updated Resume
El debido proceso trabajo
Krishna
스토리보드
Informatica solidale giugno 2014 1.1
Art and craft
Presentation social network
Collaborative divorce 7 28-13
Dip Your Toes in the Sea of Security (DPC 2015)
Welcome to the most pristine, off the map, off beat experience of a Himalayan...
Shivam ppt
Ad

Similar to Introduction to Support Vector Machines (20)

PPTX
classification algorithms in machine learning.pptx
PPTX
SVM[Support vector Machine] Machine learning
PPT
PERFORMANCE EVALUATION PARAMETERS FOR MACHINE LEARNING
PPTX
support vector machine 1.pptx
PDF
Support vector machine, machine learning
PPTX
Anomaly detection using deep one class classifier
PPT
4.Support Vector Machines.ppt machine learning and development
PDF
Conference_paper.pdf
PDF
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
PPTX
Machine Learning Algorithms (Part 1)
PDF
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
PDF
A Fuzzy Interactive BI-objective Model for SVM to Identify the Best Compromis...
PPTX
The world of loss function
PPTX
svm-proyekt.pptx
PPT
Support Vector Machines
PPTX
Support Vector Machine topic of machine learning.pptx
PDF
PPTX
Support Vector Machine.pptx
PPTX
Module 3 -Support Vector Machines data mining
PDF
Data Science - Part IX - Support Vector Machine
classification algorithms in machine learning.pptx
SVM[Support vector Machine] Machine learning
PERFORMANCE EVALUATION PARAMETERS FOR MACHINE LEARNING
support vector machine 1.pptx
Support vector machine, machine learning
Anomaly detection using deep one class classifier
4.Support Vector Machines.ppt machine learning and development
Conference_paper.pdf
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
Machine Learning Algorithms (Part 1)
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A Fuzzy Interactive BI-objective Model for SVM to Identify the Best Compromis...
The world of loss function
svm-proyekt.pptx
Support Vector Machines
Support Vector Machine topic of machine learning.pptx
Support Vector Machine.pptx
Module 3 -Support Vector Machines data mining
Data Science - Part IX - Support Vector Machine

More from Silicon Mentor (16)

PPTX
Image Processing and Computer Vision
PDF
Encoding Schemes for Multipliers
PPTX
Signal Filtering
PPTX
Implementation of DSP Algorithms on FPGA
PPTX
High Performance FPGA Based Decimal-to-Binary Conversion Schemes
PPTX
Low Power Design Approach in VLSI
DOCX
Floating Point Unit (FPU)
PDF
Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...
PPTX
Analog design
PDF
Matlab worshop
PDF
Low power vlsi design workshop 1
PDF
Hspice proposal workshop
PDF
HDL workshop
PPTX
Vlsi ieee projects
PPTX
Vlsi ieee projects
DOCX
IEEE based Research projects List for M.tech/PhD students
Image Processing and Computer Vision
Encoding Schemes for Multipliers
Signal Filtering
Implementation of DSP Algorithms on FPGA
High Performance FPGA Based Decimal-to-Binary Conversion Schemes
Low Power Design Approach in VLSI
Floating Point Unit (FPU)
Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...
Analog design
Matlab worshop
Low power vlsi design workshop 1
Hspice proposal workshop
HDL workshop
Vlsi ieee projects
Vlsi ieee projects
IEEE based Research projects List for M.tech/PhD students

Recently uploaded (20)

PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PDF
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PPTX
Module on health assessment of CHN. pptx
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
English Textual Question & Ans (12th Class).pdf
PDF
Complications of Minimal Access-Surgery.pdf
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PPTX
Virtual and Augmented Reality in Current Scenario
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
advance database management system book.pdf
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
PPTX
Education and Perspectives of Education.pptx
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
Module on health assessment of CHN. pptx
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
English Textual Question & Ans (12th Class).pdf
Complications of Minimal Access-Surgery.pdf
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
Virtual and Augmented Reality in Current Scenario
Share_Module_2_Power_conflict_and_negotiation.pptx
Uderstanding digital marketing and marketing stratergie for engaging the digi...
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
advance database management system book.pdf
Paper A Mock Exam 9_ Attempt review.pdf.
Cambridge-Practice-Tests-for-IELTS-12.docx
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
Education and Perspectives of Education.pptx
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf

Introduction to Support Vector Machines

  • 1. Feature Classification Using Support Vector Machines A new classification system based on statistical learning theory (Vapnik, 1995), called the support vector machine. Support vector machines are binary classifiers, popular for their ability to handle high dimensional data and are widely used in feature classification. This technique is said to be independent of the dimensionality of feature space as the main idea behind this classification technique is to separate the classes with a surface that maximise the margin between them, using boundary pixels to create the decision surface. The data points that are closest to the hyperplane are termed "support vectors". Applications of SVMs to any classification problem require the determination of several user-defined parameters. Some of these parameters are the choice of a suitable multiclass approach, Choice of an appropriate kernel and related parameters, determination of a suitable value of regularisation parameter (i.e. C) and a suitable optimisation technique. In the case of a two-class pattern recognition problem in which the classes are linearly separable the SVM selects from among the infinite number of linear decision boundaries the one that minimises the generalisation error. Thus, the selected decision boundary will be one that leaves the greatest margin between the two classes, where margin is defined as the sum of the distances to the hyperplane from the closest points of the two classes (Vapnik, 1995). This problem of maximising the margin can be solved using standard Quadratic Programming (QP) optimisation techniques. The data points that are closest to the hyperplane are used to measure the margin; hence these data points are termed ‘support vectors’. Consider a training data set {(x1, y1), (x2,y2),...,(xn, yn)}, where xi are the vectorized training images and yi∈ {−1,+1} are the labels to which each image can be assigned to.
  • 2. SVM tries to build a hyper plane, wT z − b = 0 that best separates the data points (by widest margin) where w is normal to the hyper plane and b is the bias and is the perpendicular distance from the hyper plane to the origin. Figure: Hyper plane that separates the data best For the linearly separable case, the support vector algorithm simply looks for the separating hyper plane with largest margin. It does so by minimizing the following objective function: F(x) = yi(wT xi+ b) ≥ 1 ∀i
  • 3. Here ξi are slack variables that allow misclassification for data that are not linearly separable and C is the penalizing constant. The problem of optimization is simplified by using its dual representation: Subject to Here corresponds to Lagrange multiplier. The Karush Kuhn–Tucker (KKT) conditions for the optimumconstrained function are necessary and sufficient to find the maximum of this equation. The corresponding KKT complementarity conditions are ∀i The optimal solution is thus given by- w = For the non-separable data, the above objective function and inequality constraint can be modified as: Subject to ξi> 0
  • 4. yi (wT z − b) ≥ 1 − ξi, ∀i Subject to ξi> 0 &0 ≤ αi≤ C, Here ξi are slack variables that allow misclassification for data that are not linearly separable and C is the penalizing constant. i. Nonlinear Support Vector Machines If the two classes are not linearly separable, the SVM tries to find the hyper plane that maximises the margin while, at the same time, minimising a quantity proportional to the number of misclassification errors. The trade-off between margin and misclassification error is controlled by a user-defined constant (Cortes and Vapnik, 1995). Training an SVM finds the large margin hyperplane, i.e. sets the parameters αi and b. The SVM has another set of parameters called hyperparameters: The soft margin constant, C, and any parameters the kernel function may depend on (width of a Gaussian kernel).SVM can also be extended to handle non-linear decision surfaces. If the input data is not linearly separable in the input space x but might be linear separable in some higher dimensional space, then the classification problem can be solved by simply mapped the input data to higher dimensional space such that x → (x).ϕ
  • 5. Figure: Mapping of input data to higher dimensional data SVM performs an implicit mapping of data into a higher (maybe infinite)dimensional feature space, and then finds a linear separatinghyperplane with the maximal margin to separate data in this higherdimensional space. The dual representation is thus given by- Subject to The problem with this approach is the very high computational complexity in higher dimensional space. The use Kernel functions eliminates this problem. A Kernel function can be represented as: K(xi, xj) = (xϕ i)T (xϕ j) A number of kernels have been developed so far but the most popular and promising kernels are: K (xi,xj) = xi T xj(Linear Kernel) K (xi, xj) = exp ( ) (Radial Basis Kernel) K(xi , xj ) = (1 + xi T xj )p (Polynomial kernel) K(xi, xj ) = tanh(axi T xj + r) (Sigmoidal Kernel) A new test example x is classified by the following function:
  • 6. F (x) =sgn( ) a. The Behaviour of the Sigmoid Kernel We consider the sigmoid kernel K(xi, xj ) = tanh(axi T xj + r), which takes two parameters: a and r. For a > 0, we can view a as a scaling parameter of the input data, and r as a shifting parameter that controls the threshold of mapping. For a < 0, the dot-product of the input data is not only scaled but reversed. It concludes that the first case, a > 0 and r < 0, is moresuitable for the sigmoid kernel. A R Results + - K is CPD after r is small; similar to RBF for small a + + in general not as good as the (+, −) case - + objective value of (6) −∞ after r large enough - - easily the objective value of (6) −∞ Table 1: behaviour in different parameter combinations in sigmoid kernel b. Behaviour of polynomial kernel Polynomial Kernel (K(xi , xj ) = (1 + xi T xj )p ) is non-stochastic kernel estimate with two parameters i.e. C and polynomial degree p. Each data from the set xi has an influence on the kernel point of the test value xj, irrespective of its the actual distance from xj [14], It gives good classification accuracy with minimum number of support vectors and low classification error. . Figure: The effect of the degree of a polynomial kernel.
  • 7. Higher degree polynomial kernels allow a more flexible decision boundary c. Gaussian radial basis function K (xi, xj) = exp ( ) deals with data that has conditional probability distribution approaching gaussian function. RBF kernels perform better than the linear and polynomial kernel. However, it is difficult to find an optimum parameters σ and equivalent C that gives better result for a given problem. A radial basis function (RBF) is a function of two vectors, which depends on only the distance between them, i.e., K ( , ) = f ( − ). may be recognized as the squared Euclidean distance between the two feature vectors. The parameter σ is called bandwidth. Figure: Circled points are support vectors. The two contour lines running through support vectors are the nonlinear counterparts of the convex hulls. The thick black line is the classifier. The lines in the image are contour lines of this surface. The classifier runs along the bottom of the "valley" between the two classes. Smoothness of the contours is controlled by σ
  • 8. Kernel parameters also have a significant effect on the decision boundary.The width parameter of the Gaussian kernel control the flexibility of theresulting classifier Gaussian, gamma=1 Gaussian, gamma=100 Figure: The effect of the inverse-width parameter of the Gaussian kernel (γ) for a fixed value of the soft-margin constant. The flexibility of the decision boundary increases with an increase in value of gamma. Large values of γ lead to over fitting (right). Intuitively, the gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The C parameter trades off misclassification of training examples against simplicity of the decision surface. ii. Multi Class Classification SVM are suitableonly for binary classification. However, they can be easilyextended to a multi-class problem by utilizing Error Correcting Output Codes. When dealing with multiple classes, an appropriate multi-class method is needed. Vapnik (1995) suggested comparing one class with the others taken together. This strategy generates n classifiers, where n is the number of classes. The final output is the class that corresponds to the SVM with the largest margin, as defined above. For multi-class problems one has to determine n hyperplanes. Thus, this method requires the solution of n QP optimisation problems, each of which separates one class from the remaining classes. A dichotomy is a two-class classifier that learns fromdata labelled with positive (+), negative (-), or (don’t care).Given any number of classes, we can re-label them withthese three symbols and thus form a dichotomy, Different relabeling result in different two-class problems eachof which is learned independently. A
  • 9. multi-class classifierprogresses through every selected dichotomy and choosesa class that is correctly classified by the maximum numberof selected dichotomies.Exhaustive dichotomies represent a set of all possibleways of dividing and relabeling the dataset with the threedefined symbols. A one-against-all classification schemeon an n-class classification considers n dichotomies eachre-label one class as (+) and all other classes as (-). a. DAG – SVM The problem of multiclass classification, especially for systems like SVMs, doesn’t present an easy solution.The standard method for –class SVMs is to constructSVMs. The ith SVM will be trained with all of the examples in the ith class with positive labels, and all other exampleswith negative labels. We refer to SVMs trained in this way as 1-v-r SVMs (short for oneversus-rest).The final output of the1-v-r SVMs is the class that corresponds to the SVMwith the highest output value. Unfortunately, there is no bound on the generalization errorfor the 1-v-r SVM, and the training time of the standard method scales linearly with N. Another method for constructing N-class classifiers from SVMs is derived from previousresearch into combiningtwo-class classifiers. Knerr suggested constructing all possible two class classifiers from a training set of N classes, each classifier being trained on onlytwo out of N classes. There would thus be K = N(N-1)/2 classifiers. When applied toSVMs, we refer to this as 1-v-1 SVMs (short for one-versus-one). A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles. A Rooted DAG has a unique node such that it is the only node which has no arcs pointinginto it. A Rooted Binary DAG has nodes which have either 0 or 2 arcs leaving them.We will use Rooted Binary DAGs in order to define a class of functions to be used inclassification tasks. The class of functions computed by Rooted Binary DAGs is formallydefined as follows. Definition 1: Decision DAGs (DDAGs). Given a space X and a set of Boolean functions F = {f: X  {0,1}}, the class DDAG(F) of Decision DAGs on N classes over F arefunctions which can be implemented using a rooted binary DAG with N leaves labelled bythe classes where each of the K = N(N-1)/2 internal nodes is labelled with an elementof F. The nodes are arranged in a triangle with the single
  • 10. root node at the top, two nodesin the second layer andso on until the finallayer of N leaves. The i-th node in layer j<N is connected to the i-th and (i+1)-st node in the (j+1)-st layer. To evaluate a particular DDAG on input x ∈X, starting at the root node, the binaryfunction at a node is evaluated. The node is then exited via the left edge, if the binaryfunction is zero; or the right edge, if the binary function is one. The next node’s binaryfunction is then evaluated. The value of the decision function D(x) is the value associatedwith the final leaf node. The path taken through the DDAG is knownas the evaluation path. The input x reaches a node of the graph, if that node is on theevaluation path for x. We refer to the decision node distinguishing classes i and j as the ij-node. Assuming that the number of a leaf is its class, this node is the i-th node in the (N-j+1)-th layer provided i<j. Similarly the j-nodes are those nodes involving class j, that is, the internal nodes on the two diagonals containing the leaf labelled by j. The DDAG is equivalent to operating on a list, where each node eliminates one class fromthe list. The list is initialized with a list of all classes. A test point is evaluated against thedecision node that corresponds to the first and last elements of the list. If the node prefersone of the two classes, the other class is eliminated from the list, and the DDAG proceedsto test the first and last elements of the new list. The DDAG terminates when only oneclass remains in the list. Thus, for a problem with N classes, N-1 decision nodes will beevaluated in order to derive an answer. The current state of the list is the total state of the system. Therefore, since a list stateis reachable in more than one possible path through the system, the decision graph thealgorithm traverses is a DAG, not simply a tree. The DAGSVM [8] separates the individual classes with large margin. It is safe to discard thelosing class at each 1-v-1 decision because, for the hard margin case, all of the examplesof the losing class are far away from the decision surface. The DAGSVM algorithm is superior to other multiclass SVM algorithms in both trainingand evaluationtime. Empirically,SVM training is observedto scale super-linearlywith the training set size, according to a power law: T = cmγ , whereγ≈2 for algorithmsbasedon the decompositionmethod,with some proportionalityconstant c. For the standard1-v-r multiclass SVM training algorithm, the entire training set is used to create all N classifiers.
  • 11. Figure: The Decision DAG for finding the best class out of four classes Hence the training time for 1-v-r is T1-v-1 = cNmγ Assuming that the classes have the same number of examples, training each 1-v-1 SVMonly requires 2m/N training examples.Thus, training K 1-v-1 SVMs would require T1-v-1 = c ≈ 2γ-1 cN2-γ mγ . For a typical case, whereγ =2, the amount of time required to train all of the 1-v-1 SVMsis independent of N, and is only twice that of training a single 1-v-r SVM. Using 1-v-1SVMs with a combination algorithm is thus preferred for training time. For more info you can visit us at: http://www.siliconmentor.com/ Below link also may be useful for you
  • 12. VLSI M.Tech Projects PhD Projects & Thesis VLSI Design Projects List IEEE Projects