Introduction to Support Vector Machines

Feature Classification Using Support Vector Machines
A new classification system based on statistical learning theory (Vapnik, 1995), called the
support vector machine. Support vector machines are binary classifiers, popular for their
ability to handle high dimensional data and are widely used in feature classification. This
technique is said to be independent of the dimensionality of feature space as the main idea
behind this classification technique is to separate the classes with a surface that maximise the
margin between them, using boundary pixels to create the decision surface. The data points
that are closest to the hyperplane are termed "support vectors". Applications of SVMs to any
classification problem require the determination of several user-defined parameters. Some of
these parameters are the choice of a suitable multiclass approach, Choice of an appropriate
kernel and related parameters, determination of a suitable value of regularisation parameter
(i.e. C) and a suitable optimisation technique.
In the case of a two-class pattern recognition problem in which the classes are linearly
separable the SVM selects from among the infinite number of linear decision boundaries the
one that minimises the generalisation error. Thus, the selected decision boundary will be one
that leaves the greatest margin between the two classes, where margin is defined as the sum
of the distances to the hyperplane from the closest points of the two classes (Vapnik, 1995).
This problem of maximising the margin can be solved using standard Quadratic
Programming (QP) optimisation techniques. The data points that are closest to the hyperplane
are used to measure the margin; hence these data points are termed ‘support vectors’.
Consider a training data set {(x1, y1), (x2,y2),...,(xn, yn)}, where xi are the vectorized training
images and yi∈ {−1,+1} are the labels to which each image can be assigned to.

SVM tries to build a hyper plane, wT
z − b = 0 that best separates the data points (by widest
margin) where w is normal to the hyper plane and b is the bias and is the perpendicular
distance from the hyper plane to the origin.
Figure: Hyper plane that separates the data best
For the linearly separable case, the support vector algorithm simply looks for the separating
hyper plane with largest margin.
It does so by minimizing the following objective function:
F(x) =
yi(wT
xi+ b) ≥ 1 ∀i

Here ξi are slack variables that allow misclassification for data that are not linearly separable
and C is the penalizing constant. The problem of optimization is simplified by using its dual
representation:
Subject to
Here corresponds to Lagrange multiplier.
The Karush Kuhn–Tucker (KKT) conditions for the optimumconstrained function are
necessary and sufficient to find the maximum of this equation. The corresponding KKT
complementarity conditions are
∀i
The optimal solution is thus given by-
w =
For the non-separable data, the above objective function and inequality constraint can be
modified as:
Subject to ξi> 0

yi (wT
z − b) ≥ 1 − ξi, ∀i
Subject to ξi> 0 &0 ≤ αi≤ C,
Here ξi are slack variables that allow misclassification for data that are not linearly separable
and C is the penalizing constant.
i. Nonlinear Support Vector Machines
If the two classes are not linearly separable, the SVM tries to find the hyper plane that
maximises the margin while, at the same time, minimising a quantity proportional to the
number of misclassification errors. The trade-off between margin and misclassification error
is controlled by a user-defined constant (Cortes and Vapnik, 1995). Training an SVM finds
the large margin hyperplane, i.e. sets the parameters αi and b. The SVM has another set of
parameters called hyperparameters: The soft margin constant, C, and any parameters the
kernel function may depend on (width of a Gaussian kernel).SVM can also be extended to
handle non-linear decision surfaces. If the input data is not linearly separable in the input
space x but might be linear separable in some higher dimensional space, then the
classification problem can be solved by simply mapped the input data to higher dimensional
space such that x → (x).ϕ

Figure: Mapping of input data to higher dimensional data
SVM performs an implicit mapping of data into a higher (maybe infinite)dimensional feature
space, and then finds a linear separatinghyperplane with the maximal margin to separate data
in this higherdimensional space.
The dual representation is thus given by-
Subject to
The problem with this approach is the very high computational complexity in higher
dimensional space. The use Kernel functions eliminates this problem.
A Kernel function can be represented as:
K(xi, xj) = (xϕ i)T
(xϕ j)
A number of kernels have been developed so far but the most popular and promising kernels
are:
K (xi,xj) = xi
T
xj(Linear Kernel)
K (xi, xj) = exp ( ) (Radial Basis Kernel)
K(xi , xj ) = (1 + xi
T
xj )p
(Polynomial kernel)
K(xi, xj ) = tanh(axi
T
xj + r) (Sigmoidal Kernel)
A new test example x is classified by the following function:

F (x) =sgn( )
a. The Behaviour of the Sigmoid Kernel
We consider the sigmoid kernel K(xi, xj ) = tanh(axi
T
xj + r), which takes two parameters: a
and r. For a > 0, we can view a as a scaling parameter of the input data, and r as a shifting
parameter that controls the threshold of mapping. For a < 0, the dot-product of the input data
is not only scaled but reversed.
It concludes that the first case, a > 0 and r < 0, is moresuitable for the sigmoid kernel.
A R Results
+ - K is CPD after r is small; similar to RBF for small a
+ + in general not as good as the (+, −) case
- + objective value of (6) −∞ after r large enough
- - easily the objective value of (6) −∞
Table 1: behaviour in different parameter combinations in sigmoid kernel
b. Behaviour of polynomial kernel
Polynomial Kernel (K(xi , xj ) = (1 + xi
T
xj )p
) is non-stochastic kernel estimate with two
parameters i.e. C and polynomial degree p. Each data from the set xi has an influence on
the kernel point of the test value xj, irrespective of its the actual distance from xj [14], It
gives good classification accuracy with minimum number of support vectors and low
classification error.
.
Figure: The effect of the degree of a polynomial kernel.

Higher degree polynomial kernels allow a more flexible decision boundary
c. Gaussian radial basis function
K (xi, xj) = exp ( ) deals with data that has conditional probability distribution
approaching gaussian function. RBF kernels perform better than the linear and polynomial
kernel. However, it is difficult to find an optimum parameters σ and equivalent C that gives
better result for a given problem.
A radial basis function (RBF) is a function of two vectors, which depends on only the
distance between them, i.e., K ( , ) = f ( − ).
may be recognized as the squared Euclidean distance between the two feature
vectors. The parameter σ is called bandwidth.
Figure: Circled points are support vectors. The two contour lines running through support
vectors are the nonlinear counterparts of the convex hulls. The thick black line is the
classifier. The lines in the image are contour lines of this surface. The classifier runs along
the bottom of the "valley" between the two classes. Smoothness of the contours is controlled
by σ

Kernel parameters also have a significant effect on the decision boundary.The width
parameter of the Gaussian kernel control the flexibility of theresulting classifier
Gaussian, gamma=1 Gaussian, gamma=100
Figure: The effect of the inverse-width parameter of the Gaussian kernel (γ) for a fixed value
of the soft-margin constant. The flexibility of the decision boundary increases with an
increase in value of gamma. Large values of γ lead to over fitting (right).
Intuitively, the gamma parameter defines how far the influence of a single training example
reaches, with low values meaning ‘far’ and high values meaning ‘close’. The C parameter
trades off misclassification of training examples against simplicity of the decision surface.
ii. Multi Class Classification
SVM are suitableonly for binary classification. However, they can be easilyextended to a
multi-class problem by utilizing Error Correcting Output Codes. When dealing with multiple
classes, an appropriate multi-class method is needed. Vapnik (1995) suggested comparing
one class with the others taken together. This strategy generates n classifiers, where n is the
number of classes. The final output is the class that corresponds to the SVM with the largest
margin, as defined above. For multi-class problems one has to determine n hyperplanes.
Thus, this method requires the solution of n QP optimisation problems, each of which
separates one class from the remaining classes. A dichotomy is a two-class classifier that
learns fromdata labelled with positive (+), negative (-), or (don’t care).Given any number of
classes, we can re-label them withthese three symbols and thus form a dichotomy, Different
relabeling result in different two-class problems eachof which is learned independently. A

multi-class classifierprogresses through every selected dichotomy and choosesa class that is
correctly classified by the maximum numberof selected dichotomies.Exhaustive dichotomies
represent a set of all possibleways of dividing and relabeling the dataset with the threedefined
symbols. A one-against-all classification schemeon an n-class classification considers n
dichotomies eachre-label one class as (+) and all other classes as (-).
a. DAG – SVM
The problem of multiclass classification, especially for systems like SVMs, doesn’t present
an easy solution.The standard method for –class SVMs is to constructSVMs. The ith SVM
will be trained with all of the examples in the ith class with positive labels, and all other
exampleswith negative labels. We refer to SVMs trained in this way as 1-v-r SVMs (short for
oneversus-rest).The final output of the1-v-r SVMs is the class that corresponds to the
SVMwith the highest output value. Unfortunately, there is no bound on the generalization
errorfor the 1-v-r SVM, and the training time of the standard method scales linearly with N.
Another method for constructing N-class classifiers from SVMs is derived from
previousresearch into combiningtwo-class classifiers. Knerr suggested constructing all
possible two class classifiers from a training set of N classes, each classifier being trained on
onlytwo out of N classes. There would thus be K = N(N-1)/2 classifiers. When applied
toSVMs, we refer to this as 1-v-1 SVMs (short for one-versus-one).
A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles.
A Rooted DAG has a unique node such that it is the only node which has no arcs pointinginto
it. A Rooted Binary DAG has nodes which have either 0 or 2 arcs leaving them.We will use
Rooted Binary DAGs in order to define a class of functions to be used inclassification tasks.
The class of functions computed by Rooted Binary DAGs is formallydefined as follows.
Definition 1: Decision DAGs (DDAGs).
Given a space X and a set of Boolean functions F = {f: X  {0,1}}, the class DDAG(F) of
Decision DAGs on N classes over F arefunctions which can be implemented using a rooted
binary DAG with N leaves labelled bythe classes where each of the K = N(N-1)/2 internal
nodes is labelled with an elementof F. The nodes are arranged in a triangle with the single

root node at the top, two nodesin the second layer andso on until the finallayer of N leaves.
The i-th node in layer j<N is connected to the i-th and (i+1)-st node in the (j+1)-st layer.
To evaluate a particular DDAG on input x ∈X, starting at the root node, the binaryfunction at
a node is evaluated. The node is then exited via the left edge, if the binaryfunction is zero; or
the right edge, if the binary function is one. The next node’s binaryfunction is then evaluated.
The value of the decision function D(x) is the value associatedwith the final leaf node. The
path taken through the DDAG is knownas the evaluation path. The input x reaches a node of
the graph, if that node is on theevaluation path for x. We refer to the decision node
distinguishing classes i and j as the ij-node. Assuming that the number of a leaf is its class,
this node is the i-th node in the (N-j+1)-th layer provided i<j. Similarly the j-nodes are those
nodes involving class j, that is, the internal nodes on the two diagonals containing the leaf
labelled by j.
The DDAG is equivalent to operating on a list, where each node eliminates one class fromthe
list. The list is initialized with a list of all classes. A test point is evaluated against thedecision
node that corresponds to the first and last elements of the list.
If the node prefersone of the two classes, the other class is eliminated from the list, and the
DDAG proceedsto test the first and last elements of the new list. The DDAG terminates when
only oneclass remains in the list. Thus, for a problem with N classes, N-1 decision nodes will
beevaluated in order to derive an answer.
The current state of the list is the total state of the system. Therefore, since a list stateis
reachable in more than one possible path through the system, the decision graph thealgorithm
traverses is a DAG, not simply a tree.
The DAGSVM [8] separates the individual classes with large margin. It is safe to discard
thelosing class at each 1-v-1 decision because, for the hard margin case, all of the examplesof
the losing class are far away from the decision surface. The DAGSVM algorithm is superior
to other multiclass SVM algorithms in both trainingand evaluationtime. Empirically,SVM
training is observedto scale super-linearlywith the training set size, according to a power law:
T = cmγ
, whereγ≈2 for algorithmsbasedon the decompositionmethod,with some
proportionalityconstant c. For the standard1-v-r multiclass SVM training algorithm, the entire
training set is used to create all N classifiers.

Figure: The Decision DAG for finding the best class out of four classes
Hence the training time for 1-v-r is
T1-v-1 = cNmγ
Assuming that the classes have the same number of examples, training each 1-v-1 SVMonly
requires 2m/N training examples.Thus, training K 1-v-1 SVMs would require
T1-v-1 = c ≈ 2γ-1
cN2-γ
mγ
.
For a typical case, whereγ =2, the amount of time required to train all of the 1-v-1 SVMsis
independent of N, and is only twice that of training a single 1-v-r SVM. Using 1-v-1SVMs
with a combination algorithm is thus preferred for training time.
For more info you can visit us at: http://www.siliconmentor.com/
Below link also may be useful for you

VLSI M.Tech Projects
PhD Projects & Thesis
VLSI Design Projects List
IEEE Projects

Introduction to Support Vector Machines

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Introduction to Support Vector Machines (20)

More from Silicon Mentor (16)

Recently uploaded (20)

Introduction to Support Vector Machines