CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx

Module-3
(Neural Networks (NN) and Support Vector
Machines (SVM))

 Perceptron, Neural Network - Multilayer feed forward
network, Activation functions (Sigmoid, ReLU, Tanh),
Backpropagation algorithm.
 SVM - Introduction, Maximum Margin Classification,
Mathematics behind Maximum Margin Classification,
Maximum Margin linear separators, soft margin SVM
classifier, non-linear SVM, Kernels for learning non-linear
functions, polynomial kernel, Radial Basis Function(RBF).

 The output of the perceptron can also be expressed as a dot
product

Net input function
Activation function

https://cs231n.github.io/neural-networks-1/

https://towardsdatascience.com/perceptron-the-artificial-neuron-4d8c70d5cc8d

Perceptron learning rule
 One way to learn an acceptable weight vector is to begin
with random weights, then iteratively apply the perceptron
to each training example, modifying the perceptron weights
whenever it misclassifies an example.
 This process is repeated, iterating through the training
examples as many times as needed until the perceptron
classifies all training examples correctly.

 Weights are modified at each step according to the
perceptron training rule, which revises the weight wi
associated with input xi according to the rule
 Here, t is the target output for the current training
example, o is the output generated by the perceptron, and
q is a positive constant called the learning rate.
 The role of the learning rate is to moderate the degree to
which weights are changed at each step.
 It is usually set to some small value (e.g., 0.1).

Gradient Descent and the Delta Rule
 Although the perceptron rule finds a successful weight vector
when the training examples are linearly separable, it can fail
to converge if the examples are not linearly separable.
 Delta rule,is designed to overcome this difficulty.
 If the training examples are not linearly separable, the delta
rule converges toward a best-fit approximation to the target
concept.

 The key idea behind the delta rule is to use gradient descent
to search the hypothesis space of possible weight vectors to
find the weights that best fit the training examples.
 The delta training rule is best understood by considering the
task of training an unthresholded perceptron;that is,a
linear unit for which the output o is given by

 Training error
 where D is the set of training examples, td is the target
output for training example d, and od is the output of the
linear unit for training example d.

 Since the gradient specifies the direction of steepest increase
of E,the training rule for gradient descent is

 Here the learning rate is a positive constant , which determines the
step size in the gradient descent search.
 The negative sign is present because we want to move the
weight vector in the direction that decreases E.
 This training rule can also be written in its component
form

Multilayer Feed Forward Network

 Each layer is made up of units.
 The inputs to the network correspond to the attributes
measured for each training tuple.
 The inputs are fed simultaneously into the units making up
the input layer.
 These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer of
“neuronlike” units, known as a hidden layer.
 The outputs of the hidden layer units can be input to another
hidden layer, and so on.
 The weighted outputs of the last hidden layer are input to
units making up the output layer, which emits the
network's prediction for given tuples.

 The units in the input layer are called input units.
 The units in the hidden layers and output layer are sometimes
referred to as neurodes, due to their symbolic biological
basis, or as output units.
 A network containing two hidden layers is called a three-
layer neural network, and so on.
 It is a feed-forward network since none of the weights cycles
back to an input unit or to a previous layer's output unit.
 It is fully connected in that each unit provides input to
each unit in the next forward layer.
https://www.sciencedirect.com/topics/computer-science/backpropagation-algorithm

 Each output unit takes, as input, a weighted sum of the
outputs from units in the previous layer.
 It applies a nonlinear (activation) function to the weighted
input.

 Compute the number of parameters for the given network.

 The network has 4 + 2 = 6 neurons (not counting the
inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases,
for a total of 26 learnable parameters.

 The network has 4 + 4 + 1 = 9 neurons (not counting
inputs), [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32
weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable
parameters.

Sigmoid function
 Sigmoid outputs are not zero centered.
 If the activation function of the network is not zero centered, y
= f(x w) is always positive or always negative.
 Thus, the output of a layer is always being moved to either the
positive values or the negative values.
 As a result, the weight vector needs more update to be trained
properly.

Tanh vs Sigmoid
 The tanh function is a stretched and shifted version of the
sigmoid.
 Both sigmoid and tanh functions belong to the S-
like functions that suppress the input value to a
bounded range.
 This helps the network to keep its weights bounded and
prevents the exploding gradient problem where the value of
the gradients becomes very large.
https://www.baeldung.com/cs/sigmoid-vs-tanh-functions

 The gradient of tanh is four times greater than the gradient of
the sigmoid function.

 This means that using the tanh activation function results in
higher values of gradient during training and higher updates
in the weights of the network.
 So, if we want strong gradients and big learning
steps, we should use the tanh activation function.
 Another difference is that the output of tanh is
symmetric around zero leading to faster
convergence.

 The output of tanh ranges from -1 to 1 and have an equal
mass on both the sides of zero-axis so it is zero centered
function.
 So, tanh overcomes the non-zero centric issue of the
logistic activation function.
 Hence optimization becomes comparatively easier than
logistic and it is always preferred over logistic.

Comparison with ReLU
 Sigmoid and tanh functions suffer from vanishing gradient
problem.
 It is encountered while training artificial neural
networks with gradient-based learning
methods and backpropagation.
 In such methods, during each iteration of training each of the
neural network's weights receives an update proportional to
the partial derivative of the error function with respect to the
current weight.
 The problem is that in some cases, the gradient will be vanishingly
small, effectively preventing the weight from changing its value.
 In the worst case, this may completely stop the neural network
from further training.

 ReLU activation function can fix the vanishing gradient
problem.

 A feedforward phase - where an input vector is applied and
the signal propagates through the network layers, modified
by the current weights and biases and by the
nonlinear activation functions.
 Corresponding output values then emerge, and these can be
compared with the target outputs for the given input vector
using a loss function.
 A feedback phase - the error signal is then fed back
(backpropagated) through the network layers to modify the
weights in a way that minimizes the error across the entire
training set, effectively minimizing the error surface in
weight-space.

Backpropagation Algorithm
(Stochastic gradient descent version)

 Determine the number of trainable parameters of the
following neural net:
 Input layer: 4 units.
 Hidden layer 1: 16 units.
 Output layer: 2 units.
262 trainable parameters.

06/21/2025
Support Vector Machine
(Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues in 1995)
51

Support Vector Machines—General
Philosophy
06/21/2025
52
SupportVectors
Small Margin Large Margin

06/21/2025
53
http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf

06/21/2025
54
 A learned classifier (hyperplane) achieves maximum
separation between the classes.
 The two planes parallel to the classifier and which pass
through one or more points in the dataset are called
bounding planes.
 The distance between these bounding planes is called
margin.
 By SVM learning, we mean finding a hyperplane which
maximizes this margin.

06/21/2025
55
https://towardsdatascience.com/support-vector-machines-dual-formulation-quadratic-programming-
sequential-minimal-optimization-57f4387ce4dd

Linearly Separable SVM
06/21/2025
56
 The optimal hyperplane is given by
w.x + b = 0
where w={w1, w2, …, wn} is a weight vector and b a scalar (bias).
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf

Maximum Margin
06/21/2025
57
 Distance between a point P (xo, yo, zo) and a given plane Ax
+ By + Cz = D, is given by
|Axo + Byo+ Czo + D|/√(A2
+ B2
+ C2
).
 Here we have, two bounding planes
w.x+b=1 and w.x+b=-1

06/21/2025
58
Distance of the bounding hyperplane w.x+b=1 from origin
|1 |
=
|| w||
Distance of the bounding hyperplane w.x+b=-1 from origin
| 1 |
=
|| w||
Distance between the planes (which needs to be maximized)
|1 |
=
||
b
b
b

 

 | 1 |
w|| || w||
2
|| w||
b
 



Mathematics behind SVM
06/21/2025
59
 For the training data to be linearly separable:
 Or,
w.x 1, 1
w.x 1, 1
i i
i i
b if y
b if y
  
  
(w.x ) 1, 1,2,...,
i i
y b i n
  

06/21/2025
60
 Vectors xi for which yi (w•xi, + b) = 1 (points which fall on
the bounding planes) are termed as support vectors.

Linearly Separable SVM
06/21/2025
69
 The optimal hyperplane is given by
where w={w1, w2, …, wn} is a weight vector and b a scalar (bias).
The linear decision function I(x) is then given by
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
w.x + b = 0

 Here, C is a hyperparameter that decides the trade-off
between maximizing the margin and minimizing the
mistakes.
 When C is small, classification mistakes are given less
importance and focus is more on maximizing the margin,
whereas when C is large, the focus is more on avoiding
misclassification at the expense of keeping the margin small.
https://towardsdatascience.com/support-vector-machines-soft-margin-formulation-and-kernel-trick-
4c9729dc8efe

Mathematics behind Soft Margin SVM

XOR Problem
X Y X XORY
0 0 0
0 1 1
1 0 1
1 1 0

https://www.tech-quantum.com/solving-xor-problem-using-neural-network-c/

https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f

SVM—Linearly Inseparable
06/21/2025
90
 Transform the original input data into a higher dimensional space.
 Search for a linear separating hyperplane in the new space.

Kernel Functions
 Kernel functions are generalized functions that take two
vectors (of any dimension) as input and output a score that
denotes how similar the input vectors are.
 An example is the dot product function: if the dot product is
small, we conclude that vectors are different and if the dot
product is large, we conclude that vectors are more similar.

Kernel Trick
 We can use any fancy Kernel function in place of dot product
that has the capability of measuring similarity in higher
dimensions (where it could be more accurate;), without
increasing the computational costs much.
 This is essentially known as the KernelTrick.

Why is SVM Effective on High Dimensional Data?
06/21/2025
105
 The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data.
 The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (Maximum Margin
Hyperplane).
 Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high.

References
06/21/2025
106
 Dunham M H, “Data Mining: Introductory and Advanced
Topics”, Pearson Education, New Delhi, 2003.
 Jaiwei Han, Micheline Kamber, “Data Mining Concepts and
Techniques”, Elsevier, 2006.
 K.P. Soman, Shyam Diwakar,V.Ajay, “Insight into Data Mining
Theory and Practice”, PHI Pvt. Ltd., New Delhi, 2008.
 https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm

CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx

More Related Content

Similar to CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx (20)

More from resming1 (10)

Recently uploaded (20)

CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx