SlideShare a Scribd company logo
Module-3
(Neural Networks (NN) and Support Vector
Machines (SVM))
 Perceptron, Neural Network - Multilayer feed forward
network, Activation functions (Sigmoid, ReLU, Tanh),
Backpropagation algorithm.
 SVM - Introduction, Maximum Margin Classification,
Mathematics behind Maximum Margin Classification,
Maximum Margin linear separators, soft margin SVM
classifier, non-linear SVM, Kernels for learning non-linear
functions, polynomial kernel, Radial Basis Function(RBF).
Biological Neuron
Artificial Neuron
Perceptron
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
 The output of the perceptron can also be expressed as a dot
product
Net input function
Activation function
https://cs231n.github.io/neural-networks-1/
https://towardsdatascience.com/perceptron-the-artificial-neuron-4d8c70d5cc8d
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
Perceptron learning rule
 One way to learn an acceptable weight vector is to begin
with random weights, then iteratively apply the perceptron
to each training example, modifying the perceptron weights
whenever it misclassifies an example.
 This process is repeated, iterating through the training
examples as many times as needed until the perceptron
classifies all training examples correctly.
 Weights are modified at each step according to the
perceptron training rule, which revises the weight wi
associated with input xi according to the rule
 Here, t is the target output for the current training
example, o is the output generated by the perceptron, and
q is a positive constant called the learning rate.
 The role of the learning rate is to moderate the degree to
which weights are changed at each step.
 It is usually set to some small value (e.g., 0.1).
Gradient Descent and the Delta Rule
 Although the perceptron rule finds a successful weight vector
when the training examples are linearly separable, it can fail
to converge if the examples are not linearly separable.
 Delta rule,is designed to overcome this difficulty.
 If the training examples are not linearly separable, the delta
rule converges toward a best-fit approximation to the target
concept.
 The key idea behind the delta rule is to use gradient descent
to search the hypothesis space of possible weight vectors to
find the weights that best fit the training examples.
 The delta training rule is best understood by considering the
task of training an unthresholded perceptron;that is,a
linear unit for which the output o is given by
 Training error
 where D is the set of training examples, td is the target
output for training example d, and od is the output of the
linear unit for training example d.
 Since the gradient specifies the direction of steepest increase
of E,the training rule for gradient descent is
 Here the learning rate is a positive constant , which determines the
step size in the gradient descent search.
 The negative sign is present because we want to move the
weight vector in the direction that decreases E.
 This training rule can also be written in its component
form
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
 That is,
 Therefore,
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
Multilayer Feed Forward Network
Feed forward neural network
 Each layer is made up of units.
 The inputs to the network correspond to the attributes
measured for each training tuple.
 The inputs are fed simultaneously into the units making up
the input layer.
 These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer of
“neuronlike” units, known as a hidden layer.
 The outputs of the hidden layer units can be input to another
hidden layer, and so on.
 The weighted outputs of the last hidden layer are input to
units making up the output layer, which emits the
network's prediction for given tuples.
 The units in the input layer are called input units.
 The units in the hidden layers and output layer are sometimes
referred to as neurodes, due to their symbolic biological
basis, or as output units.
 A network containing two hidden layers is called a three-
layer neural network, and so on.
 It is a feed-forward network since none of the weights cycles
back to an input unit or to a previous layer's output unit.
 It is fully connected in that each unit provides input to
each unit in the next forward layer.
https://www.sciencedirect.com/topics/computer-science/backpropagation-algorithm
 Each output unit takes, as input, a weighted sum of the
outputs from units in the previous layer.
 It applies a nonlinear (activation) function to the weighted
input.
 Compute the number of parameters for the given network.
 The network has 4 + 2 = 6 neurons (not counting the
inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases,
for a total of 26 learnable parameters.
 Compute the number of parameters for the given network.
 The network has 4 + 4 + 1 = 9 neurons (not counting
inputs), [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32
weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable
parameters.
Sigmoid function
Relu Function
Tanh Function
Sigmoid function
 Sigmoid outputs are not zero centered.
 If the activation function of the network is not zero centered, y
= f(x w) is always positive or always negative.
 Thus, the output of a layer is always being moved to either the
positive values or the negative values.
 As a result, the weight vector needs more update to be trained
properly.
Tanh vs Sigmoid
 The tanh function is a stretched and shifted version of the
sigmoid.
 Both sigmoid and tanh functions belong to the S-
like functions that suppress the input value to a
bounded range.
 This helps the network to keep its weights bounded and
prevents the exploding gradient problem where the value of
the gradients becomes very large.
https://www.baeldung.com/cs/sigmoid-vs-tanh-functions
 The gradient of tanh is four times greater than the gradient of
the sigmoid function.
 This means that using the tanh activation function results in
higher values of gradient during training and higher updates
in the weights of the network.
 So, if we want strong gradients and big learning
steps, we should use the tanh activation function.
 Another difference is that the output of tanh is
symmetric around zero leading to faster
convergence.
 The output of tanh ranges from -1 to 1 and have an equal
mass on both the sides of zero-axis so it is zero centered
function.
 So, tanh overcomes the non-zero centric issue of the
logistic activation function.
 Hence optimization becomes comparatively easier than
logistic and it is always preferred over logistic.
Comparison with ReLU
 Sigmoid and tanh functions suffer from vanishing gradient
problem.
 It is encountered while training artificial neural
networks with gradient-based learning
methods and backpropagation.
 In such methods, during each iteration of training each of the
neural network's weights receives an update proportional to
the partial derivative of the error function with respect to the
current weight.
 The problem is that in some cases, the gradient will be vanishingly
small, effectively preventing the weight from changing its value.
 In the worst case, this may completely stop the neural network
from further training.
 ReLU activation function can fix the vanishing gradient
problem.
Back propagation
 A feedforward phase - where an input vector is applied and
the signal propagates through the network layers, modified
by the current weights and biases and by the
nonlinear activation functions.
 Corresponding output values then emerge, and these can be
compared with the target outputs for the given input vector
using a loss function.
 A feedback phase - the error signal is then fed back
(backpropagated) through the network layers to modify the
weights in a way that minimizes the error across the entire
training set, effectively minimizing the error surface in
weight-space.
Backpropagation Algorithm
(Stochastic gradient descent version)
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
 Determine the number of trainable parameters of the
following neural net:
 Input layer: 4 units.
 Hidden layer 1: 16 units.
 Hidden layer 2: 8 units.
 Hidden layer 3: 4 units.
 Output layer: 2 units.
262 trainable parameters.
06/21/2025
Support Vector Machine
(Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues in 1995)
51
Support Vector Machines—General
Philosophy
06/21/2025
52
SupportVectors
Small Margin Large Margin
06/21/2025
53
http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf
06/21/2025
54
 A learned classifier (hyperplane) achieves maximum
separation between the classes.
 The two planes parallel to the classifier and which pass
through one or more points in the dataset are called
bounding planes.
 The distance between these bounding planes is called
margin.
 By SVM learning, we mean finding a hyperplane which
maximizes this margin.
06/21/2025
55
https://towardsdatascience.com/support-vector-machines-dual-formulation-quadratic-programming-
sequential-minimal-optimization-57f4387ce4dd
Linearly Separable SVM
06/21/2025
56
 The optimal hyperplane is given by
w.x + b = 0
where w={w1, w2, …, wn} is a weight vector and b a scalar (bias).
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
Maximum Margin
06/21/2025
57
 Distance between a point P (xo, yo, zo) and a given plane Ax
+ By + Cz = D, is given by
|Axo + Byo+ Czo + D|/√(A2
+ B2
+ C2
).
 Here we have, two bounding planes
w.x+b=1 and w.x+b=-1
06/21/2025
58
Distance of the bounding hyperplane w.x+b=1 from origin
|1 |
=
|| w||
Distance of the bounding hyperplane w.x+b=-1 from origin
| 1 |
=
|| w||
Distance between the planes (which needs to be maximized)
|1 |
=
||
b
b
b

 

 | 1 |
w|| || w||
2
|| w||
b
 


Mathematics behind SVM
06/21/2025
59
 For the training data to be linearly separable:
 Or,
w.x 1, 1
w.x 1, 1
i i
i i
b if y
b if y
  
  
(w.x ) 1, 1,2,...,
i i
y b i n
  
06/21/2025
60
 Vectors xi for which yi (w•xi, + b) = 1 (points which fall on
the bounding planes) are termed as support vectors.
06/21/2025
61
 Primal problem
(1)
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
Linearly Separable SVM
06/21/2025
69
 The optimal hyperplane is given by
where w={w1, w2, …, wn} is a weight vector and b a scalar (bias).
The linear decision function I(x) is then given by
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
w.x + b = 0
SVM – Soft Margin
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
 Here, C is a hyperparameter that decides the trade-off
between maximizing the margin and minimizing the
mistakes.
 When C is small, classification mistakes are given less
importance and focus is more on maximizing the margin,
whereas when C is large, the focus is more on avoiding
misclassification at the expense of keeping the margin small.
https://towardsdatascience.com/support-vector-machines-soft-margin-formulation-and-kernel-trick-
4c9729dc8efe
Mathematics behind Soft Margin SVM
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
(1)
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
Non-Linear SVM
XOR Problem
X Y X XORY
0 0 0
0 1 1
1 0 1
1 1 0
https://www.tech-quantum.com/solving-xor-problem-using-neural-network-c/
https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
SVM—Linearly Inseparable
06/21/2025
90
 Transform the original input data into a higher dimensional space.
 Search for a linear separating hyperplane in the new space.
Kernel Functions
 Kernel functions are generalized functions that take two
vectors (of any dimension) as input and output a score that
denotes how similar the input vectors are.
 An example is the dot product function: if the dot product is
small, we conclude that vectors are different and if the dot
product is large, we conclude that vectors are more similar.
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
Kernel Trick
 We can use any fancy Kernel function in place of dot product
that has the capability of measuring similarity in higher
dimensions (where it could be more accurate;), without
increasing the computational costs much.
 This is essentially known as the KernelTrick.
Polynomial Kernel
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx
 Kernel Matrix
Why is SVM Effective on High Dimensional Data?
06/21/2025
105
 The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data.
 The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (Maximum Margin
Hyperplane).
 Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high.
References
06/21/2025
106
 Dunham M H, “Data Mining: Introductory and Advanced
Topics”, Pearson Education, New Delhi, 2003.
 Jaiwei Han, Micheline Kamber, “Data Mining Concepts and
Techniques”, Elsevier, 2006.
 K.P. Soman, Shyam Diwakar,V.Ajay, “Insight into Data Mining
Theory and Practice”, PHI Pvt. Ltd., New Delhi, 2008.
 https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm

More Related Content

PPTX
Deep Learning in Recommender Systems - RecSys Summer School 2017
PPT
Artificial Neural Network
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
PPT
deep learning UNIT-1 Introduction Part-1.ppt
PPTX
UNIT IV NEURAL NETWORKS - Multilayer perceptron
PPTX
Introduction to Deep learning and H2O for beginner's
PDF
Capstone paper
PDF
Foundations: Artificial Neural Networks
Deep Learning in Recommender Systems - RecSys Summer School 2017
Artificial Neural Network
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
deep learning UNIT-1 Introduction Part-1.ppt
UNIT IV NEURAL NETWORKS - Multilayer perceptron
Introduction to Deep learning and H2O for beginner's
Capstone paper
Foundations: Artificial Neural Networks

Similar to CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx (20)

PPTX
Multilayer & Back propagation algorithm
PPTX
Introduction to Neural Networks and Deep Learning from Scratch
PDF
Sparse autoencoder
PPTX
Module1 (2).pptxvgybhunjimko,l.vgbyhnjmk;
PPTX
Artificial Neural Networks presentations
PDF
Lesson_8_DeepLearning.pdf
PPTX
Neural Networks in Artificial intelligence
PPTX
Introduction to Neural Netwoks
PPTX
Deep Learning Module 2A Training MLP.pptx
PDF
PPTX
ML_ Unit 2_Part_B
PPT
ann-ics320Part4.ppt
PPT
ann-ics320Part4.ppt
PDF
Pattern Recognition 21BR551 MODULE 05 NOTES.pdf
PPTX
10 Backpropagation Algorithm for Neural Networks (1).pptx
PDF
Neural network
PPTX
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
PDF
lecture24NeuralNetworksNYUcourseEEEE.pdf
PDF
Deep Feed Forward Neural Networks and Regularization
Multilayer & Back propagation algorithm
Introduction to Neural Networks and Deep Learning from Scratch
Sparse autoencoder
Module1 (2).pptxvgybhunjimko,l.vgbyhnjmk;
Artificial Neural Networks presentations
Lesson_8_DeepLearning.pdf
Neural Networks in Artificial intelligence
Introduction to Neural Netwoks
Deep Learning Module 2A Training MLP.pptx
ML_ Unit 2_Part_B
ann-ics320Part4.ppt
ann-ics320Part4.ppt
Pattern Recognition 21BR551 MODULE 05 NOTES.pdf
10 Backpropagation Algorithm for Neural Networks (1).pptx
Neural network
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
lecture24NeuralNetworksNYUcourseEEEE.pdf
Deep Feed Forward Neural Networks and Regularization
Ad

More from resming1 (10)

PPTX
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
PPTX
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
PPTX
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
PPTX
CST413 KTU S7 CSE Machine Learning Classification Assessment Confusion matrix...
PDF
CST413_Machine Learning KTU S7 CSE PracticeQuestionsForExam_Solutions.pdf
PPTX
Data Structures Module 3 Binary Trees Binary Search Trees Tree Traversals AVL...
PPTX
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
PPTX
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
PPTX
Machine Learning - Classification Algorithms
PPTX
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
CST413 KTU S7 CSE Machine Learning Classification Assessment Confusion matrix...
CST413_Machine Learning KTU S7 CSE PracticeQuestionsForExam_Solutions.pdf
Data Structures Module 3 Binary Trees Binary Search Trees Tree Traversals AVL...
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Machine Learning - Classification Algorithms
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Ad

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
Project quality management in manufacturing
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
web development for engineering and engineering
PPTX
additive manufacturing of ss316l using mig welding
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Construction Project Organization Group 2.pptx
Foundation to blockchain - A guide to Blockchain Tech
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Project quality management in manufacturing
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
web development for engineering and engineering
additive manufacturing of ss316l using mig welding
CYBER-CRIMES AND SECURITY A guide to understanding
Model Code of Practice - Construction Work - 21102022 .pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Embodied AI: Ushering in the Next Era of Intelligent Systems
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd

CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machines Module 3.pptx

  • 1. Module-3 (Neural Networks (NN) and Support Vector Machines (SVM))
  • 2.  Perceptron, Neural Network - Multilayer feed forward network, Activation functions (Sigmoid, ReLU, Tanh), Backpropagation algorithm.  SVM - Introduction, Maximum Margin Classification, Mathematics behind Maximum Margin Classification, Maximum Margin linear separators, soft margin SVM classifier, non-linear SVM, Kernels for learning non-linear functions, polynomial kernel, Radial Basis Function(RBF).
  • 7.  The output of the perceptron can also be expressed as a dot product
  • 16. Perceptron learning rule  One way to learn an acceptable weight vector is to begin with random weights, then iteratively apply the perceptron to each training example, modifying the perceptron weights whenever it misclassifies an example.  This process is repeated, iterating through the training examples as many times as needed until the perceptron classifies all training examples correctly.
  • 17.  Weights are modified at each step according to the perceptron training rule, which revises the weight wi associated with input xi according to the rule  Here, t is the target output for the current training example, o is the output generated by the perceptron, and q is a positive constant called the learning rate.  The role of the learning rate is to moderate the degree to which weights are changed at each step.  It is usually set to some small value (e.g., 0.1).
  • 18. Gradient Descent and the Delta Rule  Although the perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable.  Delta rule,is designed to overcome this difficulty.  If the training examples are not linearly separable, the delta rule converges toward a best-fit approximation to the target concept.
  • 19.  The key idea behind the delta rule is to use gradient descent to search the hypothesis space of possible weight vectors to find the weights that best fit the training examples.  The delta training rule is best understood by considering the task of training an unthresholded perceptron;that is,a linear unit for which the output o is given by
  • 20.  Training error  where D is the set of training examples, td is the target output for training example d, and od is the output of the linear unit for training example d.
  • 21.  Since the gradient specifies the direction of steepest increase of E,the training rule for gradient descent is
  • 22.  Here the learning rate is a positive constant , which determines the step size in the gradient descent search.  The negative sign is present because we want to move the weight vector in the direction that decreases E.  This training rule can also be written in its component form
  • 24.  That is,  Therefore,
  • 28.  Each layer is made up of units.  The inputs to the network correspond to the attributes measured for each training tuple.  The inputs are fed simultaneously into the units making up the input layer.  These inputs pass through the input layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units, known as a hidden layer.  The outputs of the hidden layer units can be input to another hidden layer, and so on.  The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction for given tuples.
  • 29.  The units in the input layer are called input units.  The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units.  A network containing two hidden layers is called a three- layer neural network, and so on.  It is a feed-forward network since none of the weights cycles back to an input unit or to a previous layer's output unit.  It is fully connected in that each unit provides input to each unit in the next forward layer. https://www.sciencedirect.com/topics/computer-science/backpropagation-algorithm
  • 30.  Each output unit takes, as input, a weighted sum of the outputs from units in the previous layer.  It applies a nonlinear (activation) function to the weighted input.
  • 31.  Compute the number of parameters for the given network.
  • 32.  The network has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
  • 33.  Compute the number of parameters for the given network.
  • 34.  The network has 4 + 4 + 1 = 9 neurons (not counting inputs), [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.
  • 38. Sigmoid function  Sigmoid outputs are not zero centered.  If the activation function of the network is not zero centered, y = f(x w) is always positive or always negative.  Thus, the output of a layer is always being moved to either the positive values or the negative values.  As a result, the weight vector needs more update to be trained properly.
  • 39. Tanh vs Sigmoid  The tanh function is a stretched and shifted version of the sigmoid.  Both sigmoid and tanh functions belong to the S- like functions that suppress the input value to a bounded range.  This helps the network to keep its weights bounded and prevents the exploding gradient problem where the value of the gradients becomes very large. https://www.baeldung.com/cs/sigmoid-vs-tanh-functions
  • 40.  The gradient of tanh is four times greater than the gradient of the sigmoid function.
  • 41.  This means that using the tanh activation function results in higher values of gradient during training and higher updates in the weights of the network.  So, if we want strong gradients and big learning steps, we should use the tanh activation function.  Another difference is that the output of tanh is symmetric around zero leading to faster convergence.
  • 42.  The output of tanh ranges from -1 to 1 and have an equal mass on both the sides of zero-axis so it is zero centered function.  So, tanh overcomes the non-zero centric issue of the logistic activation function.  Hence optimization becomes comparatively easier than logistic and it is always preferred over logistic.
  • 43. Comparison with ReLU  Sigmoid and tanh functions suffer from vanishing gradient problem.  It is encountered while training artificial neural networks with gradient-based learning methods and backpropagation.  In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight.  The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value.  In the worst case, this may completely stop the neural network from further training.
  • 44.  ReLU activation function can fix the vanishing gradient problem.
  • 46.  A feedforward phase - where an input vector is applied and the signal propagates through the network layers, modified by the current weights and biases and by the nonlinear activation functions.  Corresponding output values then emerge, and these can be compared with the target outputs for the given input vector using a loss function.  A feedback phase - the error signal is then fed back (backpropagated) through the network layers to modify the weights in a way that minimizes the error across the entire training set, effectively minimizing the error surface in weight-space.
  • 50.  Determine the number of trainable parameters of the following neural net:  Input layer: 4 units.  Hidden layer 1: 16 units.  Hidden layer 2: 8 units.  Hidden layer 3: 4 units.  Output layer: 2 units. 262 trainable parameters.
  • 51. 06/21/2025 Support Vector Machine (Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues in 1995) 51
  • 54. 06/21/2025 54  A learned classifier (hyperplane) achieves maximum separation between the classes.  The two planes parallel to the classifier and which pass through one or more points in the dataset are called bounding planes.  The distance between these bounding planes is called margin.  By SVM learning, we mean finding a hyperplane which maximizes this margin.
  • 56. Linearly Separable SVM 06/21/2025 56  The optimal hyperplane is given by w.x + b = 0 where w={w1, w2, …, wn} is a weight vector and b a scalar (bias). https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
  • 57. Maximum Margin 06/21/2025 57  Distance between a point P (xo, yo, zo) and a given plane Ax + By + Cz = D, is given by |Axo + Byo+ Czo + D|/√(A2 + B2 + C2 ).  Here we have, two bounding planes w.x+b=1 and w.x+b=-1
  • 58. 06/21/2025 58 Distance of the bounding hyperplane w.x+b=1 from origin |1 | = || w|| Distance of the bounding hyperplane w.x+b=-1 from origin | 1 | = || w|| Distance between the planes (which needs to be maximized) |1 | = || b b b      | 1 | w|| || w|| 2 || w|| b    
  • 59. Mathematics behind SVM 06/21/2025 59  For the training data to be linearly separable:  Or, w.x 1, 1 w.x 1, 1 i i i i b if y b if y       (w.x ) 1, 1,2,..., i i y b i n   
  • 60. 06/21/2025 60  Vectors xi for which yi (w•xi, + b) = 1 (points which fall on the bounding planes) are termed as support vectors.
  • 63. (1)
  • 69. Linearly Separable SVM 06/21/2025 69  The optimal hyperplane is given by where w={w1, w2, …, wn} is a weight vector and b a scalar (bias). The linear decision function I(x) is then given by https://link.springer.com/content/pdf/10.1007/BF00994018.pdf w.x + b = 0
  • 70. SVM – Soft Margin
  • 74.  Here, C is a hyperparameter that decides the trade-off between maximizing the margin and minimizing the mistakes.  When C is small, classification mistakes are given less importance and focus is more on maximizing the margin, whereas when C is large, the focus is more on avoiding misclassification at the expense of keeping the margin small. https://towardsdatascience.com/support-vector-machines-soft-margin-formulation-and-kernel-trick- 4c9729dc8efe
  • 79. (1)
  • 86. XOR Problem X Y X XORY 0 0 0 0 1 1 1 0 1 1 1 0
  • 90. SVM—Linearly Inseparable 06/21/2025 90  Transform the original input data into a higher dimensional space.  Search for a linear separating hyperplane in the new space.
  • 91. Kernel Functions  Kernel functions are generalized functions that take two vectors (of any dimension) as input and output a score that denotes how similar the input vectors are.  An example is the dot product function: if the dot product is small, we conclude that vectors are different and if the dot product is large, we conclude that vectors are more similar.
  • 101. Kernel Trick  We can use any fancy Kernel function in place of dot product that has the capability of measuring similarity in higher dimensions (where it could be more accurate;), without increasing the computational costs much.  This is essentially known as the KernelTrick.
  • 105. Why is SVM Effective on High Dimensional Data? 06/21/2025 105  The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data.  The support vectors are the essential or critical training examples — they lie closest to the decision boundary (Maximum Margin Hyperplane).  Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high.
  • 106. References 06/21/2025 106  Dunham M H, “Data Mining: Introductory and Advanced Topics”, Pearson Education, New Delhi, 2003.  Jaiwei Han, Micheline Kamber, “Data Mining Concepts and Techniques”, Elsevier, 2006.  K.P. Soman, Shyam Diwakar,V.Ajay, “Insight into Data Mining Theory and Practice”, PHI Pvt. Ltd., New Delhi, 2008.  https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm