16. Perceptron learning rule
One way to learn an acceptable weight vector is to begin
with random weights, then iteratively apply the perceptron
to each training example, modifying the perceptron weights
whenever it misclassifies an example.
This process is repeated, iterating through the training
examples as many times as needed until the perceptron
classifies all training examples correctly.
17. Weights are modified at each step according to the
perceptron training rule, which revises the weight wi
associated with input xi according to the rule
Here, t is the target output for the current training
example, o is the output generated by the perceptron, and
q is a positive constant called the learning rate.
The role of the learning rate is to moderate the degree to
which weights are changed at each step.
It is usually set to some small value (e.g., 0.1).
18. Gradient Descent and the Delta Rule
Although the perceptron rule finds a successful weight vector
when the training examples are linearly separable, it can fail
to converge if the examples are not linearly separable.
Delta rule,is designed to overcome this difficulty.
If the training examples are not linearly separable, the delta
rule converges toward a best-fit approximation to the target
concept.
19. The key idea behind the delta rule is to use gradient descent
to search the hypothesis space of possible weight vectors to
find the weights that best fit the training examples.
The delta training rule is best understood by considering the
task of training an unthresholded perceptron;that is,a
linear unit for which the output o is given by
20. Training error
where D is the set of training examples, td is the target
output for training example d, and od is the output of the
linear unit for training example d.
21. Since the gradient specifies the direction of steepest increase
of E,the training rule for gradient descent is
22. Here the learning rate is a positive constant , which determines the
step size in the gradient descent search.
The negative sign is present because we want to move the
weight vector in the direction that decreases E.
This training rule can also be written in its component
form
28. Each layer is made up of units.
The inputs to the network correspond to the attributes
measured for each training tuple.
The inputs are fed simultaneously into the units making up
the input layer.
These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer of
“neuronlike” units, known as a hidden layer.
The outputs of the hidden layer units can be input to another
hidden layer, and so on.
The weighted outputs of the last hidden layer are input to
units making up the output layer, which emits the
network's prediction for given tuples.
29. The units in the input layer are called input units.
The units in the hidden layers and output layer are sometimes
referred to as neurodes, due to their symbolic biological
basis, or as output units.
A network containing two hidden layers is called a three-
layer neural network, and so on.
It is a feed-forward network since none of the weights cycles
back to an input unit or to a previous layer's output unit.
It is fully connected in that each unit provides input to
each unit in the next forward layer.
https://www.sciencedirect.com/topics/computer-science/backpropagation-algorithm
30. Each output unit takes, as input, a weighted sum of the
outputs from units in the previous layer.
It applies a nonlinear (activation) function to the weighted
input.
31. Compute the number of parameters for the given network.
32. The network has 4 + 2 = 6 neurons (not counting the
inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases,
for a total of 26 learnable parameters.
33. Compute the number of parameters for the given network.
34. The network has 4 + 4 + 1 = 9 neurons (not counting
inputs), [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32
weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable
parameters.
38. Sigmoid function
Sigmoid outputs are not zero centered.
If the activation function of the network is not zero centered, y
= f(x w) is always positive or always negative.
Thus, the output of a layer is always being moved to either the
positive values or the negative values.
As a result, the weight vector needs more update to be trained
properly.
39. Tanh vs Sigmoid
The tanh function is a stretched and shifted version of the
sigmoid.
Both sigmoid and tanh functions belong to the S-
like functions that suppress the input value to a
bounded range.
This helps the network to keep its weights bounded and
prevents the exploding gradient problem where the value of
the gradients becomes very large.
https://www.baeldung.com/cs/sigmoid-vs-tanh-functions
40. The gradient of tanh is four times greater than the gradient of
the sigmoid function.
41. This means that using the tanh activation function results in
higher values of gradient during training and higher updates
in the weights of the network.
So, if we want strong gradients and big learning
steps, we should use the tanh activation function.
Another difference is that the output of tanh is
symmetric around zero leading to faster
convergence.
42. The output of tanh ranges from -1 to 1 and have an equal
mass on both the sides of zero-axis so it is zero centered
function.
So, tanh overcomes the non-zero centric issue of the
logistic activation function.
Hence optimization becomes comparatively easier than
logistic and it is always preferred over logistic.
43. Comparison with ReLU
Sigmoid and tanh functions suffer from vanishing gradient
problem.
It is encountered while training artificial neural
networks with gradient-based learning
methods and backpropagation.
In such methods, during each iteration of training each of the
neural network's weights receives an update proportional to
the partial derivative of the error function with respect to the
current weight.
The problem is that in some cases, the gradient will be vanishingly
small, effectively preventing the weight from changing its value.
In the worst case, this may completely stop the neural network
from further training.
46. A feedforward phase - where an input vector is applied and
the signal propagates through the network layers, modified
by the current weights and biases and by the
nonlinear activation functions.
Corresponding output values then emerge, and these can be
compared with the target outputs for the given input vector
using a loss function.
A feedback phase - the error signal is then fed back
(backpropagated) through the network layers to modify the
weights in a way that minimizes the error across the entire
training set, effectively minimizing the error surface in
weight-space.
54. 06/21/2025
54
A learned classifier (hyperplane) achieves maximum
separation between the classes.
The two planes parallel to the classifier and which pass
through one or more points in the dataset are called
bounding planes.
The distance between these bounding planes is called
margin.
By SVM learning, we mean finding a hyperplane which
maximizes this margin.
56. Linearly Separable SVM
06/21/2025
56
The optimal hyperplane is given by
w.x + b = 0
where w={w1, w2, …, wn} is a weight vector and b a scalar (bias).
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
57. Maximum Margin
06/21/2025
57
Distance between a point P (xo, yo, zo) and a given plane Ax
+ By + Cz = D, is given by
|Axo + Byo+ Czo + D|/√(A2
+ B2
+ C2
).
Here we have, two bounding planes
w.x+b=1 and w.x+b=-1
58. 06/21/2025
58
Distance of the bounding hyperplane w.x+b=1 from origin
|1 |
=
|| w||
Distance of the bounding hyperplane w.x+b=-1 from origin
| 1 |
=
|| w||
Distance between the planes (which needs to be maximized)
|1 |
=
||
b
b
b
| 1 |
w|| || w||
2
|| w||
b
59. Mathematics behind SVM
06/21/2025
59
For the training data to be linearly separable:
Or,
w.x 1, 1
w.x 1, 1
i i
i i
b if y
b if y
(w.x ) 1, 1,2,...,
i i
y b i n
60. 06/21/2025
60
Vectors xi for which yi (w•xi, + b) = 1 (points which fall on
the bounding planes) are termed as support vectors.
69. Linearly Separable SVM
06/21/2025
69
The optimal hyperplane is given by
where w={w1, w2, …, wn} is a weight vector and b a scalar (bias).
The linear decision function I(x) is then given by
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
w.x + b = 0
74. Here, C is a hyperparameter that decides the trade-off
between maximizing the margin and minimizing the
mistakes.
When C is small, classification mistakes are given less
importance and focus is more on maximizing the margin,
whereas when C is large, the focus is more on avoiding
misclassification at the expense of keeping the margin small.
https://towardsdatascience.com/support-vector-machines-soft-margin-formulation-and-kernel-trick-
4c9729dc8efe
91. Kernel Functions
Kernel functions are generalized functions that take two
vectors (of any dimension) as input and output a score that
denotes how similar the input vectors are.
An example is the dot product function: if the dot product is
small, we conclude that vectors are different and if the dot
product is large, we conclude that vectors are more similar.
101. Kernel Trick
We can use any fancy Kernel function in place of dot product
that has the capability of measuring similarity in higher
dimensions (where it could be more accurate;), without
increasing the computational costs much.
This is essentially known as the KernelTrick.
105. Why is SVM Effective on High Dimensional Data?
06/21/2025
105
The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data.
The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (Maximum Margin
Hyperplane).
Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high.
106. References
06/21/2025
106
Dunham M H, “Data Mining: Introductory and Advanced
Topics”, Pearson Education, New Delhi, 2003.
Jaiwei Han, Micheline Kamber, “Data Mining Concepts and
Techniques”, Elsevier, 2006.
K.P. Soman, Shyam Diwakar,V.Ajay, “Insight into Data Mining
Theory and Practice”, PHI Pvt. Ltd., New Delhi, 2008.
https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm