1. Dr. Selim Yılmaz Spring 2025
Lecture #2
Understanding Deep Networks
2. Today
• Gradient Descent Algorithm
• Computation Graph
• Multi-Layer Neural Network
• Activation Functions
• Loss Functions
• Deep Neural Network
Forward Propagation in a Deep Network
Why Deep Representations?
Building Blocks of Deep Neural Networks
Forward and Backward Propagation
Parameters and Hyperparameters
5. Algorithm
• Gradient Descent (or GD) is an algorithm that uses the gradient of
given real-valued function.
• The gradient gives the direction and the magnitude of the slope of
function.
• The direction of negative gradient is a good direction to search if we
want to find a function minimizer.
6. Algorithm
• Gradient Descent is often used to find the minimum of the cost
function in linear or in logistic regression (i.e., 𝐽(𝜃)).
𝐽(𝜃)
𝜃
7. Derivatives
• Derivation represents the amount of
vertical change with respect to the
horizontal change in the variable of
the function given.
• Here 𝑑𝐽(𝜃) and 𝑑𝜃 are Leibniz
notation and represent, respectively,
very small change in the axis 𝐽(𝜃)
and 𝜃.
𝜃
𝐽(𝜃)
𝜃!
tangent line
the slope
the derivative
slope =
𝑑𝐽(𝜃)
𝑑𝜃
10. Update Step
• As parameters approach to local/global
minimum the step size becomes smaller
because of the decreasing slope around
there.
𝜃"
𝜃!
𝐽
𝐽
positive slope
negative slope
x
x
x’
𝜃! = 𝜃! − 𝛼(𝑝𝑜𝑠. 𝑣𝑎𝑙𝑢𝑒)
𝜃" = 𝜃" − 𝛼(𝑛𝑒𝑔. 𝑣𝑎𝑙𝑢𝑒)
x’
11. Learning Rate Parameter
𝜃$ = 𝜃$ − 𝛼
𝑑𝐽 𝜃
𝑑𝜃$
𝛼 is a learning rate:
• If it is too small, gradient descent can be slow
• If it is too large, gradient descent can overshoot
the minimum
𝜃"
𝜃"
𝐽
𝐽
too small
too large
x
x
x
x
x
x
x
x
x
x
x
13. Neural Network
• Logistic (or linear) regression can be viewed as a very basic neural
network structure.
• Computation of a neural network is organized as forward
propagation step followed by backward propogation step.
• In forward step, the output value of the network is computed; while
in the backward step, the cost (error) is propagated to weight update.
16. Computing Propagations - Illustration
• Assume that our cost function is
𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐
where
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
left to right -> forward propagation
17. Computing Propagations - Illustration
• Assume that our cost function is
𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐
where
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
18. Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑣, we calculate derivative:
𝑑𝐽
𝑑𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
19. Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑎, we calculate derivative:
𝑑𝐽
𝑑𝑎
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑎
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
20. Computing Derivatives in Backward Pass
• Chain rule: The amount of change on 𝐽 is equal to the product of
• how much 𝑣 changes by 𝑎 and
• how much 𝐽 changes by 𝑣:
𝑑𝐽
𝑑𝑎
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑎
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
21. Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑏, we calculate derivative:
𝑑𝐽
𝑑𝑏
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑢
𝑑𝑢
𝑑𝑏
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
22. Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑐, we calculate derivative:
𝑑𝐽
𝑑𝑐
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑢
𝑑𝑢
𝑑𝑐
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
43. Variants of Gradient Descent
Batch gradient descent (BGD):
• All the training data is passed to update parameters of the model.
• Then the average of the gradients of all batches are taken for update.
44. Variants of Gradient Descent
Stochastic gradient descent (SGD):
• Only one sample in training data is passed to update model parameters.
• Then the gradient is calculated according to that sample.
• Ideal when the data size is too large.
45. Variants of Gradient Descent
Mini gradient descent (MGD):
• SGD slows down the computation due to the calculation of derivation
for each tuple.
• Instead MGD takes (1<size<batch) data into consideration for update.
48. Terminologies in NN
Training set
Batch: # of training examples in a single split.
Iteration:
Epoch:
batch#1
batch#2
batch#3
batch#4
49. Terminologies in NN
Training set
Batch:
Iteration: # of steps to pass all batches
iteration = # of bathces in an epoch
Epoch:
batch#1
batch#2
batch#3
batch#4
4 iterations are needed to pass all batches.
50. Terminologies in NN
Batch:
Iteration:
Epoch: # of passes that entire dataset
is given to the algorithm.
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
1st epoch 2nd epoch kth epoch
…
51. Terminologies in NN
2000 instances in training dataset
Batch size: 400
Iteration: 5
Epoch: any number.
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
…
53. Activation Functions
• Most popular nonlinear activation functions:
• Sigmoid,
• Hyperbolic (tangent),
• Rectified Linear Unit (ReLU),
• Leakly Rectified Linear Unit (LReLU).
• Softmax.
54. Sigmoid
• It is often used at the final layer of ANNs to
ensure an output to be either 0 or 1.
• Gradient information can be lost.
Vanishing gradient:
• A case when gradient information is lost.
• Arises when the parameter is too large (in
pos. or in neg. direction)
• Avoids algorithm to update weights.
𝑔(𝑥) =
1
1 + 𝑒"=
55. Hyperbolic (Tangent)
• It produces an output -1 or 1.
• Much better than the sigmoid function.
• Steeper curve on derivation.
• It has a vanishing gradient problem as well.
𝑔(𝑥) =
𝑒=
− 𝑒"=
𝑒= + 𝑒"=
56. Rectified Linear Unit (ReLU)
• It behaves like a linear function
when 𝑥 > 0.
• No derivation information is
obtainable when 𝑥 < 0.
𝑔(𝑥) = max(0, 𝑥)
57. Rectified Linear Unit (ReLU)
Dying ReLU:
• ‘A ReLU neuron is “dead” if it’s stuck in the negative
side and always outputs 0. Because the slope of ReLU
in the negative range is also 0, once a neuron gets
negative, it’s unlikely for it to recover. Such neurons
are not playing any role in discriminating the input and
is essentially useless. Over the time you may end up
with a large part of your network doing nothing.’
𝑔(𝑥) = max(0, 𝑥)
58. Leakly ReLU
• It behaves like a linear function when
𝑥 > 0.
• Derivation information is still obtainable
when 𝑥 < 0.
𝑔(𝑥) = max(𝜇𝑥, 𝑥)
59. Softmax
• It is used at the output layer.
• It is more generalized logistic activation
function which is used for multiclass
classification.
• Gives the proability for each neuron
being true for the corresponding class.
𝑔(𝑥() =
𝑒=%
∑$)&
>
𝑒=&
image credit: towardsdatascience.com
61. Categories
• Regression
• Square Error/Quadratic Loss/L2 Loss
• Absolute Error/L1 Loss
• Bias Error
• Classification
• Hinge Loss/Multi class SVM Loss
• Cross Entropy Loss/Negative Log Likelihood
• Binary Cross Entropy Loss/ Log Loss
62. Square Error/Quadratic Loss/L2 Loss
• The ‘squared’ difference between prediction and actual observation.
ℒ 𝑎, 𝑦 = 𝑎 − 𝑦 !
63. Absolute Error/L1 Loss
• The ‘absolute’ difference between prediction and actual observation.
ℒ 𝑎, 𝑦 = |𝑎 − 𝑦|
64. Bias Error
• This is much less common in machine learning domain.
ℒ 𝑎, 𝑦 = 𝑎 − 𝑦
65. Hinge Loss/Multi class SVM Loss
• The score of correct category should be greater than sum of scores
of all incorrect categories by some safety margin.
S𝑉𝑀𝐿𝑜𝑠𝑠 = .
"#$!
max 0, 𝑠" − 𝑠$!
+ 1
66. Hinge Loss/Multi class SVM Loss
## 1st training example
• max(0, (1.49) - (-0.39) + 1) + max(0, (4.21) - (-0.39) + 1)
• max(0, 2.88) + max(0, 5.6)
• 2.88 + 5.6
• 8.48 (High loss as very wrong prediction)
67. Hinge Loss/Multi class SVM Loss
## 2nd training example
• max(0, (-4.61) - (3.28)+ 1) + max(0, (1.46) - (3.28)+ 1)
• max(0, -6.89) + max(0, -0.82)
• 0 + 0
• 0 (Zero loss as correct prediction)
68. Hinge Loss/Multi class SVM Loss
## 3rd training example
• max(0, (1.03) - (-2.27)+ 1) + max(0, (-2.37) - (-2.27)+ 1)
• max(0, 4.3) + max(0, 0.9)
• 4.3 + 0.9
• 5.2 (High loss as very wrong prediction)
69. Cross Entropy Loss/Log Loss/Log Likelihood
• Cross-entropy is a commonly used loss function for multi-class
classification tasks.
ℒ 𝑎, 𝑦 = − .
%
𝑦% log 𝑎%
84. Deep Representation
• Boolean functions:
ØEvery Boolean function can be represented exactly by a neural network
ØThe number of hidden layers might need to grow with the number of inputs
• Continuous functions:
ØEvery bounded continuous function can be approximated with small error
with two layers
• Arbitrary functions:
Ø Three layers can approximate any arbitrary function.
• Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 2 (4), 303-314
• Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251257.
• Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators.« Neural networks 2.5 (1989): 359-366.
85. Deep Representation
Why go deeper if three layers is sufficient?
• Going deeper helps convergence in “big” problems.
• Going deeper in “old-fashion trained” ANNs does not help much in
accuracy
• Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015, February). The loss surfaces of multilayer networks. In Artificial Intelligence and
Statistics (pp. 192-204).
86. Deep Representation
more hidden neurons can represent more complicated functions.
Figure: https://cs231n.github.io/
87. Deep Representation
Several rule of thumbs for # of hidden units
• The number of hidden neurons should be between the size of the input
layer and the size of the output layer.
• The number of hidden neurons should be 2/3 the size of the input layer,
plus the size of the output layer.
• The number of hidden neurons should be less than twice the size of the
input layer.
• The size of hidden neurons is gradually decreased from input to output
layers.
88. Deep Representation
# of hidden layers
• Depends on the nature of the problem
• Linear classification? Then, no hidden layers needed,
• Non-linear classification?
• Trial and error is helpful.
• Watching the validation and training loss curve throughout the epochs.
• If the gap between the loss is small, you can increase the capacity (neurons and layers ),
• If the training loss is much smaller than validation loss, you should decrease the capacity.
94. Parameters and Hyperparameters
• Model Parameters: These are the parameters in the model that must be
determined using the training data set. These are the fitted parameters.
𝑊[@]
and 𝑏[@]
where 1 ≤ 𝑙 ≤ 𝐿
• Hyperparameters: These are adjustable parameters that must be tuned
in order to obtain a model with optimal performance.
learning rate (𝛼), # iterations, # hidden layers (𝐿), hidden units (𝑛[=]), choice of activation, and many many more…