SlideShare a Scribd company logo
Dr. Selim Yılmaz Spring 2025
Lecture #2
Understanding Deep Networks
Today
• Gradient Descent Algorithm
• Computation Graph
• Multi-Layer Neural Network
• Activation Functions
• Loss Functions
• Deep Neural Network
Forward Propagation in a Deep Network
Why Deep Representations?
Building Blocks of Deep Neural Networks
Forward and Backward Propagation
Parameters and Hyperparameters
Gradient Descent
Minimization of Cost Function
• Recap:
!
𝑦 = 𝜎 𝜃!
𝑥 , 𝜎 𝑧 =
1
1 + 𝑒"#
𝐽 𝜃 = −
1
𝑚
'
!"#
$
ℒ )
𝑦!, 𝑦! = −
1
𝑚
'
!"#
$
𝑦! log )
𝑦! + 1 − 𝑦! log 1 − )
𝑦!
• Want to find 𝜃 that minimizes 𝐽 𝜃
Algorithm
• Gradient Descent (or GD) is an algorithm that uses the gradient of
given real-valued function.
• The gradient gives the direction and the magnitude of the slope of
function.
• The direction of negative gradient is a good direction to search if we
want to find a function minimizer.
Algorithm
• Gradient Descent is often used to find the minimum of the cost
function in linear or in logistic regression (i.e., 𝐽(𝜃)).
𝐽(𝜃)
𝜃
Derivatives
• Derivation represents the amount of
vertical change with respect to the
horizontal change in the variable of
the function given.
• Here 𝑑𝐽(𝜃) and 𝑑𝜃 are Leibniz
notation and represent, respectively,
very small change in the axis 𝐽(𝜃)
and 𝜃.
𝜃
𝐽(𝜃)
𝜃!
tangent line
the slope
the derivative
slope =
𝑑𝐽(𝜃)
𝑑𝜃
Update Step
Repeat until convergence {
𝜃$ = 𝜃$ − 𝛼
𝑑𝐽 𝜃
𝑑𝜃$
= 𝜃$ − 𝛼
𝜕
𝜕𝜃$
𝐽 𝜃
}
Important: Simultaneous update is must!
‘partial derivation’
Update Step
• Update procedure in GD:
𝜃% = 𝜃% − 𝛼
𝜕
𝜕𝜃%
𝐽 𝜃%, 𝜃&
𝜃& = 𝜃& − 𝛼
𝜕
𝜕𝜃&
𝐽 𝜃%, 𝜃&
𝜃$ = 𝜃$ − 𝛼
𝜕
𝜕𝜃$
𝐽 𝜃%, 𝜃&, … , 𝜃'
𝜃"
𝜃!
𝐽
𝐽
positive slope
negative slope
x
x
Update Step
• As parameters approach to local/global
minimum the step size becomes smaller
because of the decreasing slope around
there.
𝜃"
𝜃!
𝐽
𝐽
positive slope
negative slope
x
x
x’
𝜃! = 𝜃! − 𝛼(𝑝𝑜𝑠. 𝑣𝑎𝑙𝑢𝑒)
𝜃" = 𝜃" − 𝛼(𝑛𝑒𝑔. 𝑣𝑎𝑙𝑢𝑒)
x’
Learning Rate Parameter
𝜃$ = 𝜃$ − 𝛼
𝑑𝐽 𝜃
𝑑𝜃$
𝛼 is a learning rate:
• If it is too small, gradient descent can be slow
• If it is too large, gradient descent can overshoot
the minimum
𝜃"
𝜃"
𝐽
𝐽
too small
too large
x
x
x
x
x
x
x
x
x
x
x
Computational Graph
Neural Network
• Logistic (or linear) regression can be viewed as a very basic neural
network structure.
• Computation of a neural network is organized as forward
propagation step followed by backward propogation step.
• In forward step, the output value of the network is computed; while
in the backward step, the cost (error) is propagated to weight update.
Neural Network
𝑥" = 𝑏 = 1
𝑥#
𝑥3
𝑥4
𝜎
5
𝑦
)
𝑦 = 𝜎(𝑧) =
1
1 + 𝑒78
𝜃"
𝜃!
𝜃#
𝜃$
𝑧 = 𝜃9𝑥
Computing Propagations - Illustration
• Assume that our cost function is
𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐
where
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
𝑎
𝑏
𝑐
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
Computing Propagations - Illustration
• Assume that our cost function is
𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐
where
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
left to right -> forward propagation
Computing Propagations - Illustration
• Assume that our cost function is
𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐
where
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑣, we calculate derivative:
𝑑𝐽
𝑑𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑎, we calculate derivative:
𝑑𝐽
𝑑𝑎
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑎
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
Computing Derivatives in Backward Pass
• Chain rule: The amount of change on 𝐽 is equal to the product of
• how much 𝑣 changes by 𝑎 and
• how much 𝐽 changes by 𝑣:
𝑑𝐽
𝑑𝑎
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑎
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑏, we calculate derivative:
𝑑𝐽
𝑑𝑏
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑢
𝑑𝑢
𝑑𝑏
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑐, we calculate derivative:
𝑑𝐽
𝑑𝑐
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑢
𝑑𝑢
𝑑𝑐
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation
Computation Graph for Logistic Regression
• Recap again that:
𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
Computation Graph for Logistic Regression
• Recap again that:
𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
Computation Graph for Logistic Regression
• Recap again that:
𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
𝑑ℒ(𝑎, 𝑦)
𝑑𝑧
Computation Graph for Logistic Regression
• Recap again that:
𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
𝑑ℒ(𝑎, 𝑦)
𝑑𝑧
𝑑ℒ(𝑎, 𝑦)
𝑑𝜔!
Computation Graph for Logistic Regression
• Recap again that:
𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
𝑑ℒ(𝑎, 𝑦)
𝑑𝑧
𝑑ℒ(𝑎, 𝑦)
𝑑𝜔#
Computation Graph for Logistic Regression
• Recap again that:
𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
𝑑ℒ(𝑎, 𝑦)
𝑑𝑧
𝑑ℒ(𝑎, 𝑦)
𝑑𝑏
Derivation on 𝑚 Examples
• Recap cost function:
𝐽 𝜔, 𝑏 =
1
𝑚
@
()&
*
ℒ 𝑎(
, 𝑦(
• Update 𝑘th weight:
𝜕
𝜕𝜔'
𝐽 𝜔, 𝑏 =
1
𝑚
@
()&
*
𝜕
𝜕𝜔'
ℒ 𝑎(
, 𝑦(
Multi-layer Neural Network
Overview
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎
𝑥
𝑧 = 𝑤%
𝑥 + 𝑏
𝑤
𝑏
𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
Overview
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[#]
𝑥
𝑧["]
= 𝑤["]
𝑥 + 𝑏["]
𝑤[!]
𝑏[!]
𝑎["]
= 𝜎(𝑧["]
) ℒ 𝑎[%]
, 𝑦
[1] [2]
𝑧[%]
= 𝑤[%]
𝑎["]
+ 𝑏[%]
𝑎[%]
= 𝜎(𝑧[%]
)
𝑤[#]
𝑏[#]
𝑧["] 𝑎["]
𝑥
𝑤[!]
𝑏[!]
𝑎[!]
inside a neuron
Representation – 2 Layer Neural Network
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[#]
𝑎!
!
𝑎#
!
𝑎(
!
𝑎[!]
𝑎$
!
𝑎["] = 𝑋 𝑎[#]
Input
layer
Output
layer
Hidden
layer
𝑎[!] =
𝑎!
!
𝑎#
!
𝑎$
!
𝑎(
!
𝑊
(×$
! %
, 𝑏(×!
!
𝑊
!×(
# %
, 𝑏!×!
#
Computing Output
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
𝑧!
!
= 𝑤!
! %
𝑥 + 𝑏!
!
, 𝑎!
!
= 𝜎(𝑧!
!
)
𝑧#
!
= 𝑤#
! %
𝑥 + 𝑏#
!
, 𝑎#
!
= 𝜎(𝑧#
!
)
𝑧$
!
= 𝑤$
! %
𝑥 + 𝑏$
!
, 𝑎$
!
= 𝜎(𝑧$
!
)
𝑧(
!
= 𝑤(
! %
𝑥 + 𝑏(
!
, 𝑎(
!
= 𝜎(𝑧(
!
)
𝑧[!]
=
𝑧!
!
𝑧#
!
𝑧$
!
𝑧(
!
=
−
−
𝑤!
! %
𝑤#
! %
−
−
−
−
𝑤$
! %
𝑤(
! %
−
−
𝑥!
𝑥#
𝑥$
+
𝑏!
!
𝑏#
!
𝑏$
!
𝑏(
!
𝑎[!] =
𝑎!
!
𝑎#
!
𝑎$
!
𝑎(
!
= 𝜎(𝑧[!])
Computing Output
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
𝑧[!]
= 𝑊[!]
𝑎["]
+ 𝑏[!]
𝑧[!]
=
𝑧!
!
𝑧#
!
𝑧$
!
𝑧(
!
=
−
−
𝑤!
! %
𝑤#
! %
−
−
−
−
𝑤$
! %
𝑤(
! %
−
−
𝑥!
𝑥#
𝑥$
+
𝑏!
!
𝑏#
!
𝑏$
!
𝑏(
!
𝑎[!] = 𝜎(𝑧 ! )
𝑧[#] = 𝑊[#]𝑎[!] + 𝑏[#]
𝑎[#] = 𝜎(𝑧 # )
Vectorization
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
𝑥 → 𝑎[#] = 5
𝑦
𝑥(!)
→ 𝑎[#](!)
= 5
𝑦(!)
𝑥(#)
→ 𝑎[#](#)
= 5
𝑦(#)
𝑥(,)
→ 𝑎[#](,)
= 5
𝑦(,)
…
computing output for 𝒎 inputs
Vectorization
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
for i = 1 to m
𝑧[!](-)
= 𝑊[!]
𝑥(-)
+ 𝑏[!]
𝑎[!](-)
= 𝜎(𝑧 ! (-)
)
𝑧[#](-)
= 𝑊[#]
𝑥(-)
+ 𝑏[#]
𝑎[#](-) = 𝜎(𝑧 # (-))
Vectorization
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
𝑍[!] = 𝑊[!]𝑋 + 𝑏[!]
𝐴[!] = 𝜎(𝑍 ! )
𝑋.,, =
| |
𝑥(!) 𝑥(#)
| |
| |
… 𝑥(,)
| |
𝑍[!]
=
| |
𝑧[!](!)
𝑧[!](#)
| |
| |
… 𝑧[!](,)
| |
𝑍[#]
= 𝑊[#]
𝐴[!]
+ 𝑏[#]
𝐴[#]
= 𝜎(𝑍 #
)
𝐴[!] =
| |
𝑎[!](!) 𝑎[!](#)
| |
| |
… 𝑎[!](,)
| |
Backpropagation in NN
1
𝑥#
𝑥3
𝑥4
ℎ0(𝑎)
1
𝑎#
#
𝑎3
#
𝑎4
#
𝑎!
#
𝐸 = (𝑎"
$
, 𝑦)
Layer 2
hidden layer
Layer 3
output layer
𝑎"
!
Layer 1
input layer
Error function:
Remember: Gradient Descent is used to propagate backward the error.
𝜃%4
&
= 𝜃%4
&
− 𝜂
𝜕
𝜕𝜃%4
& 𝐸
𝜃"$
!
Update equation for a weight (𝜃"$
!
):
Backpropagation in NN
1
𝑥#
𝑥3
𝑥4
ℎ0(𝑎)
1
𝑎#
#
𝑎3
#
𝑎4
#
𝑎!
#
Layer 2
hidden layer
Layer 3
output layer
𝑎"
!
Layer 1
input layer
𝜃54
%
= 𝜃54
%
− 𝜂
𝜕
𝜕𝜃54
% 𝐸
𝜃#$
"
𝐸 = (𝑎"
$
, 𝑦)
Error function:
Remember: Gradient Descent is used to propagate backward the error.
Update equation for a weight (𝜃#$
"
):
Backpropagation in NN
1
𝑥#
𝑥3
𝑥4
ℎ0(𝑎)
1
𝑎#
#
𝑎3
#
𝑎4
#
𝑎!
#
𝐸 = (𝑎"
$
, 𝑦)
Layer 2
hidden layer
Layer 3
output layer
𝑎"
!
Layer 1
input layer
Remember:
𝑎!
#
= 𝜎(𝑧1
)
𝑧1
= 𝜃𝐴 = 𝜃""
!
𝑎"
!
+ 𝜃"!
!
𝑎!
!
+ 𝜃"#
!
𝑎#
!
+ 𝜃"$
!
𝑎$
!
chain rule:
!
!"!"
# 𝐸 =
!#
!$#
$
!$#
$
!%%
!%%
!"!"
#
𝜃"$
!
𝜃%4
&
= 𝜃%4
&
− 𝜂
𝜕
𝜕𝜃%4
& 𝐸
Backpropagation in NN
𝐸 = 𝑎!
#
, 𝑦 = −(𝑦× log 𝑎!
#
+ (1 − 𝑦)×log(1 − 𝑎!
#
))
Let :
𝑧1
= 𝜃𝐴 = 𝜃""
!
𝑎"
!
+ 𝜃"!
!
𝑎!
!
+ 𝜃"#
!
𝑎#
!
+ 𝜃"$
!
𝑎$
!
W
WX;<
= 𝐸 =
WY
WZ=
>
WZ=
>
W8?
W8?
WX;<
=
1
𝑥!
𝑥"
𝑥#
ℎ!(𝑎)
1
𝑎!
!
𝑎"
!
𝑎#
!
𝑎"
#
Layer 2
hidden layer
Layer 3
output layer
𝑎$
"
Layer 1
input layer
𝜃$%
"
%&
%')
* = −𝑦×
"
')
* − 1 − 𝑦 ×
"
"(')
* ×(−1) =
')
*()
')
*("(')
*)
𝜕𝑎!
#
𝜕𝑧1
= 𝑎!
#
(1 − 𝑎!
#
)
ℎ0 𝑧1
= 𝑎!
#
=
1
1 + 𝑒23!
𝜕𝑧1
𝜕𝜃"$
! = 𝑎$
!
9
9:!"
# 𝐸 =
;#
$"<
;#
$(&";#
$)
𝑎&
5
1 − 𝑎&
5
𝑎4
&
= (𝑎&
5
− 𝑦)𝑎4
&
𝜕
𝜕𝜃""
! 𝐸 = 𝑎!
#
− 𝑦 𝑎"
!
= 𝑎!
#
− 𝑦
𝜃""
#
Variants of Gradient Descent
Batch gradient descent (BGD):
• All the training data is passed to update parameters of the model.
• Then the average of the gradients of all batches are taken for update.
Variants of Gradient Descent
Stochastic gradient descent (SGD):
• Only one sample in training data is passed to update model parameters.
• Then the gradient is calculated according to that sample.
• Ideal when the data size is too large.
Variants of Gradient Descent
Mini gradient descent (MGD):
• SGD slows down the computation due to the calculation of derivation
for each tuple.
• Instead MGD takes (1<size<batch) data into consideration for update.
Multi-output NN
1
𝑥#
𝑥3
𝑥4
1
𝑎#
3
𝑎3
3
𝑎4
3
𝑎#
4
hidden layer output layer
input layer
𝑎3
4
𝑎4
4
ℎ0(𝑎$)
ℎ0(𝑎$
) ∈ ℝ$
ℎ0 𝑎$ =
0
1
0
ℎ0 𝑎$
=
1
0
0
ℎ0 𝑎$ =
0
0
1
Terminologies in NN
Training set
Batch:
Iteration:
Epoch:
Terminologies in NN
Training set
Batch: # of training examples in a single split.
Iteration:
Epoch:
batch#1
batch#2
batch#3
batch#4
Terminologies in NN
Training set
Batch:
Iteration: # of steps to pass all batches
iteration = # of bathces in an epoch
Epoch:
batch#1
batch#2
batch#3
batch#4
4 iterations are needed to pass all batches.
Terminologies in NN
Batch:
Iteration:
Epoch: # of passes that entire dataset
is given to the algorithm.
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
1st epoch 2nd epoch kth epoch
…
Terminologies in NN
2000 instances in training dataset
Batch size: 400
Iteration: 5
Epoch: any number.
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
…
Activation Functions
Activation Functions
• Most popular nonlinear activation functions:
• Sigmoid,
• Hyperbolic (tangent),
• Rectified Linear Unit (ReLU),
• Leakly Rectified Linear Unit (LReLU).
• Softmax.
Sigmoid
• It is often used at the final layer of ANNs to
ensure an output to be either 0 or 1.
• Gradient information can be lost.
Vanishing gradient:
• A case when gradient information is lost.
• Arises when the parameter is too large (in
pos. or in neg. direction)
• Avoids algorithm to update weights.
𝑔(𝑥) =
1
1 + 𝑒"=
Hyperbolic (Tangent)
• It produces an output -1 or 1.
• Much better than the sigmoid function.
• Steeper curve on derivation.
• It has a vanishing gradient problem as well.
𝑔(𝑥) =
𝑒=
− 𝑒"=
𝑒= + 𝑒"=
Rectified Linear Unit (ReLU)
• It behaves like a linear function
when 𝑥 > 0.
• No derivation information is
obtainable when 𝑥 < 0.
𝑔(𝑥) = max(0, 𝑥)
Rectified Linear Unit (ReLU)
Dying ReLU:
• ‘A ReLU neuron is “dead” if it’s stuck in the negative
side and always outputs 0. Because the slope of ReLU
in the negative range is also 0, once a neuron gets
negative, it’s unlikely for it to recover. Such neurons
are not playing any role in discriminating the input and
is essentially useless. Over the time you may end up
with a large part of your network doing nothing.’
𝑔(𝑥) = max(0, 𝑥)
Leakly ReLU
• It behaves like a linear function when
𝑥 > 0.
• Derivation information is still obtainable
when 𝑥 < 0.
𝑔(𝑥) = max(𝜇𝑥, 𝑥)
Softmax
• It is used at the output layer.
• It is more generalized logistic activation
function which is used for multiclass
classification.
• Gives the proability for each neuron
being true for the corresponding class.
𝑔(𝑥() =
𝑒=%
∑$)&
>
𝑒=&
image credit: towardsdatascience.com
Cost Functions
Categories
• Regression
• Square Error/Quadratic Loss/L2 Loss
• Absolute Error/L1 Loss
• Bias Error
• Classification
• Hinge Loss/Multi class SVM Loss
• Cross Entropy Loss/Negative Log Likelihood
• Binary Cross Entropy Loss/ Log Loss
Square Error/Quadratic Loss/L2 Loss
• The ‘squared’ difference between prediction and actual observation.
ℒ 𝑎, 𝑦 = 𝑎 − 𝑦 !
Absolute Error/L1 Loss
• The ‘absolute’ difference between prediction and actual observation.
ℒ 𝑎, 𝑦 = |𝑎 − 𝑦|
Bias Error
• This is much less common in machine learning domain.
ℒ 𝑎, 𝑦 = 𝑎 − 𝑦
Hinge Loss/Multi class SVM Loss
• The score of correct category should be greater than sum of scores
of all incorrect categories by some safety margin.
S𝑉𝑀𝐿𝑜𝑠𝑠 = .
"#$!
max 0, 𝑠" − 𝑠$!
+ 1
Hinge Loss/Multi class SVM Loss
## 1st training example
• max(0, (1.49) - (-0.39) + 1) + max(0, (4.21) - (-0.39) + 1)
• max(0, 2.88) + max(0, 5.6)
• 2.88 + 5.6
• 8.48 (High loss as very wrong prediction)
Hinge Loss/Multi class SVM Loss
## 2nd training example
• max(0, (-4.61) - (3.28)+ 1) + max(0, (1.46) - (3.28)+ 1)
• max(0, -6.89) + max(0, -0.82)
• 0 + 0
• 0 (Zero loss as correct prediction)
Hinge Loss/Multi class SVM Loss
## 3rd training example
• max(0, (1.03) - (-2.27)+ 1) + max(0, (-2.37) - (-2.27)+ 1)
• max(0, 4.3) + max(0, 0.9)
• 4.3 + 0.9
• 5.2 (High loss as very wrong prediction)
Cross Entropy Loss/Log Loss/Log Likelihood
• Cross-entropy is a commonly used loss function for multi-class
classification tasks.
ℒ 𝑎, 𝑦 = − .
%
𝑦% log 𝑎%
Cross Entropy Loss/Log Loss/Log Likelihood
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛(𝑎) =
0.9
0.1
0.0
𝑝𝑟𝑜𝑏456 = 0.9
𝑝𝑟𝑜𝑏789 = 0.1
𝑝𝑟𝑜𝑏9-:5;;< = 0.0
𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ (𝑦) =
1
0
0
ℒ 𝑎, 𝑦 = − 𝑦456 log 𝑎456 + 𝑦789 log 𝑎789 + 𝑦9-:5;;< log 𝑎9-:5;;<
ℒ 𝑎, 𝑦 = − 1 log 0.9 + 0 log 0.1 + 0 log 0.0
ℒ 𝑎, 𝑦 = − 1 −0.04 + 𝟎 −2.30 + 𝟎 −∞
ℒ 𝑎, 𝑦 = 0.04
Binary Cross Entropy Loss
• Log loss is commonly used loss function for binary-class
classification tasks.
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
Deep Neural Network
Architecture
𝑥!
𝑥#
𝑥$
5
𝑦
𝑥!
𝑥#
𝑥$
5
𝑦
5
𝑦
𝑥!
𝑥#
𝑥$
5
𝑦
𝑥!
𝑥#
𝑥$
perceptron 2 layer NN
3 layer NN L layer NN
Notations
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝐿 = 4 (# 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟𝑠)
𝑛[=]
= # 𝑜𝑓 𝑢𝑛𝑖𝑡𝑠 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙
𝑎[=]
= # 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙
𝑎[=]
= 𝑔[=]
𝑧[=]
𝑊[=] = 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑓𝑜𝑟 𝑧[=]
𝑏[=] = 𝑏𝑖𝑎𝑠 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑓𝑜𝑟 𝑧[=]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
4 layer NN
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑔[=]
= 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐. 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙
General Form:
Forward Propagation
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑧[!]
= 𝑊 ! %
𝑎["]
+ 𝑏[!]
𝑎[!] = 𝑔[!] 𝑧[!]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑎[#] = 𝑔[#] 𝑧[#]
𝑧[#]
= 𝑊 # %
𝑎[!]
+ 𝑏[#]
…
𝑎[=]
= 𝑔[=]
𝑧[=]
𝑧[=] = 𝑊 = %𝑎[=2!] + 𝑏[=]
𝑎[(] = 5
𝑦 = 𝑔[(] 𝑧[(]
𝑧[(]
= 𝑊 ( %
𝑎[$]
+ 𝑏[(]
General Form:
Forward Propagation - Vectorization
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑍[!] = 𝑊 ! %𝐴["] + 𝑏[!]
𝐴[!] = 𝑔[!] 𝑍[!]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝐴[#] = 𝑔[#] 𝑍[#]
𝑍[#]
= 𝑊 # %
𝐴[!]
+ 𝑏[#]
𝐴[=]
= 𝑔[=]
𝑍[=]
𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=]
…
𝐴[(] = 5
𝑦 = 𝑔[(] 𝑍[(]
𝑍[(] = 𝑊 ( %𝐴[$] + 𝑏[(]
General Form:
Forward Propagation - Vectorization
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=]
𝐴[=] = 𝑔[=] 𝑍[=]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝐴[=]
= 𝑔[=]
𝑍[=]
𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=]
For l = 1 to L
Parameter Dimensions:
For simplicity, take 𝑊 = % = 𝑊 =
Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑏 = , 𝑧 = , 𝑎[=] ∶ (𝑛 = , 1)
𝑊 = ∶ (𝑛 = , 𝑛 =2! )
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑧[!] = 𝑊 ! 𝑎["] + 𝑏[!]
(4,1) = (4,3)(3,1) + (4,1)
Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑧[#] = 𝑊 # 𝑎[!] + 𝑏[#]
(4,1) = (4,4)(4,1) + (4,1)
Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑧[$] = 𝑊 $ 𝑎[#] + 𝑏[$]
(3,1) = (3,4)(4,1) + (3,1)
Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑧[(] = 𝑊 ( 𝑎[$] + 𝑏[(]
(1,1) = (1,3)(3,1) + (1,1)
Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
Parameter Dimensions
(vectorization):
𝑍 = , 𝐴[=] ∶ (𝑛 = , 𝑚)
𝑍[!] = 𝑊 ! 𝐴["] + 𝑏[!]
(4, 𝑚) = (4,3)(3, 𝑚) + (4,1)
broadcasting to (4, 𝑚)
Deep Representation
• Boolean functions:
ØEvery Boolean function can be represented exactly by a neural network
ØThe number of hidden layers might need to grow with the number of inputs
• Continuous functions:
ØEvery bounded continuous function can be approximated with small error
with two layers
• Arbitrary functions:
Ø Three layers can approximate any arbitrary function.
• Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 2 (4), 303-314
• Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251257.
• Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators.« Neural networks 2.5 (1989): 359-366.
Deep Representation
Why go deeper if three layers is sufficient?
• Going deeper helps convergence in “big” problems.
• Going deeper in “old-fashion trained” ANNs does not help much in
accuracy
• Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015, February). The loss surfaces of multilayer networks. In Artificial Intelligence and
Statistics (pp. 192-204).
Deep Representation
more hidden neurons can represent more complicated functions.
Figure: https://cs231n.github.io/
Deep Representation
Several rule of thumbs for # of hidden units
• The number of hidden neurons should be between the size of the input
layer and the size of the output layer.
• The number of hidden neurons should be 2/3 the size of the input layer,
plus the size of the output layer.
• The number of hidden neurons should be less than twice the size of the
input layer.
• The size of hidden neurons is gradually decreased from input to output
layers.
Deep Representation
# of hidden layers
• Depends on the nature of the problem
• Linear classification? Then, no hidden layers needed,
• Non-linear classification?
• Trial and error is helpful.
• Watching the validation and training loss curve throughout the epochs.
• If the gap between the loss is small, you can increase the capacity (neurons and layers ),
• If the training loss is much smaller than validation loss, you should decrease the capacity.
Deep Representation
What do the layers
represent?
Deep Representation
Low-level
Feature
Mid-level
Feature
High-level
Feature
Classifier
‘car’
simple features
e.g., edge orientation and size
complex features
Non-linear composition of previous layer(s)
Building Blocks
𝑊 !
, 𝑏[!]
𝑊 #
, 𝑏[#]
𝑊 $
, 𝑏[$]
𝑊 >
, 𝑏[>]
𝑊 ! , 𝑏[!], 𝑑𝑧[!] 𝑊 # , 𝑏[#], 𝑑𝑧[#] 𝑊 $ , 𝑏[$], 𝑑𝑧[$] 𝑊 > , 𝑏[>], 𝑑𝑧[>]
…
…
𝑥 = 𝑎["]
𝑎[!]
𝑎[#]
𝑎[$]
𝑎[>]
= 5
𝑦
𝑎[>2!]
𝑑𝑎["]
𝑑𝑎[%]
𝑑𝑎[&]
𝑑𝑎[']
𝑑𝑎['("]
𝑧[!] 𝑧[#]
𝑧[$]
𝑧[>]
𝑑𝑊["]
, 𝑑𝑏["]
𝑑𝑊[%]
, 𝑑𝑏[%]
𝑑𝑊[&]
, 𝑑𝑏[&]
𝑑𝑊[']
, 𝑑𝑏[']
𝑊[=] = 𝑊[=] + 𝛼𝑑𝑊[=] 𝑏[=]
= 𝑏[=]
+ 𝛼𝑑𝑏[=]
Forward Propagation for layer 𝑙
Input 𝑎[=2!]
Output 𝑎[=]
, cache 𝑧[=]
(i.e., 𝑊 =
×𝑎[=2!]
+ 𝑏[=]
)
𝑊 =
, 𝑏[=]
𝑎[=]
𝑎[=2!]
𝑧[u] = 𝑊 u 𝑎[u7#] + 𝑏[u]
𝑎[u] = 𝑔 u 𝑧[u]
𝑍[u] = 𝑊 u 𝐴[u7#] + 𝑏[u]
𝐴[u] = 𝑔 u 𝑍[u]
Vectorization for 𝒎 inputs 𝑧[=]
Backward Propagation for layer 𝑙
Input 𝑑𝑎[=]
Output 𝑑𝑎[=2!]
, d𝑊 =
, 𝑑𝑏[=]
𝑊 =
, 𝑏[=]
𝑑𝑎[=]
𝑑𝑎[=2!]
𝑑𝑊[u] =
𝑑ℒ
𝑑𝑊[u]
=
𝑑ℒ
𝑑𝑎[u]
𝑑𝑎[u]
𝑑𝑧[u]
𝑑𝑧[u]
𝑑𝑊[u]
𝑑𝑧[u] = 𝑑𝑎 u ×𝑔 u ?
(𝑧[u])
𝑑𝑍[=]
= 𝑑𝐴[=]
×𝑔 = !
(𝑍[=]
)
𝑑𝐴[=2!] = 𝑊 = %𝑑𝑍 =
Vectorization for 𝒎 inputs
𝑑𝑊[,], 𝑑𝑏[,]
Recap:
𝑑𝑊[u] = 𝑑𝑧 u 𝑎 u7# 9
𝑑𝑏[u] = 𝑑𝑧 u
𝑑𝑎[u7#] = 𝑊 u 9𝑑𝑧 u
𝑑𝑊[=] =
1
𝑚
𝑑𝑍 = 𝐴 =2! %
𝑑𝑏[=] =
1
𝑚
x
!?-?,
𝑑𝑍-
=
Parameters and Hyperparameters
• Model Parameters: These are the parameters in the model that must be
determined using the training data set. These are the fitted parameters.
𝑊[@]
and 𝑏[@]
where 1 ≤ 𝑙 ≤ 𝐿
• Hyperparameters: These are adjustable parameters that must be tuned
in order to obtain a model with optimal performance.
learning rate (𝛼), # iterations, # hidden layers (𝐿), hidden units (𝑛[=]), choice of activation, and many many more…

More Related Content

PDF
Chapter 1 - What is a Function.pdf
PDF
مدخل إلى تعلم الآلة
PDF
Lecture 5 backpropagation
PDF
11_Học máy cơ bản_Hồi quy tuyến tính.pdf
PPTX
04 Multi-layer Feedforward Networks
PDF
"Incremental Lossless Graph Summarization", KDD 2020
PPTX
Solving Poisson Equation using Conjugate Gradient Method and its implementation
PPTX
Lec05.pptx
Chapter 1 - What is a Function.pdf
مدخل إلى تعلم الآلة
Lecture 5 backpropagation
11_Học máy cơ bản_Hồi quy tuyến tính.pdf
04 Multi-layer Feedforward Networks
"Incremental Lossless Graph Summarization", KDD 2020
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Lec05.pptx

Similar to Week_2_Neural_Networks_Basichhhhhhhs.pdf (20)

PPTX
Direct solution of sparse network equations by optimally ordered triangular f...
PPTX
Machine learning introduction lecture notes
PDF
Deep Feed Forward Neural Networks and Regularization
PPTX
Coursera 2week
PDF
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
PDF
is anyone_interest_in_auto-encoding_variational-bayes
PPTX
Learning a nonlinear embedding by preserving class neibourhood structure 최종
PPTX
MLU_DTE_Lecture_2.pptx
PDF
機械学習と自動微分
PPTX
Deep learning study 2
PDF
GAN in_kakao
PDF
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
PDF
Paper Study: Melding the data decision pipeline
PPTX
A deep learning approach for twitter spam detection lijie zhou
PPTX
2Multi_armed_bandits.pptx
PDF
Prestation_ClydeShen
PDF
Lesson_8_DeepLearning.pdf
PPTX
Reinforcement Learning basics part1
PPTX
DeepLearningLecture.pptx
PDF
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Direct solution of sparse network equations by optimally ordered triangular f...
Machine learning introduction lecture notes
Deep Feed Forward Neural Networks and Regularization
Coursera 2week
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
is anyone_interest_in_auto-encoding_variational-bayes
Learning a nonlinear embedding by preserving class neibourhood structure 최종
MLU_DTE_Lecture_2.pptx
機械学習と自動微分
Deep learning study 2
GAN in_kakao
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: Melding the data decision pipeline
A deep learning approach for twitter spam detection lijie zhou
2Multi_armed_bandits.pptx
Prestation_ClydeShen
Lesson_8_DeepLearning.pdf
Reinforcement Learning basics part1
DeepLearningLecture.pptx
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Ad

Recently uploaded (20)

PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Introduction to the R Programming Language
PPTX
Business_Capability_Map_Collection__pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
DOCX
Factor Analysis Word Document Presentation
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Microsoft 365 products and services descrption
PPTX
Introduction to Inferential Statistics.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Business Analytics and business intelligence.pdf
PPTX
modul_python (1).pptx for professional and student
PPTX
New ISO 27001_2022 standard and the changes
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
retention in jsjsksksksnbsndjddjdnFPD.pptx
Introduction to the R Programming Language
Business_Capability_Map_Collection__pptx
ISS -ESG Data flows What is ESG and HowHow
Factor Analysis Word Document Presentation
Pilar Kemerdekaan dan Identi Bangsa.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Microsoft Core Cloud Services powerpoint
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Microsoft 365 products and services descrption
Introduction to Inferential Statistics.pptx
DU, AIS, Big Data and Data Analytics.ppt
Navigating the Thai Supplements Landscape.pdf
Business Analytics and business intelligence.pdf
modul_python (1).pptx for professional and student
New ISO 27001_2022 standard and the changes
CYBER SECURITY the Next Warefare Tactics
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Ad

Week_2_Neural_Networks_Basichhhhhhhs.pdf

  • 1. Dr. Selim Yılmaz Spring 2025 Lecture #2 Understanding Deep Networks
  • 2. Today • Gradient Descent Algorithm • Computation Graph • Multi-Layer Neural Network • Activation Functions • Loss Functions • Deep Neural Network Forward Propagation in a Deep Network Why Deep Representations? Building Blocks of Deep Neural Networks Forward and Backward Propagation Parameters and Hyperparameters
  • 4. Minimization of Cost Function • Recap: ! 𝑦 = 𝜎 𝜃! 𝑥 , 𝜎 𝑧 = 1 1 + 𝑒"# 𝐽 𝜃 = − 1 𝑚 ' !"# $ ℒ ) 𝑦!, 𝑦! = − 1 𝑚 ' !"# $ 𝑦! log ) 𝑦! + 1 − 𝑦! log 1 − ) 𝑦! • Want to find 𝜃 that minimizes 𝐽 𝜃
  • 5. Algorithm • Gradient Descent (or GD) is an algorithm that uses the gradient of given real-valued function. • The gradient gives the direction and the magnitude of the slope of function. • The direction of negative gradient is a good direction to search if we want to find a function minimizer.
  • 6. Algorithm • Gradient Descent is often used to find the minimum of the cost function in linear or in logistic regression (i.e., 𝐽(𝜃)). 𝐽(𝜃) 𝜃
  • 7. Derivatives • Derivation represents the amount of vertical change with respect to the horizontal change in the variable of the function given. • Here 𝑑𝐽(𝜃) and 𝑑𝜃 are Leibniz notation and represent, respectively, very small change in the axis 𝐽(𝜃) and 𝜃. 𝜃 𝐽(𝜃) 𝜃! tangent line the slope the derivative slope = 𝑑𝐽(𝜃) 𝑑𝜃
  • 8. Update Step Repeat until convergence { 𝜃$ = 𝜃$ − 𝛼 𝑑𝐽 𝜃 𝑑𝜃$ = 𝜃$ − 𝛼 𝜕 𝜕𝜃$ 𝐽 𝜃 } Important: Simultaneous update is must! ‘partial derivation’
  • 9. Update Step • Update procedure in GD: 𝜃% = 𝜃% − 𝛼 𝜕 𝜕𝜃% 𝐽 𝜃%, 𝜃& 𝜃& = 𝜃& − 𝛼 𝜕 𝜕𝜃& 𝐽 𝜃%, 𝜃& 𝜃$ = 𝜃$ − 𝛼 𝜕 𝜕𝜃$ 𝐽 𝜃%, 𝜃&, … , 𝜃' 𝜃" 𝜃! 𝐽 𝐽 positive slope negative slope x x
  • 10. Update Step • As parameters approach to local/global minimum the step size becomes smaller because of the decreasing slope around there. 𝜃" 𝜃! 𝐽 𝐽 positive slope negative slope x x x’ 𝜃! = 𝜃! − 𝛼(𝑝𝑜𝑠. 𝑣𝑎𝑙𝑢𝑒) 𝜃" = 𝜃" − 𝛼(𝑛𝑒𝑔. 𝑣𝑎𝑙𝑢𝑒) x’
  • 11. Learning Rate Parameter 𝜃$ = 𝜃$ − 𝛼 𝑑𝐽 𝜃 𝑑𝜃$ 𝛼 is a learning rate: • If it is too small, gradient descent can be slow • If it is too large, gradient descent can overshoot the minimum 𝜃" 𝜃" 𝐽 𝐽 too small too large x x x x x x x x x x x
  • 13. Neural Network • Logistic (or linear) regression can be viewed as a very basic neural network structure. • Computation of a neural network is organized as forward propagation step followed by backward propogation step. • In forward step, the output value of the network is computed; while in the backward step, the cost (error) is propagated to weight update.
  • 14. Neural Network 𝑥" = 𝑏 = 1 𝑥# 𝑥3 𝑥4 𝜎 5 𝑦 ) 𝑦 = 𝜎(𝑧) = 1 1 + 𝑒78 𝜃" 𝜃! 𝜃# 𝜃$ 𝑧 = 𝜃9𝑥
  • 15. Computing Propagations - Illustration • Assume that our cost function is 𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐 where 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 𝑎 𝑏 𝑐 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
  • 16. Computing Propagations - Illustration • Assume that our cost function is 𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐 where 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 𝑎 = 5 𝑏 = 3 𝑐 = 2 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 6 11 33 left to right -> forward propagation
  • 17. Computing Propagations - Illustration • Assume that our cost function is 𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐 where 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 𝑎 = 5 𝑏 = 3 𝑐 = 2 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 6 11 33 right to left -> backward propagation
  • 18. Computing Derivatives in Backward Pass • To know how cost function 𝐽 is changed when a little change is made on 𝑣, we calculate derivative: 𝑑𝐽 𝑑𝑣 𝑎 = 5 𝑏 = 3 𝑐 = 2 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 6 11 33 right to left -> backward propagation
  • 19. Computing Derivatives in Backward Pass • To know how cost function 𝐽 is changed when a little change is made on 𝑎, we calculate derivative: 𝑑𝐽 𝑑𝑎 = 𝑑𝐽 𝑑𝑣 𝑑𝑣 𝑑𝑎 𝑎 = 5 𝑏 = 3 𝑐 = 2 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 6 11 33 right to left -> backward propagation
  • 20. Computing Derivatives in Backward Pass • Chain rule: The amount of change on 𝐽 is equal to the product of • how much 𝑣 changes by 𝑎 and • how much 𝐽 changes by 𝑣: 𝑑𝐽 𝑑𝑎 = 𝑑𝐽 𝑑𝑣 𝑑𝑣 𝑑𝑎 𝑎 = 5 𝑏 = 3 𝑐 = 2 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 6 11 33 right to left -> backward propagation
  • 21. Computing Derivatives in Backward Pass • To know how cost function 𝐽 is changed when a little change is made on 𝑏, we calculate derivative: 𝑑𝐽 𝑑𝑏 = 𝑑𝐽 𝑑𝑣 𝑑𝑣 𝑑𝑢 𝑑𝑢 𝑑𝑏 𝑎 = 5 𝑏 = 3 𝑐 = 2 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 6 11 33 right to left -> backward propagation
  • 22. Computing Derivatives in Backward Pass • To know how cost function 𝐽 is changed when a little change is made on 𝑐, we calculate derivative: 𝑑𝐽 𝑑𝑐 = 𝑑𝐽 𝑑𝑣 𝑑𝑣 𝑑𝑢 𝑑𝑢 𝑑𝑐 𝑎 = 5 𝑏 = 3 𝑐 = 2 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 6 11 33 right to left -> backward propagation
  • 23. Computation Graph for Logistic Regression • Recap again that: 𝑧 = 𝜔! 𝑥 + 𝑏 ! 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎) 𝑥! 𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏 𝜔! 𝑥# 𝜔# 𝑏 5 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
  • 24. Computation Graph for Logistic Regression • Recap again that: 𝑧 = 𝜔! 𝑥 + 𝑏 ! 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎) 𝑥! 𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏 𝜔! 𝑥# 𝜔# 𝑏 5 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 𝑑ℒ(𝑎, 𝑦) 𝑑𝑎
  • 25. Computation Graph for Logistic Regression • Recap again that: 𝑧 = 𝜔! 𝑥 + 𝑏 ! 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎) 𝑥! 𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏 𝜔! 𝑥# 𝜔# 𝑏 5 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 𝑑ℒ(𝑎, 𝑦) 𝑑𝑎 𝑑ℒ(𝑎, 𝑦) 𝑑𝑧
  • 26. Computation Graph for Logistic Regression • Recap again that: 𝑧 = 𝜔! 𝑥 + 𝑏 ! 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎) 𝑥! 𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏 𝜔! 𝑥# 𝜔# 𝑏 5 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 𝑑ℒ(𝑎, 𝑦) 𝑑𝑎 𝑑ℒ(𝑎, 𝑦) 𝑑𝑧 𝑑ℒ(𝑎, 𝑦) 𝑑𝜔!
  • 27. Computation Graph for Logistic Regression • Recap again that: 𝑧 = 𝜔! 𝑥 + 𝑏 ! 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎) 𝑥! 𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏 𝜔! 𝑥# 𝜔# 𝑏 5 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 𝑑ℒ(𝑎, 𝑦) 𝑑𝑎 𝑑ℒ(𝑎, 𝑦) 𝑑𝑧 𝑑ℒ(𝑎, 𝑦) 𝑑𝜔#
  • 28. Computation Graph for Logistic Regression • Recap again that: 𝑧 = 𝜔! 𝑥 + 𝑏 ! 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎) 𝑥! 𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏 𝜔! 𝑥# 𝜔# 𝑏 5 𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦 𝑑ℒ(𝑎, 𝑦) 𝑑𝑎 𝑑ℒ(𝑎, 𝑦) 𝑑𝑧 𝑑ℒ(𝑎, 𝑦) 𝑑𝑏
  • 29. Derivation on 𝑚 Examples • Recap cost function: 𝐽 𝜔, 𝑏 = 1 𝑚 @ ()& * ℒ 𝑎( , 𝑦( • Update 𝑘th weight: 𝜕 𝜕𝜔' 𝐽 𝜔, 𝑏 = 1 𝑚 @ ()& * 𝜕 𝜕𝜔' ℒ 𝑎( , 𝑦(
  • 31. Overview 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎 𝑥 𝑧 = 𝑤% 𝑥 + 𝑏 𝑤 𝑏 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
  • 32. Overview 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[#] 𝑥 𝑧["] = 𝑤["] 𝑥 + 𝑏["] 𝑤[!] 𝑏[!] 𝑎["] = 𝜎(𝑧["] ) ℒ 𝑎[%] , 𝑦 [1] [2] 𝑧[%] = 𝑤[%] 𝑎["] + 𝑏[%] 𝑎[%] = 𝜎(𝑧[%] ) 𝑤[#] 𝑏[#] 𝑧["] 𝑎["] 𝑥 𝑤[!] 𝑏[!] 𝑎[!] inside a neuron
  • 33. Representation – 2 Layer Neural Network 𝑎[#] 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[#] 𝑎! ! 𝑎# ! 𝑎( ! 𝑎[!] 𝑎$ ! 𝑎["] = 𝑋 𝑎[#] Input layer Output layer Hidden layer 𝑎[!] = 𝑎! ! 𝑎# ! 𝑎$ ! 𝑎( ! 𝑊 (×$ ! % , 𝑏(×! ! 𝑊 !×( # % , 𝑏!×! #
  • 34. Computing Output 𝑎[#] 𝑥! 𝑥# 𝑥$ 5 𝑦 𝑎! ! 𝑎# ! 𝑎( ! 𝑎$ ! 𝑧! ! = 𝑤! ! % 𝑥 + 𝑏! ! , 𝑎! ! = 𝜎(𝑧! ! ) 𝑧# ! = 𝑤# ! % 𝑥 + 𝑏# ! , 𝑎# ! = 𝜎(𝑧# ! ) 𝑧$ ! = 𝑤$ ! % 𝑥 + 𝑏$ ! , 𝑎$ ! = 𝜎(𝑧$ ! ) 𝑧( ! = 𝑤( ! % 𝑥 + 𝑏( ! , 𝑎( ! = 𝜎(𝑧( ! ) 𝑧[!] = 𝑧! ! 𝑧# ! 𝑧$ ! 𝑧( ! = − − 𝑤! ! % 𝑤# ! % − − − − 𝑤$ ! % 𝑤( ! % − − 𝑥! 𝑥# 𝑥$ + 𝑏! ! 𝑏# ! 𝑏$ ! 𝑏( ! 𝑎[!] = 𝑎! ! 𝑎# ! 𝑎$ ! 𝑎( ! = 𝜎(𝑧[!])
  • 35. Computing Output 𝑎[#] 𝑥! 𝑥# 𝑥$ 5 𝑦 𝑎! ! 𝑎# ! 𝑎( ! 𝑎$ ! 𝑧[!] = 𝑊[!] 𝑎["] + 𝑏[!] 𝑧[!] = 𝑧! ! 𝑧# ! 𝑧$ ! 𝑧( ! = − − 𝑤! ! % 𝑤# ! % − − − − 𝑤$ ! % 𝑤( ! % − − 𝑥! 𝑥# 𝑥$ + 𝑏! ! 𝑏# ! 𝑏$ ! 𝑏( ! 𝑎[!] = 𝜎(𝑧 ! ) 𝑧[#] = 𝑊[#]𝑎[!] + 𝑏[#] 𝑎[#] = 𝜎(𝑧 # )
  • 36. Vectorization 𝑎[#] 𝑥! 𝑥# 𝑥$ 5 𝑦 𝑎! ! 𝑎# ! 𝑎( ! 𝑎$ ! 𝑥 → 𝑎[#] = 5 𝑦 𝑥(!) → 𝑎[#](!) = 5 𝑦(!) 𝑥(#) → 𝑎[#](#) = 5 𝑦(#) 𝑥(,) → 𝑎[#](,) = 5 𝑦(,) … computing output for 𝒎 inputs
  • 37. Vectorization 𝑎[#] 𝑥! 𝑥# 𝑥$ 5 𝑦 𝑎! ! 𝑎# ! 𝑎( ! 𝑎$ ! for i = 1 to m 𝑧[!](-) = 𝑊[!] 𝑥(-) + 𝑏[!] 𝑎[!](-) = 𝜎(𝑧 ! (-) ) 𝑧[#](-) = 𝑊[#] 𝑥(-) + 𝑏[#] 𝑎[#](-) = 𝜎(𝑧 # (-))
  • 38. Vectorization 𝑎[#] 𝑥! 𝑥# 𝑥$ 5 𝑦 𝑎! ! 𝑎# ! 𝑎( ! 𝑎$ ! 𝑍[!] = 𝑊[!]𝑋 + 𝑏[!] 𝐴[!] = 𝜎(𝑍 ! ) 𝑋.,, = | | 𝑥(!) 𝑥(#) | | | | … 𝑥(,) | | 𝑍[!] = | | 𝑧[!](!) 𝑧[!](#) | | | | … 𝑧[!](,) | | 𝑍[#] = 𝑊[#] 𝐴[!] + 𝑏[#] 𝐴[#] = 𝜎(𝑍 # ) 𝐴[!] = | | 𝑎[!](!) 𝑎[!](#) | | | | … 𝑎[!](,) | |
  • 39. Backpropagation in NN 1 𝑥# 𝑥3 𝑥4 ℎ0(𝑎) 1 𝑎# # 𝑎3 # 𝑎4 # 𝑎! # 𝐸 = (𝑎" $ , 𝑦) Layer 2 hidden layer Layer 3 output layer 𝑎" ! Layer 1 input layer Error function: Remember: Gradient Descent is used to propagate backward the error. 𝜃%4 & = 𝜃%4 & − 𝜂 𝜕 𝜕𝜃%4 & 𝐸 𝜃"$ ! Update equation for a weight (𝜃"$ ! ):
  • 40. Backpropagation in NN 1 𝑥# 𝑥3 𝑥4 ℎ0(𝑎) 1 𝑎# # 𝑎3 # 𝑎4 # 𝑎! # Layer 2 hidden layer Layer 3 output layer 𝑎" ! Layer 1 input layer 𝜃54 % = 𝜃54 % − 𝜂 𝜕 𝜕𝜃54 % 𝐸 𝜃#$ " 𝐸 = (𝑎" $ , 𝑦) Error function: Remember: Gradient Descent is used to propagate backward the error. Update equation for a weight (𝜃#$ " ):
  • 41. Backpropagation in NN 1 𝑥# 𝑥3 𝑥4 ℎ0(𝑎) 1 𝑎# # 𝑎3 # 𝑎4 # 𝑎! # 𝐸 = (𝑎" $ , 𝑦) Layer 2 hidden layer Layer 3 output layer 𝑎" ! Layer 1 input layer Remember: 𝑎! # = 𝜎(𝑧1 ) 𝑧1 = 𝜃𝐴 = 𝜃"" ! 𝑎" ! + 𝜃"! ! 𝑎! ! + 𝜃"# ! 𝑎# ! + 𝜃"$ ! 𝑎$ ! chain rule: ! !"!" # 𝐸 = !# !$# $ !$# $ !%% !%% !"!" # 𝜃"$ ! 𝜃%4 & = 𝜃%4 & − 𝜂 𝜕 𝜕𝜃%4 & 𝐸
  • 42. Backpropagation in NN 𝐸 = 𝑎! # , 𝑦 = −(𝑦× log 𝑎! # + (1 − 𝑦)×log(1 − 𝑎! # )) Let : 𝑧1 = 𝜃𝐴 = 𝜃"" ! 𝑎" ! + 𝜃"! ! 𝑎! ! + 𝜃"# ! 𝑎# ! + 𝜃"$ ! 𝑎$ ! W WX;< = 𝐸 = WY WZ= > WZ= > W8? W8? WX;< = 1 𝑥! 𝑥" 𝑥# ℎ!(𝑎) 1 𝑎! ! 𝑎" ! 𝑎# ! 𝑎" # Layer 2 hidden layer Layer 3 output layer 𝑎$ " Layer 1 input layer 𝜃$% " %& %') * = −𝑦× " ') * − 1 − 𝑦 × " "(') * ×(−1) = ') *() ') *("(') *) 𝜕𝑎! # 𝜕𝑧1 = 𝑎! # (1 − 𝑎! # ) ℎ0 𝑧1 = 𝑎! # = 1 1 + 𝑒23! 𝜕𝑧1 𝜕𝜃"$ ! = 𝑎$ ! 9 9:!" # 𝐸 = ;# $"< ;# $(&";# $) 𝑎& 5 1 − 𝑎& 5 𝑎4 & = (𝑎& 5 − 𝑦)𝑎4 & 𝜕 𝜕𝜃"" ! 𝐸 = 𝑎! # − 𝑦 𝑎" ! = 𝑎! # − 𝑦 𝜃"" #
  • 43. Variants of Gradient Descent Batch gradient descent (BGD): • All the training data is passed to update parameters of the model. • Then the average of the gradients of all batches are taken for update.
  • 44. Variants of Gradient Descent Stochastic gradient descent (SGD): • Only one sample in training data is passed to update model parameters. • Then the gradient is calculated according to that sample. • Ideal when the data size is too large.
  • 45. Variants of Gradient Descent Mini gradient descent (MGD): • SGD slows down the computation due to the calculation of derivation for each tuple. • Instead MGD takes (1<size<batch) data into consideration for update.
  • 46. Multi-output NN 1 𝑥# 𝑥3 𝑥4 1 𝑎# 3 𝑎3 3 𝑎4 3 𝑎# 4 hidden layer output layer input layer 𝑎3 4 𝑎4 4 ℎ0(𝑎$) ℎ0(𝑎$ ) ∈ ℝ$ ℎ0 𝑎$ = 0 1 0 ℎ0 𝑎$ = 1 0 0 ℎ0 𝑎$ = 0 0 1
  • 47. Terminologies in NN Training set Batch: Iteration: Epoch:
  • 48. Terminologies in NN Training set Batch: # of training examples in a single split. Iteration: Epoch: batch#1 batch#2 batch#3 batch#4
  • 49. Terminologies in NN Training set Batch: Iteration: # of steps to pass all batches iteration = # of bathces in an epoch Epoch: batch#1 batch#2 batch#3 batch#4 4 iterations are needed to pass all batches.
  • 50. Terminologies in NN Batch: Iteration: Epoch: # of passes that entire dataset is given to the algorithm. batch#1 batch#2 batch#3 batch#4 batch#1 batch#2 batch#3 batch#4 batch#1 batch#2 batch#3 batch#4 1st epoch 2nd epoch kth epoch …
  • 51. Terminologies in NN 2000 instances in training dataset Batch size: 400 Iteration: 5 Epoch: any number. batch#1 batch#2 batch#3 batch#4 batch#1 batch#2 batch#3 batch#4 batch#1 batch#2 batch#3 batch#4 …
  • 53. Activation Functions • Most popular nonlinear activation functions: • Sigmoid, • Hyperbolic (tangent), • Rectified Linear Unit (ReLU), • Leakly Rectified Linear Unit (LReLU). • Softmax.
  • 54. Sigmoid • It is often used at the final layer of ANNs to ensure an output to be either 0 or 1. • Gradient information can be lost. Vanishing gradient: • A case when gradient information is lost. • Arises when the parameter is too large (in pos. or in neg. direction) • Avoids algorithm to update weights. 𝑔(𝑥) = 1 1 + 𝑒"=
  • 55. Hyperbolic (Tangent) • It produces an output -1 or 1. • Much better than the sigmoid function. • Steeper curve on derivation. • It has a vanishing gradient problem as well. 𝑔(𝑥) = 𝑒= − 𝑒"= 𝑒= + 𝑒"=
  • 56. Rectified Linear Unit (ReLU) • It behaves like a linear function when 𝑥 > 0. • No derivation information is obtainable when 𝑥 < 0. 𝑔(𝑥) = max(0, 𝑥)
  • 57. Rectified Linear Unit (ReLU) Dying ReLU: • ‘A ReLU neuron is “dead” if it’s stuck in the negative side and always outputs 0. Because the slope of ReLU in the negative range is also 0, once a neuron gets negative, it’s unlikely for it to recover. Such neurons are not playing any role in discriminating the input and is essentially useless. Over the time you may end up with a large part of your network doing nothing.’ 𝑔(𝑥) = max(0, 𝑥)
  • 58. Leakly ReLU • It behaves like a linear function when 𝑥 > 0. • Derivation information is still obtainable when 𝑥 < 0. 𝑔(𝑥) = max(𝜇𝑥, 𝑥)
  • 59. Softmax • It is used at the output layer. • It is more generalized logistic activation function which is used for multiclass classification. • Gives the proability for each neuron being true for the corresponding class. 𝑔(𝑥() = 𝑒=% ∑$)& > 𝑒=& image credit: towardsdatascience.com
  • 61. Categories • Regression • Square Error/Quadratic Loss/L2 Loss • Absolute Error/L1 Loss • Bias Error • Classification • Hinge Loss/Multi class SVM Loss • Cross Entropy Loss/Negative Log Likelihood • Binary Cross Entropy Loss/ Log Loss
  • 62. Square Error/Quadratic Loss/L2 Loss • The ‘squared’ difference between prediction and actual observation. ℒ 𝑎, 𝑦 = 𝑎 − 𝑦 !
  • 63. Absolute Error/L1 Loss • The ‘absolute’ difference between prediction and actual observation. ℒ 𝑎, 𝑦 = |𝑎 − 𝑦|
  • 64. Bias Error • This is much less common in machine learning domain. ℒ 𝑎, 𝑦 = 𝑎 − 𝑦
  • 65. Hinge Loss/Multi class SVM Loss • The score of correct category should be greater than sum of scores of all incorrect categories by some safety margin. S𝑉𝑀𝐿𝑜𝑠𝑠 = . "#$! max 0, 𝑠" − 𝑠$! + 1
  • 66. Hinge Loss/Multi class SVM Loss ## 1st training example • max(0, (1.49) - (-0.39) + 1) + max(0, (4.21) - (-0.39) + 1) • max(0, 2.88) + max(0, 5.6) • 2.88 + 5.6 • 8.48 (High loss as very wrong prediction)
  • 67. Hinge Loss/Multi class SVM Loss ## 2nd training example • max(0, (-4.61) - (3.28)+ 1) + max(0, (1.46) - (3.28)+ 1) • max(0, -6.89) + max(0, -0.82) • 0 + 0 • 0 (Zero loss as correct prediction)
  • 68. Hinge Loss/Multi class SVM Loss ## 3rd training example • max(0, (1.03) - (-2.27)+ 1) + max(0, (-2.37) - (-2.27)+ 1) • max(0, 4.3) + max(0, 0.9) • 4.3 + 0.9 • 5.2 (High loss as very wrong prediction)
  • 69. Cross Entropy Loss/Log Loss/Log Likelihood • Cross-entropy is a commonly used loss function for multi-class classification tasks. ℒ 𝑎, 𝑦 = − . % 𝑦% log 𝑎%
  • 70. Cross Entropy Loss/Log Loss/Log Likelihood 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛(𝑎) = 0.9 0.1 0.0 𝑝𝑟𝑜𝑏456 = 0.9 𝑝𝑟𝑜𝑏789 = 0.1 𝑝𝑟𝑜𝑏9-:5;;< = 0.0 𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ (𝑦) = 1 0 0 ℒ 𝑎, 𝑦 = − 𝑦456 log 𝑎456 + 𝑦789 log 𝑎789 + 𝑦9-:5;;< log 𝑎9-:5;;< ℒ 𝑎, 𝑦 = − 1 log 0.9 + 0 log 0.1 + 0 log 0.0 ℒ 𝑎, 𝑦 = − 1 −0.04 + 𝟎 −2.30 + 𝟎 −∞ ℒ 𝑎, 𝑦 = 0.04
  • 71. Binary Cross Entropy Loss • Log loss is commonly used loss function for binary-class classification tasks. ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
  • 74. Notations 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝐿 = 4 (# 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟𝑠) 𝑛[=] = # 𝑜𝑓 𝑢𝑛𝑖𝑡𝑠 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙 𝑎[=] = # 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙 𝑎[=] = 𝑔[=] 𝑧[=] 𝑊[=] = 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑓𝑜𝑟 𝑧[=] 𝑏[=] = 𝑏𝑖𝑎𝑠 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑓𝑜𝑟 𝑧[=] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 4 layer NN 𝑛["] = 3 𝑛[!] = 4 𝑛[#] = 4 𝑛[$] = 3 𝑛[(] = 1 𝑔[=] = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐. 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙
  • 75. General Form: Forward Propagation 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑧[!] = 𝑊 ! % 𝑎["] + 𝑏[!] 𝑎[!] = 𝑔[!] 𝑧[!] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝑎[#] = 𝑔[#] 𝑧[#] 𝑧[#] = 𝑊 # % 𝑎[!] + 𝑏[#] … 𝑎[=] = 𝑔[=] 𝑧[=] 𝑧[=] = 𝑊 = %𝑎[=2!] + 𝑏[=] 𝑎[(] = 5 𝑦 = 𝑔[(] 𝑧[(] 𝑧[(] = 𝑊 ( % 𝑎[$] + 𝑏[(]
  • 76. General Form: Forward Propagation - Vectorization 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑍[!] = 𝑊 ! %𝐴["] + 𝑏[!] 𝐴[!] = 𝑔[!] 𝑍[!] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝐴[#] = 𝑔[#] 𝑍[#] 𝑍[#] = 𝑊 # % 𝐴[!] + 𝑏[#] 𝐴[=] = 𝑔[=] 𝑍[=] 𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=] … 𝐴[(] = 5 𝑦 = 𝑔[(] 𝑍[(] 𝑍[(] = 𝑊 ( %𝐴[$] + 𝑏[(]
  • 77. General Form: Forward Propagation - Vectorization 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=] 𝐴[=] = 𝑔[=] 𝑍[=] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝐴[=] = 𝑔[=] 𝑍[=] 𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=] For l = 1 to L
  • 78. Parameter Dimensions: For simplicity, take 𝑊 = % = 𝑊 = Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝑏 = , 𝑧 = , 𝑎[=] ∶ (𝑛 = , 1) 𝑊 = ∶ (𝑛 = , 𝑛 =2! ) 𝑛["] = 3 𝑛[!] = 4 𝑛[#] = 4 𝑛[$] = 3 𝑛[(] = 1
  • 79. Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝑛["] = 3 𝑛[!] = 4 𝑛[#] = 4 𝑛[$] = 3 𝑛[(] = 1 𝑧[!] = 𝑊 ! 𝑎["] + 𝑏[!] (4,1) = (4,3)(3,1) + (4,1)
  • 80. Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝑛["] = 3 𝑛[!] = 4 𝑛[#] = 4 𝑛[$] = 3 𝑛[(] = 1 𝑧[#] = 𝑊 # 𝑎[!] + 𝑏[#] (4,1) = (4,4)(4,1) + (4,1)
  • 81. Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝑛["] = 3 𝑛[!] = 4 𝑛[#] = 4 𝑛[$] = 3 𝑛[(] = 1 𝑧[$] = 𝑊 $ 𝑎[#] + 𝑏[$] (3,1) = (3,4)(4,1) + (3,1)
  • 82. Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝑛["] = 3 𝑛[!] = 4 𝑛[#] = 4 𝑛[$] = 3 𝑛[(] = 1 𝑧[(] = 𝑊 ( 𝑎[$] + 𝑏[(] (1,1) = (1,3)(3,1) + (1,1)
  • 83. Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters 𝑥! 𝑥# 𝑥$ 5 𝑦 = 𝑎[(] 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝑛["] = 3 𝑛[!] = 4 𝑛[#] = 4 𝑛[$] = 3 𝑛[(] = 1 Parameter Dimensions (vectorization): 𝑍 = , 𝐴[=] ∶ (𝑛 = , 𝑚) 𝑍[!] = 𝑊 ! 𝐴["] + 𝑏[!] (4, 𝑚) = (4,3)(3, 𝑚) + (4,1) broadcasting to (4, 𝑚)
  • 84. Deep Representation • Boolean functions: ØEvery Boolean function can be represented exactly by a neural network ØThe number of hidden layers might need to grow with the number of inputs • Continuous functions: ØEvery bounded continuous function can be approximated with small error with two layers • Arbitrary functions: Ø Three layers can approximate any arbitrary function. • Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 2 (4), 303-314 • Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251257. • Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators.« Neural networks 2.5 (1989): 359-366.
  • 85. Deep Representation Why go deeper if three layers is sufficient? • Going deeper helps convergence in “big” problems. • Going deeper in “old-fashion trained” ANNs does not help much in accuracy • Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015, February). The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).
  • 86. Deep Representation more hidden neurons can represent more complicated functions. Figure: https://cs231n.github.io/
  • 87. Deep Representation Several rule of thumbs for # of hidden units • The number of hidden neurons should be between the size of the input layer and the size of the output layer. • The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. • The number of hidden neurons should be less than twice the size of the input layer. • The size of hidden neurons is gradually decreased from input to output layers.
  • 88. Deep Representation # of hidden layers • Depends on the nature of the problem • Linear classification? Then, no hidden layers needed, • Non-linear classification? • Trial and error is helpful. • Watching the validation and training loss curve throughout the epochs. • If the gap between the loss is small, you can increase the capacity (neurons and layers ), • If the training loss is much smaller than validation loss, you should decrease the capacity.
  • 89. Deep Representation What do the layers represent?
  • 90. Deep Representation Low-level Feature Mid-level Feature High-level Feature Classifier ‘car’ simple features e.g., edge orientation and size complex features Non-linear composition of previous layer(s)
  • 91. Building Blocks 𝑊 ! , 𝑏[!] 𝑊 # , 𝑏[#] 𝑊 $ , 𝑏[$] 𝑊 > , 𝑏[>] 𝑊 ! , 𝑏[!], 𝑑𝑧[!] 𝑊 # , 𝑏[#], 𝑑𝑧[#] 𝑊 $ , 𝑏[$], 𝑑𝑧[$] 𝑊 > , 𝑏[>], 𝑑𝑧[>] … … 𝑥 = 𝑎["] 𝑎[!] 𝑎[#] 𝑎[$] 𝑎[>] = 5 𝑦 𝑎[>2!] 𝑑𝑎["] 𝑑𝑎[%] 𝑑𝑎[&] 𝑑𝑎['] 𝑑𝑎['("] 𝑧[!] 𝑧[#] 𝑧[$] 𝑧[>] 𝑑𝑊["] , 𝑑𝑏["] 𝑑𝑊[%] , 𝑑𝑏[%] 𝑑𝑊[&] , 𝑑𝑏[&] 𝑑𝑊['] , 𝑑𝑏['] 𝑊[=] = 𝑊[=] + 𝛼𝑑𝑊[=] 𝑏[=] = 𝑏[=] + 𝛼𝑑𝑏[=]
  • 92. Forward Propagation for layer 𝑙 Input 𝑎[=2!] Output 𝑎[=] , cache 𝑧[=] (i.e., 𝑊 = ×𝑎[=2!] + 𝑏[=] ) 𝑊 = , 𝑏[=] 𝑎[=] 𝑎[=2!] 𝑧[u] = 𝑊 u 𝑎[u7#] + 𝑏[u] 𝑎[u] = 𝑔 u 𝑧[u] 𝑍[u] = 𝑊 u 𝐴[u7#] + 𝑏[u] 𝐴[u] = 𝑔 u 𝑍[u] Vectorization for 𝒎 inputs 𝑧[=]
  • 93. Backward Propagation for layer 𝑙 Input 𝑑𝑎[=] Output 𝑑𝑎[=2!] , d𝑊 = , 𝑑𝑏[=] 𝑊 = , 𝑏[=] 𝑑𝑎[=] 𝑑𝑎[=2!] 𝑑𝑊[u] = 𝑑ℒ 𝑑𝑊[u] = 𝑑ℒ 𝑑𝑎[u] 𝑑𝑎[u] 𝑑𝑧[u] 𝑑𝑧[u] 𝑑𝑊[u] 𝑑𝑧[u] = 𝑑𝑎 u ×𝑔 u ? (𝑧[u]) 𝑑𝑍[=] = 𝑑𝐴[=] ×𝑔 = ! (𝑍[=] ) 𝑑𝐴[=2!] = 𝑊 = %𝑑𝑍 = Vectorization for 𝒎 inputs 𝑑𝑊[,], 𝑑𝑏[,] Recap: 𝑑𝑊[u] = 𝑑𝑧 u 𝑎 u7# 9 𝑑𝑏[u] = 𝑑𝑧 u 𝑑𝑎[u7#] = 𝑊 u 9𝑑𝑧 u 𝑑𝑊[=] = 1 𝑚 𝑑𝑍 = 𝐴 =2! % 𝑑𝑏[=] = 1 𝑚 x !?-?, 𝑑𝑍- =
  • 94. Parameters and Hyperparameters • Model Parameters: These are the parameters in the model that must be determined using the training data set. These are the fitted parameters. 𝑊[@] and 𝑏[@] where 1 ≤ 𝑙 ≤ 𝐿 • Hyperparameters: These are adjustable parameters that must be tuned in order to obtain a model with optimal performance. learning rate (𝛼), # iterations, # hidden layers (𝐿), hidden units (𝑛[=]), choice of activation, and many many more…