Week_2_Neural_Networks_Basichhhhhhhs.pdf

Dr. Selim Yılmaz Spring 2025
Lecture #2
Understanding Deep Networks

Today
• Gradient Descent Algorithm
• Computation Graph
• Multi-Layer Neural Network
• Activation Functions
• Loss Functions
• Deep Neural Network
Forward Propagation in a Deep Network
Why Deep Representations?
Building Blocks of Deep Neural Networks
Forward and Backward Propagation
Parameters and Hyperparameters

Minimization of Cost Function
• Recap:
!
𝑦 = 𝜎 𝜃!
𝑥 , 𝜎 𝑧 =
1
1 + 𝑒"#
𝐽 𝜃 = −
1
𝑚
'
!"#
$
ℒ )
𝑦!, 𝑦! = −
1
𝑚
'
!"#
$
𝑦! log )
𝑦! + 1 − 𝑦! log 1 − )
𝑦!
• Want to find 𝜃 that minimizes 𝐽 𝜃

Algorithm
• Gradient Descent (or GD) is an algorithm that uses the gradient of
given real-valued function.
• The gradient gives the direction and the magnitude of the slope of
function.
• The direction of negative gradient is a good direction to search if we
want to find a function minimizer.

Algorithm
• Gradient Descent is often used to find the minimum of the cost
function in linear or in logistic regression (i.e., 𝐽(𝜃)).
𝐽(𝜃)
𝜃

Derivatives
• Derivation represents the amount of
vertical change with respect to the
horizontal change in the variable of
the function given.
• Here 𝑑𝐽(𝜃) and 𝑑𝜃 are Leibniz
notation and represent, respectively,
very small change in the axis 𝐽(𝜃)
and 𝜃.
𝜃
𝐽(𝜃)
𝜃!
tangent line
the slope
the derivative
slope =
𝑑𝐽(𝜃)
𝑑𝜃

Update Step
Repeat until convergence {
𝜃$ = 𝜃$ − 𝛼
𝑑𝐽 𝜃
𝑑𝜃$
= 𝜃$ − 𝛼
𝜕
𝜕𝜃$
𝐽 𝜃
}
Important: Simultaneous update is must!
‘partial derivation’

Update Step
• Update procedure in GD:
𝜃% = 𝜃% − 𝛼
𝜕
𝜕𝜃%
𝐽 𝜃%, 𝜃&
𝜃& = 𝜃& − 𝛼
𝜕
𝜕𝜃&
𝐽 𝜃%, 𝜃&
𝜃$ = 𝜃$ − 𝛼
𝜕
𝜕𝜃$
𝐽 𝜃%, 𝜃&, … , 𝜃'
𝜃"
𝜃!
𝐽
𝐽
positive slope
negative slope
x
x

Update Step
• As parameters approach to local/global
minimum the step size becomes smaller
because of the decreasing slope around
there.
𝜃"
𝜃!
𝐽
𝐽
positive slope
negative slope
x
x
x’
𝜃! = 𝜃! − 𝛼(𝑝𝑜𝑠. 𝑣𝑎𝑙𝑢𝑒)
𝜃" = 𝜃" − 𝛼(𝑛𝑒𝑔. 𝑣𝑎𝑙𝑢𝑒)
x’

Learning Rate Parameter
𝜃$ = 𝜃$ − 𝛼
𝑑𝐽 𝜃
𝑑𝜃$
𝛼 is a learning rate:
• If it is too small, gradient descent can be slow
• If it is too large, gradient descent can overshoot
the minimum
𝜃"
𝜃"
𝐽
𝐽
too small
too large
x
x
x
x
x
x
x
x
x
x
x

Neural Network
• Logistic (or linear) regression can be viewed as a very basic neural
network structure.
• Computation of a neural network is organized as forward
propagation step followed by backward propogation step.
• In forward step, the output value of the network is computed; while
in the backward step, the cost (error) is propagated to weight update.

Neural Network
𝑥" = 𝑏 = 1
𝑥#
𝑥3
𝑥4
𝜎
5
𝑦
)
𝑦 = 𝜎(𝑧) =
1
1 + 𝑒78
𝜃"
𝜃!
𝜃#
𝜃$
𝑧 = 𝜃9𝑥

Computing Propagations - Illustration
• Assume that our cost function is
𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐
where
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
𝑎
𝑏
𝑐
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣

𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐
where
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
left to right -> forward propagation

𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐
where
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33
right to left -> backward propagation

Computing Derivatives in Backward Pass
• To know how cost function 𝐽 is changed when a little change is made
on 𝑣, we calculate derivative:
𝑑𝐽
𝑑𝑣
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33

on 𝑎, we calculate derivative:
𝑑𝐽
𝑑𝑎
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑎
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33

• Chain rule: The amount of change on 𝐽 is equal to the product of
• how much 𝑣 changes by 𝑎 and
• how much 𝐽 changes by 𝑣:
𝑑𝐽
𝑑𝑎
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑎
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33

on 𝑏, we calculate derivative:
𝑑𝐽
𝑑𝑏
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑢
𝑑𝑢
𝑑𝑏
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33

on 𝑐, we calculate derivative:
𝑑𝐽
𝑑𝑐
=
𝑑𝐽
𝑑𝑣
𝑑𝑣
𝑑𝑢
𝑑𝑢
𝑑𝑐
𝑎 = 5
𝑏 = 3
𝑐 = 2
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣
6
11 33

Computation Graph for Logistic Regression
• Recap again that:
𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦

𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎

𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
𝑑ℒ(𝑎, 𝑦)
𝑑𝑧

𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
𝑑ℒ(𝑎, 𝑦)
𝑑𝑧
𝑑ℒ(𝑎, 𝑦)
𝑑𝜔!

𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
𝑑ℒ(𝑎, 𝑦)
𝑑𝑧
𝑑ℒ(𝑎, 𝑦)
𝑑𝜔#

𝑧 = 𝜔!
𝑥 + 𝑏
!
𝑦 = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)
𝑥!
𝑧 = 𝜔!𝑥! + 𝜔#𝑥# + 𝑏
𝜔!
𝑥#
𝜔#
𝑏
5
𝑦 = 𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦
𝑑ℒ(𝑎, 𝑦)
𝑑𝑎
𝑑ℒ(𝑎, 𝑦)
𝑑𝑧
𝑑ℒ(𝑎, 𝑦)
𝑑𝑏

Derivation on 𝑚 Examples
• Recap cost function:
𝐽 𝜔, 𝑏 =
1
𝑚
@
()&
*
ℒ 𝑎(
, 𝑦(
• Update 𝑘th weight:
𝜕
𝜕𝜔'
𝐽 𝜔, 𝑏 =
1
𝑚
@
()&
*
𝜕
𝜕𝜔'
ℒ 𝑎(
, 𝑦(

Overview
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎
𝑥
𝑧 = 𝑤%
𝑥 + 𝑏
𝑤
𝑏
𝑎 = 𝜎(𝑧) ℒ 𝑎, 𝑦

Overview
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[#]
𝑥
𝑧["]
= 𝑤["]
𝑥 + 𝑏["]
𝑤[!]
𝑏[!]
𝑎["]
= 𝜎(𝑧["]
) ℒ 𝑎[%]
, 𝑦
[1] [2]
𝑧[%]
= 𝑤[%]
𝑎["]
+ 𝑏[%]
𝑎[%]
= 𝜎(𝑧[%]
)
𝑤[#]
𝑏[#]
𝑧["] 𝑎["]
𝑥
𝑤[!]
𝑏[!]
𝑎[!]
inside a neuron

Representation – 2 Layer Neural Network
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[#]
𝑎!
!
𝑎#
!
𝑎(
!
𝑎[!]
𝑎$
!
𝑎["] = 𝑋 𝑎[#]
Input
layer
Output
layer
Hidden
layer
𝑎[!] =
𝑎!
!
𝑎#
!
𝑎$
!
𝑎(
!
𝑊
(×$
! %
, 𝑏(×!
!
𝑊
!×(
# %
, 𝑏!×!
#

Computing Output
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
𝑧!
!
= 𝑤!
! %
𝑥 + 𝑏!
!
, 𝑎!
!
= 𝜎(𝑧!
!
)
𝑧#
!
= 𝑤#
! %
𝑥 + 𝑏#
!
, 𝑎#
!
= 𝜎(𝑧#
!
)
𝑧$
!
= 𝑤$
! %
𝑥 + 𝑏$
!
, 𝑎$
!
= 𝜎(𝑧$
!
)
𝑧(
!
= 𝑤(
! %
𝑥 + 𝑏(
!
, 𝑎(
!
= 𝜎(𝑧(
!
)
𝑧[!]
=
𝑧!
!
𝑧#
!
𝑧$
!
𝑧(
!
=
−
−
𝑤!
! %
𝑤#
! %
−
−
−
−
𝑤$
! %
𝑤(
! %
−
−
𝑥!
𝑥#
𝑥$
+
𝑏!
!
𝑏#
!
𝑏$
!
𝑏(
!
𝑎[!] =
𝑎!
!
𝑎#
!
𝑎$
!
𝑎(
!
= 𝜎(𝑧[!])

Computing Output
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
𝑧[!]
= 𝑊[!]
𝑎["]
+ 𝑏[!]
𝑧[!]
=
𝑧!
!
𝑧#
!
𝑧$
!
𝑧(
!
=
−
−
𝑤!
! %
𝑤#
! %
−
−
−
−
𝑤$
! %
𝑤(
! %
−
−
𝑥!
𝑥#
𝑥$
+
𝑏!
!
𝑏#
!
𝑏$
!
𝑏(
!
𝑎[!] = 𝜎(𝑧 ! )
𝑧[#] = 𝑊[#]𝑎[!] + 𝑏[#]
𝑎[#] = 𝜎(𝑧 # )

Vectorization
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
𝑥 → 𝑎[#] = 5
𝑦
𝑥(!)
→ 𝑎[#](!)
= 5
𝑦(!)
𝑥(#)
→ 𝑎[#](#)
= 5
𝑦(#)
𝑥(,)
→ 𝑎[#](,)
= 5
𝑦(,)
…
computing output for 𝒎 inputs

Vectorization
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
for i = 1 to m
𝑧[!](-)
= 𝑊[!]
𝑥(-)
+ 𝑏[!]
𝑎[!](-)
= 𝜎(𝑧 ! (-)
)
𝑧[#](-)
= 𝑊[#]
𝑥(-)
+ 𝑏[#]
𝑎[#](-) = 𝜎(𝑧 # (-))

Vectorization
𝑎[#]
𝑥!
𝑥#
𝑥$
5
𝑦
𝑎!
!
𝑎#
!
𝑎(
!
𝑎$
!
𝑍[!] = 𝑊[!]𝑋 + 𝑏[!]
𝐴[!] = 𝜎(𝑍 ! )
𝑋.,, =
| |
𝑥(!) 𝑥(#)
| |
| |
… 𝑥(,)
| |
𝑍[!]
=
| |
𝑧[!](!)
𝑧[!](#)
| |
| |
… 𝑧[!](,)
| |
𝑍[#]
= 𝑊[#]
𝐴[!]
+ 𝑏[#]
𝐴[#]
= 𝜎(𝑍 #
)
𝐴[!] =
| |
𝑎[!](!) 𝑎[!](#)
| |
| |
… 𝑎[!](,)
| |

Backpropagation in NN
1
𝑥#
𝑥3
𝑥4
ℎ0(𝑎)
1
𝑎#
#
𝑎3
#
𝑎4
#
𝑎!
#
𝐸 = (𝑎"
$
, 𝑦)
Layer 2
hidden layer
Layer 3
output layer
𝑎"
!
Layer 1
input layer
Error function:
Remember: Gradient Descent is used to propagate backward the error.
𝜃%4
&
= 𝜃%4
&
− 𝜂
𝜕
𝜕𝜃%4
& 𝐸
𝜃"$
!
Update equation for a weight (𝜃"$
!
):

1
𝑥#
𝑥3
𝑥4
ℎ0(𝑎)
1
𝑎#
#
𝑎3
#
𝑎4
#
𝑎!
#
Layer 2
hidden layer
Layer 3
output layer
𝑎"
!
Layer 1
input layer
𝜃54
%
= 𝜃54
%
− 𝜂
𝜕
𝜕𝜃54
% 𝐸
𝜃#$
"
𝐸 = (𝑎"
$
, 𝑦)
Error function:
Remember: Gradient Descent is used to propagate backward the error.
Update equation for a weight (𝜃#$
"
):

1
𝑥#
𝑥3
𝑥4
ℎ0(𝑎)
1
𝑎#
#
𝑎3
#
𝑎4
#
𝑎!
#
𝐸 = (𝑎"
$
, 𝑦)
Layer 2
hidden layer
Layer 3
output layer
𝑎"
!
Layer 1
input layer
Remember:
𝑎!
#
= 𝜎(𝑧1
)
𝑧1
= 𝜃𝐴 = 𝜃""
!
𝑎"
!
+ 𝜃"!
!
𝑎!
!
+ 𝜃"#
!
𝑎#
!
+ 𝜃"$
!
𝑎$
!
chain rule:
!
!"!"
# 𝐸 =
!#
!$#
$
!$#
$
!%%
!%%
!"!"
#
𝜃"$
!
𝜃%4
&
= 𝜃%4
&
− 𝜂
𝜕
𝜕𝜃%4
& 𝐸

𝐸 = 𝑎!
#
, 𝑦 = −(𝑦× log 𝑎!
#
+ (1 − 𝑦)×log(1 − 𝑎!
#
))
Let :
𝑧1
= 𝜃𝐴 = 𝜃""
!
𝑎"
!
+ 𝜃"!
!
𝑎!
!
+ 𝜃"#
!
𝑎#
!
+ 𝜃"$
!
𝑎$
!
W
WX;<
= 𝐸 =
WY
WZ=
>
WZ=
>
W8?
W8?
WX;<
=
1
𝑥!
𝑥"
𝑥#
ℎ!(𝑎)
1
𝑎!
!
𝑎"
!
𝑎#
!
𝑎"
#
Layer 2
hidden layer
Layer 3
output layer
𝑎$
"
Layer 1
input layer
𝜃$%
"
%&
%')
* = −𝑦×
"
')
* − 1 − 𝑦 ×
"
"(')
* ×(−1) =
')
*()
')
*("(')
*)
𝜕𝑎!
#
𝜕𝑧1
= 𝑎!
#
(1 − 𝑎!
#
)
ℎ0 𝑧1
= 𝑎!
#
=
1
1 + 𝑒23!
𝜕𝑧1
𝜕𝜃"$
! = 𝑎$
!
9
9:!"
# 𝐸 =
;#
$"<
;#
$(&";#
$)
𝑎&
5
1 − 𝑎&
5
𝑎4
&
= (𝑎&
5
− 𝑦)𝑎4
&
𝜕
𝜕𝜃""
! 𝐸 = 𝑎!
#
− 𝑦 𝑎"
!
= 𝑎!
#
− 𝑦
𝜃""
#

Variants of Gradient Descent
Batch gradient descent (BGD):
• All the training data is passed to update parameters of the model.
• Then the average of the gradients of all batches are taken for update.

Stochastic gradient descent (SGD):
• Only one sample in training data is passed to update model parameters.
• Then the gradient is calculated according to that sample.
• Ideal when the data size is too large.

Mini gradient descent (MGD):
• SGD slows down the computation due to the calculation of derivation
for each tuple.
• Instead MGD takes (1<size<batch) data into consideration for update.

Multi-output NN
1
𝑥#
𝑥3
𝑥4
1
𝑎#
3
𝑎3
3
𝑎4
3
𝑎#
4
hidden layer output layer
input layer
𝑎3
4
𝑎4
4
ℎ0(𝑎$)
ℎ0(𝑎$
) ∈ ℝ$
ℎ0 𝑎$ =
0
1
0
ℎ0 𝑎$
=
1
0
0
ℎ0 𝑎$ =
0
0
1

Terminologies in NN
Training set
Batch:
Iteration:
Epoch:

Terminologies in NN
Training set
Batch: # of training examples in a single split.
Iteration:
Epoch:
batch#1
batch#2
batch#3
batch#4

Terminologies in NN
Training set
Batch:
Iteration: # of steps to pass all batches
iteration = # of bathces in an epoch
Epoch:
batch#1
batch#2
batch#3
batch#4
4 iterations are needed to pass all batches.

Terminologies in NN
Batch:
Iteration:
Epoch: # of passes that entire dataset
is given to the algorithm.
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
1st epoch 2nd epoch kth epoch
…

Terminologies in NN
2000 instances in training dataset
Batch size: 400
Iteration: 5
Epoch: any number.
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
batch#1
batch#2
batch#3
batch#4
…

Activation Functions
• Most popular nonlinear activation functions:
• Sigmoid,
• Hyperbolic (tangent),
• Rectified Linear Unit (ReLU),
• Leakly Rectified Linear Unit (LReLU).
• Softmax.

Sigmoid
• It is often used at the final layer of ANNs to
ensure an output to be either 0 or 1.
• Gradient information can be lost.
Vanishing gradient:
• A case when gradient information is lost.
• Arises when the parameter is too large (in
pos. or in neg. direction)
• Avoids algorithm to update weights.
𝑔(𝑥) =
1
1 + 𝑒"=

Hyperbolic (Tangent)
• It produces an output -1 or 1.
• Much better than the sigmoid function.
• Steeper curve on derivation.
• It has a vanishing gradient problem as well.
𝑔(𝑥) =
𝑒=
− 𝑒"=
𝑒= + 𝑒"=

Rectified Linear Unit (ReLU)
• It behaves like a linear function
when 𝑥 > 0.
• No derivation information is
obtainable when 𝑥 < 0.
𝑔(𝑥) = max(0, 𝑥)

Rectified Linear Unit (ReLU)
Dying ReLU:
• ‘A ReLU neuron is “dead” if it’s stuck in the negative
side and always outputs 0. Because the slope of ReLU
in the negative range is also 0, once a neuron gets
negative, it’s unlikely for it to recover. Such neurons
are not playing any role in discriminating the input and
is essentially useless. Over the time you may end up
with a large part of your network doing nothing.’
𝑔(𝑥) = max(0, 𝑥)

Leakly ReLU
• It behaves like a linear function when
𝑥 > 0.
• Derivation information is still obtainable
when 𝑥 < 0.
𝑔(𝑥) = max(𝜇𝑥, 𝑥)

Softmax
• It is used at the output layer.
• It is more generalized logistic activation
function which is used for multiclass
classification.
• Gives the proability for each neuron
being true for the corresponding class.
𝑔(𝑥() =
𝑒=%
∑$)&
>
𝑒=&
image credit: towardsdatascience.com

Categories
• Regression
• Square Error/Quadratic Loss/L2 Loss
• Absolute Error/L1 Loss
• Bias Error
• Classification
• Hinge Loss/Multi class SVM Loss
• Cross Entropy Loss/Negative Log Likelihood
• Binary Cross Entropy Loss/ Log Loss

Square Error/Quadratic Loss/L2 Loss
• The ‘squared’ difference between prediction and actual observation.
ℒ 𝑎, 𝑦 = 𝑎 − 𝑦 !

Absolute Error/L1 Loss
• The ‘absolute’ difference between prediction and actual observation.
ℒ 𝑎, 𝑦 = |𝑎 − 𝑦|

Bias Error
• This is much less common in machine learning domain.
ℒ 𝑎, 𝑦 = 𝑎 − 𝑦

Hinge Loss/Multi class SVM Loss
• The score of correct category should be greater than sum of scores
of all incorrect categories by some safety margin.
S𝑉𝑀𝐿𝑜𝑠𝑠 = .
"#$!
max 0, 𝑠" − 𝑠$!
+ 1

## 1st training example
• max(0, (1.49) - (-0.39) + 1) + max(0, (4.21) - (-0.39) + 1)
• max(0, 2.88) + max(0, 5.6)
• 2.88 + 5.6
• 8.48 (High loss as very wrong prediction)

## 2nd training example
• max(0, (-4.61) - (3.28)+ 1) + max(0, (1.46) - (3.28)+ 1)
• max(0, -6.89) + max(0, -0.82)
• 0 + 0
• 0 (Zero loss as correct prediction)

## 3rd training example
• max(0, (1.03) - (-2.27)+ 1) + max(0, (-2.37) - (-2.27)+ 1)
• max(0, 4.3) + max(0, 0.9)
• 4.3 + 0.9
• 5.2 (High loss as very wrong prediction)

Cross Entropy Loss/Log Loss/Log Likelihood
• Cross-entropy is a commonly used loss function for multi-class
classification tasks.
ℒ 𝑎, 𝑦 = − .
%
𝑦% log 𝑎%

Cross Entropy Loss/Log Loss/Log Likelihood
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛(𝑎) =
0.9
0.1
0.0
𝑝𝑟𝑜𝑏456 = 0.9
𝑝𝑟𝑜𝑏789 = 0.1
𝑝𝑟𝑜𝑏9-:5;;< = 0.0
𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ (𝑦) =
1
0
0
ℒ 𝑎, 𝑦 = − 𝑦456 log 𝑎456 + 𝑦789 log 𝑎789 + 𝑦9-:5;;< log 𝑎9-:5;;<
ℒ 𝑎, 𝑦 = − 1 log 0.9 + 0 log 0.1 + 0 log 0.0
ℒ 𝑎, 𝑦 = − 1 −0.04 + 𝟎 −2.30 + 𝟎 −∞
ℒ 𝑎, 𝑦 = 0.04

Binary Cross Entropy Loss
• Log loss is commonly used loss function for binary-class
classification tasks.
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + (1 − 𝑦) log(1 − 𝑎)

Architecture
𝑥!
𝑥#
𝑥$
5
𝑦
𝑥!
𝑥#
𝑥$
5
𝑦
5
𝑦
𝑥!
𝑥#
𝑥$
5
𝑦
𝑥!
𝑥#
𝑥$
perceptron 2 layer NN
3 layer NN L layer NN

Notations
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝐿 = 4 (# 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟𝑠)
𝑛[=]
= # 𝑜𝑓 𝑢𝑛𝑖𝑡𝑠 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙
𝑎[=]
= # 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙
𝑎[=]
= 𝑔[=]
𝑧[=]
𝑊[=] = 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑓𝑜𝑟 𝑧[=]
𝑏[=] = 𝑏𝑖𝑎𝑠 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑓𝑜𝑟 𝑧[=]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
4 layer NN
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑔[=]
= 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐. 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙

General Form:
Forward Propagation
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑧[!]
= 𝑊 ! %
𝑎["]
+ 𝑏[!]
𝑎[!] = 𝑔[!] 𝑧[!]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑎[#] = 𝑔[#] 𝑧[#]
𝑧[#]
= 𝑊 # %
𝑎[!]
+ 𝑏[#]
…
𝑎[=]
= 𝑔[=]
𝑧[=]
𝑧[=] = 𝑊 = %𝑎[=2!] + 𝑏[=]
𝑎[(] = 5
𝑦 = 𝑔[(] 𝑧[(]
𝑧[(]
= 𝑊 ( %
𝑎[$]
+ 𝑏[(]

General Form:
Forward Propagation - Vectorization
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑍[!] = 𝑊 ! %𝐴["] + 𝑏[!]
𝐴[!] = 𝑔[!] 𝑍[!]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝐴[#] = 𝑔[#] 𝑍[#]
𝑍[#]
= 𝑊 # %
𝐴[!]
+ 𝑏[#]
𝐴[=]
= 𝑔[=]
𝑍[=]
𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=]
…
𝐴[(] = 5
𝑦 = 𝑔[(] 𝑍[(]
𝑍[(] = 𝑊 ( %𝐴[$] + 𝑏[(]

General Form:
Forward Propagation - Vectorization
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=]
𝐴[=] = 𝑔[=] 𝑍[=]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(] 𝐴[=]
= 𝑔[=]
𝑍[=]
𝑍[=] = 𝑊 = %𝐴[=2!] + 𝑏[=]
For l = 1 to L

Parameter Dimensions:
For simplicity, take 𝑊 = % = 𝑊 =
Weight (𝑾[𝒍]) and Bias (𝒃[𝒍]) Parameters
𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑏 = , 𝑧 = , 𝑎[=] ∶ (𝑛 = , 1)
𝑊 = ∶ (𝑛 = , 𝑛 =2! )
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1

𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑧[!] = 𝑊 ! 𝑎["] + 𝑏[!]
(4,1) = (4,3)(3,1) + (4,1)

𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑧[#] = 𝑊 # 𝑎[!] + 𝑏[#]
(4,1) = (4,4)(4,1) + (4,1)

𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑧[$] = 𝑊 $ 𝑎[#] + 𝑏[$]
(3,1) = (3,4)(4,1) + (3,1)

𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
𝑧[(] = 𝑊 ( 𝑎[$] + 𝑏[(]
(1,1) = (1,3)(3,1) + (1,1)

𝑥!
𝑥#
𝑥$
5
𝑦 = 𝑎[(]
𝑥 = 𝑎["]
𝑎[!] 𝑎[#] 𝑎[$] 𝑎[(]
𝑛["] = 3 𝑛[!]
= 4 𝑛[#]
= 4 𝑛[$]
= 3 𝑛[(]
= 1
Parameter Dimensions
(vectorization):
𝑍 = , 𝐴[=] ∶ (𝑛 = , 𝑚)
𝑍[!] = 𝑊 ! 𝐴["] + 𝑏[!]
(4, 𝑚) = (4,3)(3, 𝑚) + (4,1)
broadcasting to (4, 𝑚)

Deep Representation
• Boolean functions:
ØEvery Boolean function can be represented exactly by a neural network
ØThe number of hidden layers might need to grow with the number of inputs
• Continuous functions:
ØEvery bounded continuous function can be approximated with small error
with two layers
• Arbitrary functions:
Ø Three layers can approximate any arbitrary function.
• Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 2 (4), 303-314
• Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251257.
• Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators.« Neural networks 2.5 (1989): 359-366.

Deep Representation
Why go deeper if three layers is sufficient?
• Going deeper helps convergence in “big” problems.
• Going deeper in “old-fashion trained” ANNs does not help much in
accuracy
• Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015, February). The loss surfaces of multilayer networks. In Artificial Intelligence and
Statistics (pp. 192-204).

Deep Representation
more hidden neurons can represent more complicated functions.
Figure: https://cs231n.github.io/

Deep Representation
Several rule of thumbs for # of hidden units
• The number of hidden neurons should be between the size of the input
layer and the size of the output layer.
• The number of hidden neurons should be 2/3 the size of the input layer,
plus the size of the output layer.
• The number of hidden neurons should be less than twice the size of the
input layer.
• The size of hidden neurons is gradually decreased from input to output
layers.

Deep Representation
# of hidden layers
• Depends on the nature of the problem
• Linear classification? Then, no hidden layers needed,
• Non-linear classification?
• Trial and error is helpful.
• Watching the validation and training loss curve throughout the epochs.
• If the gap between the loss is small, you can increase the capacity (neurons and layers ),
• If the training loss is much smaller than validation loss, you should decrease the capacity.

Deep Representation
What do the layers
represent?

Deep Representation
Low-level
Feature
Mid-level
Feature
High-level
Feature
Classifier
‘car’
simple features
e.g., edge orientation and size
complex features
Non-linear composition of previous layer(s)

Building Blocks
𝑊 !
, 𝑏[!]
𝑊 #
, 𝑏[#]
𝑊 $
, 𝑏[$]
𝑊 >
, 𝑏[>]
𝑊 ! , 𝑏[!], 𝑑𝑧[!] 𝑊 # , 𝑏[#], 𝑑𝑧[#] 𝑊 $ , 𝑏[$], 𝑑𝑧[$] 𝑊 > , 𝑏[>], 𝑑𝑧[>]
…
…
𝑥 = 𝑎["]
𝑎[!]
𝑎[#]
𝑎[$]
𝑎[>]
= 5
𝑦
𝑎[>2!]
𝑑𝑎["]
𝑑𝑎[%]
𝑑𝑎[&]
𝑑𝑎[']
𝑑𝑎['("]
𝑧[!] 𝑧[#]
𝑧[$]
𝑧[>]
𝑑𝑊["]
, 𝑑𝑏["]
𝑑𝑊[%]
, 𝑑𝑏[%]
𝑑𝑊[&]
, 𝑑𝑏[&]
𝑑𝑊[']
, 𝑑𝑏[']
𝑊[=] = 𝑊[=] + 𝛼𝑑𝑊[=] 𝑏[=]
= 𝑏[=]
+ 𝛼𝑑𝑏[=]

Forward Propagation for layer 𝑙
Input 𝑎[=2!]
Output 𝑎[=]
, cache 𝑧[=]
(i.e., 𝑊 =
×𝑎[=2!]
+ 𝑏[=]
)
𝑊 =
, 𝑏[=]
𝑎[=]
𝑎[=2!]
𝑧[u] = 𝑊 u 𝑎[u7#] + 𝑏[u]
𝑎[u] = 𝑔 u 𝑧[u]
𝑍[u] = 𝑊 u 𝐴[u7#] + 𝑏[u]
𝐴[u] = 𝑔 u 𝑍[u]
Vectorization for 𝒎 inputs 𝑧[=]

Backward Propagation for layer 𝑙
Input 𝑑𝑎[=]
Output 𝑑𝑎[=2!]
, d𝑊 =
, 𝑑𝑏[=]
𝑊 =
, 𝑏[=]
𝑑𝑎[=]
𝑑𝑎[=2!]
𝑑𝑊[u] =
𝑑ℒ
𝑑𝑊[u]
=
𝑑ℒ
𝑑𝑎[u]
𝑑𝑎[u]
𝑑𝑧[u]
𝑑𝑧[u]
𝑑𝑊[u]
𝑑𝑧[u] = 𝑑𝑎 u ×𝑔 u ?
(𝑧[u])
𝑑𝑍[=]
= 𝑑𝐴[=]
×𝑔 = !
(𝑍[=]
)
𝑑𝐴[=2!] = 𝑊 = %𝑑𝑍 =
Vectorization for 𝒎 inputs
𝑑𝑊[,], 𝑑𝑏[,]
Recap:
𝑑𝑊[u] = 𝑑𝑧 u 𝑎 u7# 9
𝑑𝑏[u] = 𝑑𝑧 u
𝑑𝑎[u7#] = 𝑊 u 9𝑑𝑧 u
𝑑𝑊[=] =
1
𝑚
𝑑𝑍 = 𝐴 =2! %
𝑑𝑏[=] =
1
𝑚
x
!?-?,
𝑑𝑍-
=

Parameters and Hyperparameters
• Model Parameters: These are the parameters in the model that must be
determined using the training data set. These are the fitted parameters.
𝑊[@]
and 𝑏[@]
where 1 ≤ 𝑙 ≤ 𝐿
• Hyperparameters: These are adjustable parameters that must be tuned
in order to obtain a model with optimal performance.
learning rate (𝛼), # iterations, # hidden layers (𝐿), hidden units (𝑛[=]), choice of activation, and many many more…

Week_2_Neural_Networks_Basichhhhhhhs.pdf

More Related Content

Similar to Week_2_Neural_Networks_Basichhhhhhhs.pdf (20)

Recently uploaded (20)

Week_2_Neural_Networks_Basichhhhhhhs.pdf