SlideShare a Scribd company logo
Backpropagation
Day 2 Lecture 1
http://bit.ly/idl2020
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
2
Acknowledgements
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
Elisa Sayrol
elisa.sayrol@upc.edu
Associate Professor
ETSETB TelecomBCN
Universitat Politècnica de Catalunya
3
Video lecture
Loss function - L(y, ŷ)
The loss function assesses the performance of our model by comparing its
predictions (ŷ) to an expected value (y), typically coming from annotations.
Example: the predicted price (ŷ) and one actually paid (y)
could be compared with the Euclidean distance (also
referred as L2 distance or Mean Square Error - MSE):
1
x1
x2
x3
y = {-∞, ∞}
b
w1
w2
w3
Discussion: Consider the
single-parameter model...
…....and that, given a pair (y, ŷ), we would
like to update the current wt
value to a
new wt+1
based on the loss function Lw
.
(a) Would you increase or decrease wt
?
(b) What operation could indicate which
way to go ?
(c) How much would you increase or
decrease wt
?
Loss function - L(y, ŷ)
ŷ = x · w
Lw
Motivation for this lecture:
if we had a way to estimate the
gradient of the loss (▽L)with respect
to the parameter(s), we could use
gradient descent to optimize them.
6
Gradient Descent (GD)
Descend
(minus sign)
Learning
rate (LR)
Lw
Backpropagation will allow us to compute the gradients of the loss function with
respect to:
● all model parameters (w & b) - final goal during training
● input/intermediate data - visualization & interpretability purposes.
Gradients will “flow” from the output of the model towards the input (“back”).
Gradient Descent (GD)
Computational graph of a simple perceptron
8
σ
1
x1
x2
y = {0, 1}
b
w1
w2
Question: What is the computational graph
(operations & order) of this perceptron with a
sigmoid activation ?
Computational graph of a perceptron
9
σ
1
x1
x2
y = {0, 1}
b
w1
w2
x
x
w1
x1
w2
x2
Example adapted from “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University.
10
σ
1
x1
x2
y = {0, 1}
b
w1
w2
x
+
x
w1
x1
w2
x2
Example adapted from “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University.
Computational graph of a perceptron
11
σ
1
x1
x2
y = {0, 1}
b
w1
w2
x
+
x
+
w1
x1
w2
x2
b
Example adapted from “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University.
Computational graph of a perceptron
12
σ
1
x1
x2
y = {0, 1}
b
w1
w2
x
+
x
+ σ
w1
x1
w2
x2
b
Example adapted from “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University.
y
Computational graph of a perceptron
13
σ
1
x1
x2
y = {0, 1}
b
w1
w2
x
+
x
+ σ
1 0.73
Computational graph of a perceptron
y
Forward pass
w1
x1
w2
x2
b
14
x
+
x
+ σ
Challenge: How to compute
the gradient of the loss
function with respect to w1
or
w2
?
w1
x1
w2
x2
b
y
ŷ
Computational graph of a perceptron
15
Gradients from composition (chain rule)
Forward pass
16
Gradients from composition (chain rule)
How does a variation
(“difference”) on the
input affect the
prediction ?
Backward pass
17
Gradients from composition (chain rule)
A variation in x5
directly affects on ŷ
with a 1:1 factor.
18
Gradients from composition (chain rule)
How does a
variation on x4
affect the
predicted ŷ ?
19
Gradients from composition (chain rule)
How does a
variation on x4
affect the
predicted ŷ ?
It corresponds to
how a variation of x5
affects ŷ...
20
Gradients from composition (chain rule)
...multiplied by how
a variation near the
input x4
affects the
output g4
(x4
).
It corresponds to
how a variation of x5
affects ŷ ...
How does a
variation on x4
affect the
predicted ŷ ?
21
Gradients from composition (chain rule)
Backward pass
The same reasoning can be iteratively applied until reaching :
22
Gradients from composition (chain rule)
In order to compute , we must:
1) Find the derivative function ➝ g’i
(·)
2) Evaluate g’i
(·) at xi
➝ g’i
(xi
)
3) Multiply g’i
(xi
) with the backpropagated gradient (δk
).
23
Gradients from composition (chain rule)
Backward pass
When training NN, we
will actually compute
the derivative over the
loss function, not over
the predicted value ŷ.
L(y, ŷ)
24
Gradients from composition (chain rule)
x
+
x
+ σ
w1
x1
w2
x2
b
Question: What are the
derivatives of the function
involved in the
computational graph of a
perceptron ?
● SIGMOID (σ)
● SUM (+)
● PRODUCT (x)
ŷ
25
Gradient backpropagation in a perceptron
We can now estimate the
sensitivity of the output y with
respect to each input
parameter wi
and xi
.
Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University.
w1
x1
w2
x2
b
x
+
x
+ σ
1 0.73
dy/dy=1
Backward pass
ŷ
Gradient weights for sigmoid σ
Even more details: Arunava, “Derivative of the Sigmoid function” (2018)
x
+
x
+ σ
w1
x1
w2
x2
b
...which can be re-arranged as...
(*)
(*)
Figure: Andrej Karpathy
ŷ
27
Gradient backpropagation in a perceptron
w1
x1
w2
x2
b
x
+
x
+ σ
1 0.73
0,2
Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University.
Backward pass
ŷ
dŷ/dŷ=1
28
Sum: Distributes the gradient to both branches.
w1
x1
w2
x2
b
x
+
x
+ σ
1 0.73
0,2
0,2
0,2
0,2
Gradient backpropagation in a perceptron
Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University.
Backward passdy/db
0,2
ŷ
dŷ/dŷ=1
29
x
+
x
+ σ
1 0.73
0,2
0,2
0,2
-0,2
0,2
0,2
0,4
-0,4
-0,6
Product: Switches gradient weight values.
Gradient backpropagation in a perceptron
Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University.
Backward pass
w1
x1
w2
x2
b
dy/dw1
dy/dx1
dy/dw2
dy/dx2
dy/db
ŷ
dŷ/dŷ=1
30
w1
x1
w2
x2
b
x
+
x
+ σ
1 0.73
0,2
0,2
0,2
-0,2
0,2
0,2
0,4
-0,4
-0,6
dŷ/dŷ=1
Gradient backpropagation in a perceptron
Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University.
Normally, we will be
interested only on the
weights (wi
) and biases (b),
not the inputs (xi
). The
weights are the parameters
to learn in our models.
Backward pass
dy/dw1
dy/dx1
dy/dw2
dy/dx2
dy/db
ŷ
(bonus) Gradients weights for MAX & SPLIT
0,2
0,2
0
max
Max: Routes the gradient only to the higher input branch (not sensitive
to the lower branches).
Split: Branches that split in the forward pass and merge in the backward
pass, add gradients
+
(bonus) Gradient weights for ReLU
Figures: Andrej Karpathy
Backpropagation across layers
Backward pass
Gradients can flow across stacked layers of neurons to estimate their parameters.
Watch more
34
Gilbert Strang, “27. Backpropagation: Find
Partial Derivatives”. MIT 18.065 (2018)
Creative Commons, “Yoshua Bengio Extra
Footage 1: Brainstorm with students” (2018)
Learn more
35
READ
● Chris Olah, “Calculus on Computational Graphs: Backpropagation” (2015).
● Andrej Karpathy,, “Yes, you should understand backprop” (2016), and his “course notes at
Stanford University CS231n.
THREAD
Advanced discussion
36
Problem
Consider a perceptron with a ReLU as activation function designed to process a single-dimensional inputs x.
a) Draw the computational graph of the perceptron, drawing a circle around the parameters that need to
be estimated during training.
b) Compute the partial derivative of the output of the perceptron (y) with respect to each of its
parameters for the input sample x=2. Consider that all the trainable parameters of the perceptron are
initialized to 1.
c) Modify the results obtained in b) for the case in which all the trainable parameters of the perceptron
are initialized to -1.
d) Briefly comment and compare the results obtained in b) and c).
Problem (solved)
a) Draw the computational graph of the perceptron, drawing a circle around the parameters that need to
be estimated during training.
b) b) Compute the partial derivative of the output of the perceptron (y) with respect to each of its
parameters for the input sample x=2. Consider that all the trainable parameters of the perceptron are
initialized to 1.
Problem (solved)
c) Modify the results obtained in b) for the case in which all the trainable parameters of the perceptron are
initialized to -1.
d) Briefly comment and compare the results obtained in b) and c).
d) While in case b) the gradients can flow until the trainable parameters w1
and b, in case c) gradients are
“killed” by the ReLU.
Backpropagation for Deep Learning

More Related Content

PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
PDF
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...

What's hot (19)

PDF
Bayesian Core: Chapter 8
PDF
The Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intelligence)
PDF
MLIP - Chapter 5 - Detection, Segmentation, Captioning
PDF
From RNN to neural networks for cyclic undirected graphs
PDF
“Can You See What I See? The Power of Deep Learning,” a Presentation from Str...
PDF
MLIP - Chapter 2 - Preliminaries to deep learning
PDF
MLIP - Chapter 3 - Introduction to deep learning
PDF
Multimodal Residual Networks for Visual QA
PDF
Chapter 1 - Introduction
PPTX
Anomaly detection using deep one class classifier
PDF
Lecture8 multi class_svm
PDF
Information-theoretic clustering with applications
PDF
Convolutional networks and graph networks through kernels
PDF
Social Network Analysis
PDF
Kernels and Support Vector Machines
PPTX
Unsupervised Learning of Object Landmarks through Conditional Image Generation
PDF
Epsrcws08 campbell isvm_01
PDF
Introductions to Neural Networks,Basic concepts
PPTX
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
Bayesian Core: Chapter 8
The Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intelligence)
MLIP - Chapter 5 - Detection, Segmentation, Captioning
From RNN to neural networks for cyclic undirected graphs
“Can You See What I See? The Power of Deep Learning,” a Presentation from Str...
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 3 - Introduction to deep learning
Multimodal Residual Networks for Visual QA
Chapter 1 - Introduction
Anomaly detection using deep one class classifier
Lecture8 multi class_svm
Information-theoretic clustering with applications
Convolutional networks and graph networks through kernels
Social Network Analysis
Kernels and Support Vector Machines
Unsupervised Learning of Object Landmarks through Conditional Image Generation
Epsrcws08 campbell isvm_01
Introductions to Neural Networks,Basic concepts
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
Ad

Similar to Backpropagation for Deep Learning (20)

PDF
Backpropagation for Neural Networks
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PPTX
Deep neural networks & computational graphs
PDF
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
PDF
APPLIED MACHINE LEARNING
PDF
Backpropagation in Convolutional Neural Network
PDF
Week_2_Neural_Networks_Basichhhhhhhs.pdf
PPTX
PRML Chapter 5
PPTX
The essence of deep learning, automatic differentiation
PPTX
Lecture02_Updated_Shallow Neural Networks.pptx
PPTX
Getting_Started_with_DL_in_Keras.pptx
PPT
Artificial Neural Network
PPTX
Backpropagation and computational graph.pptx
PDF
NPTEL_backprobagation_Lecture4_DL(1).pdf
PPTX
Artificial intelligence learning presentations
PDF
The Magic of Auto Differentiation
PDF
Sparse autoencoder
PPTX
Deep learning simplified
PPTX
Module1 (2).pptxvgybhunjimko,l.vgbyhnjmk;
Backpropagation for Neural Networks
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Deep neural networks & computational graphs
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
APPLIED MACHINE LEARNING
Backpropagation in Convolutional Neural Network
Week_2_Neural_Networks_Basichhhhhhhs.pdf
PRML Chapter 5
The essence of deep learning, automatic differentiation
Lecture02_Updated_Shallow Neural Networks.pptx
Getting_Started_with_DL_in_Keras.pptx
Artificial Neural Network
Backpropagation and computational graph.pptx
NPTEL_backprobagation_Lecture4_DL(1).pdf
Artificial intelligence learning presentations
The Magic of Auto Differentiation
Sparse autoencoder
Deep learning simplified
Module1 (2).pptxvgybhunjimko,l.vgbyhnjmk;
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
Open challenges in sign language translation and production
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
PDF
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
PDF
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
PDF
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Open challenges in sign language translation and production
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Foundation of Data Science unit number two notes
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
IB Computer Science - Internal Assessment.pptx
Moving the Public Sector (Government) to a Digital Adoption
Major-Components-ofNKJNNKNKNKNKronment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
.pdf is not working space design for the following data for the following dat...
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Foundation of Data Science unit number two notes
Launch Your Data Science Career in Kochi – 2025
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Acumen Training GuidePresentation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21

Backpropagation for Deep Learning

  • 1. Backpropagation Day 2 Lecture 1 http://bit.ly/idl2020 Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya
  • 2. 2 Acknowledgements Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University Elisa Sayrol elisa.sayrol@upc.edu Associate Professor ETSETB TelecomBCN Universitat Politècnica de Catalunya
  • 4. Loss function - L(y, ŷ) The loss function assesses the performance of our model by comparing its predictions (ŷ) to an expected value (y), typically coming from annotations. Example: the predicted price (ŷ) and one actually paid (y) could be compared with the Euclidean distance (also referred as L2 distance or Mean Square Error - MSE): 1 x1 x2 x3 y = {-∞, ∞} b w1 w2 w3
  • 5. Discussion: Consider the single-parameter model... …....and that, given a pair (y, ŷ), we would like to update the current wt value to a new wt+1 based on the loss function Lw . (a) Would you increase or decrease wt ? (b) What operation could indicate which way to go ? (c) How much would you increase or decrease wt ? Loss function - L(y, ŷ) ŷ = x · w Lw
  • 6. Motivation for this lecture: if we had a way to estimate the gradient of the loss (▽L)with respect to the parameter(s), we could use gradient descent to optimize them. 6 Gradient Descent (GD) Descend (minus sign) Learning rate (LR) Lw
  • 7. Backpropagation will allow us to compute the gradients of the loss function with respect to: ● all model parameters (w & b) - final goal during training ● input/intermediate data - visualization & interpretability purposes. Gradients will “flow” from the output of the model towards the input (“back”). Gradient Descent (GD)
  • 8. Computational graph of a simple perceptron 8 σ 1 x1 x2 y = {0, 1} b w1 w2 Question: What is the computational graph (operations & order) of this perceptron with a sigmoid activation ?
  • 9. Computational graph of a perceptron 9 σ 1 x1 x2 y = {0, 1} b w1 w2 x x w1 x1 w2 x2 Example adapted from “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University.
  • 10. 10 σ 1 x1 x2 y = {0, 1} b w1 w2 x + x w1 x1 w2 x2 Example adapted from “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University. Computational graph of a perceptron
  • 11. 11 σ 1 x1 x2 y = {0, 1} b w1 w2 x + x + w1 x1 w2 x2 b Example adapted from “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University. Computational graph of a perceptron
  • 12. 12 σ 1 x1 x2 y = {0, 1} b w1 w2 x + x + σ w1 x1 w2 x2 b Example adapted from “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University. y Computational graph of a perceptron
  • 13. 13 σ 1 x1 x2 y = {0, 1} b w1 w2 x + x + σ 1 0.73 Computational graph of a perceptron y Forward pass w1 x1 w2 x2 b
  • 14. 14 x + x + σ Challenge: How to compute the gradient of the loss function with respect to w1 or w2 ? w1 x1 w2 x2 b y ŷ Computational graph of a perceptron
  • 15. 15 Gradients from composition (chain rule) Forward pass
  • 16. 16 Gradients from composition (chain rule) How does a variation (“difference”) on the input affect the prediction ? Backward pass
  • 17. 17 Gradients from composition (chain rule) A variation in x5 directly affects on ŷ with a 1:1 factor.
  • 18. 18 Gradients from composition (chain rule) How does a variation on x4 affect the predicted ŷ ?
  • 19. 19 Gradients from composition (chain rule) How does a variation on x4 affect the predicted ŷ ? It corresponds to how a variation of x5 affects ŷ...
  • 20. 20 Gradients from composition (chain rule) ...multiplied by how a variation near the input x4 affects the output g4 (x4 ). It corresponds to how a variation of x5 affects ŷ ... How does a variation on x4 affect the predicted ŷ ?
  • 21. 21 Gradients from composition (chain rule) Backward pass The same reasoning can be iteratively applied until reaching :
  • 22. 22 Gradients from composition (chain rule) In order to compute , we must: 1) Find the derivative function ➝ g’i (·) 2) Evaluate g’i (·) at xi ➝ g’i (xi ) 3) Multiply g’i (xi ) with the backpropagated gradient (δk ).
  • 23. 23 Gradients from composition (chain rule) Backward pass When training NN, we will actually compute the derivative over the loss function, not over the predicted value ŷ. L(y, ŷ)
  • 24. 24 Gradients from composition (chain rule) x + x + σ w1 x1 w2 x2 b Question: What are the derivatives of the function involved in the computational graph of a perceptron ? ● SIGMOID (σ) ● SUM (+) ● PRODUCT (x) ŷ
  • 25. 25 Gradient backpropagation in a perceptron We can now estimate the sensitivity of the output y with respect to each input parameter wi and xi . Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University. w1 x1 w2 x2 b x + x + σ 1 0.73 dy/dy=1 Backward pass ŷ
  • 26. Gradient weights for sigmoid σ Even more details: Arunava, “Derivative of the Sigmoid function” (2018) x + x + σ w1 x1 w2 x2 b ...which can be re-arranged as... (*) (*) Figure: Andrej Karpathy ŷ
  • 27. 27 Gradient backpropagation in a perceptron w1 x1 w2 x2 b x + x + σ 1 0.73 0,2 Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University. Backward pass ŷ dŷ/dŷ=1
  • 28. 28 Sum: Distributes the gradient to both branches. w1 x1 w2 x2 b x + x + σ 1 0.73 0,2 0,2 0,2 0,2 Gradient backpropagation in a perceptron Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University. Backward passdy/db 0,2 ŷ dŷ/dŷ=1
  • 29. 29 x + x + σ 1 0.73 0,2 0,2 0,2 -0,2 0,2 0,2 0,4 -0,4 -0,6 Product: Switches gradient weight values. Gradient backpropagation in a perceptron Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University. Backward pass w1 x1 w2 x2 b dy/dw1 dy/dx1 dy/dw2 dy/dx2 dy/db ŷ dŷ/dŷ=1
  • 30. 30 w1 x1 w2 x2 b x + x + σ 1 0.73 0,2 0,2 0,2 -0,2 0,2 0,2 0,4 -0,4 -0,6 dŷ/dŷ=1 Gradient backpropagation in a perceptron Example extracted from Andrej Karpathy’s notes for CS231n from Stanford University. Normally, we will be interested only on the weights (wi ) and biases (b), not the inputs (xi ). The weights are the parameters to learn in our models. Backward pass dy/dw1 dy/dx1 dy/dw2 dy/dx2 dy/db ŷ
  • 31. (bonus) Gradients weights for MAX & SPLIT 0,2 0,2 0 max Max: Routes the gradient only to the higher input branch (not sensitive to the lower branches). Split: Branches that split in the forward pass and merge in the backward pass, add gradients +
  • 32. (bonus) Gradient weights for ReLU Figures: Andrej Karpathy
  • 33. Backpropagation across layers Backward pass Gradients can flow across stacked layers of neurons to estimate their parameters.
  • 34. Watch more 34 Gilbert Strang, “27. Backpropagation: Find Partial Derivatives”. MIT 18.065 (2018) Creative Commons, “Yoshua Bengio Extra Footage 1: Brainstorm with students” (2018)
  • 35. Learn more 35 READ ● Chris Olah, “Calculus on Computational Graphs: Backpropagation” (2015). ● Andrej Karpathy,, “Yes, you should understand backprop” (2016), and his “course notes at Stanford University CS231n. THREAD
  • 37. Problem Consider a perceptron with a ReLU as activation function designed to process a single-dimensional inputs x. a) Draw the computational graph of the perceptron, drawing a circle around the parameters that need to be estimated during training. b) Compute the partial derivative of the output of the perceptron (y) with respect to each of its parameters for the input sample x=2. Consider that all the trainable parameters of the perceptron are initialized to 1. c) Modify the results obtained in b) for the case in which all the trainable parameters of the perceptron are initialized to -1. d) Briefly comment and compare the results obtained in b) and c).
  • 38. Problem (solved) a) Draw the computational graph of the perceptron, drawing a circle around the parameters that need to be estimated during training. b) b) Compute the partial derivative of the output of the perceptron (y) with respect to each of its parameters for the input sample x=2. Consider that all the trainable parameters of the perceptron are initialized to 1.
  • 39. Problem (solved) c) Modify the results obtained in b) for the case in which all the trainable parameters of the perceptron are initialized to -1. d) Briefly comment and compare the results obtained in b) and c). d) While in case b) the gradients can flow until the trainable parameters w1 and b, in case c) gradients are “killed” by the ReLU.