SlideShare a Scribd company logo
Introduction to Deep Learning
MIT 6.S191
Alexander Amini
January 28, 2019
Follow me of LinkedIn for more:
Steve Nouri
https://www.linkedin.com/in/stevenouri/
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
The Rise of Deep Learning
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
What is Deep Learning?
ARTIFICIAL
INTELLIGENCE
MACHINE LEARNING
DEEP LEARNING
Any technique that enables
computers to mimic
human behavior
Ability to learn without
explicitly being programmed Extract patterns from data using
neural networks
Why Deep Learning and Why Now?
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Why Deep Learning?
Hand engineered features are time consuming, brittle and not scalable in practice
Can we learn the underlying features directly from data?
Low Level Features
Lines & Edges Eyes & Nose & Ears Facial Structure
Mid Level Features High Level Features
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Why Now?
1952
Stochastic Gradient
Descent
1958
Perceptron
• Learnable Weights
1995 Deep Convolutional NN
• Digit Recognition
1986 Backpropagation
• Multi-Layer Perceptron
1. Big Data
• Larger Datasets
• Easier Collection
& Storage
2. Hardware
• Graphics
Processing Units
(GPUs)
• Massively
Parallelizable
3. Software
• Improved
Techniques
• New Models
• Toolboxes
Neural Networks date back decades, so why the resurgence?
The Perceptron
The structural building block of deep learning
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
The Perceptron: Forward Propagation
!"
!#
!$
Σ
&$
&#
&"
'
( = * +
, - "
#
& , !,
Non-linear
activation function
Output
Linear combination
of inputs
'
(
Inputs Weights Sum Non-Linearity Output
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Inputs Weights Sum Non-Linearity
!"
!#
!$
!%
&$
&#
1
&"
(
) = + !% + -
. / "
#
& . !.
Non-linear
activation function
Output
Linear combination
of inputs
Σ (
)
Output
Bias
The Perceptron: Forward Propagation
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Inputs Weights Sum Non-Linearity
!"
!#
!$
!%
&$
&#
1
&"
(
) = + !% + -
. / "
#
& . !.
Σ (
)
Output
The Perceptron: Forward Propagation
(
) = + !% + 1 2 3
where: 1 =
&"
⋮
& #
and 3 =
!"
⋮
!#
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Inputs Weights Sum Non-Linearity
!"
!#
!$
!%
&$
&#
1
&"
Σ )
*
Output
The Perceptron: Forward Propagation
)
* = , !% +./ 0
Activation Functions
• Example: sigmoid function
, 1 = 2 1 =
1
1 + 3 4 5
1
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Common Activation Functions
NOTE: All activation functions are non-linear
! " =
1
1 + & ' (
Sigmoid Function
! ′ " = !(") 1 − !(")
! " =
& ( − & ' (
& ( + & ' (
HyperbolicTangent
! ′ " = 1 − !(")-
! " = max ( 0 , " )
Rectified Linear Unit (ReLU)
! ′ ( " ) = 3
1 , " > 0
0 , otherwise
tf.nn.sigmoid(z) tf.nn.tanh(z) tf.nn.relu(z)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network
What if we wanted to build a Neural Network to
distinguish green vs red points?
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network
Linear Activation functions produce linear
decisions no matter the network size
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Importance of Activation Functions
Linear Activation functions produce linear
decisions no matter the network size
Non-linearities allow us to approximate
arbitrarily complex functions
The purpose of activation functions is to introduce non-linearities into the network
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
The Perceptron: Example
1
−2
3
Σ
&'
&(
1
)
*
We have: +, = 1 and . =
3
− 2
)
* = / +, + 1 2 .
= / 1 +
&'
&(
2
3
− 2
)
* = / (1 + 3 &' − 2 &( )
This is just a line in 2D!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
The Perceptron: Example
1
−2
3
Σ
&'
&(
1
)
*
)
* = , (1 + 3 &' − 2 &( )
1
+
3
&
'
−
2
&
(
=
0
&'
& (
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
The Perceptron: Example
1
−2
3
Σ
&'
&(
1
)
*
)
* = , (1 + 3 &' − 2 &( )
Assume we have input: 0 =
− 1
2
−1
2
)
* = , 1 + 3∗−1 − 2∗2
= , −6 ≈ 0.002
1
+
3
&
'
−
2
&
(
=
0
&'
& (
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
The Perceptron: Example
1
−2
3
Σ
&'
&(
1
)
*
)
* = , (1 + 3 &' − 2 &( )
1
+
3
&
'
−
2
&
(
=
0
&'
& (
1 < 0
* < 0.5
1 > 0
* > 0.5
Building Neural Networks with Perceptrons
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Inputs Weights Sum Non-Linearity
!"
!#
!$
!%
&$
&#
1
&"
Σ )
*
Output
The Perceptron: Simplified
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
The Perceptron: Simplified
!"
!#
!$
%
& = ( %
% = )* + ,
-.$
#
!- )-
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Multi Output Perceptron
!"
!#
!$
%"
%$
&$ = ( %$
&" = ( %"
%) = *+,) + .
/0$
#
!/ */,)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Single Layer Neural Network
Inputs
!"
!#
!$
Hidden
%&
%"
'
("
'
($
Final Output
%$
%)*
%+ = -.,+
($)
+ 3
45$
#
!4 -4,+
($)
'
(+ = 6 -.,+
(")
+ 3
45$
)*
%4 -4,+
(")
6 %$
6 %"
6 %&
6 %)*
7
($)
7
(")
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Single Layer Neural Network
!"
!#
!$
%&
%"
'
("
'
($
%$
%)*
%" = ,-,"
($)
+ 2
34$
#
!3 ,3,"
($)
= ,-,"
($)
+ !$ ,$,"
($)
+ !" ,","
($)
+ !# ,#,"
($)
5$,"
($)
5","
($)
5#,"
($)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Multi Output Perceptron
Inputs
!"
!#
!$
Hidden
%&
%"
'
("
'
($
Output
%$
%)*
from tf.keras.layers import *
inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
model = Model(inputs, outputs)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Deep Neural Network
Inputs
!"
!#
!$
Hidden
%&,(
%&,"
)
*"
)
*$
Output
%&,$
%&,+,
%&,- = /0,-
(&)
+ 4
56$
+,78
9(%&:$,5) /5,-
(&)
⋯ ⋯
Applying Neural Networks
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Example Problem
Will I pass this class?
Let’s start with a simple two feature model
!" = Number of lectures you attend
!$ = Hours spent on the final project
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Example Problem: Will I pass this class?
! " = Hours
spent on the
final project
!$ = Number of lectures you attend
Pass
Fail
Legend
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Example Problem: Will I pass this class?
! " = Hours
spent on the
final project
!$ = Number of lectures you attend
Pass
Fail
Legend
?
4
5
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Example Problem: Will I pass this class?
!"
!#
$%
$" &
'#
$#
! #
= 4 ,5 Predicted: 0.1
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Example Problem: Will I pass this class?
!"
!#
$%
$" &
'#
$#
Predicted: 0.1
Actual: 1
! #
= 4 ,5
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Quantifying Loss
!"
!#
$%
$" &
'#
$#
Predicted: 0.1
Actual: 1
The loss of our network measures the cost incurred from incorrect predictions
ℒ , !(.)
; 1 , '(.)
Predicted Actual
! #
= 4 ,5
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Empirical Loss
!"
!#
$%
$" &
'#
$#
4,
2,
5,
⋮
The empirical loss measures the total loss over our entire dataset
5
1
8
⋮
) =
0.1
0.8
0.6
⋮
+(!)
1
0
1
⋮
'
. / =
1
1
2
34#
5
ℒ + !(3)
; / , '(3)
Predicted Actual
Also known as:
• Objective function
• Cost function
• Empirical Risk
Binary Cross Entropy Loss
!"
!#
$%
$" &
'#
$#
4,
2,
5,
⋮
Cross entropy loss can be used with models that output a probability between 0 and 1
5
1
8
⋮
) =
0.1
0.8
0.6
⋮
+(!)
1
0
1
⋮
'
. / =
1
1
2
34#
5
'(3) log + ! 3 ; / + (1 − '(3)) log 1 − + ! 3 ; /
Predicted
Actual
Predicted
Actual
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(model.y, model.pred) )
Mean Squared Error Loss
!"
!#
$%
$" &
'#
$#
4,
2,
5,
⋮
Mean squared error loss can be used with regression models that output continuous real numbers
5
1
8
⋮
) =
30
80
85
⋮
+(!)
90
20
95
⋮
'
. / =
1
1
2
34#
5
' 3 − + ! 3 ; /
"
Predicted
Actual
loss = tf.reduce_mean( tf.square(tf.subtract(model.y, model.pred) )
Final Grades
(percentage)
Training Neural Networks
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Loss Optimization
We want to find the network weights that achieve the lowest loss
!∗ = argmin
!
1
+
,
-./
0
ℒ 2 3(-); ! , 8(-)
!∗ = argmin
!
9(!)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Loss Optimization
We want to find the network weights that achieve the lowest loss
!∗ = argmin
!
1
+
,
-./
0
ℒ 2 3(-); ! , 8(-)
!∗ = argmin
!
9(!)
Remember:
! = !(:), !(/), ⋯
Loss Optimization
!∗ = argmin
!
*(!)
*(-., -0)
-0
-.
Remember:
Our loss is a function of
the network weights!
Loss Optimization
Randomly pick an initial ("#, "%)
'("#, "%)
"%
"#
Loss Optimization
Compute gradient,
!"($)
!$
&('(, '*)
'*
'(
Loss Optimization
Take small step in opposite direction of gradient
!(#$, #&)
#&
#$
Gradient Descent
Repeat until convergence
!(#$, #&)
#&
#$
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, &')
2. Loop until convergence:
3. Compute gradient,
)*(+)
)+
4. Update weights, + ← + − .
)*(+)
)+
5. Return weights
weights = tf.random_normal(shape, stddev=sigma)
grads = tf.gradients(ys=loss, xs=weights)
weights_new = weights.assign(weights – lr * grads)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, &')
2. Loop until convergence:
3. Compute gradient,
)*(+)
)+
4. Update weights, + ← + − .
)*(+)
)+
5. Return weights
weights = tf.random_normal(shape, stddev=sigma)
grads = tf.gradients(ys=loss, xs=weights)
weights_new = weights.assign(weights – lr * grads)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Computing Gradients: Backpropagation
How does a small change in one weight (ex. !") affect the final loss #(%)?
' () *
+
!) !"
#(%)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Computing Gradients: Backpropagation
!"($)
!&'
=
) *+ ,
-
&+ &'
"($)
Let’s use the chain rule!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Computing Gradients: Backpropagation
!"($)
!&'
=
!"($)
! )
*
∗
! )
*
!&'
, -. )
*
&. &'
"($)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Computing Gradients: Backpropagation
!"($)
!&'
=
!"($)
! )
*
∗
! )
*
!&'
, -' )
*
&' &.
"($)
Apply chain rule! Apply chain rule!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Computing Gradients: Backpropagation
!"($)
!&'
=
!"($)
! )
*
∗
! )
*
!,'
- ,' )
*
&' &.
"($)
∗
!,'
!&'
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Computing Gradients: Backpropagation
!"($)
!&'
=
!"($)
! )
*
∗
! )
*
!,'
- ,' )
*
&' &.
"($)
∗
!,'
!&'
Repeat this for every weight in the network using gradients from later layers
Neural Networks in Practice:
Optimization
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Training Neural Networks is Difficult
“Visualizing the loss landscape
of neural nets”. Dec 2017.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Loss Functions Can Be Difficult to Optimize
Remember:
Optimization through gradient descent
! ← ! − $
%&(!)
%!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Remember:
Optimization through gradient descent
! ← ! − $
%&(!)
%!
How can we set the
learning rate?
Loss Functions Can Be Difficult to Optimize
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Setting the Learning Rate
Small learning rate converges slowly and gets stuck in false local minima
Initial guess
!
"(!)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Setting the Learning Rate
Large learning rates overshoot, become unstable and diverge
Initial guess
!
"(!)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Setting the Learning Rate
Stable learning rates converge smoothly and avoid local minima
Initial guess
!
"($)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
Idea 2:
Do something smarter!
Design an adaptive learning rate that “adapts” to the landscape
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Adaptive Learning Rates
• Learning rates are no longer fixed
• Can be made larger or smaller depending on:
• how large gradient is
• how fast learning is happening
• size of particular weights
• etc...
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Adaptive Learning Rate Algorithms
• Momentum
• Adagrad
• Adadelta
• Adam
• RMSProp
Additional details: http://ruder.io/optimizing-gradient-descent/
tf.train.MomentumOptimizer
tf.train.AdagradOptimizer
tf.train.AdadeltaOptimizer
tf.train.AdamOptimizer
tf.train.RMSPropOptimizer
Qian et al.“On the momentum term in gradient
descent learning algorithms.” 1999.
Duchi et al.“Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization.” 2011.
Zeiler et al.“ADADELTA:An Adaptive Learning Rate
Method.” 2012.
Kingma et al.“Adam:A Method for Stochastic
Optimization.” 2014.
Neural Networks in Practice:
Mini-batches
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, &')
2. Loop until convergence:
3. Compute gradient,
)*(+)
)+
4. Update weights, + ← + − .
)*(+)
)+
5. Return weights
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, &')
2. Loop until convergence:
3. Compute gradient,
)*(+)
)+
4. Update weights, + ← + − .
)*(+)
)+
5. Return weights
Can be very
computational to
compute!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, &')
2. Loop until convergence:
3. Pick single data point )
4. Compute gradient,
*+,(-)
*-
5. Update weights, - ← - − 0
*+(-)
*-
6. Return weights
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, &')
2. Loop until convergence:
3. Pick single data point )
4. Compute gradient,
*+,(-)
*-
5. Update weights, - ← - − 0
*+(-)
*-
6. Return weights
Easy to compute but
very noisy
(stochastic)!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, &')
2. Loop until convergence:
3. Pick batch of ) data points
4. Compute gradient,
*+(,)
*,
=
.
/
∑12.
/ *+3(,)
*,
5. Update weights, , ← , − 6
*+(,)
*,
6. Return weights
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, &')
2. Loop until convergence:
3. Pick batch of ) data points
4. Compute gradient,
*+(,)
*,
=
.
/
∑12.
/ *+3(,)
*,
5. Update weights, , ← , − 6
*+(,)
*,
6. Return weights
Fast to compute and a much better
estimate of the true gradient!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Mini-batches while training
More accurate estimation of gradient
Smoother convergence
Allows for larger learning rates
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Mini-batches while training
More accurate estimation of gradient
Smoother convergence
Allows for larger learning rates
Mini-batches lead to fast training!
Can parallelize computation + achieve significant speed increases on GPU’s
Neural Networks in Practice:
Overfitting
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
The Problem of Overfitting
Underfitting
Model does not have capacity
to fully learn the data
Ideal fit Overfitting
Too complex, extra parameters,
does not generalize well
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization
What is it?
Technique that constrains our optimization problem to discourage complex models
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization
What is it?
Technique that constrains our optimization problem to discourage complex models
Why do we need it?
Improve generalization of our model on unseen data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 1: Dropout
!"
!#
!$
%
&"
%
&$
'$,#
'$,"
'$,$
'$,)
'",#
'","
'",$
'",)
• During training, randomly set some activations to 0
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 1: Dropout
!"
!#
!$
%
&"
%
&$
'$,#
'$,"
'$,$
'$,)
'",#
'","
'",$
'",)
• During training, randomly set some activations to 0
• Typically ‘drop’ 50% of activations in layer
• Forces network to not rely on any 1 node
tf.keras.layers.Dropout(p=0.5)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 1: Dropout
!"
!#
!$
%
&"
%
&$
'$,#
'$,"
'$,$
'$,)
'",#
'","
'",$
'",)
• During training, randomly set some activations to 0
• Typically ‘drop’ 50% of activations in layer
• Forces network to not rely on any 1 node
tf.keras.layers.Dropout(p=0.5)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Training Iterations
Loss
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Training Iterations
Loss Testing
Training
Legend
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Training Iterations
Loss
Training
Legend
Testing
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Training Iterations
Loss
Training
Legend
Testing
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 2: Early Stopping
Training Iterations
Loss
Training
Legend
Testing
• Stop training before we have a chance to overfit
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 2: Early Stopping
Training Iterations
Loss
Training
Legend
Testing
• Stop training before we have a chance to overfit
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 2: Early Stopping
Training Iterations
Loss
Training
Legend
Stop training
here!
Testing
• Stop training before we have a chance to overfit
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Regularization 2: Early Stopping
Training Iterations
Loss
Training
Legend
Stop training
here!
Over-fitting
Under-fitting
Testing
• Stop training before we have a chance to overfit
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/28/19
Core Foundation Review
• Structural building blocks
• Nonlinear activation
functions
The Perceptron Neural Networks Training in Practice
• Stacking Perceptrons to
form neural networks
• Optimization through
backpropagation
• Adaptive learning
• Batching
• Regularization
Σ
"#
"$
"%
&
' "#
"$
"%
(),+
(),#
&
'#
&
'%
(),%
(),,-
⋯ ⋯
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
What Computers “See”
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Images are Numbers
[1]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Images are Numbers
[1]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Images are Numbers
What the computer sees
An image is just a matrix of numbers [0,255]!
i.e., 1080x1080x3 for an RGB image
[1]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Tasks in ComputerVision
- Regression: output variable takes continuous value
- Classification: output variable takes class label. Can produce probability of belonging to a particular class
Input Image
classification
Lincoln
Washington
Jefferson
Obama
Pixel Representation
0.8
0.1
0.05
0.05
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
High Level Feature Detection
Let’s identify key features in each image category
Wheels,
License Plate,
Headlights
Door,
Windows,
Steps
Nose,
Eyes,
Mouth
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Manual Feature Extraction
Problems?
Define features
Domain knowledge
Detect features
to classify
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Manual Feature Extraction
Define features
Domain knowledge
Detect features
to classify
[2]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Manual Feature Extraction
Define features
Domain knowledge
Detect features
to classify
[2]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Learning Feature Representations
Can we learn a hierarchy of features directly from the data
instead of hand engineering?
Low level features Mid level features High level features
Eyes, ears, nose
Edges, dark spots Facial structure
[3]
LearningVisual Features
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Fully Connected Neural Network
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Fully Connected Neural Network
Fully Connected:
• Connect neuron in hidden
layer to all neurons in input
layer
• No spatial information!
• And many, many parameters!
Input:
• 2D image
• Vector of pixel values
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Fully Connected Neural Network
How can we use spatial structure in the input to inform the architecture of the network?
Fully Connected:
• Connect neuron in hidden
layer to all neurons in input
layer
• No spatial information!
• And many, many parameters!
Input:
• 2D image
• Vector of pixel values
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Using Spatial Structure
Neuron connected to region of
input. Only “sees” these values.
Idea: connect patches of input
to neurons in hidden layer.
Input: 2D image.
Array of pixel values
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Using Spatial Structure
Connect patch in input layer to a single neuron in subsequent layer.
Use a sliding window to define connections.
How can we weight the patch to detect particular features?
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Applying Filters to Extract Features
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
(features that matter in one part of the input should matter elsewhere)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Feature Extraction with Convolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
- Filter of size 4x4 : 16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch
This “patchy” operation is convolution
Feature Extraction and Convolution
A Case Study
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
X or X?
Image is represented as matrix of pixel values… and computers are literal!
We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.
[4]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Features of X
[4]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Filters to Detect X Features
filters
[4]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
element wise
multiply
add outputs
1 1 = 1
X
= 9
[4]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
Suppose we want to compute the convolution of a 5x5 image and a 3x3 filter:
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs…
image
filter
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
filter feature map
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
filter feature map
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
filter feature map
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
filter feature map
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
filter feature map
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
filter feature map
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
[5]
filter feature map
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
filter feature map
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:
filter feature map
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Producing Feature Maps
Original Sharpen Edge Detect “Strong” Edge
Detect
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Feature Extraction with Convolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
Convolutional Neural Networks (CNNs)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
CNNs for Classification
1. Convolution:Apply filters with learned weights to generate feature maps.
2. Non-linearity: Often ReLU.
3. Pooling: Downsampling operation on each feature map.
Train model with image data.
Learn weights of filters in convolutional layers.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Convolutional Layers: Local Connectivity
For a neuron in hidden layer:
- Take inputs from patch
- Compute weighted sum
- Apply bias
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Convolutional Layers: Local Connectivity
For a neuron in hidden layer:
- Take inputs from patch
- Compute weighted sum
- Apply bias
4x4 filter: matrix
of weights !"#
$
"%&
'
$
#%&
'
!"# (")*,#), + .
for neuron (p,q) in hidden layer
1) applying a window of weights
2) computing linear combinations
3) activating with non-linear function
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
CNNs: Spatial Arrangement of OutputVolume
depth
width
height
Layer Dimensions:
ℎ " # " $
where h and w are spatial dimensions
d (depth) = number of filters
Receptive Field:
Locations in input image that
a node is path connected to
Stride:
Filter step size
[3]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Introducing Non-Linearity
! " = max(0 , " )
Rectified Linear Unit (ReLU)
- Apply after every convolution operation (i.e., after
convolutional layers)
- ReLU: pixel-by-pixel operation that replaces all negative
values by zero. Non-linear operation!
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Pooling
How else can we downsample and preserve spatial invariance?
1) Reduced dimensionality
2) Spatial invariance
[3]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Representation Learning in Deep CNNs
Mid level features
Eyes, ears, nose
Low level features
Edges, dark spots
High level features
Facial structure
Conv Layer 1 Conv Layer 2 Conv Layer 3
[3]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
CNNs for Classification: Feature Learning
1. Learn features in input image through convolution
2. Introduce non-linearity through activation function (real-world data is non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
CNNs for Classification: Class Probabilities
- CONV and POOL layers output high-level features of input
- Fully connected layer uses these features for classifying input image
- Express output as probability of image belonging to a particular class
softmax () =
+,-
∑/ +,0
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
CNNs:Training with Backpropagation
Learn weights for convolutional filters and fully connected layers
! " = $
%
&(%) log ,
-(%)
Backpropagation: cross-entropy loss
CNNs for Classification: ImageNet
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
ImageNet Dataset
Dataset of over 14 million images across 21,841 categories
1409 pictures of bananas.
“Elongated crescent-shaped yellow fruit with soft sweet flesh”
[6,7]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
ImageNet Challenge
Classification task: produce a list of object categories present in image. 1000 categories.
“Top 5 error”: rate at which the model does not output correct label in top 5 predictions
Other tasks include:
single-object localization, object detection from video/image, scene classification, scene parsing
[6,7]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
ImageNet Challenge: Classification Task
2
0
1
0
2
0
1
1
2
0
1
2
2
0
1
3
2
0
1
4
2
0
1
5
H
u
m
a
n
0
10
20
30
classification
error
%
28.2
25.8
16.4
11.7
6.7
3.57
5.1
2012:AlexNet. First CNN to win.
- 8 layers, 61 million parameters
2013: ZFNet
- 8 layers, more filters
2014:VGG
- 19 layers
2014: GoogLeNet
- “Inception” modules
- 22 layers, 5million parameters
2015: ResNet
- 152 layers
[6,7]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
ImageNet Challenge: Classification Task
2
0
1
0
2
0
1
1
2
0
1
2
2
0
1
3
2
0
1
4
2
0
1
4
2
0
1
5
H
u
m
a
n
0
10
20
30
classification
error
%
28.2
25.8
16.4
11.7
6.7
3.57
5.1
7.3
2
0
1
0
2
0
1
1
2
0
1
2
2
0
1
3
2
0
1
4
2
0
1
4
2
0
1
5
0
50
100
150
number
of
layers
[6,7]
An Architecture for Many Applications
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
An Architecture for Many Applications
Object detection with R-CNNs
Segmentation with fully convolutional networks
Image captioning with RNNs
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Beyond Classification
Object Detection
CAT, DOG, DUCK
Semantic Segmentation
CAT
Image Captioning
The cat is in the grass.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Semantic Segmentation: FCNs
FCN: Fully Convolutional Network.
Network designed with all convolutional layers,
with downsampling and upsampling operations
[3,8,9]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Driving Scene Segmentation
[10]
Fix reference
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Driving Scene Segmentation
[11, 12]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Object Detection with R-CNNs
R-CNN: Find regions that we think have objects. Use CNN to classify.
[13]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Image Captioning using RNNs
[14,15]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Image Captioning using RNNs
[14,15]
Deep Learning for ComputerVision:
Impact and Summary
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Data, Data, Data
MNIST: handwritten digits
places: natural scenes
ImageNet:
22K categories. 14M images.
CIFAR-10
Airplane
Automobile
Bird
Cat
Deer
Dog
Frog
Horse
Ship
Truck
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Deep Learning for ComputerVision: Impact
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Impact: Face Detection 6.S191 Lab!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Impact: Self-Driving Cars
[16]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Impact: Healthcare
[17]
Identifying facial phenotypes of genetic disorders using deep learning
Gurovich et al., Nature Med. 2019
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Deep Learning for ComputerVision: Summary
• Why computer vision?
• Representing images
• Convolutions for
feature extraction
Foundations CNNs Applications
• CNN architecture
• Application to
classification
• ImageNet
• Segmentation, object
detection, image
captioning
• Visualization
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Which face is fake?
[1]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Supervised vs unsupervised learning
Supervised Learning
Data: (", $)
" is data, $ is label
Goal: Learn function to map
" → $
Examples: Classification,
regression, object detection,
semantic segmentation, etc.
Unsupervised Learning
Data: "
" is data, no labels!
Goal: Learn some hidden or
underlying structure of the data
Examples: Clustering, feature or
dimensionality reduction, etc.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Supervised vs unsupervised learning
Supervised Learning
Data: (", $)
" is data, $ is label
Goal: Learn function to map
" → $
Examples: Classification,
regression, object detection,
semantic segmentation, etc.
Unsupervised Learning
Data: "
" is data, no labels!
Goal: Learn some hidden or
underlying structure of the data
Examples: Clustering, feature or
dimensionality reduction, etc.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Generative modeling
Goal: Take as input training samples from some distribution
and learn a model that represents that distribution
Density Estimation Sample Generation
Input samples Generated samples
Training data ~ "#$%$ & Generated ~ "'(#)* &
How can we learn "'(#)* & similar to "#$%$ & ?
samples
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Why generative models? Debiasing
vs
Capable of uncovering underlying latent variables in a dataset
Homogeneous skin color, pose Diverse skin color, pose, illumination
How can we use latent distributions to create fair and representative datasets?
[2]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Why generative models? Outlier detection
95% of Driving Data:
(1) sunny, (2) highway, (3) straight road
Detect outliers to avoid unpredictable behavior when training
Edge Cases Harsh Weather Pedestrians
• Problem: How can we detect when
we encounter something new or rare?
• Strategy: Leverage generative
models, detect outliers in the
distribution
• Use outliers during training to
improve even more!
[3]
Latent variable models
Autoencoders andVariational
Autoencoders (VAEs)
Generative Adversarial
Networks (GANs)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
What is a latent variable?
Myth of the Cave
[4]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
What is a latent variable?
Can we learn the true explanatory factors, e.g. latent variables, from only observed data?
Autoencoders
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Autoencoders: background
Unsupervised approach for learning a lower-dimensional feature
representation from unlabeled training data
! "
“Encoder” learns mapping from the data, !, to a low-dimensional latent space, "
Why do we care about a
low-dimensional "?
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Autoencoders: background
How can we learn this latent space?
Train the model to use these features to reconstruct the original data
! "
“Decoder” learns mapping back from latent, ", to a reconstructed observation, #
!
#
!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Autoencoders: background
How can we learn this latent space?
Train the model to use these features to reconstruct the original data
! " #
!
ℒ !, #
! = ! − #
! ( Loss function doesn’t
use any labels!!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Autoencoders: background
How can we learn this latent space?
Train the model to use these features to reconstruct the original data
! " #
!
ℒ !, #
! = ! − #
! ( Loss function doesn’t
use any labels!!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Dimensionality of latent space à
reconstruction quality
2D latent space 5D latent space GroundTruth
Autoencoding is a form of compression!
Smaller latent space will force a larger training bottleneck
[5]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Autoencoders for representation learning
Bottleneck hidden layer forces network to learn a compressed
latent representation
Reconstruction loss forces the latent representation to capture
(or encode) as much “information” about the data as possible
Autoencoding = Automatically encoding data
Variational Autoencoders (VAEs)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs: key difference with traditional autoencoder
! " #
!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs: key difference with traditional autoencoder
! " #
!
$
%
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs: key difference with traditional autoencoder
! " #
!
$
%
mean
vector
standard deviation
vector
Variational autoencoders are a probabilistic twist on autoencoders!
Sample from the mean and standard dev. to compute latent sample
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE optimization
! " #
!
$
%
Encoder computes: &'(z|!) Decoder computes: ,-(x|")
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE optimization
! " #
!
$
%
Encoder computes: &'(z|!) Decoder computes:,-(x|")
ℒ ϕ, 2 = (reconstruction loss) + (regularization term)
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE optimization
! " #
!
$
%
Encoder computes: &'(z|!) Decoder computes:,-(x|")
ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE optimization
! " #
!
$
%
Encoder computes: &'(z|!) Decoder computes:,-(x|")
ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)
e.g. ! − #
! 5
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE optimization
! " #
!
$
%
Encoder computes: &'(z|!) Decoder computes:,-(x|")
ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)
4 &' z ! ∥ & "
Inferred latent
distribution
Fixed prior on
latent distribution
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Priors on the latent distribution
! "# z % ∥ " '
Inferred latent
distribution
Fixed prior on
latent distribution
Common choice of prior:
" ' = ) * = 0, -. = 1
• Encourages encodings to distribute encodings evenly around
the center of the latent space
• Penalize the network when it tries to “cheat” by clustering
points in specific regions (ie. memorizing the data)
[7]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Priors on the latent distribution
! "# z % ∥ " '
Common choice of prior:
" ' = ) * = 0, -. = 1
• Encourages encodings to distribute encodings evenly around
the center of the latent space
• Penalize the network when it tries to “cheat” by clustering
points in specific regions (ie. memorizing the data)
= −
1
2
2
345
678
-3 + *3
.
− 1 − log -3
KL-divergence between
the two distributions
[7]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs computation graph
! " #
!
$
%
Encoder computes: &'(z|!) Decoder computes:,-(x|")
ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs computation graph
! " #
!
$
%
Encoder computes: &'(z|!) Decoder computes:,-(x|")
ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)
Problem: We cannot backpropagate gradients through sampling layers!
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Reparametrizing the sampling layer
!
"
#
Key Idea:
! ~%(", #()
Consider the sampled latent
vector as a sum of
• a fixed " vector,
• and fixed # vector, scaled by
random constants drawn from
the prior distribution
⇒ ! = " + #⨀.
where .~%(0,1)
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Reparametrizing the sampling layer
!
"
" ∼ $%(z|))
+ )
Deterministic node
Stochastic node
Original form
Backprop
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Reparametrizing the sampling layer
!
"
" ∼ $%(z|))
!
" " = ,(-, ), /)
- ) /
- )
Deterministic node
Stochastic node
Original form Reparametrized form
0!
0"
0!
0-
Backprop
~2(0,1)
[6]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs: Latent perturbation
Slowly increase or decrease a single latent variable
Keep all other variables fixed
Head pose
Different dimensions of ! encodes different interpretable latent features
[8]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs: Latent perturbation
Head pose
Smile
Ideally, we want latent variables that
are uncorrelated with each other
Enforce diagonal prior on the latent
variables to encourage
independence
Disentanglement
[8]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs: Latent perturbation
Google BeatBlender
[9]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAEs: Latent perturbation
[10]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE summary
1. Compress representation of world to something we can use to learn
! " #
!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE summary
1. Compress representation of world to something we can use to learn
2. Reconstruction allows for unsupervised learning (no labels!)
! " #
!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE summary
1. Compress representation of world to something we can use to learn
2. Reconstruction allows for unsupervised learning (no labels!)
3. Reparameterization trick to train end-to-end
! " #
!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE summary
1. Compress representation of world to something we can use to learn
2. Reconstruction allows for unsupervised learning (no labels!)
3. Reparameterization trick to train end-to-end
4. Interpret hidden latent variables using perturbation
! " #
!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
VAE summary
1. Compress representation of world to something we can use to learn
2. Reconstruction allows for unsupervised learning (no labels!)
3. Reparameterization trick to train end-to-end
4. Interpret hidden latent variables using perturbation
5. Generating new examples
! " #
!
Generative Adversarial Networks (GANs)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
What if we just want to sample?
Idea: don’t explicitly model density, and instead just sample to generate new instances.
Problem: want to sample from complex distribution – can’t do this directly!
Solution: sample from something simple (noise), learn a
transformation to the training distribution.
!
noise "
#
$%&'
“fake” sample from the
training distribution
Generator Network "
[11]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs) are a way to make a generative
model by having two neural networks compete with each other.
The discriminator tries to identify real
data from fakes created by the generator.
The generator turns noise into an imitation
of the data to try to trick the discriminator.
!
noise "
#
$
%&'(
$
)'*&
+
[11]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Generator
Generator starts from noise to try to create an imitation of the data.
Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
Discriminator looks at both real data and fake data created by the generator.
Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
Discriminator looks at both real data and fake data created by the generator.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Discriminator tries to predict what’s real and what’s fake.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Discriminator tries to predict what’s real and what’s fake.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Discriminator tries to predict what’s real and what’s fake.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Discriminator tries to predict what’s real and what’s fake.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Generator tries to improve its imitation of the data.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Generator tries to improve its imitation of the data.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Generator tries to improve its imitation of the data.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Discriminator tries to predict what’s real and what’s fake.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Discriminator tries to predict what’s real and what’s fake.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Discriminator tries to predict what’s real and what’s fake.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Generator
Discriminator
! "#$% = 1
Discriminator tries to predict what’s real and what’s fake.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Generator tries to improve its imitation of the data.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator Generator
! "#$% = 1
Generator tries to improve its imitation of the data.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator
! "#$% = 1
Generator
Generator tries to improve its imitation of the data.
Real data Fake data
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Intuition behind GANs
Discriminator
! "#$% = 1
Real data Fake data
Generator
Discriminator tries to identify real data from fakes created by the generator.
Generator tries to create imitations of data to trick the discriminator.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Training GANs
Discriminator tries to identify real data from fakes created by the generator.
Generator tries to create imitations of data to trick the discriminator.
min
$%
max
$(
)*~,(-.-
log 2$(
3 + )5~,(5) log 1 − 2$(
:$%
(;)
Train GAN jointly via minimax game:
Discriminator wants to maximize objective s.t. 2 3 close to 1, 2 :(; ) close to 0.
Generator wants to minimize objective s.t. 2 :(; ) close to 1.
[11]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Why GANs?
A. Courville, 6S191 2018.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Why GANs?
A. Courville, 6S191 2018.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Generating new data with GANs
After training, use generator network to create new data that’s never been seen before.
!
noise "
#
$
%&'(
$
)'*&
+
GANs: Recent Advances
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Progressive growing of GANs (NVIDIA)
Karras et al., ICLR 2018.
[12]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Progressive growing of GANs: results
Karras et al., ICLR 2018.
[12]
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Style-based generator: results
Karras et al.,Arxiv 2018.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Style-based transfer: results
Karras et al.,Arxiv 2018.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
CycleGAN: domain transformation
CycleGAN learns transformations across domains with unpaired data.
Zhu et al., ICCV 2017.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/29/19
Deep Generative Modeling: Summary
Autoencoders and Variational
Autoencoders (VAEs)
Generative Adversarial
Networks (GANs)
Competing generator and
discriminator networks
Learn lower-dimensional latent
space and sample to generate
input reconstructions
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
T-shirts! Today!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Course Schedule
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Final Class Project
• Judged by a panel of industry judges
• Top winners are awarded:
3x NVIDIA RTX 2080Ti
MSRP: $4000
4x Google Home
MSRP: $400
Option 1: Proposal Presentation
• Present a novel deep learning
research idea or application
• Groups of 1 welcome
• Listeners welcome
• Groups of 2 to 4 to be eligible
for prizes, incl. 1 for-credit student
• 3 minutes
• Proposal instructions:
goo.gl/JGJ5E7
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Final Class Project
Option 1: Proposal Presentation
• Present a novel deep learning
research idea or application
• Groups of 1 welcome
• Listeners welcome
• Groups of 2 to 4 to be eligible
for prizes, incl. 1 for-credit student
• 3 minutes
• Proposal instructions:
goo.gl/JGJ5E7
Proposal Logistics
• >= 1 for-credit student to be eligible
for prizes
• Prepare slides on Google Slides
• Group submit by today 10pm:
goo.gl/rV6rLK
• In class project work: Thu, Jan 31
• Slide submit by Thu 11:59 pm:
goo.gl/7smL8w
• Presentations on Friday, Feb 1
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Final Class Project
Option 2:Write a 1-page review
of a deep learning paper
• Grade is based on clarity of
writing and technical
communication of main ideas
• Due Friday 1:00pm (before
lecture)
Option 1: Proposal Presentation
• Present a novel deep learning
research idea or application
• Groups of 1 welcome
• Listeners welcome
• Groups of 2 to 4 to be eligible
for prizes, incl. 1 for-credit student
• 3 minutes
• Proposal instructions:
goo.gl/JGJ5E7
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Thursday: Visualization in ML +
Biologically Inspired Learning
Fernanda Viegas,
Co-Director Google PAIR
DataVisualization for
Machine Learning
Dmitry Krotov,
MIT-IBM Watson AI Lab
Biologically Inspired Deep
Learning
Final project work
Ask us questions!
Open office hours!
Work with group members!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Friday: Learning and Perception +
Project Proposals + Awards + Pizza
Jan Kautz,
VP of Research
Learning and Perception
Project Proposals!
Judging and Awards!
Pizza Celebration!
So far in 6.S191…
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
The Rise of Deep Learning
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
So far in 6.S191…
Data
• Signals
• Images
• Sensors
…
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
So far in 6.S191…
Data
• Signals
• Images
• Sensors
…
Decision
• Prediction
• Detection
• Action
…
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
So far in 6.S191…
Data
• Signals
• Images
• Sensors
…
Decision
• Prediction
• Detection
• Action
…
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Power of Neural Nets
Universal ApproximationTheorem
A feedforward network with a single layer is sufficient to approximate, to
an arbitrary precision, any continuous function.
Hornik et al. Neural Networks. (1989)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Power of Neural Nets
Caveats:
The number of
hidden units may
be infeasibly large
The resulting
model may not
generalize
Hornik et al. Neural Networks. (1989)
Universal ApproximationTheorem
A feedforward network with a single layer is sufficient to approximate, to
an arbitrary precision, any continuous function.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Artificial Intelligence “Hype”: Historical Perspective
Limitations
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Rethinking Generalization
dog banana dog tree
“Understanding Deep Neural Networks Requires Rethinking Generalization”
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Rethinking Generalization
dog banana dog tree
“Understanding Deep Neural Networks Requires Rethinking Generalization”
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Rethinking Generalization
banana dog tree dog
dog banana dog tree
“Understanding Deep Neural Networks Requires Rethinking Generalization”
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Rethinking Generalization
dog banana dog tree
banana dog tree dog
Zhang et al. ICLR. (2017)
“Understanding Deep Neural Networks Requires Rethinking Generalization”
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Capacity of Deep Neural Networks
randomization
original
labels
completely
random
accuracy
100%
0%
Training Set Testing Set
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Capacity of Deep Neural Networks
randomization
original
labels
completely
random
accuracy
100%
0%
Training Set Testing Set
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Capacity of Deep Neural Networks
Training Set Testing Set
randomization
original
labels
completely
random
accuracy
100%
0%
Modern deep networks can
perfectly fit to random data
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Neural Networks as Function Approximators
Neural networks are excellent function approximators
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Neural Networks as Function Approximators
Neural networks are excellent function approximators
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Neural Networks as Function Approximators
Neural networks are excellent function approximators
?
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Neural Networks as Function Approximators
Neural networks are excellent function approximators
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Neural Networks as Function Approximators
Neural networks are excellent function approximators
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Neural Networks as Function Approximators
Neural networks are excellent function approximators
…when they have training data
How do we know when our
network doesn’t know?
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Adversarial Attacks on Neural Networks
Despois. “Adversarial examples and their implications” (2017).
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Adversarial Attacks on Neural Networks
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Adversarial Attacks on Neural Networks
Remember:
We train our networks with gradient descent
! ← ! − $
%&(!, ), *)
%!
“How does a small change in weights decrease our loss”
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Adversarial Attacks on Neural Networks
Remember:
We train our networks with gradient descent
! ← ! − $
%&(!, ), *)
%!
“How does a small change in weights decrease our loss”
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Adversarial Attacks on Neural Networks
Remember:
We train our networks with gradient descent
! ← ! − $
%&(!, ), *)
%!
“How does a small change in weights decrease our loss”
Fix your image ),
and true label *
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Adversarial Attacks on Neural Networks
Adversarial Image:
Modify image to increase error
! ← ! + $
%&((, !, *)
%!
“How does a small change in the input increase our loss”
Goodfellow et al. NIPS (2014)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Adversarial Attacks on Neural Networks
Adversarial Image:
Modify image to increase error
! ← ! + $
%&((, !, *)
%!
“How does a small change in the input increase our loss”
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Adversarial Attacks on Neural Networks
Adversarial Image:
Modify image to increase error
! ← ! + $
%&((, !, *)
%!
“How does a small change in the input increase our loss”
Fix your weights (,
and true label *
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Synthesizing Robust Adversarial Examples
Athalye et al. ICML. (2018)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Neural Network Limitations…
• Very data hungry (eg. often millions of examples)
• Computationally intensive to train and deploy (tractably requires GPUs)
• Easily fooled by adversarial examples
• Can be subject to algorithmic bias
• Poor at representing uncertainty (how do you know what the model knows?)
• Uninterpretable black boxes, difficult to trust
• Finicky to optimize: non-convex, choice of architecture, learning parameters
• Often require expert knowledge to design, fine tune architectures
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Neural Network Limitations…
• Very data hungry (eg. often millions of examples)
• Computationally intensive to train and deploy (tractably requires GPUs)
• Easily fooled by adversarial examples
• Can be subject to algorithmic bias
• Poor at representing uncertainty (how do you know what the model knows?)
• Uninterpretable black boxes, difficult to trust
• Finicky to optimize: non-convex, choice of architecture, learning parameters
• Often require expert knowledge to design, fine tune architectures
New Frontiers 1:
Bayesian Deep Learning
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Why Care About Uncertainty?
OR
ℙ(cat)
ℙ(dog)
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Why Care About Uncertainty?
ℙ cat = 0.2
ℙ dog = 0.8
Remember: ℙ cat + ℙ dog = 1
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Bayesian Deep Learning for Uncertainty
Network tries to learn output, !, directly from raw data, "
Find mapping, #, parameterized by weights $ such that
min ℒ(!, # +; $ )
Bayesian neural networks aim to learn a posterior over weights,
ℙ $ ", ! :
ℙ $ ", ! =
ℙ ! ", $ ℙ($)
ℙ(!|")
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Network tries to learn output, !, directly from raw data, "
Find mapping, #, parameterized by weights $ such that
min ℒ(!, # +; $ )
Bayesian neural networks aim to learn a posterior over weights,
ℙ $ ", ! :
ℙ $ ", ! =
ℙ ! ", $ ℙ($)
ℙ(!|")
Bayesian Deep Learning for Uncertainty
Intractable!
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Elementwise Dropout for Uncertainty
Evaluate ! stochastic forward passes through the network "# #$%
&
Dropout as a form of stochastic sampling '(,# ~ +,-./0112 3 ∀ 5 ∈ "
⊙ =
Unregularized Kernel
"
Bernoulli Dropout
'",#
Stochastic Sampled
"#
9 :
; < =
1
!
>
#$%
&
? < "#
@A- :
; < =
1
!
>
#$%
&
?(<)D − 9 :
; <
D
Amini, Soleimany, et al., NIPS Workshop on Bayesian Deep Learning, 2017.
Gal and Ghahramani, ICML, 2016.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Kendall, Gal, NIPS, 2017.
Input image Predicted Depth Model Uncertainty
Model Uncertainty Application
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Multi-Task Learning Using Uncertainty
Kendall, et al., CVPR, 2018.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Multi-Task Learning Using Uncertainty
Kendall, et al., CVPR, 2018.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Multi-Task Learning Using Uncertainty
Kendall, et al., CVPR, 2018.
New Frontiers II:
Learning to Learn
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Motivation: Learning to Learn
Standard deep neural networks are optimized for a single task
Often require expert knowledge to build an architecture for a given task
Complexity of models increases Greater need for specialized engineers
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Motivation: Learning to Learn
Standard deep neural networks are optimized for a single task
Often require expert knowledge to build an architecture for a given task
Complexity of models increases Greater need for specialized engineers
Build a learning algorithm that learns which model to use to solve a given problem
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
Motivation: Learning to Learn
Standard deep neural networks are optimized for a single task
Often require expert knowledge to build an architecture for a given task
Complexity of models increases Greater need for specialized engineers
Build a learning algorithm that learns which model to use to solve a given problem
AutoML
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
AutoML: Learning to Learn
Zoph and Le, ICLR 2017.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
AutoML: Model Controller
At each step, the model samples a brand new network
Zoph and Le, ICLR 2017.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
AutoML:The Child Network
Sampled network
from RNN
Training Data Prediction
Compute final accuracy on this dataset.
Update RNN controller based on the accuracy of the child network after training.
Zoph and Le, ICLR 2017.
6.S191 Introduction to Deep Learning
introtodeeplearning.com
1/30/19
AutoML on the Cloud
Google Cloud.
• Design an AI algorithm that can build new models
capable of solving a task
• Reduces the need for experienced engineers to
design the networks
• Makes deep learning more accessible to the public
AutoML Spawns a Powerful Idea
Connection to
Artificial General Intelligence:
the ability to intelligently
reason about how we learn
Follow me of LinkedIn for more:
Steve Nouri
https://www.linkedin.com/in/stevenouri/

More Related Content

PDF
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
PDF
Actor Concurrency
PDF
K02-salen: Systems Thinking in Action 2011
PDF
"Quantum" performance effects
DOCX
CST2403 NOTES
PDF
MIT6_0001F16_Lec1.pdf
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
Actor Concurrency
K02-salen: Systems Thinking in Action 2011
"Quantum" performance effects
CST2403 NOTES
MIT6_0001F16_Lec1.pdf

Similar to Deep Learning and details about how to pass that class.pdf -2.pdf (20)

ODP
Python quickstart for programmers: Python Kung Fu
PPT
lecture8_Cuong.ppt
PDF
Introduction to Rust - Waterford Tech Meetup 2025
PDF
Arduino programming of ML-style in ATS
PPT
Ch02 primitive-data-definite-loops
PPT
Python week 4 2019 2020 for g10 by eng.osama ghandour
PDF
Intro to maths for software eng
PDF
Optimization in Programming languages
PDF
Operationalizing Clojure Confidently
PPT
Python week 2 2019 2020 for g10 by eng.osama ghandour
PDF
Astronomical data analysis by python.pdf
PDF
Effective Numerical Computation in NumPy and SciPy
PDF
Writing Faster Python 3
PPTX
The Mathematics of RSA Encryption
PDF
Yoyak ScalaDays 2015
PPT
Code Tuning
PPT
Pythonic Math
PPT
Python week 1 2020-2021
PDF
LLVM Overview
Python quickstart for programmers: Python Kung Fu
lecture8_Cuong.ppt
Introduction to Rust - Waterford Tech Meetup 2025
Arduino programming of ML-style in ATS
Ch02 primitive-data-definite-loops
Python week 4 2019 2020 for g10 by eng.osama ghandour
Intro to maths for software eng
Optimization in Programming languages
Operationalizing Clojure Confidently
Python week 2 2019 2020 for g10 by eng.osama ghandour
Astronomical data analysis by python.pdf
Effective Numerical Computation in NumPy and SciPy
Writing Faster Python 3
The Mathematics of RSA Encryption
Yoyak ScalaDays 2015
Code Tuning
Pythonic Math
Python week 1 2020-2021
LLVM Overview
Ad

Recently uploaded (20)

PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Business Analytics and business intelligence.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Introduction to the R Programming Language
PPTX
Database Infoormation System (DBIS).pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
DOCX
Factor Analysis Word Document Presentation
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Leprosy and NLEP programme community medicine
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Business Analytics and business intelligence.pdf
ISS -ESG Data flows What is ESG and HowHow
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
retention in jsjsksksksnbsndjddjdnFPD.pptx
Introduction to the R Programming Language
Database Infoormation System (DBIS).pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
[EN] Industrial Machine Downtime Prediction
STERILIZATION AND DISINFECTION-1.ppthhhbx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Factor Analysis Word Document Presentation
SAP 2 completion done . PRESENTATION.pptx
Leprosy and NLEP programme community medicine
Ad

Deep Learning and details about how to pass that class.pdf -2.pdf

  • 1. Introduction to Deep Learning MIT 6.S191 Alexander Amini January 28, 2019 Follow me of LinkedIn for more: Steve Nouri https://www.linkedin.com/in/stevenouri/
  • 2. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 The Rise of Deep Learning
  • 3. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 What is Deep Learning? ARTIFICIAL INTELLIGENCE MACHINE LEARNING DEEP LEARNING Any technique that enables computers to mimic human behavior Ability to learn without explicitly being programmed Extract patterns from data using neural networks
  • 4. Why Deep Learning and Why Now?
  • 5. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Why Deep Learning? Hand engineered features are time consuming, brittle and not scalable in practice Can we learn the underlying features directly from data? Low Level Features Lines & Edges Eyes & Nose & Ears Facial Structure Mid Level Features High Level Features
  • 6. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Why Now? 1952 Stochastic Gradient Descent 1958 Perceptron • Learnable Weights 1995 Deep Convolutional NN • Digit Recognition 1986 Backpropagation • Multi-Layer Perceptron 1. Big Data • Larger Datasets • Easier Collection & Storage 2. Hardware • Graphics Processing Units (GPUs) • Massively Parallelizable 3. Software • Improved Techniques • New Models • Toolboxes Neural Networks date back decades, so why the resurgence?
  • 7. The Perceptron The structural building block of deep learning
  • 8. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 The Perceptron: Forward Propagation !" !# !$ Σ &$ &# &" ' ( = * + , - " # & , !, Non-linear activation function Output Linear combination of inputs ' ( Inputs Weights Sum Non-Linearity Output
  • 9. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Inputs Weights Sum Non-Linearity !" !# !$ !% &$ &# 1 &" ( ) = + !% + - . / " # & . !. Non-linear activation function Output Linear combination of inputs Σ ( ) Output Bias The Perceptron: Forward Propagation
  • 10. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Inputs Weights Sum Non-Linearity !" !# !$ !% &$ &# 1 &" ( ) = + !% + - . / " # & . !. Σ ( ) Output The Perceptron: Forward Propagation ( ) = + !% + 1 2 3 where: 1 = &" ⋮ & # and 3 = !" ⋮ !#
  • 11. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Inputs Weights Sum Non-Linearity !" !# !$ !% &$ &# 1 &" Σ ) * Output The Perceptron: Forward Propagation ) * = , !% +./ 0 Activation Functions • Example: sigmoid function , 1 = 2 1 = 1 1 + 3 4 5 1
  • 12. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Common Activation Functions NOTE: All activation functions are non-linear ! " = 1 1 + & ' ( Sigmoid Function ! ′ " = !(") 1 − !(") ! " = & ( − & ' ( & ( + & ' ( HyperbolicTangent ! ′ " = 1 − !(")- ! " = max ( 0 , " ) Rectified Linear Unit (ReLU) ! ′ ( " ) = 3 1 , " > 0 0 , otherwise tf.nn.sigmoid(z) tf.nn.tanh(z) tf.nn.relu(z)
  • 13. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Importance of Activation Functions The purpose of activation functions is to introduce non-linearities into the network What if we wanted to build a Neural Network to distinguish green vs red points?
  • 14. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Importance of Activation Functions The purpose of activation functions is to introduce non-linearities into the network Linear Activation functions produce linear decisions no matter the network size
  • 15. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Importance of Activation Functions Linear Activation functions produce linear decisions no matter the network size Non-linearities allow us to approximate arbitrarily complex functions The purpose of activation functions is to introduce non-linearities into the network
  • 16. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 The Perceptron: Example 1 −2 3 Σ &' &( 1 ) * We have: +, = 1 and . = 3 − 2 ) * = / +, + 1 2 . = / 1 + &' &( 2 3 − 2 ) * = / (1 + 3 &' − 2 &( ) This is just a line in 2D!
  • 17. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 The Perceptron: Example 1 −2 3 Σ &' &( 1 ) * ) * = , (1 + 3 &' − 2 &( ) 1 + 3 & ' − 2 & ( = 0 &' & (
  • 18. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 The Perceptron: Example 1 −2 3 Σ &' &( 1 ) * ) * = , (1 + 3 &' − 2 &( ) Assume we have input: 0 = − 1 2 −1 2 ) * = , 1 + 3∗−1 − 2∗2 = , −6 ≈ 0.002 1 + 3 & ' − 2 & ( = 0 &' & (
  • 19. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 The Perceptron: Example 1 −2 3 Σ &' &( 1 ) * ) * = , (1 + 3 &' − 2 &( ) 1 + 3 & ' − 2 & ( = 0 &' & ( 1 < 0 * < 0.5 1 > 0 * > 0.5
  • 20. Building Neural Networks with Perceptrons
  • 21. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Inputs Weights Sum Non-Linearity !" !# !$ !% &$ &# 1 &" Σ ) * Output The Perceptron: Simplified
  • 22. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 The Perceptron: Simplified !" !# !$ % & = ( % % = )* + , -.$ # !- )-
  • 23. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Multi Output Perceptron !" !# !$ %" %$ &$ = ( %$ &" = ( %" %) = *+,) + . /0$ # !/ */,)
  • 24. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Single Layer Neural Network Inputs !" !# !$ Hidden %& %" ' (" ' ($ Final Output %$ %)* %+ = -.,+ ($) + 3 45$ # !4 -4,+ ($) ' (+ = 6 -.,+ (") + 3 45$ )* %4 -4,+ (") 6 %$ 6 %" 6 %& 6 %)* 7 ($) 7 (")
  • 25. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Single Layer Neural Network !" !# !$ %& %" ' (" ' ($ %$ %)* %" = ,-," ($) + 2 34$ # !3 ,3," ($) = ,-," ($) + !$ ,$," ($) + !" ,"," ($) + !# ,#," ($) 5$," ($) 5"," ($) 5#," ($)
  • 26. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Multi Output Perceptron Inputs !" !# !$ Hidden %& %" ' (" ' ($ Output %$ %)* from tf.keras.layers import * inputs = Inputs(m) hidden = Dense(d1)(inputs) outputs = Dense(2)(hidden) model = Model(inputs, outputs)
  • 27. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Deep Neural Network Inputs !" !# !$ Hidden %&,( %&," ) *" ) *$ Output %&,$ %&,+, %&,- = /0,- (&) + 4 56$ +,78 9(%&:$,5) /5,- (&) ⋯ ⋯
  • 29. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Example Problem Will I pass this class? Let’s start with a simple two feature model !" = Number of lectures you attend !$ = Hours spent on the final project
  • 30. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Example Problem: Will I pass this class? ! " = Hours spent on the final project !$ = Number of lectures you attend Pass Fail Legend
  • 31. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Example Problem: Will I pass this class? ! " = Hours spent on the final project !$ = Number of lectures you attend Pass Fail Legend ? 4 5
  • 32. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Example Problem: Will I pass this class? !" !# $% $" & '# $# ! # = 4 ,5 Predicted: 0.1
  • 33. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Example Problem: Will I pass this class? !" !# $% $" & '# $# Predicted: 0.1 Actual: 1 ! # = 4 ,5
  • 34. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Quantifying Loss !" !# $% $" & '# $# Predicted: 0.1 Actual: 1 The loss of our network measures the cost incurred from incorrect predictions ℒ , !(.) ; 1 , '(.) Predicted Actual ! # = 4 ,5
  • 35. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Empirical Loss !" !# $% $" & '# $# 4, 2, 5, ⋮ The empirical loss measures the total loss over our entire dataset 5 1 8 ⋮ ) = 0.1 0.8 0.6 ⋮ +(!) 1 0 1 ⋮ ' . / = 1 1 2 34# 5 ℒ + !(3) ; / , '(3) Predicted Actual Also known as: • Objective function • Cost function • Empirical Risk
  • 36. Binary Cross Entropy Loss !" !# $% $" & '# $# 4, 2, 5, ⋮ Cross entropy loss can be used with models that output a probability between 0 and 1 5 1 8 ⋮ ) = 0.1 0.8 0.6 ⋮ +(!) 1 0 1 ⋮ ' . / = 1 1 2 34# 5 '(3) log + ! 3 ; / + (1 − '(3)) log 1 − + ! 3 ; / Predicted Actual Predicted Actual loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(model.y, model.pred) )
  • 37. Mean Squared Error Loss !" !# $% $" & '# $# 4, 2, 5, ⋮ Mean squared error loss can be used with regression models that output continuous real numbers 5 1 8 ⋮ ) = 30 80 85 ⋮ +(!) 90 20 95 ⋮ ' . / = 1 1 2 34# 5 ' 3 − + ! 3 ; / " Predicted Actual loss = tf.reduce_mean( tf.square(tf.subtract(model.y, model.pred) ) Final Grades (percentage)
  • 39. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Loss Optimization We want to find the network weights that achieve the lowest loss !∗ = argmin ! 1 + , -./ 0 ℒ 2 3(-); ! , 8(-) !∗ = argmin ! 9(!)
  • 40. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Loss Optimization We want to find the network weights that achieve the lowest loss !∗ = argmin ! 1 + , -./ 0 ℒ 2 3(-); ! , 8(-) !∗ = argmin ! 9(!) Remember: ! = !(:), !(/), ⋯
  • 41. Loss Optimization !∗ = argmin ! *(!) *(-., -0) -0 -. Remember: Our loss is a function of the network weights!
  • 42. Loss Optimization Randomly pick an initial ("#, "%) '("#, "%) "% "#
  • 44. Loss Optimization Take small step in opposite direction of gradient !(#$, #&) #& #$
  • 45. Gradient Descent Repeat until convergence !(#$, #&) #& #$
  • 46. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Gradient Descent Algorithm 1. Initialize weights randomly ~"(0, &') 2. Loop until convergence: 3. Compute gradient, )*(+) )+ 4. Update weights, + ← + − . )*(+) )+ 5. Return weights weights = tf.random_normal(shape, stddev=sigma) grads = tf.gradients(ys=loss, xs=weights) weights_new = weights.assign(weights – lr * grads)
  • 47. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Gradient Descent Algorithm 1. Initialize weights randomly ~"(0, &') 2. Loop until convergence: 3. Compute gradient, )*(+) )+ 4. Update weights, + ← + − . )*(+) )+ 5. Return weights weights = tf.random_normal(shape, stddev=sigma) grads = tf.gradients(ys=loss, xs=weights) weights_new = weights.assign(weights – lr * grads)
  • 48. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Computing Gradients: Backpropagation How does a small change in one weight (ex. !") affect the final loss #(%)? ' () * + !) !" #(%)
  • 49. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Computing Gradients: Backpropagation !"($) !&' = ) *+ , - &+ &' "($) Let’s use the chain rule!
  • 50. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Computing Gradients: Backpropagation !"($) !&' = !"($) ! ) * ∗ ! ) * !&' , -. ) * &. &' "($)
  • 51. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Computing Gradients: Backpropagation !"($) !&' = !"($) ! ) * ∗ ! ) * !&' , -' ) * &' &. "($) Apply chain rule! Apply chain rule!
  • 52. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Computing Gradients: Backpropagation !"($) !&' = !"($) ! ) * ∗ ! ) * !,' - ,' ) * &' &. "($) ∗ !,' !&'
  • 53. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Computing Gradients: Backpropagation !"($) !&' = !"($) ! ) * ∗ ! ) * !,' - ,' ) * &' &. "($) ∗ !,' !&' Repeat this for every weight in the network using gradients from later layers
  • 54. Neural Networks in Practice: Optimization
  • 55. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Training Neural Networks is Difficult “Visualizing the loss landscape of neural nets”. Dec 2017.
  • 56. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Loss Functions Can Be Difficult to Optimize Remember: Optimization through gradient descent ! ← ! − $ %&(!) %!
  • 57. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Remember: Optimization through gradient descent ! ← ! − $ %&(!) %! How can we set the learning rate? Loss Functions Can Be Difficult to Optimize
  • 58. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Setting the Learning Rate Small learning rate converges slowly and gets stuck in false local minima Initial guess ! "(!)
  • 59. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Setting the Learning Rate Large learning rates overshoot, become unstable and diverge Initial guess ! "(!)
  • 60. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Setting the Learning Rate Stable learning rates converge smoothly and avoid local minima Initial guess ! "($)
  • 61. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 How to deal with this? Idea 1: Try lots of different learning rates and see what works “just right”
  • 62. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 How to deal with this? Idea 1: Try lots of different learning rates and see what works “just right” Idea 2: Do something smarter! Design an adaptive learning rate that “adapts” to the landscape
  • 63. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Adaptive Learning Rates • Learning rates are no longer fixed • Can be made larger or smaller depending on: • how large gradient is • how fast learning is happening • size of particular weights • etc...
  • 64. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Adaptive Learning Rate Algorithms • Momentum • Adagrad • Adadelta • Adam • RMSProp Additional details: http://ruder.io/optimizing-gradient-descent/ tf.train.MomentumOptimizer tf.train.AdagradOptimizer tf.train.AdadeltaOptimizer tf.train.AdamOptimizer tf.train.RMSPropOptimizer Qian et al.“On the momentum term in gradient descent learning algorithms.” 1999. Duchi et al.“Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” 2011. Zeiler et al.“ADADELTA:An Adaptive Learning Rate Method.” 2012. Kingma et al.“Adam:A Method for Stochastic Optimization.” 2014.
  • 65. Neural Networks in Practice: Mini-batches
  • 66. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Gradient Descent Algorithm 1. Initialize weights randomly ~"(0, &') 2. Loop until convergence: 3. Compute gradient, )*(+) )+ 4. Update weights, + ← + − . )*(+) )+ 5. Return weights
  • 67. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Gradient Descent Algorithm 1. Initialize weights randomly ~"(0, &') 2. Loop until convergence: 3. Compute gradient, )*(+) )+ 4. Update weights, + ← + − . )*(+) )+ 5. Return weights Can be very computational to compute!
  • 68. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Stochastic Gradient Descent Algorithm 1. Initialize weights randomly ~"(0, &') 2. Loop until convergence: 3. Pick single data point ) 4. Compute gradient, *+,(-) *- 5. Update weights, - ← - − 0 *+(-) *- 6. Return weights
  • 69. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Stochastic Gradient Descent Algorithm 1. Initialize weights randomly ~"(0, &') 2. Loop until convergence: 3. Pick single data point ) 4. Compute gradient, *+,(-) *- 5. Update weights, - ← - − 0 *+(-) *- 6. Return weights Easy to compute but very noisy (stochastic)!
  • 70. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Stochastic Gradient Descent Algorithm 1. Initialize weights randomly ~"(0, &') 2. Loop until convergence: 3. Pick batch of ) data points 4. Compute gradient, *+(,) *, = . / ∑12. / *+3(,) *, 5. Update weights, , ← , − 6 *+(,) *, 6. Return weights
  • 71. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Stochastic Gradient Descent Algorithm 1. Initialize weights randomly ~"(0, &') 2. Loop until convergence: 3. Pick batch of ) data points 4. Compute gradient, *+(,) *, = . / ∑12. / *+3(,) *, 5. Update weights, , ← , − 6 *+(,) *, 6. Return weights Fast to compute and a much better estimate of the true gradient!
  • 72. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Mini-batches while training More accurate estimation of gradient Smoother convergence Allows for larger learning rates
  • 73. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Mini-batches while training More accurate estimation of gradient Smoother convergence Allows for larger learning rates Mini-batches lead to fast training! Can parallelize computation + achieve significant speed increases on GPU’s
  • 74. Neural Networks in Practice: Overfitting
  • 75. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 The Problem of Overfitting Underfitting Model does not have capacity to fully learn the data Ideal fit Overfitting Too complex, extra parameters, does not generalize well
  • 76. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization What is it? Technique that constrains our optimization problem to discourage complex models
  • 77. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization What is it? Technique that constrains our optimization problem to discourage complex models Why do we need it? Improve generalization of our model on unseen data
  • 78. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 1: Dropout !" !# !$ % &" % &$ '$,# '$," '$,$ '$,) '",# '"," '",$ '",) • During training, randomly set some activations to 0
  • 79. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 1: Dropout !" !# !$ % &" % &$ '$,# '$," '$,$ '$,) '",# '"," '",$ '",) • During training, randomly set some activations to 0 • Typically ‘drop’ 50% of activations in layer • Forces network to not rely on any 1 node tf.keras.layers.Dropout(p=0.5)
  • 80. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 1: Dropout !" !# !$ % &" % &$ '$,# '$," '$,$ '$,) '",# '"," '",$ '",) • During training, randomly set some activations to 0 • Typically ‘drop’ 50% of activations in layer • Forces network to not rely on any 1 node tf.keras.layers.Dropout(p=0.5)
  • 81. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 2: Early Stopping • Stop training before we have a chance to overfit Training Iterations Loss
  • 82. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 2: Early Stopping • Stop training before we have a chance to overfit Training Iterations Loss Testing Training Legend
  • 83. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 2: Early Stopping • Stop training before we have a chance to overfit Training Iterations Loss Training Legend Testing
  • 84. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 2: Early Stopping • Stop training before we have a chance to overfit Training Iterations Loss Training Legend Testing
  • 85. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 2: Early Stopping Training Iterations Loss Training Legend Testing • Stop training before we have a chance to overfit
  • 86. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 2: Early Stopping Training Iterations Loss Training Legend Testing • Stop training before we have a chance to overfit
  • 87. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 2: Early Stopping Training Iterations Loss Training Legend Stop training here! Testing • Stop training before we have a chance to overfit
  • 88. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Regularization 2: Early Stopping Training Iterations Loss Training Legend Stop training here! Over-fitting Under-fitting Testing • Stop training before we have a chance to overfit
  • 89. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/28/19 Core Foundation Review • Structural building blocks • Nonlinear activation functions The Perceptron Neural Networks Training in Practice • Stacking Perceptrons to form neural networks • Optimization through backpropagation • Adaptive learning • Batching • Regularization Σ "# "$ "% & ' "# "$ "% (),+ (),# & '# & '% (),% (),,- ⋯ ⋯
  • 90. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19
  • 92. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Images are Numbers [1]
  • 93. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Images are Numbers [1]
  • 94. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Images are Numbers What the computer sees An image is just a matrix of numbers [0,255]! i.e., 1080x1080x3 for an RGB image [1]
  • 95. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Tasks in ComputerVision - Regression: output variable takes continuous value - Classification: output variable takes class label. Can produce probability of belonging to a particular class Input Image classification Lincoln Washington Jefferson Obama Pixel Representation 0.8 0.1 0.05 0.05
  • 96. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 High Level Feature Detection Let’s identify key features in each image category Wheels, License Plate, Headlights Door, Windows, Steps Nose, Eyes, Mouth
  • 97. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Manual Feature Extraction Problems? Define features Domain knowledge Detect features to classify
  • 98. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Manual Feature Extraction Define features Domain knowledge Detect features to classify [2]
  • 99. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Manual Feature Extraction Define features Domain knowledge Detect features to classify [2]
  • 100. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Learning Feature Representations Can we learn a hierarchy of features directly from the data instead of hand engineering? Low level features Mid level features High level features Eyes, ears, nose Edges, dark spots Facial structure [3]
  • 102. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Fully Connected Neural Network
  • 103. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Fully Connected Neural Network Fully Connected: • Connect neuron in hidden layer to all neurons in input layer • No spatial information! • And many, many parameters! Input: • 2D image • Vector of pixel values
  • 104. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Fully Connected Neural Network How can we use spatial structure in the input to inform the architecture of the network? Fully Connected: • Connect neuron in hidden layer to all neurons in input layer • No spatial information! • And many, many parameters! Input: • 2D image • Vector of pixel values
  • 105. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Using Spatial Structure Neuron connected to region of input. Only “sees” these values. Idea: connect patches of input to neurons in hidden layer. Input: 2D image. Array of pixel values
  • 106. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Using Spatial Structure Connect patch in input layer to a single neuron in subsequent layer. Use a sliding window to define connections. How can we weight the patch to detect particular features?
  • 107. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Applying Filters to Extract Features 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter (features that matter in one part of the input should matter elsewhere)
  • 108. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Feature Extraction with Convolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter - Filter of size 4x4 : 16 different weights - Apply this same filter to 4x4 patches in input - Shift by 2 pixels for next patch This “patchy” operation is convolution
  • 109. Feature Extraction and Convolution A Case Study
  • 110. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 X or X? Image is represented as matrix of pixel values… and computers are literal! We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed. [4]
  • 111. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Features of X [4]
  • 112. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Filters to Detect X Features filters [4]
  • 113. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation element wise multiply add outputs 1 1 = 1 X = 9 [4]
  • 114. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation Suppose we want to compute the convolution of a 5x5 image and a 3x3 filter: We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs… image filter [5]
  • 115. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation filter feature map We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: [5]
  • 116. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: filter feature map [5]
  • 117. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: filter feature map [5]
  • 118. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: filter feature map [5]
  • 119. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: filter feature map [5]
  • 120. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: filter feature map [5]
  • 121. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: [5] filter feature map
  • 122. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: filter feature map [5]
  • 123. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 The Convolution Operation We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs: filter feature map [5]
  • 124. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Producing Feature Maps Original Sharpen Edge Detect “Strong” Edge Detect
  • 125. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Feature Extraction with Convolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 127. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 CNNs for Classification 1. Convolution:Apply filters with learned weights to generate feature maps. 2. Non-linearity: Often ReLU. 3. Pooling: Downsampling operation on each feature map. Train model with image data. Learn weights of filters in convolutional layers.
  • 128. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Convolutional Layers: Local Connectivity For a neuron in hidden layer: - Take inputs from patch - Compute weighted sum - Apply bias
  • 129. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Convolutional Layers: Local Connectivity For a neuron in hidden layer: - Take inputs from patch - Compute weighted sum - Apply bias 4x4 filter: matrix of weights !"# $ "%& ' $ #%& ' !"# (")*,#), + . for neuron (p,q) in hidden layer 1) applying a window of weights 2) computing linear combinations 3) activating with non-linear function
  • 130. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 CNNs: Spatial Arrangement of OutputVolume depth width height Layer Dimensions: ℎ " # " $ where h and w are spatial dimensions d (depth) = number of filters Receptive Field: Locations in input image that a node is path connected to Stride: Filter step size [3]
  • 131. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Introducing Non-Linearity ! " = max(0 , " ) Rectified Linear Unit (ReLU) - Apply after every convolution operation (i.e., after convolutional layers) - ReLU: pixel-by-pixel operation that replaces all negative values by zero. Non-linear operation! [5]
  • 132. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Pooling How else can we downsample and preserve spatial invariance? 1) Reduced dimensionality 2) Spatial invariance [3]
  • 133. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Representation Learning in Deep CNNs Mid level features Eyes, ears, nose Low level features Edges, dark spots High level features Facial structure Conv Layer 1 Conv Layer 2 Conv Layer 3 [3]
  • 134. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 CNNs for Classification: Feature Learning 1. Learn features in input image through convolution 2. Introduce non-linearity through activation function (real-world data is non-linear!) 3. Reduce dimensionality and preserve spatial invariance with pooling
  • 135. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 CNNs for Classification: Class Probabilities - CONV and POOL layers output high-level features of input - Fully connected layer uses these features for classifying input image - Express output as probability of image belonging to a particular class softmax () = +,- ∑/ +,0
  • 136. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 CNNs:Training with Backpropagation Learn weights for convolutional filters and fully connected layers ! " = $ % &(%) log , -(%) Backpropagation: cross-entropy loss
  • 138. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 ImageNet Dataset Dataset of over 14 million images across 21,841 categories 1409 pictures of bananas. “Elongated crescent-shaped yellow fruit with soft sweet flesh” [6,7]
  • 139. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 ImageNet Challenge Classification task: produce a list of object categories present in image. 1000 categories. “Top 5 error”: rate at which the model does not output correct label in top 5 predictions Other tasks include: single-object localization, object detection from video/image, scene classification, scene parsing [6,7]
  • 140. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 ImageNet Challenge: Classification Task 2 0 1 0 2 0 1 1 2 0 1 2 2 0 1 3 2 0 1 4 2 0 1 5 H u m a n 0 10 20 30 classification error % 28.2 25.8 16.4 11.7 6.7 3.57 5.1 2012:AlexNet. First CNN to win. - 8 layers, 61 million parameters 2013: ZFNet - 8 layers, more filters 2014:VGG - 19 layers 2014: GoogLeNet - “Inception” modules - 22 layers, 5million parameters 2015: ResNet - 152 layers [6,7]
  • 141. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 ImageNet Challenge: Classification Task 2 0 1 0 2 0 1 1 2 0 1 2 2 0 1 3 2 0 1 4 2 0 1 4 2 0 1 5 H u m a n 0 10 20 30 classification error % 28.2 25.8 16.4 11.7 6.7 3.57 5.1 7.3 2 0 1 0 2 0 1 1 2 0 1 2 2 0 1 3 2 0 1 4 2 0 1 4 2 0 1 5 0 50 100 150 number of layers [6,7]
  • 142. An Architecture for Many Applications
  • 143. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 An Architecture for Many Applications Object detection with R-CNNs Segmentation with fully convolutional networks Image captioning with RNNs
  • 144. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Beyond Classification Object Detection CAT, DOG, DUCK Semantic Segmentation CAT Image Captioning The cat is in the grass.
  • 145. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Semantic Segmentation: FCNs FCN: Fully Convolutional Network. Network designed with all convolutional layers, with downsampling and upsampling operations [3,8,9]
  • 146. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Driving Scene Segmentation [10] Fix reference
  • 147. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Driving Scene Segmentation [11, 12]
  • 148. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Object Detection with R-CNNs R-CNN: Find regions that we think have objects. Use CNN to classify. [13]
  • 149. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Image Captioning using RNNs [14,15]
  • 150. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Image Captioning using RNNs [14,15]
  • 151. Deep Learning for ComputerVision: Impact and Summary
  • 152. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Data, Data, Data MNIST: handwritten digits places: natural scenes ImageNet: 22K categories. 14M images. CIFAR-10 Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck
  • 153. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Deep Learning for ComputerVision: Impact
  • 154. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Impact: Face Detection 6.S191 Lab!
  • 155. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Impact: Self-Driving Cars [16]
  • 156. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Impact: Healthcare [17] Identifying facial phenotypes of genetic disorders using deep learning Gurovich et al., Nature Med. 2019
  • 157. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Deep Learning for ComputerVision: Summary • Why computer vision? • Representing images • Convolutions for feature extraction Foundations CNNs Applications • CNN architecture • Application to classification • ImageNet • Segmentation, object detection, image captioning • Visualization
  • 158. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Which face is fake? [1]
  • 159. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Supervised vs unsupervised learning Supervised Learning Data: (", $) " is data, $ is label Goal: Learn function to map " → $ Examples: Classification, regression, object detection, semantic segmentation, etc. Unsupervised Learning Data: " " is data, no labels! Goal: Learn some hidden or underlying structure of the data Examples: Clustering, feature or dimensionality reduction, etc.
  • 160. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Supervised vs unsupervised learning Supervised Learning Data: (", $) " is data, $ is label Goal: Learn function to map " → $ Examples: Classification, regression, object detection, semantic segmentation, etc. Unsupervised Learning Data: " " is data, no labels! Goal: Learn some hidden or underlying structure of the data Examples: Clustering, feature or dimensionality reduction, etc.
  • 161. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Generative modeling Goal: Take as input training samples from some distribution and learn a model that represents that distribution Density Estimation Sample Generation Input samples Generated samples Training data ~ "#$%$ & Generated ~ "'(#)* & How can we learn "'(#)* & similar to "#$%$ & ? samples
  • 162. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Why generative models? Debiasing vs Capable of uncovering underlying latent variables in a dataset Homogeneous skin color, pose Diverse skin color, pose, illumination How can we use latent distributions to create fair and representative datasets? [2]
  • 163. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Why generative models? Outlier detection 95% of Driving Data: (1) sunny, (2) highway, (3) straight road Detect outliers to avoid unpredictable behavior when training Edge Cases Harsh Weather Pedestrians • Problem: How can we detect when we encounter something new or rare? • Strategy: Leverage generative models, detect outliers in the distribution • Use outliers during training to improve even more! [3]
  • 164. Latent variable models Autoencoders andVariational Autoencoders (VAEs) Generative Adversarial Networks (GANs)
  • 165. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 What is a latent variable? Myth of the Cave [4]
  • 166. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 What is a latent variable? Can we learn the true explanatory factors, e.g. latent variables, from only observed data?
  • 168. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Autoencoders: background Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data ! " “Encoder” learns mapping from the data, !, to a low-dimensional latent space, " Why do we care about a low-dimensional "?
  • 169. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Autoencoders: background How can we learn this latent space? Train the model to use these features to reconstruct the original data ! " “Decoder” learns mapping back from latent, ", to a reconstructed observation, # ! # !
  • 170. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Autoencoders: background How can we learn this latent space? Train the model to use these features to reconstruct the original data ! " # ! ℒ !, # ! = ! − # ! ( Loss function doesn’t use any labels!!
  • 171. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Autoencoders: background How can we learn this latent space? Train the model to use these features to reconstruct the original data ! " # ! ℒ !, # ! = ! − # ! ( Loss function doesn’t use any labels!!
  • 172. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Dimensionality of latent space à reconstruction quality 2D latent space 5D latent space GroundTruth Autoencoding is a form of compression! Smaller latent space will force a larger training bottleneck [5]
  • 173. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Autoencoders for representation learning Bottleneck hidden layer forces network to learn a compressed latent representation Reconstruction loss forces the latent representation to capture (or encode) as much “information” about the data as possible Autoencoding = Automatically encoding data
  • 175. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs: key difference with traditional autoencoder ! " # !
  • 176. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs: key difference with traditional autoencoder ! " # ! $ % [6]
  • 177. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs: key difference with traditional autoencoder ! " # ! $ % mean vector standard deviation vector Variational autoencoders are a probabilistic twist on autoencoders! Sample from the mean and standard dev. to compute latent sample [6]
  • 178. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE optimization ! " # ! $ % Encoder computes: &'(z|!) Decoder computes: ,-(x|") [6]
  • 179. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE optimization ! " # ! $ % Encoder computes: &'(z|!) Decoder computes:,-(x|") ℒ ϕ, 2 = (reconstruction loss) + (regularization term) [6]
  • 180. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE optimization ! " # ! $ % Encoder computes: &'(z|!) Decoder computes:,-(x|") ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term) [6]
  • 181. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE optimization ! " # ! $ % Encoder computes: &'(z|!) Decoder computes:,-(x|") ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term) e.g. ! − # ! 5 [6]
  • 182. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE optimization ! " # ! $ % Encoder computes: &'(z|!) Decoder computes:,-(x|") ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term) 4 &' z ! ∥ & " Inferred latent distribution Fixed prior on latent distribution [6]
  • 183. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Priors on the latent distribution ! "# z % ∥ " ' Inferred latent distribution Fixed prior on latent distribution Common choice of prior: " ' = ) * = 0, -. = 1 • Encourages encodings to distribute encodings evenly around the center of the latent space • Penalize the network when it tries to “cheat” by clustering points in specific regions (ie. memorizing the data) [7]
  • 184. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Priors on the latent distribution ! "# z % ∥ " ' Common choice of prior: " ' = ) * = 0, -. = 1 • Encourages encodings to distribute encodings evenly around the center of the latent space • Penalize the network when it tries to “cheat” by clustering points in specific regions (ie. memorizing the data) = − 1 2 2 345 678 -3 + *3 . − 1 − log -3 KL-divergence between the two distributions [7]
  • 185. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs computation graph ! " # ! $ % Encoder computes: &'(z|!) Decoder computes:,-(x|") ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term) [6]
  • 186. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs computation graph ! " # ! $ % Encoder computes: &'(z|!) Decoder computes:,-(x|") ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term) Problem: We cannot backpropagate gradients through sampling layers! [6]
  • 187. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Reparametrizing the sampling layer ! " # Key Idea: ! ~%(", #() Consider the sampled latent vector as a sum of • a fixed " vector, • and fixed # vector, scaled by random constants drawn from the prior distribution ⇒ ! = " + #⨀. where .~%(0,1) [6]
  • 188. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Reparametrizing the sampling layer ! " " ∼ $%(z|)) + ) Deterministic node Stochastic node Original form Backprop [6]
  • 189. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Reparametrizing the sampling layer ! " " ∼ $%(z|)) ! " " = ,(-, ), /) - ) / - ) Deterministic node Stochastic node Original form Reparametrized form 0! 0" 0! 0- Backprop ~2(0,1) [6]
  • 190. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs: Latent perturbation Slowly increase or decrease a single latent variable Keep all other variables fixed Head pose Different dimensions of ! encodes different interpretable latent features [8]
  • 191. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs: Latent perturbation Head pose Smile Ideally, we want latent variables that are uncorrelated with each other Enforce diagonal prior on the latent variables to encourage independence Disentanglement [8]
  • 192. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs: Latent perturbation Google BeatBlender [9]
  • 193. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAEs: Latent perturbation [10]
  • 194. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE summary 1. Compress representation of world to something we can use to learn ! " # !
  • 195. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE summary 1. Compress representation of world to something we can use to learn 2. Reconstruction allows for unsupervised learning (no labels!) ! " # !
  • 196. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE summary 1. Compress representation of world to something we can use to learn 2. Reconstruction allows for unsupervised learning (no labels!) 3. Reparameterization trick to train end-to-end ! " # !
  • 197. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE summary 1. Compress representation of world to something we can use to learn 2. Reconstruction allows for unsupervised learning (no labels!) 3. Reparameterization trick to train end-to-end 4. Interpret hidden latent variables using perturbation ! " # !
  • 198. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 VAE summary 1. Compress representation of world to something we can use to learn 2. Reconstruction allows for unsupervised learning (no labels!) 3. Reparameterization trick to train end-to-end 4. Interpret hidden latent variables using perturbation 5. Generating new examples ! " # !
  • 200. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 What if we just want to sample? Idea: don’t explicitly model density, and instead just sample to generate new instances. Problem: want to sample from complex distribution – can’t do this directly! Solution: sample from something simple (noise), learn a transformation to the training distribution. ! noise " # $%&' “fake” sample from the training distribution Generator Network " [11]
  • 201. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Generative Adversarial Networks (GANs) Generative Adversarial Networks (GANs) are a way to make a generative model by having two neural networks compete with each other. The discriminator tries to identify real data from fakes created by the generator. The generator turns noise into an imitation of the data to try to trick the discriminator. ! noise " # $ %&'( $ )'*& + [11]
  • 202. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Generator Generator starts from noise to try to create an imitation of the data. Fake data
  • 203. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator Discriminator looks at both real data and fake data created by the generator. Fake data
  • 204. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator Discriminator looks at both real data and fake data created by the generator. Real data Fake data
  • 205. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Discriminator tries to predict what’s real and what’s fake. Real data Fake data
  • 206. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Discriminator tries to predict what’s real and what’s fake. Real data Fake data
  • 207. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Discriminator tries to predict what’s real and what’s fake. Real data Fake data
  • 208. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Discriminator tries to predict what’s real and what’s fake. Real data Fake data
  • 209. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Generator tries to improve its imitation of the data. Real data Fake data
  • 210. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Generator tries to improve its imitation of the data. Real data Fake data
  • 211. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Generator tries to improve its imitation of the data. Real data Fake data
  • 212. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Discriminator tries to predict what’s real and what’s fake. Real data Fake data
  • 213. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Discriminator tries to predict what’s real and what’s fake. Real data Fake data
  • 214. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Discriminator tries to predict what’s real and what’s fake. Real data Fake data
  • 215. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Generator Discriminator ! "#$% = 1 Discriminator tries to predict what’s real and what’s fake. Real data Fake data
  • 216. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Generator tries to improve its imitation of the data. Real data Fake data
  • 217. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator Generator ! "#$% = 1 Generator tries to improve its imitation of the data. Real data Fake data
  • 218. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator ! "#$% = 1 Generator Generator tries to improve its imitation of the data. Real data Fake data
  • 219. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Intuition behind GANs Discriminator ! "#$% = 1 Real data Fake data Generator Discriminator tries to identify real data from fakes created by the generator. Generator tries to create imitations of data to trick the discriminator.
  • 220. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Training GANs Discriminator tries to identify real data from fakes created by the generator. Generator tries to create imitations of data to trick the discriminator. min $% max $( )*~,(-.- log 2$( 3 + )5~,(5) log 1 − 2$( :$% (;) Train GAN jointly via minimax game: Discriminator wants to maximize objective s.t. 2 3 close to 1, 2 :(; ) close to 0. Generator wants to minimize objective s.t. 2 :(; ) close to 1. [11]
  • 221. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Why GANs? A. Courville, 6S191 2018.
  • 222. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Why GANs? A. Courville, 6S191 2018.
  • 223. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Generating new data with GANs After training, use generator network to create new data that’s never been seen before. ! noise " # $ %&'( $ )'*& +
  • 225. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Progressive growing of GANs (NVIDIA) Karras et al., ICLR 2018. [12]
  • 226. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Progressive growing of GANs: results Karras et al., ICLR 2018. [12]
  • 227. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Style-based generator: results Karras et al.,Arxiv 2018.
  • 228. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Style-based transfer: results Karras et al.,Arxiv 2018.
  • 229. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 CycleGAN: domain transformation CycleGAN learns transformations across domains with unpaired data. Zhu et al., ICCV 2017.
  • 230. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/29/19 Deep Generative Modeling: Summary Autoencoders and Variational Autoencoders (VAEs) Generative Adversarial Networks (GANs) Competing generator and discriminator networks Learn lower-dimensional latent space and sample to generate input reconstructions
  • 231. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 T-shirts! Today!
  • 232. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Course Schedule
  • 233. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Final Class Project • Judged by a panel of industry judges • Top winners are awarded: 3x NVIDIA RTX 2080Ti MSRP: $4000 4x Google Home MSRP: $400 Option 1: Proposal Presentation • Present a novel deep learning research idea or application • Groups of 1 welcome • Listeners welcome • Groups of 2 to 4 to be eligible for prizes, incl. 1 for-credit student • 3 minutes • Proposal instructions: goo.gl/JGJ5E7
  • 234. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Final Class Project Option 1: Proposal Presentation • Present a novel deep learning research idea or application • Groups of 1 welcome • Listeners welcome • Groups of 2 to 4 to be eligible for prizes, incl. 1 for-credit student • 3 minutes • Proposal instructions: goo.gl/JGJ5E7 Proposal Logistics • >= 1 for-credit student to be eligible for prizes • Prepare slides on Google Slides • Group submit by today 10pm: goo.gl/rV6rLK • In class project work: Thu, Jan 31 • Slide submit by Thu 11:59 pm: goo.gl/7smL8w • Presentations on Friday, Feb 1
  • 235. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Final Class Project Option 2:Write a 1-page review of a deep learning paper • Grade is based on clarity of writing and technical communication of main ideas • Due Friday 1:00pm (before lecture) Option 1: Proposal Presentation • Present a novel deep learning research idea or application • Groups of 1 welcome • Listeners welcome • Groups of 2 to 4 to be eligible for prizes, incl. 1 for-credit student • 3 minutes • Proposal instructions: goo.gl/JGJ5E7
  • 236. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Thursday: Visualization in ML + Biologically Inspired Learning Fernanda Viegas, Co-Director Google PAIR DataVisualization for Machine Learning Dmitry Krotov, MIT-IBM Watson AI Lab Biologically Inspired Deep Learning Final project work Ask us questions! Open office hours! Work with group members!
  • 237. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Friday: Learning and Perception + Project Proposals + Awards + Pizza Jan Kautz, VP of Research Learning and Perception Project Proposals! Judging and Awards! Pizza Celebration!
  • 238. So far in 6.S191…
  • 239. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 The Rise of Deep Learning
  • 240. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 So far in 6.S191… Data • Signals • Images • Sensors …
  • 241. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 So far in 6.S191… Data • Signals • Images • Sensors … Decision • Prediction • Detection • Action …
  • 242. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 So far in 6.S191… Data • Signals • Images • Sensors … Decision • Prediction • Detection • Action …
  • 243. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Power of Neural Nets Universal ApproximationTheorem A feedforward network with a single layer is sufficient to approximate, to an arbitrary precision, any continuous function. Hornik et al. Neural Networks. (1989)
  • 244. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Power of Neural Nets Caveats: The number of hidden units may be infeasibly large The resulting model may not generalize Hornik et al. Neural Networks. (1989) Universal ApproximationTheorem A feedforward network with a single layer is sufficient to approximate, to an arbitrary precision, any continuous function.
  • 245. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Artificial Intelligence “Hype”: Historical Perspective
  • 247. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Rethinking Generalization dog banana dog tree “Understanding Deep Neural Networks Requires Rethinking Generalization” Zhang et al. ICLR. (2017)
  • 248. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Rethinking Generalization dog banana dog tree “Understanding Deep Neural Networks Requires Rethinking Generalization” Zhang et al. ICLR. (2017)
  • 249. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Rethinking Generalization banana dog tree dog dog banana dog tree “Understanding Deep Neural Networks Requires Rethinking Generalization” Zhang et al. ICLR. (2017)
  • 250. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Rethinking Generalization dog banana dog tree banana dog tree dog Zhang et al. ICLR. (2017) “Understanding Deep Neural Networks Requires Rethinking Generalization”
  • 251. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Capacity of Deep Neural Networks randomization original labels completely random accuracy 100% 0% Training Set Testing Set Zhang et al. ICLR. (2017)
  • 252. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Capacity of Deep Neural Networks randomization original labels completely random accuracy 100% 0% Training Set Testing Set Zhang et al. ICLR. (2017)
  • 253. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Capacity of Deep Neural Networks Training Set Testing Set randomization original labels completely random accuracy 100% 0% Modern deep networks can perfectly fit to random data Zhang et al. ICLR. (2017)
  • 254. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Neural Networks as Function Approximators Neural networks are excellent function approximators
  • 255. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Neural Networks as Function Approximators Neural networks are excellent function approximators
  • 256. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Neural Networks as Function Approximators Neural networks are excellent function approximators ?
  • 257. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Neural Networks as Function Approximators Neural networks are excellent function approximators
  • 258. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Neural Networks as Function Approximators Neural networks are excellent function approximators
  • 259. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Neural Networks as Function Approximators Neural networks are excellent function approximators …when they have training data How do we know when our network doesn’t know?
  • 260. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Adversarial Attacks on Neural Networks Despois. “Adversarial examples and their implications” (2017).
  • 261. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Adversarial Attacks on Neural Networks
  • 262. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Adversarial Attacks on Neural Networks Remember: We train our networks with gradient descent ! ← ! − $ %&(!, ), *) %! “How does a small change in weights decrease our loss”
  • 263. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Adversarial Attacks on Neural Networks Remember: We train our networks with gradient descent ! ← ! − $ %&(!, ), *) %! “How does a small change in weights decrease our loss”
  • 264. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Adversarial Attacks on Neural Networks Remember: We train our networks with gradient descent ! ← ! − $ %&(!, ), *) %! “How does a small change in weights decrease our loss” Fix your image ), and true label *
  • 265. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Adversarial Attacks on Neural Networks Adversarial Image: Modify image to increase error ! ← ! + $ %&((, !, *) %! “How does a small change in the input increase our loss” Goodfellow et al. NIPS (2014)
  • 266. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Adversarial Attacks on Neural Networks Adversarial Image: Modify image to increase error ! ← ! + $ %&((, !, *) %! “How does a small change in the input increase our loss”
  • 267. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Adversarial Attacks on Neural Networks Adversarial Image: Modify image to increase error ! ← ! + $ %&((, !, *) %! “How does a small change in the input increase our loss” Fix your weights (, and true label *
  • 268. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Synthesizing Robust Adversarial Examples Athalye et al. ICML. (2018)
  • 269. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Neural Network Limitations… • Very data hungry (eg. often millions of examples) • Computationally intensive to train and deploy (tractably requires GPUs) • Easily fooled by adversarial examples • Can be subject to algorithmic bias • Poor at representing uncertainty (how do you know what the model knows?) • Uninterpretable black boxes, difficult to trust • Finicky to optimize: non-convex, choice of architecture, learning parameters • Often require expert knowledge to design, fine tune architectures
  • 270. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Neural Network Limitations… • Very data hungry (eg. often millions of examples) • Computationally intensive to train and deploy (tractably requires GPUs) • Easily fooled by adversarial examples • Can be subject to algorithmic bias • Poor at representing uncertainty (how do you know what the model knows?) • Uninterpretable black boxes, difficult to trust • Finicky to optimize: non-convex, choice of architecture, learning parameters • Often require expert knowledge to design, fine tune architectures
  • 271. New Frontiers 1: Bayesian Deep Learning
  • 272. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Why Care About Uncertainty? OR ℙ(cat) ℙ(dog)
  • 273. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Why Care About Uncertainty? ℙ cat = 0.2 ℙ dog = 0.8 Remember: ℙ cat + ℙ dog = 1
  • 274. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Bayesian Deep Learning for Uncertainty Network tries to learn output, !, directly from raw data, " Find mapping, #, parameterized by weights $ such that min ℒ(!, # +; $ ) Bayesian neural networks aim to learn a posterior over weights, ℙ $ ", ! : ℙ $ ", ! = ℙ ! ", $ ℙ($) ℙ(!|")
  • 275. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Network tries to learn output, !, directly from raw data, " Find mapping, #, parameterized by weights $ such that min ℒ(!, # +; $ ) Bayesian neural networks aim to learn a posterior over weights, ℙ $ ", ! : ℙ $ ", ! = ℙ ! ", $ ℙ($) ℙ(!|") Bayesian Deep Learning for Uncertainty Intractable!
  • 276. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Elementwise Dropout for Uncertainty Evaluate ! stochastic forward passes through the network "# #$% & Dropout as a form of stochastic sampling '(,# ~ +,-./0112 3 ∀ 5 ∈ " ⊙ = Unregularized Kernel " Bernoulli Dropout '",# Stochastic Sampled "# 9 : ; < = 1 ! > #$% & ? < "# @A- : ; < = 1 ! > #$% & ?(<)D − 9 : ; < D Amini, Soleimany, et al., NIPS Workshop on Bayesian Deep Learning, 2017. Gal and Ghahramani, ICML, 2016.
  • 277. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Kendall, Gal, NIPS, 2017. Input image Predicted Depth Model Uncertainty Model Uncertainty Application
  • 278. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Multi-Task Learning Using Uncertainty Kendall, et al., CVPR, 2018.
  • 279. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Multi-Task Learning Using Uncertainty Kendall, et al., CVPR, 2018.
  • 280. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Multi-Task Learning Using Uncertainty Kendall, et al., CVPR, 2018.
  • 282. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Motivation: Learning to Learn Standard deep neural networks are optimized for a single task Often require expert knowledge to build an architecture for a given task Complexity of models increases Greater need for specialized engineers
  • 283. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Motivation: Learning to Learn Standard deep neural networks are optimized for a single task Often require expert knowledge to build an architecture for a given task Complexity of models increases Greater need for specialized engineers Build a learning algorithm that learns which model to use to solve a given problem
  • 284. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 Motivation: Learning to Learn Standard deep neural networks are optimized for a single task Often require expert knowledge to build an architecture for a given task Complexity of models increases Greater need for specialized engineers Build a learning algorithm that learns which model to use to solve a given problem AutoML
  • 285. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 AutoML: Learning to Learn Zoph and Le, ICLR 2017.
  • 286. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 AutoML: Model Controller At each step, the model samples a brand new network Zoph and Le, ICLR 2017.
  • 287. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 AutoML:The Child Network Sampled network from RNN Training Data Prediction Compute final accuracy on this dataset. Update RNN controller based on the accuracy of the child network after training. Zoph and Le, ICLR 2017.
  • 288. 6.S191 Introduction to Deep Learning introtodeeplearning.com 1/30/19 AutoML on the Cloud Google Cloud.
  • 289. • Design an AI algorithm that can build new models capable of solving a task • Reduces the need for experienced engineers to design the networks • Makes deep learning more accessible to the public AutoML Spawns a Powerful Idea Connection to Artificial General Intelligence: the ability to intelligently reason about how we learn Follow me of LinkedIn for more: Steve Nouri https://www.linkedin.com/in/stevenouri/