"A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale

Copyright © 2017 DeepScale 1
A Shallow Dive into Training Deep
Neural Networks
Sammy Sidhu
May 2017

• Perception systems for autonomous vehicles
• Focusing on enabling technologies for mass-produced autonomous
vehicles
• Working with a number of OEMs and automotive suppliers
• Open Source ☺
• Visit http://deepscale.ai
About DeepScale

• Feature Engineering vs. Learned Features
• Neural Network Review
• Loss Function (Objective Function)
• Gradients
• Optimization Techniques
• Datasets
• Overfitting and Underfitting
Overview

Feature Engineering vs. Learned Features
Example of hand written features for face detection

• Feature Engineering for computer vision can work well
• Very time consuming to find useful features
• Requires BOTH domain expertise and programming know-how
• Hard to generalize all cases (lumination, pose and variations in
domain)
• Can use generalized features like HOG/SIFT but accuracy suffers
Feature Engineering vs. Learned Features (Cont’d.)

Example of learned features of a CNN for facial
classification [DeepFace CVPR14]

• Learned Features for computer vision can work extremely well
• Image Classification: 5.71% vs. 26.2% error [ResNet-152 vs. SIFT
sparse]
• Only requires labeled data, deep learning expertise and computing
power
• “Training” the network is essentially learning features layer by layer
• The deeper you go, the features become much more complex
• Hard to perform validation outside of putting in data and seeing what
happens

y = fw(x)
where w is a set of parameters we can learn and f is a nonlinear function
A neural network can be seen as a function approximation
Neural Networks — Quick Review
8
Typical nonlinear functions in DNN

• Take the example of a Linear Regression
• Given data, we fit a line (𝑦 = 𝑚𝑥 + 𝑏) that minimizes the sum of the
squares of differences (Euclidian distance loss function)
• This function that we minimize is the loss function
• An example would be to predict house value given square footage and
median income
• f(sqft, income) --> value where value is [0, inf] dollars
• we want to minimize L(actual_value, predicted_value) where L is the
loss function
Loss Function (Objective Function)

Loss Function (Objective Function) (Cont’d.)

• Another loss function is the Softmax loss for classification
• This is useful for the case if we want to predict the probability of an event
• For Example: Predict if an image is of a cat or a dog

• Loss functions can be used for either classification or regression
• The goal is to pick a set of weights that makes this loss value as small
as possible
• It is very crucial to pick the right objective function for the right task, i.e.,
one technically can use a squared loss for predicting probability

• Now if we have a loss function and a neural network, how do we know
what part of the network is “responsible” for causing that error?
• Let’s go back to the simple linear regression!
Gradients

• Let’s define the loss function
• 𝐿 =
1
2
(𝑌 − ෠𝑌)2 where ෠𝑌 is the predicted
• Let’s then take the derivative to see how ෠𝑌 contributes to the loss L
•
𝑑𝐿
𝑑 ෠𝑌
= −(𝑌 − ෠𝑌) = ෠𝑌 − 𝑌
• We’re fitting a line
• ෠𝑌 = 𝑚𝑋 + 𝑏
• Two weights to optimize (slope and bias)
•
𝑑 ෠𝑌
𝑑𝑚
= X,
𝑑 ෠𝑌
𝑑𝑏
= 1
Gradients (Cont’d.)

Line with noise to fit Surface of loss w.r.t slope and bias (m, b)
https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/

• We know
𝑑𝐿
𝑑 ෠𝑌
= ෠𝑌 − 𝑌 and
𝑑 ෠𝑌
𝑑𝑚
= X,
𝑑 ෠𝑌
𝑑𝑏
= 1
• To optimize our line [slope and bias] we use the chain rule!
•
𝑑𝐿
𝑑𝑚
=
𝑑𝐿
𝑑෢𝑌
𝑑 ෠𝑌
𝑑𝑚
= X(෠𝑌 − 𝑌) and
𝑑𝐿
𝑑𝑏
=
𝑑𝐿
𝑑෢𝑌
𝑑 ෠𝑌
𝑑𝑏
= (෠𝑌 − 𝑌)
• Together, these two derivatives make a Gradient!
• We update our weights with the following
• 𝑚 = 𝑚 + 𝛼
𝑑𝐿
𝑑𝑚
and 𝑏 = 𝑏 + 𝛼
𝑑𝐿
𝑑𝑏
• where 𝛼 is a rate parameter

• How to minimize loss?
• Walk down surface via gradient steps until you reach the minimum!
https://github.com/mattnedrich/GradientDescentExample

• Gradient descent is not just limited to linear regression
• We can take derivatives with respect to any parameter in the
neural network
• To avoid math complexity and recomputation, we can use the
chain rule again
• We can even do this through our nonlinear functions that are
not continuous

Gradients (Cont.)
• This process of computing and applying gradient updates to a neural
network layer by layer is called Back Propagation

• Given the fact that we now have gradients, and the weights, what's the
best way to apply the updates?
• In the previous linear regression example
• Grab random sample and apply updates to slope and bias
• Repeat until converges
• Known as SGD
• Can we do better to find the best possible set of weights to minimize
loss? (Optimization)
Optimization Techniques

• Momentum
• Keep a running average of previous updates and add to each update
Optimization Techniques (Cont’d.)
Steps without Momentum Steps with Momentum

• AdaGrad, AdaProp, RMSProp, ADAM
• Automatically tune learning rate to reach convergence in less
updates
• Great for fast convergence
• Sometimes finicky for reaching lowest loss possible for a network

• When it comes to neural networks, you want to have a diverse dataset
that large enough to training your network without overfitting (more on
this later)
• You can also augment your data to generate more samples
• Rotations / reflections when makes sense
• Add noise / hue / contrast
• This is extremely useful in the case where you have rare samples classes
Datasets

Datasets (Cont’d.)
MNIST

CIFAR-10

Imagenet

• What is Overfitting?
• Fitting to the training data but not generalizing well
• What is Underfitting?
• The model does not capture the trends in the data
• How to tell?
Overfitting and Underfitting

Overfitting and Underfitting (Cont’d.)

• We can split the training data into 3 disjoint parts
• Training set, Validation set, Test set
• During training
• “Learn” via the training set
• Evaluate the model every epoch with the validation set
• After Training
• Test the model with the test set which the model hasn’t seen before

• Overfitting when
• Training loss is low but validation and test loss is high

• How to combat overfitting?
• More data
• Data augmentation
• Regularization (weight decay)
• Add the magnitude of the weights to the loss function
• Ignore some of the weight updates (Dropout)
• Simpler model?

• Underfitting when
• Training loss drops at first then stops
• Training loss is still high
• Training loss tracks validation loss
• More complex model?
• Turn down regularization

• Neural Nets are function approximators
• Deep Learning can work surprising well
• Optimizing nets is an art that requires intuition
• Making good datasets is hard
• Overfitting makes it hard to generalize for applications
• We can find how robust our models are with validation testing
Takeaways

Thank you!
Questions?

"A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale

More Related Content

What's hot (20)

Similar to "A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

"A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale