SlideShare a Scribd company logo
Copyright © 2017 DeepScale 1
A Shallow Dive into Training Deep
Neural Networks
Sammy Sidhu
May 2017
Copyright © 2017 DeepScale 2
• Perception systems for autonomous vehicles
• Focusing on enabling technologies for mass-produced autonomous
vehicles
• Working with a number of OEMs and automotive suppliers
• Open Source ☺
• Visit http://deepscale.ai
About DeepScale
Copyright © 2017 DeepScale 3
• Feature Engineering vs. Learned Features
• Neural Network Review
• Loss Function (Objective Function)
• Gradients
• Optimization Techniques
• Datasets
• Overfitting and Underfitting
Overview
Copyright © 2017 DeepScale 4
Feature Engineering vs. Learned Features
Example of hand written features for face detection
Copyright © 2017 DeepScale 5
• Feature Engineering for computer vision can work well
• Very time consuming to find useful features
• Requires BOTH domain expertise and programming know-how
• Hard to generalize all cases (lumination, pose and variations in
domain)
• Can use generalized features like HOG/SIFT but accuracy suffers
Feature Engineering vs. Learned Features (Cont’d.)
Copyright © 2017 DeepScale 6
Feature Engineering vs. Learned Features (Cont’d.)
Example of learned features of a CNN for facial
classification [DeepFace CVPR14]
Copyright © 2017 DeepScale 7
• Learned Features for computer vision can work extremely well
• Image Classification: 5.71% vs. 26.2% error [ResNet-152 vs. SIFT
sparse]
• Only requires labeled data, deep learning expertise and computing
power
• “Training” the network is essentially learning features layer by layer
• The deeper you go, the features become much more complex
• Hard to perform validation outside of putting in data and seeing what
happens
Feature Engineering vs. Learned Features (Cont’d.)
Copyright © 2017 DeepScale 8
y = fw(x)
where w is a set of parameters we can learn and f is a nonlinear function
A neural network can be seen as a function approximation
Neural Networks — Quick Review
8
Typical nonlinear functions in DNN
Copyright © 2017 DeepScale 9
• Take the example of a Linear Regression
• Given data, we fit a line (𝑦 = 𝑚𝑥 + 𝑏) that minimizes the sum of the
squares of differences (Euclidian distance loss function)
• This function that we minimize is the loss function
• An example would be to predict house value given square footage and
median income
• f(sqft, income) --> value where value is [0, inf] dollars
• we want to minimize L(actual_value, predicted_value) where L is the
loss function
Loss Function (Objective Function)
Copyright © 2017 DeepScale 10
Loss Function (Objective Function) (Cont’d.)
Copyright © 2017 DeepScale 11
Loss Function (Objective Function) (Cont’d.)
• Another loss function is the Softmax loss for classification
• This is useful for the case if we want to predict the probability of an event
• For Example: Predict if an image is of a cat or a dog
Copyright © 2017 DeepScale 12
• Loss functions can be used for either classification or regression
• The goal is to pick a set of weights that makes this loss value as small
as possible
• It is very crucial to pick the right objective function for the right task, i.e.,
one technically can use a squared loss for predicting probability
Loss Function (Objective Function) (Cont’d.)
Copyright © 2017 DeepScale 13
• Now if we have a loss function and a neural network, how do we know
what part of the network is “responsible” for causing that error?
• Let’s go back to the simple linear regression!
Gradients
Copyright © 2017 DeepScale 14
• Let’s define the loss function
• 𝐿 =
1
2
(𝑌 − ෠𝑌)2 where ෠𝑌 is the predicted
• Let’s then take the derivative to see how ෠𝑌 contributes to the loss L
•
𝑑𝐿
𝑑 ෠𝑌
= −(𝑌 − ෠𝑌) = ෠𝑌 − 𝑌
• We’re fitting a line
• ෠𝑌 = 𝑚𝑋 + 𝑏
• Two weights to optimize (slope and bias)
•
𝑑 ෠𝑌
𝑑𝑚
= X,
𝑑 ෠𝑌
𝑑𝑏
= 1
Gradients (Cont’d.)
Copyright © 2017 DeepScale 15
Gradients (Cont’d.)
Line with noise to fit Surface of loss w.r.t slope and bias (m, b)
https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
Copyright © 2017 DeepScale 16
• We know
𝑑𝐿
𝑑 ෠𝑌
= ෠𝑌 − 𝑌 and
𝑑 ෠𝑌
𝑑𝑚
= X,
𝑑 ෠𝑌
𝑑𝑏
= 1
• To optimize our line [slope and bias] we use the chain rule!
•
𝑑𝐿
𝑑𝑚
=
𝑑𝐿
𝑑෢𝑌
𝑑 ෠𝑌
𝑑𝑚
= X(෠𝑌 − 𝑌) and
𝑑𝐿
𝑑𝑏
=
𝑑𝐿
𝑑෢𝑌
𝑑 ෠𝑌
𝑑𝑏
= (෠𝑌 − 𝑌)
• Together, these two derivatives make a Gradient!
• We update our weights with the following
• 𝑚 = 𝑚 + 𝛼
𝑑𝐿
𝑑𝑚
and 𝑏 = 𝑏 + 𝛼
𝑑𝐿
𝑑𝑏
• where 𝛼 is a rate parameter
Gradients (Cont’d.)
Copyright © 2017 DeepScale 17
• How to minimize loss?
• Walk down surface via gradient steps until you reach the minimum!
Gradients (Cont’d.)
https://github.com/mattnedrich/GradientDescentExample
Copyright © 2017 DeepScale 18
• Gradient descent is not just limited to linear regression
• We can take derivatives with respect to any parameter in the
neural network
• To avoid math complexity and recomputation, we can use the
chain rule again
• We can even do this through our nonlinear functions that are
not continuous
Gradients (Cont’d.)
Copyright © 2017 DeepScale 19
Gradients (Cont.)
• This process of computing and applying gradient updates to a neural
network layer by layer is called Back Propagation
Copyright © 2017 DeepScale 20
• Given the fact that we now have gradients, and the weights, what's the
best way to apply the updates?
• In the previous linear regression example
• Grab random sample and apply updates to slope and bias
• Repeat until converges
• Known as SGD
• Can we do better to find the best possible set of weights to minimize
loss? (Optimization)
Optimization Techniques
Copyright © 2017 DeepScale 21
• Momentum
• Keep a running average of previous updates and add to each update
Optimization Techniques (Cont’d.)
Steps without Momentum Steps with Momentum
Copyright © 2017 DeepScale 22
• AdaGrad, AdaProp, RMSProp, ADAM
• Automatically tune learning rate to reach convergence in less
updates
• Great for fast convergence
• Sometimes finicky for reaching lowest loss possible for a network
Optimization Techniques (Cont’d.)
Copyright © 2017 DeepScale 23
Optimization Techniques (Cont’d.)
Copyright © 2017 DeepScale 24
• When it comes to neural networks, you want to have a diverse dataset
that large enough to training your network without overfitting (more on
this later)
• You can also augment your data to generate more samples
• Rotations / reflections when makes sense
• Add noise / hue / contrast
• This is extremely useful in the case where you have rare samples classes
Datasets
Copyright © 2017 DeepScale 25
Datasets (Cont’d.)
MNIST
Copyright © 2017 DeepScale 26
Datasets (Cont’d.)
CIFAR-10
Copyright © 2017 DeepScale 27
Datasets (Cont’d.)
Imagenet
Copyright © 2017 DeepScale 28
• What is Overfitting?
• Fitting to the training data but not generalizing well
• What is Underfitting?
• The model does not capture the trends in the data
• How to tell?
Overfitting and Underfitting
Copyright © 2017 DeepScale 29
Overfitting and Underfitting (Cont’d.)
Copyright © 2017 DeepScale 30
• We can split the training data into 3 disjoint parts
• Training set, Validation set, Test set
• During training
• “Learn” via the training set
• Evaluate the model every epoch with the validation set
• After Training
• Test the model with the test set which the model hasn’t seen before
Overfitting and Underfitting (Cont’d.)
Copyright © 2017 DeepScale 31
Overfitting and Underfitting (Cont’d.)
• Overfitting when
• Training loss is low but validation and test loss is high
Copyright © 2017 DeepScale 32
• How to combat overfitting?
• More data
• Data augmentation
• Regularization (weight decay)
• Add the magnitude of the weights to the loss function
• Ignore some of the weight updates (Dropout)
• Simpler model?
Overfitting and Underfitting (Cont’d.)
Copyright © 2017 DeepScale 33
• Underfitting when
• Training loss drops at first then stops
• Training loss is still high
• Training loss tracks validation loss
• More complex model?
• Turn down regularization
Overfitting and Underfitting (Cont’d.)
Copyright © 2017 DeepScale 34
• Neural Nets are function approximators
• Deep Learning can work surprising well
• Optimizing nets is an art that requires intuition
• Making good datasets is hard
• Overfitting makes it hard to generalize for applications
• We can find how robust our models are with validation testing
Takeaways
Copyright © 2017 DeepScale 35
Thank you!
Questions?

More Related Content

PPTX
Convolution Neural Network (CNN)
PPT
Automatic Attendance system using Facial Recognition
PPSX
Edge Detection and Segmentation
PDF
Digital Image Processing - Image Compression
PPT
Image enhancement
PDF
Image Restoration (Digital Image Processing)
PDF
Speech emotion recognition
PPTX
Noise
Convolution Neural Network (CNN)
Automatic Attendance system using Facial Recognition
Edge Detection and Segmentation
Digital Image Processing - Image Compression
Image enhancement
Image Restoration (Digital Image Processing)
Speech emotion recognition
Noise

What's hot (20)

PPTX
Spatial Filters (Digital Image Processing)
PPTX
Watershed Segmentation Image Processing
PPT
Wavelet transform in image compression
PPTX
Image Enhancement using Frequency Domain Filters
PDF
Digital Image Processing: Image Segmentation
PPTX
5. gray level transformation
PDF
Digital Image Fundamentals
PPTX
Log Transformation in Image Processing with Example
PPTX
Deep Learning With Neural Networks
PPT
Image Restoration
PPTX
discrete wavelet transform
PPTX
SPATIAL FILTERING IN IMAGE PROCESSING
PDF
Image Segmentation
PPTX
Noise filtering
PPTX
Image enhancement
PPTX
An overview of gradient descent optimization algorithms
PPSX
Color Image Processing: Basics
PDF
Elements of visual perception
PPTX
Point processing
PPTX
Image Enhancement in Spatial Domain
Spatial Filters (Digital Image Processing)
Watershed Segmentation Image Processing
Wavelet transform in image compression
Image Enhancement using Frequency Domain Filters
Digital Image Processing: Image Segmentation
5. gray level transformation
Digital Image Fundamentals
Log Transformation in Image Processing with Example
Deep Learning With Neural Networks
Image Restoration
discrete wavelet transform
SPATIAL FILTERING IN IMAGE PROCESSING
Image Segmentation
Noise filtering
Image enhancement
An overview of gradient descent optimization algorithms
Color Image Processing: Basics
Elements of visual perception
Point processing
Image Enhancement in Spatial Domain
Ad

Similar to "A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale (20)

PPTX
Introduction to deep Learning Fundamentals
PDF
Deep Learning: concepts and use cases (October 2018)
PDF
Training Neural Networks
PPTX
Deeplearning
PDF
Deep Style: Using Variational Auto-encoders for Image Generation
PPTX
Batch normalization presentation
PDF
#7 Neural Networks Artificial intelligence
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
PPTX
Visualization of Deep Learning
PPTX
Illustrative Introductory Neural Networks
PPTX
Introduction to deep learning
PDF
Deep learning
PPTX
Deep learning
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PDF
Feedforward Networks and Deep Learning Module-02.pdf
PDF
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
PPTX
Chapter10.pptx
PPTX
Neural networks and deep learning
PPTX
A Beginner's Approach to Deep Learning Techniques
PDF
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Introduction to deep Learning Fundamentals
Deep Learning: concepts and use cases (October 2018)
Training Neural Networks
Deeplearning
Deep Style: Using Variational Auto-encoders for Image Generation
Batch normalization presentation
#7 Neural Networks Artificial intelligence
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Visualization of Deep Learning
Illustrative Introductory Neural Networks
Introduction to deep learning
Deep learning
Deep learning
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Feedforward Networks and Deep Learning Module-02.pdf
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Chapter10.pptx
Neural networks and deep learning
A Beginner's Approach to Deep Learning Techniques
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ad

More from Edge AI and Vision Alliance (20)

PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Empathic Computing: Creating Shared Understanding
MIND Revenue Release Quarter 2 2025 Press Release
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf

"A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale

  • 1. Copyright © 2017 DeepScale 1 A Shallow Dive into Training Deep Neural Networks Sammy Sidhu May 2017
  • 2. Copyright © 2017 DeepScale 2 • Perception systems for autonomous vehicles • Focusing on enabling technologies for mass-produced autonomous vehicles • Working with a number of OEMs and automotive suppliers • Open Source ☺ • Visit http://deepscale.ai About DeepScale
  • 3. Copyright © 2017 DeepScale 3 • Feature Engineering vs. Learned Features • Neural Network Review • Loss Function (Objective Function) • Gradients • Optimization Techniques • Datasets • Overfitting and Underfitting Overview
  • 4. Copyright © 2017 DeepScale 4 Feature Engineering vs. Learned Features Example of hand written features for face detection
  • 5. Copyright © 2017 DeepScale 5 • Feature Engineering for computer vision can work well • Very time consuming to find useful features • Requires BOTH domain expertise and programming know-how • Hard to generalize all cases (lumination, pose and variations in domain) • Can use generalized features like HOG/SIFT but accuracy suffers Feature Engineering vs. Learned Features (Cont’d.)
  • 6. Copyright © 2017 DeepScale 6 Feature Engineering vs. Learned Features (Cont’d.) Example of learned features of a CNN for facial classification [DeepFace CVPR14]
  • 7. Copyright © 2017 DeepScale 7 • Learned Features for computer vision can work extremely well • Image Classification: 5.71% vs. 26.2% error [ResNet-152 vs. SIFT sparse] • Only requires labeled data, deep learning expertise and computing power • “Training” the network is essentially learning features layer by layer • The deeper you go, the features become much more complex • Hard to perform validation outside of putting in data and seeing what happens Feature Engineering vs. Learned Features (Cont’d.)
  • 8. Copyright © 2017 DeepScale 8 y = fw(x) where w is a set of parameters we can learn and f is a nonlinear function A neural network can be seen as a function approximation Neural Networks — Quick Review 8 Typical nonlinear functions in DNN
  • 9. Copyright © 2017 DeepScale 9 • Take the example of a Linear Regression • Given data, we fit a line (𝑦 = 𝑚𝑥 + 𝑏) that minimizes the sum of the squares of differences (Euclidian distance loss function) • This function that we minimize is the loss function • An example would be to predict house value given square footage and median income • f(sqft, income) --> value where value is [0, inf] dollars • we want to minimize L(actual_value, predicted_value) where L is the loss function Loss Function (Objective Function)
  • 10. Copyright © 2017 DeepScale 10 Loss Function (Objective Function) (Cont’d.)
  • 11. Copyright © 2017 DeepScale 11 Loss Function (Objective Function) (Cont’d.) • Another loss function is the Softmax loss for classification • This is useful for the case if we want to predict the probability of an event • For Example: Predict if an image is of a cat or a dog
  • 12. Copyright © 2017 DeepScale 12 • Loss functions can be used for either classification or regression • The goal is to pick a set of weights that makes this loss value as small as possible • It is very crucial to pick the right objective function for the right task, i.e., one technically can use a squared loss for predicting probability Loss Function (Objective Function) (Cont’d.)
  • 13. Copyright © 2017 DeepScale 13 • Now if we have a loss function and a neural network, how do we know what part of the network is “responsible” for causing that error? • Let’s go back to the simple linear regression! Gradients
  • 14. Copyright © 2017 DeepScale 14 • Let’s define the loss function • 𝐿 = 1 2 (𝑌 − ෠𝑌)2 where ෠𝑌 is the predicted • Let’s then take the derivative to see how ෠𝑌 contributes to the loss L • 𝑑𝐿 𝑑 ෠𝑌 = −(𝑌 − ෠𝑌) = ෠𝑌 − 𝑌 • We’re fitting a line • ෠𝑌 = 𝑚𝑋 + 𝑏 • Two weights to optimize (slope and bias) • 𝑑 ෠𝑌 𝑑𝑚 = X, 𝑑 ෠𝑌 𝑑𝑏 = 1 Gradients (Cont’d.)
  • 15. Copyright © 2017 DeepScale 15 Gradients (Cont’d.) Line with noise to fit Surface of loss w.r.t slope and bias (m, b) https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
  • 16. Copyright © 2017 DeepScale 16 • We know 𝑑𝐿 𝑑 ෠𝑌 = ෠𝑌 − 𝑌 and 𝑑 ෠𝑌 𝑑𝑚 = X, 𝑑 ෠𝑌 𝑑𝑏 = 1 • To optimize our line [slope and bias] we use the chain rule! • 𝑑𝐿 𝑑𝑚 = 𝑑𝐿 𝑑෢𝑌 𝑑 ෠𝑌 𝑑𝑚 = X(෠𝑌 − 𝑌) and 𝑑𝐿 𝑑𝑏 = 𝑑𝐿 𝑑෢𝑌 𝑑 ෠𝑌 𝑑𝑏 = (෠𝑌 − 𝑌) • Together, these two derivatives make a Gradient! • We update our weights with the following • 𝑚 = 𝑚 + 𝛼 𝑑𝐿 𝑑𝑚 and 𝑏 = 𝑏 + 𝛼 𝑑𝐿 𝑑𝑏 • where 𝛼 is a rate parameter Gradients (Cont’d.)
  • 17. Copyright © 2017 DeepScale 17 • How to minimize loss? • Walk down surface via gradient steps until you reach the minimum! Gradients (Cont’d.) https://github.com/mattnedrich/GradientDescentExample
  • 18. Copyright © 2017 DeepScale 18 • Gradient descent is not just limited to linear regression • We can take derivatives with respect to any parameter in the neural network • To avoid math complexity and recomputation, we can use the chain rule again • We can even do this through our nonlinear functions that are not continuous Gradients (Cont’d.)
  • 19. Copyright © 2017 DeepScale 19 Gradients (Cont.) • This process of computing and applying gradient updates to a neural network layer by layer is called Back Propagation
  • 20. Copyright © 2017 DeepScale 20 • Given the fact that we now have gradients, and the weights, what's the best way to apply the updates? • In the previous linear regression example • Grab random sample and apply updates to slope and bias • Repeat until converges • Known as SGD • Can we do better to find the best possible set of weights to minimize loss? (Optimization) Optimization Techniques
  • 21. Copyright © 2017 DeepScale 21 • Momentum • Keep a running average of previous updates and add to each update Optimization Techniques (Cont’d.) Steps without Momentum Steps with Momentum
  • 22. Copyright © 2017 DeepScale 22 • AdaGrad, AdaProp, RMSProp, ADAM • Automatically tune learning rate to reach convergence in less updates • Great for fast convergence • Sometimes finicky for reaching lowest loss possible for a network Optimization Techniques (Cont’d.)
  • 23. Copyright © 2017 DeepScale 23 Optimization Techniques (Cont’d.)
  • 24. Copyright © 2017 DeepScale 24 • When it comes to neural networks, you want to have a diverse dataset that large enough to training your network without overfitting (more on this later) • You can also augment your data to generate more samples • Rotations / reflections when makes sense • Add noise / hue / contrast • This is extremely useful in the case where you have rare samples classes Datasets
  • 25. Copyright © 2017 DeepScale 25 Datasets (Cont’d.) MNIST
  • 26. Copyright © 2017 DeepScale 26 Datasets (Cont’d.) CIFAR-10
  • 27. Copyright © 2017 DeepScale 27 Datasets (Cont’d.) Imagenet
  • 28. Copyright © 2017 DeepScale 28 • What is Overfitting? • Fitting to the training data but not generalizing well • What is Underfitting? • The model does not capture the trends in the data • How to tell? Overfitting and Underfitting
  • 29. Copyright © 2017 DeepScale 29 Overfitting and Underfitting (Cont’d.)
  • 30. Copyright © 2017 DeepScale 30 • We can split the training data into 3 disjoint parts • Training set, Validation set, Test set • During training • “Learn” via the training set • Evaluate the model every epoch with the validation set • After Training • Test the model with the test set which the model hasn’t seen before Overfitting and Underfitting (Cont’d.)
  • 31. Copyright © 2017 DeepScale 31 Overfitting and Underfitting (Cont’d.) • Overfitting when • Training loss is low but validation and test loss is high
  • 32. Copyright © 2017 DeepScale 32 • How to combat overfitting? • More data • Data augmentation • Regularization (weight decay) • Add the magnitude of the weights to the loss function • Ignore some of the weight updates (Dropout) • Simpler model? Overfitting and Underfitting (Cont’d.)
  • 33. Copyright © 2017 DeepScale 33 • Underfitting when • Training loss drops at first then stops • Training loss is still high • Training loss tracks validation loss • More complex model? • Turn down regularization Overfitting and Underfitting (Cont’d.)
  • 34. Copyright © 2017 DeepScale 34 • Neural Nets are function approximators • Deep Learning can work surprising well • Optimizing nets is an art that requires intuition • Making good datasets is hard • Overfitting makes it hard to generalize for applications • We can find how robust our models are with validation testing Takeaways
  • 35. Copyright © 2017 DeepScale 35 Thank you! Questions?