SlideShare a Scribd company logo
Introduction to Deep Learning
by
Vishal Gour
Prerequisite
s
• Python
• Machine Learning
• Statistics
History of Deep Learning
• Definition: Deep learning is a subfield of machine learning inspired by the
brain's structure and function, specifically artificial neural networks.
• Origins: It began in the 1940s with the McCulloch-Pitts Neuron model,
which was foundational for later developments in neural networks.
• Key Development: In the 1980s, the backpropagation algorithm
was
introduced, allowing for the training of multi-layered neural networks.
• Resurgence: The 2000s saw renewed interest in deep learning, leading to
significant breakthroughs.
• Notable Achievement: AlexNet's victory in the 2012 ImageNet
competition was a pivotal moment, sparking widespread interest and
rapid advancements in the field of deep learning.
Artificial Intelligence, Machine Learning and
Deep Learning
Why do we need Deep Learning?
Applications of Deep Learning
Applications of Deep Learning
Applications of Deep Learning
Applications of Deep Learning
Introduction to deep Learning Fundamentals
Perceptron
s
• Perceptrons, introduced by Frank Rosenblatt in 1958, are the simplest
type of artificial neural network. They consist of input features, weights, a
bias, and an activation function.
Biological Neuron vs Artificial Neuron
Introduction to deep Learning Fundamentals
Introduction to deep Learning Fundamentals
Multilayer Perceptrons (MLPs)
• MLPs are an extension of perceptrons that include one or more hidden
layers. These hidden layers enable MLPs to model complex, non-linear
functions.
FeedForward Neural Networks
• A Feedforward Neural Network (FNN) is a type of artificial neural network
where connections between nodes move in one direction—from the input
layer, through hidden layers, to the output layer, without cycles. It
processes inputs to produce an output by applying weights, biases, and
activation functions in each layer.
Backpropagation
Backpropagation is an algorithm used to train neural networks
by minimizing the error between predicted and actual outputs.
It involves two phases:
1. Forward Pass: The input is passed through the
network to
compute the output.
2. Backward Pass: The error is propagated backward through the
network to update the weights.
It uses optimizers like the gradient descent method to minimize
the error by updating the parameters based on the gradient of the
loss function with respect to each parameter.
Introduction to deep Learning Fundamentals
Activation Functions
• The Sigmoid function is a commonly used activation function
in deep learning, especially in the early days. It maps input
values to an output range between 0 and 1, making it useful
for normalizing outputs. However, it has several drawbacks:
1. Gradient Vanishing: The function's gradient becomes very small when
the input is far from zero, leading to poor weight updates during
backpropagation.
2. Non Zero-Centered Output: The function's output is not centered
around zero, which can slow down the training process.
3. Computationally Expensive: The function involves exponential
calculations, which are slower for computers.
• Advantages include a smooth gradient and clear, bounded
output, which helps prevent erratic behavior in neural
network predictions
The Tanh (hyperbolic tangent) function is similar to the Sigmoid function, but with key
differences. Both have small gradients for large or small inputs, which can hinder
weight updates. However, Tanh's output range is between -1 and 1, and it is centered
around 0, making it better for training efficiency compared to Sigmoid.
In practice, Tanh is often used in hidden layers, while Sigmoid is used in output layers
for binary classification. The choice of activation function depends on the specific
problem and may require experimentation.
The ReLU (Rectified Linear Unit) function is a popular activation function in deep learning. It is simple
and efficient, offering key advantages over Sigmoid and Tanh functions:
Advantages:
1.No Gradient Saturation for Positive Inputs: ReLU avoids the gradient vanishing problem for
positive inputs.
2.Faster Computation: ReLU involves simple linear operations, making it faster in both forward and
backward passes.
Disadvantages:
3.Inactive for Negative Inputs: ReLU outputs zero for negative inputs, leading to "dead neurons"
where gradients become zero during backpropagation.
4.Non Zero-Centered: The output is either 0 or a positive value, which is not centered around zero,
potentially slowing down convergence.
In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x
instead of 0. Another intuitive idea is a parameter-based method, Parametric ReLU :
f(x)= max(alpha x,x), which alpha can be learned from back propagation. In theory, Leaky
ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in
actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.
5. Softmax
The Softmax activation function is used in the output layer of neural networks,
particularly in multi-class classification problems. It converts raw scores (logits)
from the network into probabilities, ensuring that the sum of the probabilities
for each class is 1.
Loss Functions
• Mean Squared Error (MSE):
• Definition: MSE is the average of the squared differences between
the actual and predicted values. It penalizes larger errors more
heavily due to squaring, which makes it sensitive to outliers.
Loss Functions
• Mean Absolute Error (MAE):
• Definition: MAE is the average of the absolute differences
between the actual and predicted values. Unlike MSE, it treats all
errors equally and is more robust to outliers.
Loss Functions
• Huber Loss
• Definition: Huber Loss combines the advantages of MSE and MAE.
It behaves like MAE when the error is small and like MSE when the
error is large, making it robust to outliers while still penalizing large
errors.
Loss Functions
• Binary Cross Entropy
• Used for binary classification tasks where there are two possible
outcomes (e.g., spam vs. not spam).
• Each output node represents the probability of one class, typically using a
sigmoid activation function.
• Used for multi-class classification tasks where there are more than two
classes
• The network outputs a probability distribution over all classes, typically
using a softmax activation function
• Categorical Cross Entropy
Gradient Descent (GD)
Working of Gradient Descent
• Starting Point: The process begins at an arbitrary point in the
parameter space (i.e., initial values for weights and biases). This
starting point serves as a baseline for evaluating the model's
performance.
• Calculate the Slope: From this initial point, the derivative (or slope)
of the cost function is calculated. The slope is derived from a tangent
line at the current point, which provides insight into how steep the
slope is. The steepness of this slope indicates how much the
parameters need to be adjusted.
• Parameter Updates: Using the slope, gradient descent updates the
weights and biases. Initially, the slope is typically steep, leading to
more significant adjustments. As the process continues, the slope
flattens, indicating that the updates are getting smaller as the
algorithm approaches the minimum of the cost function.
Working of Gradient Descent
• Minimizing the Cost Function: The objective of gradient descent,
similar to finding the line of best fit in linear regression, is to minimize
the cost function. The cost function measures the difference (or error)
between the predicted output and the actual output.
• Direction and Learning Rate:
– Direction: Gradient descent moves in the direction of the
steepest
descent, or the negative gradient, to reduce the cost function.
– Learning Rate (Alpha): This determines the size of the steps taken towards
the minimum. A higher learning rate results in larger steps, which can
speed up convergence but risks overshooting the minimum. Conversely, a
lower learning rate takes smaller, more precise steps, but this can slow
down the convergence process, requiring more iterations and
computational resources.
η
Working of Gradient Descent
• Convergence: As gradient descent iteratively updates the parameters, it
moves closer to the minimum of the cost function. The process continues
until the cost function reaches a value close to or at zero, indicating that the
model has minimized the error. At this point, the model has effectively
"learned" the optimal parameters.
Introduction to deep Learning Fundamentals
Stochastic Gradient Descent (SGD)
Introduction to deep Learning Fundamentals
Mini Batch Gradient Descent
Introduction to deep Learning Fundamentals
SGD with Momentum
η
Introduction to deep Learning Fundamentals
AdaGrad (Adaptive Gradient Descent)
RMS Prop
Adam (Adaptive Moment Estimation)
• Adam optimizer is one of the most popular and
famous gradient descent optimization
algorithms.
• Adam combines the advantages of both
momentum and RMSprop. It maintains two
moving averages: the mean of gradients
(momentum) and the mean of squared
gradients (RMSprop).
Source: https://musstafa0804.medium.com/optimizers-in-deep-learning-7bf81fed78a0

More Related Content

PPTX
Machine Learning Techniques - Linear Model.pptx
PPTX
Machine Learning Techniques - Linear Model.pptx
PPTX
Machine learning Module-2, 6th Semester Elective
PPTX
Machine learning Module-2, 6th Semester Elective
PPTX
08 neural networks
PPTX
08 neural networks
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Machine Learning Techniques - Linear Model.pptx
Machine Learning Techniques - Linear Model.pptx
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
08 neural networks
08 neural networks
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...

Similar to Introduction to deep Learning Fundamentals (20)

PPTX
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
PPTX
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
PPTX
Activation functions and Training Algorithms for Deep Neural network
PPTX
Activation functions and Training Algorithms for Deep Neural network
PPTX
Unit 2 ml.pptx
PPTX
Unit 2 ml.pptx
PPTX
Lecture02_Updated_Shallow Neural Networks.pptx
PPTX
Lecture02_Updated_Shallow Neural Networks.pptx
PDF
Back propagation
PDF
Back propagation
PPTX
V2.0 open power ai virtual university deep learning and ai introduction
PPTX
V2.0 open power ai virtual university deep learning and ai introduction
PPTX
Introduction to Deep learning and H2O for beginner's
PPTX
Introduction to Deep learning and H2O for beginner's
PPTX
cnn ppt.pptx
PPTX
cnn ppt.pptx
PPTX
ANN Lec 5 Activation functions In deep learning.pptx
PPTX
ANN Lec 5 Activation functions In deep learning.pptx
PPTX
ML Module 3 Non Linear Learning.pptx
PPTX
ML Module 3 Non Linear Learning.pptx
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
Unit 2 ml.pptx
Unit 2 ml.pptx
Lecture02_Updated_Shallow Neural Networks.pptx
Lecture02_Updated_Shallow Neural Networks.pptx
Back propagation
Back propagation
V2.0 open power ai virtual university deep learning and ai introduction
V2.0 open power ai virtual university deep learning and ai introduction
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
cnn ppt.pptx
cnn ppt.pptx
ANN Lec 5 Activation functions In deep learning.pptx
ANN Lec 5 Activation functions In deep learning.pptx
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
Ad

Recently uploaded (20)

PPTX
additive manufacturing of ss316l using mig welding
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PPT on Performance Review to get promotions
PDF
composite construction of structures.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPT
Mechanical Engineering MATERIALS Selection
PDF
Digital Logic Computer Design lecture notes
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
web development for engineering and engineering
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
additive manufacturing of ss316l using mig welding
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Internet of Things (IOT) - A guide to understanding
Operating System & Kernel Study Guide-1 - converted.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT on Performance Review to get promotions
composite construction of structures.pdf
Lecture Notes Electrical Wiring System Components
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Mechanical Engineering MATERIALS Selection
Digital Logic Computer Design lecture notes
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Automation-in-Manufacturing-Chapter-Introduction.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
web development for engineering and engineering
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Ad

Introduction to deep Learning Fundamentals

  • 1. Introduction to Deep Learning by Vishal Gour
  • 2. Prerequisite s • Python • Machine Learning • Statistics
  • 3. History of Deep Learning • Definition: Deep learning is a subfield of machine learning inspired by the brain's structure and function, specifically artificial neural networks. • Origins: It began in the 1940s with the McCulloch-Pitts Neuron model, which was foundational for later developments in neural networks. • Key Development: In the 1980s, the backpropagation algorithm was introduced, allowing for the training of multi-layered neural networks. • Resurgence: The 2000s saw renewed interest in deep learning, leading to significant breakthroughs. • Notable Achievement: AlexNet's victory in the 2012 ImageNet competition was a pivotal moment, sparking widespread interest and rapid advancements in the field of deep learning.
  • 4. Artificial Intelligence, Machine Learning and Deep Learning
  • 5. Why do we need Deep Learning?
  • 11. Perceptron s • Perceptrons, introduced by Frank Rosenblatt in 1958, are the simplest type of artificial neural network. They consist of input features, weights, a bias, and an activation function.
  • 12. Biological Neuron vs Artificial Neuron
  • 15. Multilayer Perceptrons (MLPs) • MLPs are an extension of perceptrons that include one or more hidden layers. These hidden layers enable MLPs to model complex, non-linear functions.
  • 16. FeedForward Neural Networks • A Feedforward Neural Network (FNN) is a type of artificial neural network where connections between nodes move in one direction—from the input layer, through hidden layers, to the output layer, without cycles. It processes inputs to produce an output by applying weights, biases, and activation functions in each layer.
  • 17. Backpropagation Backpropagation is an algorithm used to train neural networks by minimizing the error between predicted and actual outputs. It involves two phases: 1. Forward Pass: The input is passed through the network to compute the output. 2. Backward Pass: The error is propagated backward through the network to update the weights. It uses optimizers like the gradient descent method to minimize the error by updating the parameters based on the gradient of the loss function with respect to each parameter.
  • 20. • The Sigmoid function is a commonly used activation function in deep learning, especially in the early days. It maps input values to an output range between 0 and 1, making it useful for normalizing outputs. However, it has several drawbacks: 1. Gradient Vanishing: The function's gradient becomes very small when the input is far from zero, leading to poor weight updates during backpropagation. 2. Non Zero-Centered Output: The function's output is not centered around zero, which can slow down the training process. 3. Computationally Expensive: The function involves exponential calculations, which are slower for computers. • Advantages include a smooth gradient and clear, bounded output, which helps prevent erratic behavior in neural network predictions
  • 21. The Tanh (hyperbolic tangent) function is similar to the Sigmoid function, but with key differences. Both have small gradients for large or small inputs, which can hinder weight updates. However, Tanh's output range is between -1 and 1, and it is centered around 0, making it better for training efficiency compared to Sigmoid. In practice, Tanh is often used in hidden layers, while Sigmoid is used in output layers for binary classification. The choice of activation function depends on the specific problem and may require experimentation.
  • 22. The ReLU (Rectified Linear Unit) function is a popular activation function in deep learning. It is simple and efficient, offering key advantages over Sigmoid and Tanh functions: Advantages: 1.No Gradient Saturation for Positive Inputs: ReLU avoids the gradient vanishing problem for positive inputs. 2.Faster Computation: ReLU involves simple linear operations, making it faster in both forward and backward passes. Disadvantages: 3.Inactive for Negative Inputs: ReLU outputs zero for negative inputs, leading to "dead neurons" where gradients become zero during backpropagation. 4.Non Zero-Centered: The output is either 0 or a positive value, which is not centered around zero, potentially slowing down convergence.
  • 23. In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x instead of 0. Another intuitive idea is a parameter-based method, Parametric ReLU : f(x)= max(alpha x,x), which alpha can be learned from back propagation. In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.
  • 24. 5. Softmax The Softmax activation function is used in the output layer of neural networks, particularly in multi-class classification problems. It converts raw scores (logits) from the network into probabilities, ensuring that the sum of the probabilities for each class is 1.
  • 25. Loss Functions • Mean Squared Error (MSE): • Definition: MSE is the average of the squared differences between the actual and predicted values. It penalizes larger errors more heavily due to squaring, which makes it sensitive to outliers.
  • 26. Loss Functions • Mean Absolute Error (MAE): • Definition: MAE is the average of the absolute differences between the actual and predicted values. Unlike MSE, it treats all errors equally and is more robust to outliers.
  • 27. Loss Functions • Huber Loss • Definition: Huber Loss combines the advantages of MSE and MAE. It behaves like MAE when the error is small and like MSE when the error is large, making it robust to outliers while still penalizing large errors.
  • 28. Loss Functions • Binary Cross Entropy • Used for binary classification tasks where there are two possible outcomes (e.g., spam vs. not spam). • Each output node represents the probability of one class, typically using a sigmoid activation function. • Used for multi-class classification tasks where there are more than two classes • The network outputs a probability distribution over all classes, typically using a softmax activation function • Categorical Cross Entropy
  • 30. Working of Gradient Descent • Starting Point: The process begins at an arbitrary point in the parameter space (i.e., initial values for weights and biases). This starting point serves as a baseline for evaluating the model's performance. • Calculate the Slope: From this initial point, the derivative (or slope) of the cost function is calculated. The slope is derived from a tangent line at the current point, which provides insight into how steep the slope is. The steepness of this slope indicates how much the parameters need to be adjusted. • Parameter Updates: Using the slope, gradient descent updates the weights and biases. Initially, the slope is typically steep, leading to more significant adjustments. As the process continues, the slope flattens, indicating that the updates are getting smaller as the algorithm approaches the minimum of the cost function.
  • 31. Working of Gradient Descent • Minimizing the Cost Function: The objective of gradient descent, similar to finding the line of best fit in linear regression, is to minimize the cost function. The cost function measures the difference (or error) between the predicted output and the actual output. • Direction and Learning Rate: – Direction: Gradient descent moves in the direction of the steepest descent, or the negative gradient, to reduce the cost function. – Learning Rate (Alpha): This determines the size of the steps taken towards the minimum. A higher learning rate results in larger steps, which can speed up convergence but risks overshooting the minimum. Conversely, a lower learning rate takes smaller, more precise steps, but this can slow down the convergence process, requiring more iterations and computational resources. η
  • 32. Working of Gradient Descent • Convergence: As gradient descent iteratively updates the parameters, it moves closer to the minimum of the cost function. The process continues until the cost function reaches a value close to or at zero, indicating that the model has minimized the error. At this point, the model has effectively "learned" the optimal parameters.
  • 42. Adam (Adaptive Moment Estimation) • Adam optimizer is one of the most popular and famous gradient descent optimization algorithms. • Adam combines the advantages of both momentum and RMSprop. It maintains two moving averages: the mean of gradients (momentum) and the mean of squared gradients (RMSprop).