Introduction to deep Learning Fundamentals

Introduction to Deep Learning
by
Vishal Gour

Prerequisite
s
• Python
• Machine Learning
• Statistics

History of Deep Learning
• Definition: Deep learning is a subfield of machine learning inspired by the
brain's structure and function, specifically artificial neural networks.
• Origins: It began in the 1940s with the McCulloch-Pitts Neuron model,
which was foundational for later developments in neural networks.
• Key Development: In the 1980s, the backpropagation algorithm
was
introduced, allowing for the training of multi-layered neural networks.
• Resurgence: The 2000s saw renewed interest in deep learning, leading to
significant breakthroughs.
• Notable Achievement: AlexNet's victory in the 2012 ImageNet
competition was a pivotal moment, sparking widespread interest and
rapid advancements in the field of deep learning.

Artificial Intelligence, Machine Learning and
Deep Learning

Perceptron
s
• Perceptrons, introduced by Frank Rosenblatt in 1958, are the simplest
type of artificial neural network. They consist of input features, weights, a
bias, and an activation function.

Biological Neuron vs Artificial Neuron

Multilayer Perceptrons (MLPs)
• MLPs are an extension of perceptrons that include one or more hidden
layers. These hidden layers enable MLPs to model complex, non-linear
functions.

FeedForward Neural Networks
• A Feedforward Neural Network (FNN) is a type of artificial neural network
where connections between nodes move in one direction—from the input
layer, through hidden layers, to the output layer, without cycles. It
processes inputs to produce an output by applying weights, biases, and
activation functions in each layer.

Backpropagation
Backpropagation is an algorithm used to train neural networks
by minimizing the error between predicted and actual outputs.
It involves two phases:
1. Forward Pass: The input is passed through the
network to
compute the output.
2. Backward Pass: The error is propagated backward through the
network to update the weights.
It uses optimizers like the gradient descent method to minimize
the error by updating the parameters based on the gradient of the
loss function with respect to each parameter.

• The Sigmoid function is a commonly used activation function
in deep learning, especially in the early days. It maps input
values to an output range between 0 and 1, making it useful
for normalizing outputs. However, it has several drawbacks:
1. Gradient Vanishing: The function's gradient becomes very small when
the input is far from zero, leading to poor weight updates during
backpropagation.
2. Non Zero-Centered Output: The function's output is not centered
around zero, which can slow down the training process.
3. Computationally Expensive: The function involves exponential
calculations, which are slower for computers.
• Advantages include a smooth gradient and clear, bounded
output, which helps prevent erratic behavior in neural
network predictions

The Tanh (hyperbolic tangent) function is similar to the Sigmoid function, but with key
differences. Both have small gradients for large or small inputs, which can hinder
weight updates. However, Tanh's output range is between -1 and 1, and it is centered
around 0, making it better for training efficiency compared to Sigmoid.
In practice, Tanh is often used in hidden layers, while Sigmoid is used in output layers
for binary classification. The choice of activation function depends on the specific
problem and may require experimentation.

The ReLU (Rectified Linear Unit) function is a popular activation function in deep learning. It is simple
and efficient, offering key advantages over Sigmoid and Tanh functions:
Advantages:
1.No Gradient Saturation for Positive Inputs: ReLU avoids the gradient vanishing problem for
positive inputs.
2.Faster Computation: ReLU involves simple linear operations, making it faster in both forward and
backward passes.
Disadvantages:
3.Inactive for Negative Inputs: ReLU outputs zero for negative inputs, leading to "dead neurons"
where gradients become zero during backpropagation.
4.Non Zero-Centered: The output is either 0 or a positive value, which is not centered around zero,
potentially slowing down convergence.

In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x
instead of 0. Another intuitive idea is a parameter-based method, Parametric ReLU :
f(x)= max(alpha x,x), which alpha can be learned from back propagation. In theory, Leaky
ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in
actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.

5. Softmax
The Softmax activation function is used in the output layer of neural networks,
particularly in multi-class classification problems. It converts raw scores (logits)
from the network into probabilities, ensuring that the sum of the probabilities
for each class is 1.

Loss Functions
• Mean Squared Error (MSE):
• Definition: MSE is the average of the squared differences between
the actual and predicted values. It penalizes larger errors more
heavily due to squaring, which makes it sensitive to outliers.

Loss Functions
• Mean Absolute Error (MAE):
• Definition: MAE is the average of the absolute differences
between the actual and predicted values. Unlike MSE, it treats all
errors equally and is more robust to outliers.

Loss Functions
• Huber Loss
• Definition: Huber Loss combines the advantages of MSE and MAE.
It behaves like MAE when the error is small and like MSE when the
error is large, making it robust to outliers while still penalizing large
errors.

Loss Functions
• Binary Cross Entropy
• Used for binary classification tasks where there are two possible
outcomes (e.g., spam vs. not spam).
• Each output node represents the probability of one class, typically using a
sigmoid activation function.
• Used for multi-class classification tasks where there are more than two
classes
• The network outputs a probability distribution over all classes, typically
using a softmax activation function
• Categorical Cross Entropy

Working of Gradient Descent
• Starting Point: The process begins at an arbitrary point in the
parameter space (i.e., initial values for weights and biases). This
starting point serves as a baseline for evaluating the model's
performance.
• Calculate the Slope: From this initial point, the derivative (or slope)
of the cost function is calculated. The slope is derived from a tangent
line at the current point, which provides insight into how steep the
slope is. The steepness of this slope indicates how much the
parameters need to be adjusted.
• Parameter Updates: Using the slope, gradient descent updates the
weights and biases. Initially, the slope is typically steep, leading to
more significant adjustments. As the process continues, the slope
flattens, indicating that the updates are getting smaller as the
algorithm approaches the minimum of the cost function.

• Minimizing the Cost Function: The objective of gradient descent,
similar to finding the line of best fit in linear regression, is to minimize
the cost function. The cost function measures the difference (or error)
between the predicted output and the actual output.
• Direction and Learning Rate:
– Direction: Gradient descent moves in the direction of the
steepest
descent, or the negative gradient, to reduce the cost function.
– Learning Rate (Alpha): This determines the size of the steps taken towards
the minimum. A higher learning rate results in larger steps, which can
speed up convergence but risks overshooting the minimum. Conversely, a
lower learning rate takes smaller, more precise steps, but this can slow
down the convergence process, requiring more iterations and
computational resources.
η

• Convergence: As gradient descent iteratively updates the parameters, it
moves closer to the minimum of the cost function. The process continues
until the cost function reaches a value close to or at zero, indicating that the
model has minimized the error. At this point, the model has effectively
"learned" the optimal parameters.

Stochastic Gradient Descent (SGD)

AdaGrad (Adaptive Gradient Descent)

Adam (Adaptive Moment Estimation)
• Adam optimizer is one of the most popular and
famous gradient descent optimization
algorithms.
• Adam combines the advantages of both
momentum and RMSprop. It maintains two
moving averages: the mean of gradients
(momentum) and the mean of squared
gradients (RMSprop).

Source: https://musstafa0804.medium.com/optimizers-in-deep-learning-7bf81fed78a0

Introduction to deep Learning Fundamentals

More Related Content

Similar to Introduction to deep Learning Fundamentals (20)

Recently uploaded (20)

Introduction to deep Learning Fundamentals