Gradient Descent or Assent is to find optimal parameters that minimize the loss.

Gradient Descent
Dr. M. Ramesh
Prof. & HOD CSE - Cyber Security

The Idea Behind Gradient Descent
● Purpose: Gradient Descent is used to minimize a function, typically the loss or
cost function in machine learning models. The goal is to find the optimal
parameters (e.g., weights in a neural network) that minimize the loss.
● Key Insight: The direction of steepest descent (negative gradient) tells us how to
update the parameters to reduce the loss. By iteratively adjusting the parameters
in small steps, we eventually reach the minimum of the loss function.
● Example: Consider a simple linear regression problem with a loss function that
measures the difference between predicted values and actual values (Mean
Squared Error). Gradient descent helps adjust the line's slope and intercept until
this loss is minimized.

What is the Gradient ?
The gradient is a vector that points in the direction of the steepest ascent of the
loss function. The negative of this vector points in the direction of the steepest
descent.

Estimating the Gradient:
● In simple cases (e.g., linear regression), the gradient can be computed
analytically.
● In more complex scenarios (e.g., neural networks), backpropagation is used to
calculate gradients.
● In cases with large datasets, mini-batches or stochastic techniques are used to
estimate the gradient over small subsets of data.

Choosing the Right Step Size (Learning Rate)
Importance of Step Size:
● Too Large: If the step size (learning rate) is too large, the algorithm might
overshoot the minimum, causing divergence (i.e., the loss increases).
● Too Small: If the step size is too small, convergence will be slow,
requiring many iterations to reach the minimum.
● Optimal Step Size: Selecting the right learning rate is crucial for efficient
training. Methods such as learning rate schedules (reducing the learning
rate over time) or adaptive learning rates (e.g., Adam, RMSprop) can
help.

Heuristics:
● Use cross-validation to experiment with different learning
rates.
● Start with a higher learning rate and gradually reduce it
(learning rate annealing).

Using Gradient Descent to Fit Models
Example 1: Linear Regression
● Loss Function: Mean Squared Error (MSE).
● Gradient Descent: Adjusts the slope and intercept of the regression line until the error between
predicted and actual values is minimized.
Example 2: Logistic Regression
● Loss Function: Binary Cross-Entropy.
● Gradient Descent: Finds the optimal decision boundary by adjusting weights to minimize
classification error.
Example 3: Neural Networks:
● Loss Functions: Cross-Entropy for classification, Mean Squared Error for regression tasks.
● Gradient Descent: Using backpropagation, the gradients are propagated backward through the
network to update the weights in each layer.

Mini-Batch Gradient Descent:
● Instead of using the entire dataset, mini-batch gradient descent computes the gradient
on small batches (subsets) of data.
● Advantages:
○ Computationally efficient.
○ Introduces a balance between the accuracy of Batch Gradient Descent and the
noisy updates of SGD.
○ Common batch sizes: 32, 64, 128.
● Use Case: Widely used in deep learning frameworks (e.g., TensorFlow, PyTorch) as it
optimizes memory usage and allows for faster computation on GPUs.

Stochastic Gradient Descent (SGD):
● In each iteration, SGD computes the gradient using a single randomly chosen data point.
● Advantages:
○ Very fast as it only processes one example per iteration.
○ Introduces noise in the gradient updates, which can help the algorithm escape local
minima and saddle points.
● Disadvantages:
○ Noisy updates can cause oscillations around the minimum rather than exact
convergence.
● Use Case: Suitable for large-scale problems where using the entire dataset is computationally
expensive.

Comparison of Gradient Descent Methods
Batch Gradient Descent: Uses the entire dataset for each update;
more accurate but computationally intensive.
Mini-Batch Gradient Descent: Trades off between stability and
computational efficiency by using small batches; very popular in
practice.
Stochastic Gradient Descent (SGD): Fast, noisy updates; useful for
large datasets and online learning.

Takeaways
● Gradient Descent is a versatile and widely used optimization
algorithm for training machine learning models.
● Proper tuning of the learning rate and choosing the right variant
(batch, mini-batch, or stochastic) are key to achieving efficient and
effective optimization.

Gradient Descent or Assent is to find optimal parameters that minimize the loss.

More Related Content

Similar to Gradient Descent or Assent is to find optimal parameters that minimize the loss. (20)

Recently uploaded (20)

Gradient Descent or Assent is to find optimal parameters that minimize the loss.