SlideShare a Scribd company logo
17. Large Scale Machine Learning
-- Machine learning algos work better if the large amount of
data is fed to it. So the algos have to be efficient to works with such
large datasets
Accuracy of prediction grows as the amount of data increases, for
each type of algorithm used:
In a Single step of gradient descent, it will have to compute sum of
100 million terms
Sanity check: use just a small set of examples and check if increase
the no of examples will benefit the model:
Draw the learning curves:
If this type of curve occurs, it means that it’s a high variance
problem : and increase no. of examples is likely to help.
But mainly it will require the addition of new extra features, since
its high variance problem.
If this type of curve occurs, it means that it’s a high bias problem :
then increase the no. of examples is not very likely to benefit the
model.
STOCHASTIC GRADIENT DESCCENT: to make it less expensive than
simple gradient descent
What gradient descent does:
What will happen is: Contours:
➢Its called Batch Gradient Descend.
What it does:
➢Each time the inner for loop works: it fits the parameters a
little better for that particular example
The simple gradient descent converges to the optimum in a
reasonably straight-line path
But SGD does not follow linear path, it just moves into a random
direction, but in the end, it converges to a region near optimum.
It does never reach the optimum, it continuously wanders around
the optimum, but it’s a fine fit as the region near optimum is OK to
converge
The outer loop may be made to run 1 to 10 times, based on the
size of m.
If m is very-very large, running the outer loop just once will give a
reasonably good hypothesis, while in case of batch gradient
descent, looping over all examples once will only move the
hypothesis one step closer.
MINI BATCH GRADIENT DESCENT:
Using b - examples at a time benefits the vectorization for parallel
computation
SGD CONVERGENCE:
If the plot looks like this:
Red : with smaller α;
Blue : with larger α
The curve is noisy as the cost is only averaged over 1000 examples
Using smaller learning rate, the algo will initially learn slowly but it
will eventually give a slightly better estimate: because the SGD
doesn’t converge, but the parameters will oscillate around the
optimum, so slower learning rate means the oscillations are also
smaller.
If we average over 5000 examples, the curve will smoothen, b’coz
we only plot one data point out of every 5000 examlpes
If the plot looks like this:
Blue – no of examples to average over is very low
Red – incr in no examples to average over has given a smooth curve
and the cost is decr
Pink – if the curve looks like this after incr no of ex: there is
something wrong with the algo and the algo is not learning. We
have to change some parameters of the algo.
If the plot looks like this: the algo is diverging
How to converge the SGD:
ONLINE LEARNING: when there is continuous input data like on
websites which track users’ actions
➢It can adapt to changing user preferences, even if the entire
pool of users changes
➢After updating the parameters with current example, we can
just throw away the data.
≫ if the data is continuous, its good to use this algo
≫ if the data is small, then its better to use logistic regression.
CLICK THROUGH RATE PREDICTION PROBLEM – online learning
Lets say we have 100 phones in the shop and we want to return the
top 10 results to user:
Each time the user clicks a link, we will collect the (x, y) – pair to
learn through and show them the products they are most likely to
click on
MAP REDUCE:
When an ML problem is too big for a single machine, we can use
map-reduce:
➢We split the training set and sent each part to different
computers
➢Computers process data and sent them to centralized
controller
➢Centralized controller combines the results and produce the
model
When to use map-reduce:
Ask yourself: can the algorithm be expressed as a sum of functions
over training set?
▪ If yes: use map-reduce
DATA PARALLELISM:
We can also use map reduce on machines having multiple cores:
➢This way we don’t have to care about network latencies
Some linear algebra libraries automatically take care of it.

More Related Content

PDF
15 anomaly detection
PDF
8 neural network representation
PDF
2 linear regression with one variable
PDF
13 unsupervised learning clustering
PDF
14 dimentionality reduction
PDF
12 support vector machines
PDF
10 advice for applying ml
PDF
4 linear regeression with multiple variables
15 anomaly detection
8 neural network representation
2 linear regression with one variable
13 unsupervised learning clustering
14 dimentionality reduction
12 support vector machines
10 advice for applying ml
4 linear regeression with multiple variables

What's hot (20)

PDF
7 regularization
PDF
6 logistic regression classification algo
PDF
9 neural network learning
ODP
U6 Cn2 Definite Integrals Intro
PPTX
Riemann's Sum
PPT
Antiderivatives And Slope Fields
PPTX
Ap calculus warm up 8.27.13
PPT
Calc 2.1
PDF
AP Calculus Slides October 24, 2007
PDF
Logistic regression in Machine Learning
PDF
Monte carlo-simulation
PPTX
Simplex Method Flowchart/Algorithm
PDF
Regression Analysis and model comparison on the Boston Housing Data
PPTX
Linear regression by Kodebay
PPTX
Teknik Simulasi
PPTX
How to combine interpolation and regression graphs in R
PPTX
Random number generation
PPT
May 26: Slope from a graph
PPTX
Gamma, Expoential, Poisson And Chi Squared Distributions
PPTX
Approximation and error
7 regularization
6 logistic regression classification algo
9 neural network learning
U6 Cn2 Definite Integrals Intro
Riemann's Sum
Antiderivatives And Slope Fields
Ap calculus warm up 8.27.13
Calc 2.1
AP Calculus Slides October 24, 2007
Logistic regression in Machine Learning
Monte carlo-simulation
Simplex Method Flowchart/Algorithm
Regression Analysis and model comparison on the Boston Housing Data
Linear regression by Kodebay
Teknik Simulasi
How to combine interpolation and regression graphs in R
Random number generation
May 26: Slope from a graph
Gamma, Expoential, Poisson And Chi Squared Distributions
Approximation and error
Ad

Similar to 17 large scale machine learning (20)

PPTX
Advance Machine Learning presentation.pptx
PDF
Methods of Optimization in Machine Learning
PPTX
Gradient descent variants in deep laearning
PPTX
DeepLearningLecture.pptx
PPTX
PPTX
4. OPTIMIZATION NN AND FL.pptx
PDF
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
PDF
Deep learning concepts
PDF
Chap 8. Optimization for training deep models
PDF
CS229 Machine Learning Lecture Notes
PPTX
Gradient Descent or Assent is to find optimal parameters that minimize the l...
PDF
Optimization
PPTX
MACHINE LEARNING YEAR DL SECOND PART.pptx
PPTX
An overview of gradient descent optimization algorithms
PDF
L1 intro2 supervised_learning
PDF
3.1. Linear Regression and Gradient Desent.pdf
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PDF
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
PPTX
Deep learning crash course
PPTX
Gradient Descent DS Rohit Sharma fench knjs.pptx
Advance Machine Learning presentation.pptx
Methods of Optimization in Machine Learning
Gradient descent variants in deep laearning
DeepLearningLecture.pptx
4. OPTIMIZATION NN AND FL.pptx
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Deep learning concepts
Chap 8. Optimization for training deep models
CS229 Machine Learning Lecture Notes
Gradient Descent or Assent is to find optimal parameters that minimize the l...
Optimization
MACHINE LEARNING YEAR DL SECOND PART.pptx
An overview of gradient descent optimization algorithms
L1 intro2 supervised_learning
3.1. Linear Regression and Gradient Desent.pdf
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Deep learning crash course
Gradient Descent DS Rohit Sharma fench knjs.pptx
Ad

More from TanmayVijay1 (6)

PDF
18 application example photo ocr
PDF
1 Introduction to Machine Learning
PDF
16 recommender systems
PDF
11 ml system design
PDF
5 octave tutorial
PDF
3 linear algebra review
18 application example photo ocr
1 Introduction to Machine Learning
16 recommender systems
11 ml system design
5 octave tutorial
3 linear algebra review

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Cloud computing and distributed systems.
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Machine Learning_overview_presentation.pptx
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
A comparative analysis of optical character recognition models for extracting...
Per capita expenditure prediction using model stacking based on satellite ima...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25-Week II
Machine Learning_overview_presentation.pptx

17 large scale machine learning

  • 1. 17. Large Scale Machine Learning -- Machine learning algos work better if the large amount of data is fed to it. So the algos have to be efficient to works with such large datasets Accuracy of prediction grows as the amount of data increases, for each type of algorithm used:
  • 2. In a Single step of gradient descent, it will have to compute sum of 100 million terms Sanity check: use just a small set of examples and check if increase the no of examples will benefit the model: Draw the learning curves: If this type of curve occurs, it means that it’s a high variance problem : and increase no. of examples is likely to help. But mainly it will require the addition of new extra features, since its high variance problem.
  • 3. If this type of curve occurs, it means that it’s a high bias problem : then increase the no. of examples is not very likely to benefit the model. STOCHASTIC GRADIENT DESCCENT: to make it less expensive than simple gradient descent What gradient descent does:
  • 4. What will happen is: Contours: ➢Its called Batch Gradient Descend.
  • 5. What it does: ➢Each time the inner for loop works: it fits the parameters a little better for that particular example The simple gradient descent converges to the optimum in a reasonably straight-line path But SGD does not follow linear path, it just moves into a random direction, but in the end, it converges to a region near optimum.
  • 6. It does never reach the optimum, it continuously wanders around the optimum, but it’s a fine fit as the region near optimum is OK to converge The outer loop may be made to run 1 to 10 times, based on the size of m. If m is very-very large, running the outer loop just once will give a reasonably good hypothesis, while in case of batch gradient descent, looping over all examples once will only move the hypothesis one step closer. MINI BATCH GRADIENT DESCENT:
  • 7. Using b - examples at a time benefits the vectorization for parallel computation SGD CONVERGENCE:
  • 8. If the plot looks like this: Red : with smaller α; Blue : with larger α The curve is noisy as the cost is only averaged over 1000 examples Using smaller learning rate, the algo will initially learn slowly but it will eventually give a slightly better estimate: because the SGD doesn’t converge, but the parameters will oscillate around the optimum, so slower learning rate means the oscillations are also smaller.
  • 9. If we average over 5000 examples, the curve will smoothen, b’coz we only plot one data point out of every 5000 examlpes If the plot looks like this: Blue – no of examples to average over is very low Red – incr in no examples to average over has given a smooth curve and the cost is decr Pink – if the curve looks like this after incr no of ex: there is something wrong with the algo and the algo is not learning. We have to change some parameters of the algo.
  • 10. If the plot looks like this: the algo is diverging How to converge the SGD:
  • 11. ONLINE LEARNING: when there is continuous input data like on websites which track users’ actions ➢It can adapt to changing user preferences, even if the entire pool of users changes ➢After updating the parameters with current example, we can just throw away the data. ≫ if the data is continuous, its good to use this algo ≫ if the data is small, then its better to use logistic regression.
  • 12. CLICK THROUGH RATE PREDICTION PROBLEM – online learning Lets say we have 100 phones in the shop and we want to return the top 10 results to user: Each time the user clicks a link, we will collect the (x, y) – pair to learn through and show them the products they are most likely to click on
  • 13. MAP REDUCE: When an ML problem is too big for a single machine, we can use map-reduce: ➢We split the training set and sent each part to different computers ➢Computers process data and sent them to centralized controller ➢Centralized controller combines the results and produce the model When to use map-reduce: Ask yourself: can the algorithm be expressed as a sum of functions over training set? ▪ If yes: use map-reduce
  • 14. DATA PARALLELISM: We can also use map reduce on machines having multiple cores: ➢This way we don’t have to care about network latencies Some linear algebra libraries automatically take care of it.