SlideShare a Scribd company logo
Machine Learning on Big Data
Lessons Learned from Google Projects

Max Lin
Software Engineer | Google Research

Massively Parallel Computing | Harvard CS 264
Guest Lecture | March 29th, 2011
Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
“Machine Learning is a study
of computer algorithms that
   improve automatically
    through experience.”
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
The quick brown fox
                                           English
           jumped over the lazy dog.
           To err is human, but to
           really foul things up you       English
Training        Input X
           need a computer.             Output Y
           No hay mal que por bien
                                           Spanish
           no venga.
                            Model f(x)
           La tercera es la vencida. Spanish

           To be or not to be -- that
                                               ?
Testing                 f(x’)
           is the question
                                        = y’
           La fe mueve montañas.               ?
Linear Classifier
       The quick brown fox jumped over the lazy dog.

    ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ...      0,     ... 1, ... 1, ...          0,      ... ]

w [ 0.1, ...    132,     ... 150, ... 200, ...     -153,   ... ]
                                   P
                                   
                 f (x) = w · x =         wp ∗ xp
                                   p=1
Training Data
                 Input X                      Ouput Y

                        P


                                  ...

                                  ...

                                  ...
N




     ...   ...    ...       ...         ...     ...

                                  ...
Typical machine learning
data at Google

N: 100 billions / 1 billion
P: 1 billion / 10 million
(mean / median)




                              http://www.flickr.com/photos/mr_t_in_dc/5469563053
Classifier Training


• Training: Given {(x, y)} and f, minimize the
  following objective function
                  N
                  
        arg min         L(yi , f (xi ; w)) + R(w)
             w
                  n=1
Use Newton’s method?
    t+1      t     t −1                    t
w         ← w − H(w )      ∇J(w )




                    http://www.flickr.com/photos/visitfinland/5424369765/
Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
 • Distributed learning
Subsampling
                               Big Data




Reduce N   Shard 1   Shard 2     Shard 3
                                           ...
                                                 Shard M



           Machine




           Model
Why not Small Data?




                [Banko and Brill, 2001]
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
 • Distributed learning
Parallelize Estimates
• Naive Bayes Classifier
                 N P
                 
                               i
     arg min −             P (xp |yi ; w)P (yi ; w)
         w
                 i=1 p=1


• Maximum Likelihood Estimates
                          N              i
                           i=1 1EN,the (x )
        wthe|EN =          N
                             i=1 1EN (xi )
Word Counting
                                            (‘the|EN’, 1)
         X: “The quick brown fox ...”
 Map                                        (‘quick|EN’, 1)
         Y: EN
                                            (‘brown|EN’, 1)

Reduce     [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
                C(‘the’|EN) = SUM of values = 3
                                
                           C( the |EN )
           w the |EN   =
                             C(EN )
Word Counting
                                      Big Data

             Mapper 1   Mapper 2    Mapper 3             Mapper M

 Map          Shard 1    Shard 2      Shard 3      ...    Shard M



         (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

                                     Reducer
Reduce                              Tally counts
                                   and update w


                                      Model
Parallelize Optimization
• Maximum Entropy Classifiers
                         P
             N
                                 i yi
                   exp( p=1 wp ∗ xp )
      arg min             P
           w
              i=1 1 + exp( p=1 wp ∗ xi )
                                      p


• Good: J(w) is concave
• Bad: no closed-form solution like NB
• Ugly: Large N
Gradient Descent




        http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
Gradient Descent
• w is initialized as zero
• for t in 1 to T
 • Calculate gradients ∇J(w)
 • w ← w − η∇J(w)
      t+1         t


          N
          
∇J(w) =         P (w, xi , yi )
          i=1
Distribute Gradient
• w is initialized as zero
• for t in 1 to T
 • Calculate gradients in parallel
    wt+1 ← wt − η∇J(w)



• Training CPU: O(TPN) to O(TPN / M)
Distribute Gradient
                                      Big Data

          Machine 1     Machine 2   Machine 3          Machine M

 Map       Shard 1       Shard 2     Shard 3     ...    Shard M



                     (dummy key, partial gradient sum)


Reduce                               Sum and
                                     Update w


           Repeat M/R
          until converge               Model
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
 • Distributed learning
Parallelize Subroutines
• Support Vector Machines
                 1
                                         n
                                         
                                2
           arg min         ||w||2   +C         ζi
               w,b,ζ   2                 i=1

    s.t.   1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0
• Solve the dual problem
                    1 T
             arg min α Qα − αT 1
                  α 2

            s.t.   0 ≤ α ≤ C, yT α = 0
The computational
cost for the Primal-
Dual Interior Point
Method is O(n^3) in
time and O(n^2) in
      memory




http://www.flickr.com/photos/sea-turtle/198445204/
Parallel SVM                [Chang et al, 2007]




•   Parallel, row-wise incomplete Cholesky
    Factorization for Q
•   Parallel interior point method
    •   Time O(n^3) becomes O(n^2 / M)
                                   √
    •   Memory O(n^2) becomes O(n N / M)
•   Parallel Support Vector Machines (psvm) http://
    code.google.com/p/psvm/
    •   Implement in MPI
Parallel ICF
• Distribute Q by row into M machines
    Machine 1     Machine 2   Machine 3

      row 1        row 3       row 5      ...
      row 2        row 4       row 6


• For each dimension n  N    √

  • Send local pivots to master
  • Master selects largest local pivots and
    broadcast the global pivot to workers
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
 • Distributed learning
Majority Vote
                                Big Data

      Machine 1   Machine 2   Machine 3          Machine M

Map    Shard 1     Shard 2     Shard 3     ...    Shard M




      Model 1     Model 2      Model 3           Model 4
Majority Vote

• Train individual classifiers independently
• Predict by taking majority votes
• Training CPU: O(TPN) to O(TPN / M)
Parameter Mixture                          [Mann et al, 2009]

                                   Big Data

         Machine 1   Machine 2   Machine 3                   Machine M

 Map      Shard 1     Shard 2     Shard 3     ...             Shard M




             (dummy key, w1) (dummy key, w2) ...

Reduce                            Average w




                                    Model
Much Less network
                                                      usage than
                                                      distributed gradient
                                                      descent
                                                      O(MN) vs. O(MNT)




ttp://www.flickr.com/photos/annamatic3000/127945652/
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Iterative Param Mixture                       [McDonald et al., 2010]

                                       Big Data

             Machine 1   Machine 2   Machine 3                Machine M

  Map         Shard 1     Shard 2     Shard 3     ...           Shard M




                 (dummy key, w1) (dummy key, w2) ...
 Reduce
after each                            Average w

 epoch
                                        Model
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
Scalable



           http://www.flickr.com/photos/mr_t_in_dc/5469563053
Parallel



http://www.flickr.com/photos/aloshbennett/3209564747/
Accuracy
http://www.flickr.com/photos/wanderlinse/4367261825/
http://www.flickr.com/photos/imagelink/4006753760/
Binary
                                                     Classification
http://www.flickr.com/photos/brenderous/4532934181/
Automatic
 Feature
Discovery


   http://www.flickr.com/photos/mararie/2340572508/
Fast
                                              Response

http://www.flickr.com/photos/prunejuice/3687192643/
Memory is new
      hard disk.




http://www.flickr.com/photos/jepoirrier/840415676/
Algorithm +
                                                Infrastructure

http://www.flickr.com/photos/neubie/854242030/
Design for
Multicores
             http://www.flickr.com/photos/geektechnique/2344029370/
Combiner
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Multi-shard Combiner




[Chandra et al., 2010]
Machine
Learning on
 Big Data
Parallelize ML
         Algorithms

• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning
Parallel   Accuracy


  Fast
Response
Google APIs
•   Prediction API
    •   machine learning service on the cloud
    •   http://code.google.com/apis/predict


•   BigQuery
    •   interactive analysis of massive data on the cloud
    •   http://code.google.com/apis/bigquery

More Related Content

PDF
Towards typesafe deep learning in scala
PDF
Additive model and boosting tree
PDF
Dixon Deep Learning
PDF
Applied Machine Learning For Search Engine Relevance
PDF
Expectation propagation
PDF
Chapter 1 - Introduction
PDF
Data-Driven Recommender Systems
PDF
QMC: Transition Workshop - Approximating Multivariate Functions When Function...
Towards typesafe deep learning in scala
Additive model and boosting tree
Dixon Deep Learning
Applied Machine Learning For Search Engine Relevance
Expectation propagation
Chapter 1 - Introduction
Data-Driven Recommender Systems
QMC: Transition Workshop - Approximating Multivariate Functions When Function...

What's hot (19)

PDF
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
PDF
Gradient Estimation Using Stochastic Computation Graphs
PDF
Iclr2016 vaeまとめ
PDF
Actors for Behavioural Simulation
PDF
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
PDF
MLIP - Chapter 2 - Preliminaries to deep learning
PDF
Bioalgo 2012-04-hmm
PDF
Efficient end-to-end learning for quantizable representations
PDF
Deep Generative Models
PDF
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
PDF
Accelerating Metropolis Hastings with Lightweight Inference Compilation
PDF
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
PDF
Runtime Analysis of Population-based Evolutionary Algorithms
PPT
1533 game mathematics
PDF
Uncertainty Awareness in Integrating Machine Learning and Game Theory
PDF
KDD CUP 2015 - 9th solution
PDF
Approximate Bayesian Computation on GPUs
PDF
k-MLE: A fast algorithm for learning statistical mixture models
PDF
Asynchronous Stochastic Optimization, New Analysis and Algorithms
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
Gradient Estimation Using Stochastic Computation Graphs
Iclr2016 vaeまとめ
Actors for Behavioural Simulation
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
MLIP - Chapter 2 - Preliminaries to deep learning
Bioalgo 2012-04-hmm
Efficient end-to-end learning for quantizable representations
Deep Generative Models
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Runtime Analysis of Population-based Evolutionary Algorithms
1533 game mathematics
Uncertainty Awareness in Integrating Machine Learning and Game Theory
KDD CUP 2015 - 9th solution
Approximate Bayesian Computation on GPUs
k-MLE: A fast algorithm for learning statistical mixture models
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Ad

Viewers also liked (20)

PDF
Machine Learning and Big Data at Foursquare
PDF
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
PDF
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
PDF
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
PDF
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
PDF
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
PDF
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
PDF
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
PDF
Learning Analytics
PDF
Top 3 Challenges to Profitable Mortgage Lending
PDF
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
PDF
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
PDF
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
PDF
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
PDF
Graph analytic and machine learning
PDF
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
PDF
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
PDF
Top 3 Considerations for Machine Learning on Big Data
Machine Learning and Big Data at Foursquare
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Learning Analytics
Top 3 Challenges to Profitable Mortgage Lending
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
Graph analytic and machine learning
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
Top 3 Considerations for Machine Learning on Big Data
Ad

Similar to [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research) (20)

PDF
An introduc on to Machine Learning
PDF
Introduction to Machine Learning
PDF
Introduction to Machine Learning
PPTX
Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce C...
PDF
Terascale Learning
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PDF
633-600 Machine Learning
PDF
Foilsを使ってみた。
PDF
Shogun 2.0 @ PyData NYC 2012
PPTX
Keynote at IWLS 2017
PPT
Machine Learning and Inductive Inference
PDF
An introduction to Machine Learning
PDF
Icml2012 tutorial representation_learning
PDF
Machine Learning - What, Where and How
An introduc on to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce C...
Terascale Learning
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
633-600 Machine Learning
Foilsを使ってみた。
Shogun 2.0 @ PyData NYC 2012
Keynote at IWLS 2017
Machine Learning and Inductive Inference
An introduction to Machine Learning
Icml2012 tutorial representation_learning
Machine Learning - What, Where and How

More from npinto (16)

PDF
"AI" for Blockchain Security (Case Study: Cosmos)
PDF
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
PDF
[Harvard CS264] 05 - Advanced-level CUDA Programming
PDF
[Harvard CS264] 04 - Intermediate-level CUDA Programming
PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
PDF
[Harvard CS264] 01 - Introduction
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
PDF
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
PDF
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
"AI" for Blockchain Security (Case Study: Cosmos)
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 01 - Introduction
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...

Recently uploaded (20)

PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Cell Types and Its function , kingdom of life
PPTX
Cell Structure & Organelles in detailed.
PDF
Insiders guide to clinical Medicine.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Complications of Minimal Access Surgery at WLH
STATICS OF THE RIGID BODIES Hibbelers.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
O7-L3 Supply Chain Operations - ICLT Program
Pharma ospi slides which help in ospi learning
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Cell Types and Its function , kingdom of life
Cell Structure & Organelles in detailed.
Insiders guide to clinical Medicine.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Renaissance Architecture: A Journey from Faith to Humanism
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
TR - Agricultural Crops Production NC III.pdf
Week 4 Term 3 Study Techniques revisited.pptx
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
2.FourierTransform-ShortQuestionswithAnswers.pdf
Complications of Minimal Access Surgery at WLH

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

  • 1. Machine Learning on Big Data Lessons Learned from Google Projects Max Lin Software Engineer | Google Research Massively Parallel Computing | Harvard CS 264 Guest Lecture | March 29th, 2011
  • 2. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  • 3. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  • 4. “Machine Learning is a study of computer algorithms that improve automatically through experience.”
  • 10. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing f(x’) is the question = y’ La fe mueve montañas. ?
  • 11. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ] P f (x) = w · x = wp ∗ xp p=1
  • 12. Training Data Input X Ouput Y P ... ... ... N ... ... ... ... ... ... ...
  • 13. Typical machine learning data at Google N: 100 billions / 1 billion P: 1 billion / 10 million (mean / median) http://www.flickr.com/photos/mr_t_in_dc/5469563053
  • 14. Classifier Training • Training: Given {(x, y)} and f, minimize the following objective function N arg min L(yi , f (xi ; w)) + R(w) w n=1
  • 15. Use Newton’s method? t+1 t t −1 t w ← w − H(w ) ∇J(w ) http://www.flickr.com/photos/visitfinland/5424369765/
  • 16. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  • 17. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 18. Subsampling Big Data Reduce N Shard 1 Shard 2 Shard 3 ... Shard M Machine Model
  • 19. Why not Small Data? [Banko and Brill, 2001]
  • 20. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 21. Parallelize Estimates • Naive Bayes Classifier N P i arg min − P (xp |yi ; w)P (yi ; w) w i=1 p=1 • Maximum Likelihood Estimates N i i=1 1EN,the (x ) wthe|EN = N i=1 1EN (xi )
  • 22. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3 C( the |EN ) w the |EN = C(EN )
  • 23. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) Reducer Reduce Tally counts and update w Model
  • 24. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave • Bad: no closed-form solution like NB • Ugly: Large N
  • 25. Gradient Descent http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
  • 26. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients ∇J(w) • w ← w − η∇J(w) t+1 t N ∇J(w) = P (w, xi , yi ) i=1
  • 27. Distribute Gradient • w is initialized as zero • for t in 1 to T • Calculate gradients in parallel wt+1 ← wt − η∇J(w) • Training CPU: O(TPN) to O(TPN / M)
  • 28. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum) Reduce Sum and Update w Repeat M/R until converge Model
  • 29. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 30. Parallelize Subroutines • Support Vector Machines 1 n 2 arg min ||w||2 +C ζi w,b,ζ 2 i=1 s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0 • Solve the dual problem 1 T arg min α Qα − αT 1 α 2 s.t. 0 ≤ α ≤ C, yT α = 0
  • 31. The computational cost for the Primal- Dual Interior Point Method is O(n^3) in time and O(n^2) in memory http://www.flickr.com/photos/sea-turtle/198445204/
  • 32. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q • Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M) • Parallel Support Vector Machines (psvm) http:// code.google.com/p/psvm/ • Implement in MPI
  • 33. Parallel ICF • Distribute Q by row into M machines Machine 1 Machine 2 Machine 3 row 1 row 3 row 5 ... row 2 row 4 row 6 • For each dimension n N √ • Send local pivots to master • Master selects largest local pivots and broadcast the global pivot to workers
  • 35. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 36. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M Model 1 Model 2 Model 3 Model 4
  • 37. Majority Vote • Train individual classifiers independently • Predict by taking majority votes • Training CPU: O(TPN) to O(TPN / M)
  • 38. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce Average w Model
  • 39. Much Less network usage than distributed gradient descent O(MN) vs. O(MNT) ttp://www.flickr.com/photos/annamatic3000/127945652/
  • 41. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce after each Average w epoch Model
  • 43. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  • 44. Scalable http://www.flickr.com/photos/mr_t_in_dc/5469563053
  • 48. Binary Classification http://www.flickr.com/photos/brenderous/4532934181/
  • 49. Automatic Feature Discovery http://www.flickr.com/photos/mararie/2340572508/
  • 50. Fast Response http://www.flickr.com/photos/prunejuice/3687192643/
  • 51. Memory is new hard disk. http://www.flickr.com/photos/jepoirrier/840415676/
  • 52. Algorithm + Infrastructure http://www.flickr.com/photos/neubie/854242030/
  • 53. Design for Multicores http://www.flickr.com/photos/geektechnique/2344029370/
  • 58. Parallelize ML Algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 59. Parallel Accuracy Fast Response
  • 60. Google APIs • Prediction API • machine learning service on the cloud • http://code.google.com/apis/predict • BigQuery • interactive analysis of massive data on the cloud • http://code.google.com/apis/bigquery