SlideShare a Scribd company logo
Machine Learning on Big Data
Lessons Learned from Google Projects

Max Lin
Software Engineer | Google Research

Massively Parallel Computing | Harvard CS 264
Guest Lecture | March 29th, 2011
Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
“Machine Learning is a study
of computer algorithms that
   improve automatically
    through experience.”
Machine Learning on Big Data
Machine Learning on Big Data
Machine Learning on Big Data
Machine Learning on Big Data
Machine Learning on Big Data
Machine Learning on Big Data
Machine Learning on Big Data
Machine Learning on Big Data
The quick brown fox
jumped over the lazy dog.
The quick brown fox
                            English
jumped over the lazy dog.
The quick brown fox
                            English
jumped over the lazy dog.
To err is human, but to
really foul things up you
need a computer.
The quick brown fox
                            English
jumped over the lazy dog.
To err is human, but to
really foul things up you   English
need a computer.
The quick brown fox
                            English
jumped over the lazy dog.
To err is human, but to
really foul things up you   English
need a computer.
No hay mal que por bien
no venga.
The quick brown fox
                            English
jumped over the lazy dog.
To err is human, but to
really foul things up you   English
need a computer.
No hay mal que por bien
                            Spanish
no venga.
The quick brown fox
                            English
jumped over the lazy dog.
To err is human, but to
really foul things up you   English
need a computer.
No hay mal que por bien
                            Spanish
no venga.
La tercera es la vencida.
The quick brown fox
                            English
jumped over the lazy dog.
To err is human, but to
really foul things up you   English
need a computer.
No hay mal que por bien
                            Spanish
no venga.
La tercera es la vencida.   Spanish
The quick brown fox
                             English
jumped over the lazy dog.
To err is human, but to
really foul things up you    English
need a computer.
No hay mal que por bien
                             Spanish
no venga.
La tercera es la vencida.    Spanish

To be or not to be -- that
is the question
The quick brown fox
                             English
jumped over the lazy dog.
To err is human, but to
really foul things up you    English
need a computer.
No hay mal que por bien
                             Spanish
no venga.
La tercera es la vencida.    Spanish

To be or not to be -- that
                                ?
is the question
The quick brown fox
                             English
jumped over the lazy dog.
To err is human, but to
really foul things up you    English
need a computer.
No hay mal que por bien
                             Spanish
no venga.
La tercera es la vencida.    Spanish

To be or not to be -- that
                                ?
is the question

La fe mueve montañas.
The quick brown fox
                             English
jumped over the lazy dog.
To err is human, but to
really foul things up you    English
need a computer.
No hay mal que por bien
                             Spanish
no venga.
La tercera es la vencida.    Spanish

To be or not to be -- that
                                ?
is the question

La fe mueve montañas.           ?
The quick brown fox
                                        English
           jumped over the lazy dog.
           To err is human, but to
           really foul things up you    English
Training   need a computer.
           No hay mal que por bien
                                        Spanish
           no venga.
           La tercera es la vencida.    Spanish

           To be or not to be -- that
                                           ?
           is the question

           La fe mueve montañas.           ?
The quick brown fox
                                        English
           jumped over the lazy dog.
           To err is human, but to
           really foul things up you    English
Training        Input X
           need a computer.
           No hay mal que por bien
                                        Spanish
           no venga.
           La tercera es la vencida.    Spanish

           To be or not to be -- that
                                           ?
           is the question

           La fe mueve montañas.           ?
The quick brown fox
                                          English
           jumped over the lazy dog.
           To err is human, but to
           really foul things up you      English
Training        Input X
           need a computer.             Output Y
           No hay mal que por bien
                                          Spanish
           no venga.
           La tercera es la vencida.      Spanish

           To be or not to be -- that
                                             ?
           is the question

           La fe mueve montañas.             ?
The quick brown fox
                                          English
           jumped over the lazy dog.
           To err is human, but to
           really foul things up you      English
Training        Input X
           need a computer.             Output Y
           No hay mal que por bien
                                          Spanish
           no venga.
                            Model f(x)
           La tercera es la vencida. Spanish

           To be or not to be -- that
                                             ?
           is the question

           La fe mueve montañas.             ?
The quick brown fox
                                          English
           jumped over the lazy dog.
           To err is human, but to
           really foul things up you      English
Training        Input X
           need a computer.             Output Y
           No hay mal que por bien
                                          Spanish
           no venga.
                            Model f(x)
           La tercera es la vencida. Spanish

           To be or not to be -- that
                                             ?
Testing    is the question

           La fe mueve montañas.             ?
The quick brown fox
                                          English
           jumped over the lazy dog.
           To err is human, but to
           really foul things up you      English
Training        Input X
           need a computer.             Output Y
           No hay mal que por bien
                                          Spanish
           no venga.
                            Model f(x)
           La tercera es la vencida. Spanish

           To be or not to be -- that
                                             ?
Testing                 f(x’)
           is the question

           La fe mueve montañas.             ?
The quick brown fox
                                           English
           jumped over the lazy dog.
           To err is human, but to
           really foul things up you       English
Training        Input X
           need a computer.             Output Y
           No hay mal que por bien
                                           Spanish
           no venga.
                            Model f(x)
           La tercera es la vencida. Spanish

           To be or not to be -- that
                                               ?
Testing                 f(x’)
           is the question
                                        = y’
           La fe mueve montañas.               ?
Linear Classifier
Linear Classifier
The quick brown fox jumped over the lazy dog.
Linear Classifier
  The quick brown fox jumped over the lazy dog.

‘a’
Linear Classifier
  The quick brown fox jumped over the lazy dog.

‘a’ ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0,
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...     0,
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...     0,     ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...     0,     ... 1,
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...     0,     ... 1, ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...     0,     ... 1, ... 1,
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...     0,     ... 1, ... 1, ...
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...     0,     ... 1, ... 1, ...          0,
Linear Classifier
 The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
 0, ...     0,     ... 1, ... 1, ...          0,      ...
Linear Classifier
   The quick brown fox jumped over the lazy dog.

  ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
[ 0, ...      0,     ... 1, ... 1, ...          0,      ...
Linear Classifier
   The quick brown fox jumped over the lazy dog.

  ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
[ 0, ...      0,     ... 1, ... 1, ...          0,      ... ]
Linear Classifier
       The quick brown fox jumped over the lazy dog.

    ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ...      0,     ... 1, ... 1, ...          0,      ... ]
Linear Classifier
       The quick brown fox jumped over the lazy dog.

    ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ...      0,     ... 1, ... 1, ...          0,      ... ]

   [ 0.1, ...    132,    ... 150, ... 200, ...   -153,     ... ]
Linear Classifier
       The quick brown fox jumped over the lazy dog.

    ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ...      0,     ... 1, ... 1, ...          0,      ... ]

w [ 0.1, ...    132,     ... 150, ... 200, ...   -153,     ... ]
Linear Classifier
       The quick brown fox jumped over the lazy dog.

    ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ...      0,     ... 1, ... 1, ...          0,      ... ]

w [ 0.1, ...    132,     ... 150, ... 200, ...      -153,   ... ]
                                   P
                 f (x) = w · x =         w p ∗ xp
                                   p=1
Training Data
                 Input X                      Ouput Y

                        P


                                  ...

                                  ...

                                  ...
N




     ...   ...    ...       ...         ...     ...

                                  ...
Typical machine learning
data at Google

N: 100 billions / 1 billion
P: 1 billion / 10 million
(mean / median)




                              http://www.flickr.com/photos/mr_t_in_dc/5469563053
Classifier Training


• Training: Given {(x, y)} and f, minimize the
  following objective function
                  N
        arg min         L(yi , f (xi ; w)) + R(w)
             w
                  n=1
Use Newton’s method?
    t +1      t     t −1                    t
w          ← w − H(w )           J(w )




                     http://www.flickr.com/photos/visitfinland/5424369765/
Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
Scaling Up
Scaling Up

• Why big data?
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
 • Distributed learning
Subsampling
Subsampling
      Big Data
Subsampling
                    Big Data




Shard 1   Shard 2     Shard 3         Shard M
                                ...
Subsampling
                      Big Data




Reduce N   Shard 1
Subsampling
                      Big Data




Reduce N   Shard 1



           Machine
Subsampling
                      Big Data




Reduce N

           Machine

           Shard 1
Subsampling
                      Big Data




Reduce N

           Machine

           Shard 1




           Model
Why not Small Data?




                [Banko and Brill, 2001]
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
 • Distributed learning
Parallelize Estimates
• Naive Bayes Classifier
                 N   P
                               i
     arg min −             P (xp |yi ; w)P (yi ; w)
         w
                 i=1 p=1


• Maximum Likelihood Estimates
                           N              i
                           i=1 1EN,the (x )
        wthe|EN =            N
                             i=1 1EN (xi )
Word Counting
Word Counting
Map
Word Counting
      X: “The quick brown fox ...”
Map
      Y: EN
Word Counting
                                     (‘the|EN’, 1)
      X: “The quick brown fox ...”
Map
      Y: EN
Word Counting
                                     (‘the|EN’, 1)
      X: “The quick brown fox ...”
Map                                  (‘quick|EN’, 1)
      Y: EN
Word Counting
                                     (‘the|EN’, 1)
      X: “The quick brown fox ...”
Map                                  (‘quick|EN’, 1)
      Y: EN
                                     (‘brown|EN’, 1)
Word Counting
                                        (‘the|EN’, 1)
         X: “The quick brown fox ...”
 Map                                    (‘quick|EN’, 1)
         Y: EN
                                        (‘brown|EN’, 1)

Reduce
Word Counting
                                            (‘the|EN’, 1)
         X: “The quick brown fox ...”
 Map                                        (‘quick|EN’, 1)
         Y: EN
                                            (‘brown|EN’, 1)

Reduce     [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
Word Counting
                                            (‘the|EN’, 1)
         X: “The quick brown fox ...”
 Map                                        (‘quick|EN’, 1)
         Y: EN
                                            (‘brown|EN’, 1)

Reduce     [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
               C(‘the’|EN) = SUM of values = 3
Word Counting
                                            (‘the|EN’, 1)
         X: “The quick brown fox ...”
 Map                                        (‘quick|EN’, 1)
         Y: EN
                                            (‘brown|EN’, 1)

Reduce     [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
               C(‘the’|EN) = SUM of values = 3
                         C( the |EN )
           w the |EN   =
                           C(EN )
Word Counting
Word Counting
       Big Data
Word Counting
                    Big Data




Shard 1   Shard 2   Shard 3    ...   Shard M
Word Counting
                                   Big Data

          Mapper 1   Mapper 2     Mapper 3          Mapper M

Map        Shard 1      Shard 2    Shard 3    ...   Shard M



      (‘the’ | EN, 1)
Word Counting
                                      Big Data

             Mapper 1   Mapper 2    Mapper 3             Mapper M

 Map          Shard 1    Shard 2      Shard 3      ...    Shard M



         (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

                                     Reducer
Reduce                              Tally counts
                                   and update w
Word Counting
                                      Big Data

             Mapper 1   Mapper 2    Mapper 3             Mapper M

 Map          Shard 1    Shard 2      Shard 3      ...    Shard M



         (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

                                     Reducer
Reduce                              Tally counts
                                   and update w


                                      Model
Parallelize Optimization
            N           P       i yi
                 exp( p=1 wp ∗ xp )
    arg min               P
         w
            i=1 1 + exp( p=1 wp ∗ xi )
                                    p
Parallelize Optimization
• Maximum Entropy Classifiers
                         P
             N                   i yi
                  exp( p=1 wp ∗ xp )
     arg min               P
          w
             i=1 1 + exp( p=1 wp ∗ xi )
                                     p
Parallelize Optimization
• Maximum Entropy Classifiers
                         P
             N                   i yi
                  exp( p=1 wp ∗ xp )
     arg min               P
          w
             i=1 1 + exp( p=1 wp ∗ xi )
                                     p
Parallelize Optimization
• Maximum Entropy Classifiers
                         P
             N                   i yi
                  exp( p=1 wp ∗ xp )
     arg min               P
          w
             i=1 1 + exp( p=1 wp ∗ xi )
                                     p
Parallelize Optimization
• Maximum Entropy Classifiers
                          P
              N                   i yi
                   exp( p=1 wp ∗ xp )
      arg min               P
           w
              i=1 1 + exp( p=1 wp ∗ xi )
                                      p


• Good: J(w) is concave
Parallelize Optimization
• Maximum Entropy Classifiers
                          P
              N                   i yi
                   exp( p=1 wp ∗ xp )
      arg min               P
           w
              i=1 1 + exp( p=1 wp ∗ xi )
                                      p


• Good: J(w) is concave
• Bad: no closed-form solution like NB
Parallelize Optimization
• Maximum Entropy Classifiers
                          P
              N                   i yi
                   exp( p=1 wp ∗ xp )
      arg min               P
           w
              i=1 1 + exp( p=1 wp ∗ xi )
                                      p


• Good: J(w) is concave
• Bad: no closed-form solution like NB
• Ugly: Large N
Gradient Descent




        http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
Gradient Descent
Gradient Descent
• w is initialized as zero
• for t in 1 to T
 • Calculate gradients
 •
Gradient Descent
• w is initialized as zero
• for t in 1 to T
 • Calculate gradients       J(w)

 •
Gradient Descent
• w is initialized as zero
• for t in 1 to T
 • Calculate gradients J(w)
 • w ← w − η J(w)
     t+1    t
Gradient Descent
• w is initialized as zero
• for t in 1 to T
 • Calculate gradients J(w)
 • w ← w − η J(w)
     t+1          t


           N
 J(w) =         P (w, xi , yi )
          i=1
Distribute Gradient
• w is initialized as zero
• for t in 1 to T
 • Calculate gradients in parallel


• Training CPU: O(TPN) to O(TPN / M)
Distribute Gradient
• w is initialized as zero
• for t in 1 to T
 • Calculate gradients in parallel
    wt+1 ← wt − η J(w)



• Training CPU: O(TPN) to O(TPN / M)
Distribute Gradient
Distribute Gradient
          Big Data
Distribute Gradient
                     Big Data




 Shard 1   Shard 2   Shard 3    ...   Shard M
Distribute Gradient
                                   Big Data

       Machine 1     Machine 2   Machine 3          Machine M

Map     Shard 1       Shard 2     Shard 3     ...    Shard M



                  (dummy key, partial gradient sum)
Distribute Gradient
                                      Big Data

          Machine 1     Machine 2   Machine 3          Machine M

 Map       Shard 1       Shard 2     Shard 3     ...    Shard M



                     (dummy key, partial gradient sum)


Reduce                               Sum and
                                     Update w
Distribute Gradient
                                      Big Data

          Machine 1     Machine 2   Machine 3          Machine M

 Map       Shard 1       Shard 2     Shard 3     ...    Shard M



                     (dummy key, partial gradient sum)


Reduce                               Sum and
                                     Update w


           Repeat M/R
          until converge               Model
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
 • Distributed learning
Parallelize Subroutines
• Support Vector Machines
                 1
                                         n
                                2
           arg min         ||w||2   +C         ζi
               w,b,ζ   2                 i=1

    s.t.   1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0
• Solve the dual problem
                    1 T
             arg min α Qα − αT 1
                  α 2

            s.t.   0 ≤ α ≤ C, yT α = 0
The computational
cost for the Primal-
Dual Interior Point
Method is O(n^3) in
time and O(n^2) in
      memory




http://www.flickr.com/photos/sea-turtle/198445204/
Parallel SVM      [Chang et al, 2007]




          √
              N
Parallel SVM                    [Chang et al, 2007]




•   Parallel, row-wise incomplete Cholesky
    Factorization for Q



                                    √
                                        N
Parallel SVM                [Chang et al, 2007]




•   Parallel, row-wise incomplete Cholesky
    Factorization for Q
•   Parallel interior point method
    •   Time O(n^3) becomes O(n^2 / M)
                                   √
    •   Memory O(n^2) becomes O(n N / M)
Parallel SVM                [Chang et al, 2007]




•   Parallel, row-wise incomplete Cholesky
    Factorization for Q
•   Parallel interior point method
    •   Time O(n^3) becomes O(n^2 / M)
                                   √
    •   Memory O(n^2) becomes O(n N / M)
•   Parallel Support Vector Machines (psvm) http://
    code.google.com/p/psvm/
    •   Implement in MPI
Parallel ICF
• Distribute Q by row into M machines
    Machine 1     Machine 2   Machine 3

      row 1        row 3       row 5      ...
      row 2        row 4       row 6


• For each dimension n < N    √

  • Send local pivots to master
  • Master selects largest local pivots and
    broadcast the global pivot to workers
Machine Learning on Big Data
Scaling Up

• Why big data?
• Parallelize machine learning algorithms
 • Embarrassingly parallel
 • Parallelize sub-routines
 • Distributed learning
Majority Vote
Majority Vote
       Big Data
Majority Vote
                    Big Data




Shard 1   Shard 2   Shard 3    ...   Shard M
Majority Vote
                                Big Data

      Machine 1   Machine 2   Machine 3          Machine M

Map    Shard 1     Shard 2     Shard 3     ...    Shard M
Majority Vote
                                Big Data

      Machine 1   Machine 2   Machine 3          Machine M

Map    Shard 1     Shard 2     Shard 3     ...    Shard M




      Model 1     Model 2      Model 3           Model 4
Majority Vote

• Train individual classifiers independently
• Predict by taking majority votes
• Training CPU: O(TPN) to O(TPN / M)
Parameter Mixture
               [Mann et al, 2009]
Parameter Mixture   [Mann et al, 2009]

         Big Data
Parameter Mixture                     [Mann et al, 2009]

                     Big Data




 Shard 1   Shard 2   Shard 3    ...             Shard M
Parameter Mixture                          [Mann et al, 2009]

                                Big Data

      Machine 1   Machine 2   Machine 3                   Machine M

Map    Shard 1     Shard 2     Shard 3     ...             Shard M




          (dummy key, w1) (dummy key, w2) ...
Parameter Mixture                          [Mann et al, 2009]

                                   Big Data

         Machine 1   Machine 2   Machine 3                   Machine M

 Map      Shard 1     Shard 2     Shard 3     ...             Shard M




             (dummy key, w1) (dummy key, w2) ...

Reduce                            Average w
Parameter Mixture                          [Mann et al, 2009]

                                   Big Data

         Machine 1   Machine 2   Machine 3                   Machine M

 Map      Shard 1     Shard 2     Shard 3     ...             Shard M




             (dummy key, w1) (dummy key, w2) ...

Reduce                            Average w




                                    Model
Much Less network
                                                      usage than
                                                      distributed gradient
                                                      descent
                                                      O(MN) vs. O(MNT)




ttp://www.flickr.com/photos/annamatic3000/127945652/
Machine Learning on Big Data
Machine Learning on Big Data
Machine Learning on Big Data
Iterative Param Mixture
                  [McDonald et al., 2010]
Iterative Param Mixture[McDonald et al., 2010]

            Big Data
Iterative Param Mixture            [McDonald et al., 2010]

                       Big Data




   Shard 1   Shard 2   Shard 3    ...           Shard M
Iterative Param Mixture                   [McDonald et al., 2010]

                                Big Data

      Machine 1   Machine 2   Machine 3                Machine M

Map    Shard 1     Shard 2     Shard 3     ...           Shard M




          (dummy key, w1) (dummy key, w2) ...
Iterative Param Mixture                       [McDonald et al., 2010]

                                       Big Data

             Machine 1   Machine 2   Machine 3                Machine M

  Map         Shard 1     Shard 2     Shard 3     ...           Shard M




                 (dummy key, w1) (dummy key, w2) ...
 Reduce
after each                            Average w

 epoch
Iterative Param Mixture                       [McDonald et al., 2010]

                                       Big Data

             Machine 1   Machine 2   Machine 3                Machine M

  Map         Shard 1     Shard 2     Shard 3     ...           Shard M




                 (dummy key, w1) (dummy key, w2) ...
 Reduce
after each                            Average w

 epoch
                                        Model
Machine Learning on Big Data
Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
Scalable



           http://www.flickr.com/photos/mr_t_in_dc/5469563053
Parallel



http://www.flickr.com/photos/aloshbennett/3209564747/
Accuracy
http://www.flickr.com/photos/wanderlinse/4367261825/
http://www.flickr.com/photos/imagelink/4006753760/
Binary
                                                     Classification
http://www.flickr.com/photos/brenderous/4532934181/
Automatic
 Feature
Discovery


   http://www.flickr.com/photos/mararie/2340572508/
Fast
                                              Response

http://www.flickr.com/photos/prunejuice/3687192643/
Memory is new
      hard disk.




http://www.flickr.com/photos/jepoirrier/840415676/
Algorithm +
                                                Infrastructure

http://www.flickr.com/photos/neubie/854242030/
Design for
Multicores
             http://www.flickr.com/photos/geektechnique/2344029370/
Combiner
Machine Learning on Big Data
Multi-shard Combiner




[Chandra et al., 2010]
Machine
Learning on
 Big Data
Parallelize ML
 Algorithms
Parallelize ML
         Algorithms

• Embarrassingly parallel
Parallelize ML
         Algorithms

• Embarrassingly parallel
• Parallelize sub-routines
Parallelize ML
         Algorithms

• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning
Machine Learning on Big Data
Parallel
Parallel   Accuracy
Parallel   Accuracy


  Fast
Response
Parallel   Accuracy


  Fast
Response
Google APIs
Google APIs
•   Prediction API
    •   machine learning service on the cloud
    •   http://code.google.com/apis/predict
Google APIs
•   Prediction API
    •   machine learning service on the cloud
    •   http://code.google.com/apis/predict


•   BigQuery
    •   interactive analysis of massive data on the cloud
    •   http://code.google.com/apis/bigquery

More Related Content

PDF
501 synonyms and antonyms
PPT
Agile Software Development Scrum Vs Lean
PDF
Big data and machine learning for Businesses
PDF
Deep Learning through Examples
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
PPTX
Tutorial on Deep learning and Applications
PPTX
Introduction to Machine Learning
PPTX
Introduction to Big Data/Machine Learning
501 synonyms and antonyms
Agile Software Development Scrum Vs Lean
Big data and machine learning for Businesses
Deep Learning through Examples
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Tutorial on Deep learning and Applications
Introduction to Machine Learning
Introduction to Big Data/Machine Learning

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Modernizing your data center with Dell and AMD
PDF
KodekX | Application Modernization Development
PPTX
Cloud computing and distributed systems.
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Dropbox Q2 2025 Financial Results & Investor Presentation
Understanding_Digital_Forensics_Presentation.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
Modernizing your data center with Dell and AMD
KodekX | Application Modernization Development
Cloud computing and distributed systems.
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Ad
Ad

Machine Learning on Big Data

  • 1. Machine Learning on Big Data Lessons Learned from Google Projects Max Lin Software Engineer | Google Research Massively Parallel Computing | Harvard CS 264 Guest Lecture | March 29th, 2011
  • 2. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  • 3. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  • 4. “Machine Learning is a study of computer algorithms that improve automatically through experience.”
  • 13. The quick brown fox jumped over the lazy dog.
  • 14. The quick brown fox English jumped over the lazy dog.
  • 15. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you need a computer.
  • 16. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer.
  • 17. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien no venga.
  • 18. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga.
  • 19. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida.
  • 20. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish
  • 21. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that is the question
  • 22. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question
  • 23. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas.
  • 24. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  • 25. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  • 26. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  • 27. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  • 28. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  • 29. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing is the question La fe mueve montañas. ?
  • 30. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing f(x’) is the question La fe mueve montañas. ?
  • 31. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing f(x’) is the question = y’ La fe mueve montañas. ?
  • 33. Linear Classifier The quick brown fox jumped over the lazy dog.
  • 34. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’
  • 35. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ...
  • 36. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’
  • 37. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ...
  • 38. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’
  • 39. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ...
  • 40. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’
  • 41. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ...
  • 42. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’
  • 43. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
  • 44. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0,
  • 45. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ...
  • 46. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0,
  • 47. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ...
  • 48. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1,
  • 49. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ...
  • 50. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1,
  • 51. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ...
  • 52. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ... 0,
  • 53. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ... 0, ...
  • 54. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... [ 0, ... 0, ... 1, ... 1, ... 0, ...
  • 55. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
  • 56. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
  • 57. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
  • 58. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
  • 59. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ] P f (x) = w · x = w p ∗ xp p=1
  • 60. Training Data Input X Ouput Y P ... ... ... N ... ... ... ... ... ... ...
  • 61. Typical machine learning data at Google N: 100 billions / 1 billion P: 1 billion / 10 million (mean / median) http://www.flickr.com/photos/mr_t_in_dc/5469563053
  • 62. Classifier Training • Training: Given {(x, y)} and f, minimize the following objective function N arg min L(yi , f (xi ; w)) + R(w) w n=1
  • 63. Use Newton’s method? t +1 t t −1 t w ← w − H(w ) J(w ) http://www.flickr.com/photos/visitfinland/5424369765/
  • 64. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  • 66. Scaling Up • Why big data?
  • 67. Scaling Up • Why big data? • Parallelize machine learning algorithms
  • 68. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel
  • 69. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines
  • 70. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 72. Subsampling Big Data
  • 73. Subsampling Big Data Shard 1 Shard 2 Shard 3 Shard M ...
  • 74. Subsampling Big Data Reduce N Shard 1
  • 75. Subsampling Big Data Reduce N Shard 1 Machine
  • 76. Subsampling Big Data Reduce N Machine Shard 1
  • 77. Subsampling Big Data Reduce N Machine Shard 1 Model
  • 78. Why not Small Data? [Banko and Brill, 2001]
  • 79. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 80. Parallelize Estimates • Naive Bayes Classifier N P i arg min − P (xp |yi ; w)P (yi ; w) w i=1 p=1 • Maximum Likelihood Estimates N i i=1 1EN,the (x ) wthe|EN = N i=1 1EN (xi )
  • 83. Word Counting X: “The quick brown fox ...” Map Y: EN
  • 84. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map Y: EN
  • 85. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN
  • 86. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)
  • 87. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce
  • 88. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
  • 89. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3
  • 90. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3 C( the |EN ) w the |EN = C(EN )
  • 92. Word Counting Big Data
  • 93. Word Counting Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  • 94. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1)
  • 95. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) Reducer Reduce Tally counts and update w
  • 96. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) Reducer Reduce Tally counts and update w Model
  • 97. Parallelize Optimization N P i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
  • 98. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
  • 99. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
  • 100. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
  • 101. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave
  • 102. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave • Bad: no closed-form solution like NB
  • 103. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave • Bad: no closed-form solution like NB • Ugly: Large N
  • 104. Gradient Descent http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
  • 106. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients •
  • 107. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) •
  • 108. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) • w ← w − η J(w) t+1 t
  • 109. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) • w ← w − η J(w) t+1 t N J(w) = P (w, xi , yi ) i=1
  • 110. Distribute Gradient • w is initialized as zero • for t in 1 to T • Calculate gradients in parallel • Training CPU: O(TPN) to O(TPN / M)
  • 111. Distribute Gradient • w is initialized as zero • for t in 1 to T • Calculate gradients in parallel wt+1 ← wt − η J(w) • Training CPU: O(TPN) to O(TPN / M)
  • 113. Distribute Gradient Big Data
  • 114. Distribute Gradient Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  • 115. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum)
  • 116. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum) Reduce Sum and Update w
  • 117. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum) Reduce Sum and Update w Repeat M/R until converge Model
  • 118. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 119. Parallelize Subroutines • Support Vector Machines 1 n 2 arg min ||w||2 +C ζi w,b,ζ 2 i=1 s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0 • Solve the dual problem 1 T arg min α Qα − αT 1 α 2 s.t. 0 ≤ α ≤ C, yT α = 0
  • 120. The computational cost for the Primal- Dual Interior Point Method is O(n^3) in time and O(n^2) in memory http://www.flickr.com/photos/sea-turtle/198445204/
  • 121. Parallel SVM [Chang et al, 2007] √ N
  • 122. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q √ N
  • 123. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q • Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M)
  • 124. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q • Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M) • Parallel Support Vector Machines (psvm) http:// code.google.com/p/psvm/ • Implement in MPI
  • 125. Parallel ICF • Distribute Q by row into M machines Machine 1 Machine 2 Machine 3 row 1 row 3 row 5 ... row 2 row 4 row 6 • For each dimension n < N √ • Send local pivots to master • Master selects largest local pivots and broadcast the global pivot to workers
  • 127. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 129. Majority Vote Big Data
  • 130. Majority Vote Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  • 131. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M
  • 132. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M Model 1 Model 2 Model 3 Model 4
  • 133. Majority Vote • Train individual classifiers independently • Predict by taking majority votes • Training CPU: O(TPN) to O(TPN / M)
  • 134. Parameter Mixture [Mann et al, 2009]
  • 135. Parameter Mixture [Mann et al, 2009] Big Data
  • 136. Parameter Mixture [Mann et al, 2009] Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  • 137. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...
  • 138. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce Average w
  • 139. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce Average w Model
  • 140. Much Less network usage than distributed gradient descent O(MN) vs. O(MNT) ttp://www.flickr.com/photos/annamatic3000/127945652/
  • 144. Iterative Param Mixture [McDonald et al., 2010]
  • 145. Iterative Param Mixture[McDonald et al., 2010] Big Data
  • 146. Iterative Param Mixture [McDonald et al., 2010] Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  • 147. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...
  • 148. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce after each Average w epoch
  • 149. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce after each Average w epoch Model
  • 151. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  • 152. Scalable http://www.flickr.com/photos/mr_t_in_dc/5469563053
  • 156. Binary Classification http://www.flickr.com/photos/brenderous/4532934181/
  • 157. Automatic Feature Discovery http://www.flickr.com/photos/mararie/2340572508/
  • 158. Fast Response http://www.flickr.com/photos/prunejuice/3687192643/
  • 159. Memory is new hard disk. http://www.flickr.com/photos/jepoirrier/840415676/
  • 160. Algorithm + Infrastructure http://www.flickr.com/photos/neubie/854242030/
  • 161. Design for Multicores http://www.flickr.com/photos/geektechnique/2344029370/
  • 167. Parallelize ML Algorithms • Embarrassingly parallel
  • 168. Parallelize ML Algorithms • Embarrassingly parallel • Parallelize sub-routines
  • 169. Parallelize ML Algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  • 172. Parallel Accuracy
  • 173. Parallel Accuracy Fast Response
  • 174. Parallel Accuracy Fast Response
  • 176. Google APIs • Prediction API • machine learning service on the cloud • http://code.google.com/apis/predict
  • 177. Google APIs • Prediction API • machine learning service on the cloud • http://code.google.com/apis/predict • BigQuery • interactive analysis of massive data on the cloud • http://code.google.com/apis/bigquery

Editor's Notes