SlideShare a Scribd company logo
Map-Reduce for Machine Learning on Multicore


         Cheng-Tao Chu ∗                   Sang Kyun Kim ∗                      Yi-An Lin ∗
    chengtao@stanford.edu              skkim38@stanford.edu                ianl@stanford.edu

         YuanYuan Yu ∗                     Gary Bradski ∗†                  Andrew Y. Ng ∗
    yuanyuan@stanford.edu               garybradski@gmail               ang@cs.stanford.edu

                                        Kunle Olukotun ∗
                                   kunle@cs.stanford.edu
                       ∗
                       . CS. Department, Stanford University 353 Serra Mall,
                           Stanford University, Stanford CA 94305-9025.
                                            †
                                            . Rexee Inc.



                                              Abstract

         We are at the beginning of the multicore era. Computers will have increasingly
         many cores (processors), but there is still no good programming framework for
         these architectures, and thus no simple and unified way for machine learning to
         take advantage of the potential speed up. In this paper, we develop a broadly ap-
         plicable parallel programming method, one that is easily applied to many different
         learning algorithms. Our work is in distinct contrast to the tradition in machine
         learning of designing (often ingenious) ways to speed up a single algorithm at a
         time. Specifically, we show that algorithms that fit the Statistical Query model [15]
         can be written in a certain “summation form,” which allows them to be easily par-
         allelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to
         demonstrate this parallel speed up technique on a variety of learning algorithms
         including locally weighted linear regression (LWLR), k-means, logistic regres-
         sion (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis
         (GDA), EM, and backpropagation (NN). Our experimental results show basically
         linear speedup with an increasing number of processors.


1    Introduction

Frequency scaling on silicon—the ability to drive chips at ever higher clock rates—is beginning to
hit a power limit as device geometries shrink due to leakage, and simply because CMOS consumes
power every time it changes state [9, 10]. Yet Moore’s law [20], the density of circuits doubling
every generation, is projected to last between 10 and 20 more years for silicon based circuits [10].
By keeping clock frequency fixed, but doubling the number of processing cores on a chip, one can
maintain lower power while doubling the speed of many applications. This has forced an industry-
wide shift to multicore.
We thus approach an era of increasing numbers of cores per chip, but there is as yet no good frame-
work for machine learning to take advantage of massive numbers of cores. There are many parallel
programming languages such as Orca, Occam ABCL, SNOW, MPI and PARLOG, but none of these
approaches make it obvious how to parallelize a particular algorithm. There is a vast literature on
distributed learning and data mining [18], but very little of this literature focuses on our goal: A gen-
eral means of programming machine learning on multicore. Much of this literature contains a long
and distinguished tradition of developing (often ingenious) ways to speed up or parallelize individ-
ual learning algorithms, for instance cascaded SVMs [11]. But these yield no general parallelization
technique for machine learning and, more pragmatically, specialized implementations of popular
algorithms rarely lead to widespread use. Some examples of more general papers are: Caregea et.
al. [5] give some general data distribution conditions for parallelizing machine learning, but restrict
the focus to decision trees; Jin and Agrawal [14] give a general machine learning programming ap-
proach, but only for shared memory machines. This doesn’t fit the architecture of cellular or grid
type multiprocessors where cores have local cache, even if it can be dynamically reallocated.
In this paper, we focuses on developing a general and exact technique for parallel programming
of a large class of machine learning algorithms for multicore processors. The central idea of this
approach is to allow a future programmer or user to speed up machine learning applications by
”throwing more cores” at the problem rather than search for specialized optimizations. This paper’s
contributions are:
(i) We show that any algorithm fitting the Statistical Query Model may be written in a certain “sum-
mation form.” This form does not change the underlying algorithm and so is not an approximation,
but is instead an exact implementation. (ii) The summation form does not depend on, but can be
easily expressed in a map-reduce [7] framework which is easy to program in. (iii) This technique
achieves basically linear speed-up with the number of cores.
We attempt to develop a pragmatic and general framework. What we do not claim:
(i) We make no claim that our technique will necessarily run faster than a specialized, one-off so-
lution. Here we achieve linear speedup which in fact often does beat specific solutions such as
cascaded SVM [11] (see section 5; however, they do handle kernels, which we have not addressed).
(ii) We make no claim that following our framework (for a specific algorithm) always leads to a
novel parallelization undiscovered by others. What is novel is the larger, broadly applicable frame-
work, together with a pragmatic programming paradigm, map-reduce. (iii) We focus here on exact
implementation of machine learning algorithms, not on parallel approximations to algorithms (a
worthy topic, but one which is beyond this paper’s scope).
In section 2 we discuss the Statistical Query Model, our summation form framework and an example
of its application. In section 3 we describe how our framework may be implemented in a Google-
like map-reduce paradigm. In section 4 we choose 10 frequently used machine learning algorithms
as examples of what can be coded in this framework. This is followed by experimental runs on 10
moderately large data sets in section 5, where we show a good match to our theoretical computational
complexity results. Basically, we often achieve linear speedup in the number of cores. Section 6
concludes the paper.


2 Statistical Query and Summation Form

For multicore systems, Sutter and Larus [25] point out that multicore mostly benefits concurrent
applications, meaning ones where there is little communication between cores. The best match is
thus if the data is subdivided and stays local to the cores. To achieve this, we look to Kearns’
Statistical Query Model [15].
The Statistical Query Model is sometimes posed as a restriction on the Valiant PAC model [26],
in which we permit the learning algorithm to access the learning problem only through a statistical
query oracle. Given a function f (x, y) over instances, the statistical query oracle returns an estimate
of the expectation of f (x, y) (averaged over the training/test distribution). Algorithms that calculate
sufficient statistics or gradients fit this model, and since these calculations may be batched, they
are expressible as a sum over data points. This class of algorithms is large; We show 10 popular
algorithms in section 4 below. An example that does not fit is that of learning an XOR over a subset
of bits. [16, 15]. However, when an algorithm does sum over the data, we can easily distribute the
calculations over multiple cores: We just divide the data set into as many pieces as there are cores,
give each core its share of the data to sum the equations over, and aggregate the results at the end.
We call this form of the algorithm the “summation form.”
As an example, consider ordinary least squares (linear regression), which fits a model of the form
                                    m
y = θT x by solving: θ∗ = minθ i=1 (θT xi − yi )2 The parameter θ is typically solved for by
1.1.3.2
                                                            Algorithm
                              1.1.1.2                                               1.1.3.1: query_info
                                                              2    1: run
                                          0: data input
                            Data                              Engine
                                                            1.2    1.1: run
                                                                                     1.1.3: reduce
                                                              Master                                              Reducer
                                                                                                  1.1.4: result
                                        1.1.2: intermediate data   1.1.1: map (split data)

                                Mapper               Mapper        Mapper           Mapper
      1.1.1.1: query_info


                                         Figure 1: Multicore map-reduce framework


defining the design matrix X ∈ Rm×n to be a matrix whose rows contain the training instances
x1 , . . . , xm , letting y = [y1 , . . . , ym ]m be the vector of target labels, and solving the normal equa-
tions to obtain θ∗ = (X T X)−1 X T y.
To put this computation into summation form, we reformulate it into a two phase algorithm where
we first compute sufficient statistics by summing over the data, and then aggregate those statistics
and solve to get θ∗ = A−1 b. Concretely, we compute A = X T X and b = X T y as follows:
         m                        m
A = i=1 (xi xT ) and b = i=1 (xi yi ). The computation of A and b can now be divided into
                 i
equal size pieces and distributed among the cores. We next discuss an architecture that lends itself
to the summation form: Map-reduce.

3   Architecture
Many programming frameworks are possible for the summation form, but inspired by Google’s
success in adapting a functional programming construct, map-reduce [7], for wide spread parallel
programming use inside their company, we adapted this same construct for multicore use. Google’s
map-reduce is specialized for use over clusters that have unreliable communication and where indi-
vidual computers may go down. These are issues that multicores do not have; thus, we were able to
developed a much lighter weight architecture for multicores, shown in Figure 1.
Figure 1 shows a high level view of our architecture and how it processes the data. In step 0, the
map-reduce engine is responsible for splitting the data by training examples (rows). The engine then
caches the split data for the subsequent map-reduce invocations. Every algorithm has its own engine
instance, and every map-reduce task will be delegated to its engine (step 1). Similar to the original
map-reduce architecture, the engine will run a master (step 1.1) which coordinates the mappers
and the reducers. The master is responsible for assigning the split data to different mappers, and
then collects the processed intermediate data from the mappers (step 1.1.1 and 1.1.2). After the
intermediate data is collected, the master will in turn invoke the reducer to process it (step 1.1.3) and
return final results (step 1.1.4). Note that some mapper and reducer operations require additional
scalar information from the algorithms. In order to support these operations, the mapper/reducer
can obtain this information through the query info interface, which can be customized for each
different algorithm (step 1.1.1.1 and 1.1.3.2).

4   Adopted Algorithms
In this section, we will briefly discuss the algorithms we have implemented based on our framework.
These algorithms were chosen partly by their popularity of use in NIPS papers, and our goal will be
to illustrate how each algorithm can be expressed in summation form. We will defer the discussion
of the theoretical improvement that can be achieved by this parallelization to Section 4.1. In the
following, x or xi denotes a training vector and y or yi denotes a training label.
• Locally Weighted Linear Regression (LWLR) LWLR [28, 3] is solved by finding
                                                                          m           T
  the solution of the normal equations Aθ = b, where A =                  i=1 wi (xi xi ) and b =
     m
     i=1 wi (xi yi ). For the summation form, we divide the computation among different map-
  pers. In this case, one set of mappers is used to compute subgroup wi (xi xT ) and another
                                                                                   i
  set to compute subgroup wi (xi yi ). Two reducers respectively sum up the partial values
  for A and b, and the algorithm finally computes the solution θ = A−1 b. Note that if wi = 1,
  the algorithm reduces to the case of ordinary least squares (linear regression).
• Naive Bayes (NB) In NB [17, 21], we have to estimate P (xj = k|y = 1), P (xj = k|y =
  0), and P (y) from the training data. In order to do so, we need to sum over xj = k for
  each y label in the training data to calculate P (x|y). We specify different sets of mappers
  to calculate the following:           subgroup 1{xj = k|y = 1},      subgroup 1{xj = k|y = 0},
     subgroup  1{y = 1} and subgroup 1{y = 0}. The reducer then sums up intermediate
  results to get the final result for the parameters.
• Gaussian Discriminative Analysis (GDA) The classic GDA algorithm [13] needs to learn
  the following four statistics P (y), µ0 , µ1 and Σ. For all the summation forms involved in
  these computations, we may leverage the map-reduce framework to parallelize the process.
  Each mapper will handle the summation (i.e. Σ 1{yi = 1}, Σ 1{yi = 0}, Σ 1{yi =
  0}xi , etc) for a subgroup of the training samples. Finally, the reducer will aggregate the
  intermediate sums and calculate the final result for the parameters.
• k-means In k-means [12], it is clear that the operation of computing the Euclidean distance
  between the sample vectors and the centroids can be parallelized by splitting the data into
  individual subgroups and clustering samples in each subgroup separately (by the mapper).
  In recalculating new centroid vectors, we divide the sample vectors into subgroups, com-
  pute the sum of vectors in each subgroup in parallel, and finally the reducer will add up the
  partial sums and compute the new centroids.
• Logistic Regression (LR) For logistic regression [23], we choose the form of hypothesis
  as hθ (x) = g(θT x) = 1/(1 + exp(−θT x)) Learning is done by fitting θ to the training
  data where the likelihood function can be optimized by using Newton-Raphson to update
  θ := θ − H −1 θ (θ).             θ (θ) is the gradient, which can be computed in parallel by
                                                             (i)
  mappers summing up subgroup (y (i) − hθ (x(i) ))xj each NR step i. The computation
  of the hessian matrix can be also written in a summation form of H(j, k) := H(j, k) +
                               (i) (i)
  hθ (x(i) )(hθ (x(i) ) − 1)xj xk for the mappers. The reducer will then sum up the values
  for gradient and hessian to perform the update for θ.
• Neural Network (NN) We focus on backpropagation [6] By defining a network struc-
  ture (we use a three layer network with two output neurons classifying the data into two
  categories), each mapper propagates its set of data through the network. For each train-
  ing example, the error is back propagated to calculate the partial gradient for each of the
  weights in the network. The reducer then sums the partial gradient from each mapper and
  does a batch gradient descent to update the weights of the network.
• Principal Components Analysis (PCA) PCA [29] computes the principle eigenvectors of
                                     1      m        T
  the covariance matrix Σ = m               i=1 xi xi  − µµT over the data. In the definition for
                     m       T
  Σ, the term        i=1 xi xi    is already expressed in summation form. Further, we can also
                                                   1   m
  express the mean vector µ as a sum, µ = m i=1 xi . The sums can be mapped to separate
  cores, and then the reducer will sum up the partial results to produce the final empirical
  covariance matrix.
• Independent Component Analysis (ICA) ICA [1] tries to identify the independent source
  vectors based on the assumption that the observed data are linearly transformed from the
  source data. In ICA, the main goal is to compute the unmixing matrix W. We implement
  batch gradient ascent to optimize the W ’s likelihood. In this scheme, we can independently
                                  1 − 2g(w1 x(i) )
                                             T
                                                           T
  calculate the expression                .            x(i) in the mappers and sum them up in the
                                          .
                                          .
  reducer.
• Expectation Maximization (EM) For EM [8] we use Mixture of Gaussian as the underly-
  ing model as per [19]. For parallelization: In the E-step, every mapper processes its subset
(i)
           of the training data and computes the corresponding wj (expected pseudo count). In M-
           phase, three sets of parameters need to be updated: p(y), µ, and Σ. For p(y), every mapper
                                       (i)
           will compute subgroup (wj ), and the reducer will sum up the partial result and divide it
                                                                    (i)                           (i)
           by m. For µ, each mapper will compute subgroup (wj ∗ x(i) ) and subgroup (wj ), and
           the reducer will sum up the partial result and divide them. For Σ, every mapper will com-
                              (i)                                                 (i)
           pute subgroup (wj ∗ (x(i) − µj ) ∗ (x(i) − µj )T ) and subgroup (wj ), and the reducer
           will again sum up the partial result and divide them.
         • Support Vector Machine (SVM) Linear SVM’s [27, 22] primary goal is to optimize the
                                                                  p
           following primal problem minw,b w 2 + C i:ξi >0 ξi s.t. y (i) (wT x(i) + b) ≥ 1 −
           ξi where p is either 1 (hinge loss) or 2 (quadratic loss). [2] has shown that the primal
           problem for quadratic loss can be solved using the following formula where sv are the
           support vectors: = 2w + 2C i∈sv (w · xi − yi )xi & Hessian H = I + C i∈sv xi xT          i
           We perform batch gradient descent to optimize the objective function. The mappers will
           calculate the partial gradient subgroup(i∈sv) (w · xi − yi )xi and the reducer will sum up
           the partial results to update w vector.

Some implementations of machine learning algorithms, such as ICA, are commonly done with
stochastic gradient ascent, which poses a challenge to parallelization. The problem is that in ev-
ery step of gradient ascent, the algorithm updates a common set of parameters (e.g. the unmixing
W matrix in ICA). When one gradient ascent step (involving one training sample) is updating W , it
has to lock down this matrix, read it, compute the gradient, update W , and finally release the lock.
This “lock-release” block creates a bottleneck for parallelization; thus, instead of stochastic gradient
ascent, our algorithms above were implemented using batch gradient ascent.

4.1 Algorithm Time Complexity Analysis

Table 1 shows the theoretical complexity analysis for the ten algorithms we implemented on top of
our framework. We assume that the dimension of the inputs is n (i.e., x ∈ Rn ), that we have m
training examples, and that there are P cores. The complexity of iterative algorithms is analyzed
for one iteration, and so their actual running time may be slower.1 A few algorithms require matrix
inversion or an eigen-decomposition of an n-by-n matrix; we did not parallelize these steps in our
experiments, because for us m >> n, and so their cost is small. However, there is extensive research
in numerical linear algebra on parallelizing these numerical operations [4], and in the complexity
analysis shown in the table, we have assumed that matrix inversion and eigen-decompositions can be
sped up by a factor of P on P cores. (In practice, we expect P ≈ P .) In our own software imple-
mentation, we had P = 1. Further, the reduce phase can minimize communication by combining
data as it’s passed back; this accounts for the log(P ) factor.
As an example of our running-time analysis, for single-core LWLR we have to compute A =
   m           T                         2                                        3
   i=1 wi (xi xi ), which gives us the mn term. This matrix must be inverted for n ; also, the
                                                              2
reduce step incurs a covariance matrix communication cost of n .

5       Experiments

To provide fair comparisons, each algorithm had two different versions: One running map-reduce,
and the other a serial implementation without the framework. We conducted an extensive series of
experiments to compare the speed up on data sets of various sizes (table 2), on eight commonly used
machine learning data sets from the UCI Machine Learning repository and two other ones from a
[anonymous] research group (Helicopter Control and sensor data). Note that not all the experiments
make sense from an output view – regression on categorical data – but our purpose was to test
speedup so we ran every algorithm over all the data.
The first environment we conducted experiments on was an Intel X86 PC with two Pentium-III 700
MHz CPUs and 1GB physical memory. The operating system was Linux RedHat 8.0 Kernel 2.4.20-
    1
    If, for example, the number of iterations required grows with m. However, this would affect single- and
multi-core implementations equally.
single                           multi
                                                                        2
                                                                             n3
                            LWLR         O(mn2 + n3 )             O( mn + P + n2 log(P ))
                                                                      P
                                                                        2
                                                                             n3
                            LR           O(mn2 + n3 )             O( mn + P + n2 log(P ))
                                                                      P
                                                                         mn
                            NB           O(mn + nc)                  O( P + nc log(P ))
                            NN           O(mn + nc)                  O( mn + nc log(P ))
                                                                          P
                                                                        2
                                                                             n3
                            GDA          O(mn2 + n3 )             O( mn + P + n2 log(P ))
                                                                      P
                                                                        2
                                                                             n3
                            PCA          O(mn2 + n3 )             O( mn + P + n2 log(P ))
                                                                      P
                                                                        2
                                                                             n3
                            ICA          O(mn2 + n3 )             O( mn + P + n2 log(P ))
                                                                      P
                            k-means        O(mnc)                  O( mnc + mn log(P ))
                                                                         P
                                                                        2
                                                                             n3
                            EM           O(mn2 + n3 )             O( mn + P + n2 log(P ))
                                                                      P
                                                                           2
                            SVM            O(m2 n)                   O( m n + n log(P ))
                                                                          P

                                         Table 1: time complexity analysis

                             Data Sets                           samples (m)      features (n)
                             Adult                                   30162             14
                             Helicopter Control                      44170             21
                             Corel Image Features                    68040             32
                             IPUMS Census                            88443             61
                             Synthetic Time Series                  100001             10
                             Census Income                          199523             40
                             ACIP Sensor                            229564              8
                             KDD Cup 99                             494021             41
                             Forest Cover Type                      581012             55
                             1990 US Census                        2458285             68

                                      Table 2: data sets size and description


8smp. In addition, we also ran extensive comparison experiments on a 16 way Sun Enterprise 6000,
running Solaris 10; here, we compared results using 1,2,4,8, and 16 cores.

5.1 Results and Discussion

Table 3 shows the speedup on dual processors over all the algorithms on all the data sets. As can be
seen from the table, most of the algorithms achieve more than 1.9x times performance improvement.
For some of the experiments, e.g. gda/covertype, ica/ipums, nn/colorhistogram, etc., we obtain a
greater than 2x speedup. This is because the original algorithms do not utilize all the cpu cycles
efficiently, but do better when we distribute the tasks to separate threads/processes.
Figure 2 shows the speedup of the algorithms over all the data sets for 2,4,8 and 16 processing cores.
In the figure, the thick lines shows the average speedup, the error bars show the maximum and
minimum speedups and the dashed lines show the variance. Speedup is basically linear with number

                               lwlr    gda      nb    logistic     pca     ica     svm     nn     kmeans     em
            Adult             1.922   1.801   1.844    1.962      1.809   1.857   1.643   1.825    1.947   1.854
            Helicopter        1.93    2.155   1.924     1.92      1.791   1.856   1.744   1.847    1.857    1.86
            Corel Image       1.96    1.876   2.002    1.929       1.97   1.936   1.754   2.018    1.921   1.832
            IPUMS             1.963   2.23    1.965    1.938      1.965   2.025   1.799   1.974    1.957   1.984
            Synthetic         1.909   1.964   1.972     1.92      1.842   1.907    1.76   1.902    1.888   1.804
            Census Income     1.975   2.179   1.967    1.941      2.019   1.941    1.88   1.896    1.961    1.99
            Sensor            1.927   1.853    2.01    1.913      1.955   1.893   1.803   1.914    1.953   1.949
            KDD               1.969   2.216   1.848    1.927      2.012   1.998   1.946   1.899    1.973   1.979
            Cover Type        1.961   2.232   1.951    1.935      2.007   2.029   1.906   1.887    1.963   1.991
            Census            2.327   2.292   2.008    1.906      1.997   2.001   1.959   1.883    1.946   1.977
            avg.              1.985   2.080   1.950   1.930       1.937   1.944   1.819   1.905   1.937    1.922


Table 3: Speedups achieved on a dual core processor, without load time. Numbers reported are dual-
core time / single-core time. Super linear speedup sometimes occurs due to a reduction in processor
idle time with multiple threads.
(a)                                  (b)                        (c)




                   (d)                                  (e)                        (f)




                   (g)                                  (h)                        (i)


Figure 2: (a)-(i) show the speedup from 1 to 16 processors of all the algorithms over all the data
sets. The Bold line is the average, error bars are the max and min speedups and the dashed lines are
the variance.


of cores, but with a slope < 1.0. The reason for the sub-unity slope is increasing communication
overhead. For simplicity and because the number of data points m typically dominates reduction
phase communication costs (typically a factor of n2 but n << m), we did not parallelize the reduce
phase where we could have combined data on the way back. Even so, our simple SVM approach
gets about 13.6% speed up on average over 16 cores whereas the specialized SVM cascade [11]
averages only 4%.
Finally, the above are runs on multiprocessor machines. We finish by reporting some confirming
results and higher performance on a proprietary multicore simulator over the sensor dataset.2 NN
speedup was [16 cores, 15.5x], [32 cores, 29x], [64 cores, 54x]. LR speedup was [16 cores, 15x],
[32 cores, 29.5x], [64 cores, 53x]. Multicore machines are generally faster than multiprocessor
machines because communication internal to the chip is much less costly.

6       Conclusion

As the Intel and AMD product roadmaps indicate [24], the number of processing cores on a chip
will be doubling several times over the next decade, even as individual cores cease to become sig-
nificantly faster. For machine learning to continue reaping the bounty of Moore’s law and apply to
ever larger datasets and problems, it is important to adopt a programming architecture which takes
advantage of multicore. In this paper, by taking advantage of the summation form in a map-reduce
    2
        This work was done in collaboration with Intel Corporation.
framework, we could parallelize a wide range of machine learning algorithms and achieve a 1.9
times speedup on a dual processor on up to 54 times speedup on 64 cores. These results are in
line with the complexity analysis in Table 1. We note that the speedups achieved here involved no
special optimizations of the algorithms themselves. We have demonstrated a simple programming
framework where in the future we can just “throw cores” at the problem of speeding up machine
learning code.

Acknowledgments
We would like to thank Skip Macy from Intel for sharing his valuable experience in VTune perfor-
mance analyzer. Yirong Shen, Anya Petrovskaya, and Su-In Lee from Stanford University helped us
in preparing various data sets used in our experiments. This research was sponsored in part by the
Defense Advanced Research Projects Agency (DARPA) under the ACIP program and grant number
NBCH104009.

References
 [1] Sejnowski TJ. Bell AJ. An information-maximization approach to blind separation and blind deconvolution. In Neural Computation, 1995.
 [2] O. Chapelle. Training a support vector machine in the primal. Journal of Machine Learning Research (submitted), 2006.
 [3] W. S. Cleveland and S. J. Devlin. Locally weighted regression: An approach to regression analysis by local fitting. In J. Amer. Statist. Assoc. 83, pages 596–610,
     1988.
 [4] L. Csanky. Fast parallel matrix inversion algorithms. SIAM J. Comput., 5(4):618–623, 1976.
 [5] A. Silvescu D. Caragea and V. Honavar. A framework for learning from distributed data using sufficient statistics and its application to learning decision trees.
     International Journal of Hybrid Intelligent Systems, 2003.
 [6] R. J. Williams D. E. Rumelhart, G. E. Hinton. Learning representation by back-propagating errors. In Nature, volume 323, pages 533–536, 1986.
 [7] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Operating Systems Design and Implementation, pages 137–149, 2004.
 [8] N.M. Dempster A.P., Laird and Rubin D.B.
 [9] D.J. Frank. Power-constrained cmos scaling limits. IBM Journal of Research and Development, 46, 2002.
[10] P. Gelsinger. Microprocessors for the new millennium: Challenges, opportunities and new frontiers. In ISSCC Tech. Digest, pages 22–25, 2001.
[11] Leon Bottou Igor Durdanovic Hans Peter Graf, Eric Cosatto and Vladimire Vapnik. Parallel support vector machines: The cascade svm. In NIPS, 2004.
[12] J. Hartigan. Clustering Algorithms. Wiley, 1975.
[13] T. Hastie and R. Tibshirani. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society B, pages 155–176, 1996.
[14] R. Jin and G. Agrawal. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. In Second SIAM
     International Conference on Data Mining,, 2002.
[15] M. Kearns. Efficient noise-tolerant learning from statistical queries. pages 392–401, 1999.
[16] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.
[17] David Lewis. Naive (bayes) at forty: The independence asssumption in information retrieval. In ECML98: Tenth European Conference On Machine Learning,
     1998.
[18] Kun Liu and Hillow Kargupta. Distributed data mining bibliography. http://www.cs.umbc.edu/ hillol/DDMBIB/, 2006.
[19] T. K. MOON. The expectation-maximization algorithm. In IEEE Trans. Signal Process, pages 47–59, 1996.
[20] G. Moore. Progress in digital integrated electronics. In IEDM Tech. Digest, pages 11–13, 1975.
[21] Wayne Iba Pat Langley and Kevin Thompson. An analysis of bayesian classifiers. In AAAI, 1992.
[22] John C. Platt. Fast training of support vector machines using sequential minimal optimization. pages 185–208, 1999.
[23] Daryl Pregibon. Logistic regression diagnostics. In The Annals of Statistics, volume 9, pages 705–724, 1981.
[24] T. Studt. There’s a multicore in your future, http://tinyurl.com/ohd2m, 2006.
[25] Herb Sutter and James Larus. Software and the concurrency revolution. Queue, 3(7):54–62, 2005.
[26] L.G. Valiant. A theory of the learnable. Communications of the ACM, 3(11):1134–1142, 1984.
[27] V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer Verlag, 1982.
[28] R. E. Welsch and E. KUH. Linear regression diagnostics. In Working Paper 173, Nat. Bur. Econ. Res.Inc, 1977.
[29] K. Esbensen Wold, S. and P. Geladi. Principal component analysis. In Chemometrics and Intelligent Laboratory Systems, 1987.

More Related Content

PDF
IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...
PDF
Kernels in convolution
PDF
Report_NLNN
PDF
APPLIED MACHINE LEARNING
PDF
Svm map reduce_slides
PDF
STATE SPACE GENERATION FRAMEWORK BASED ON BINARY DECISION DIAGRAM FOR DISTRIB...
PDF
INVERSIONOF MAGNETIC ANOMALIES DUE TO 2-D CYLINDRICAL STRUCTURES –BY AN ARTIF...
PDF
Background Estimation Using Principal Component Analysis Based on Limited Mem...
IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...
Kernels in convolution
Report_NLNN
APPLIED MACHINE LEARNING
Svm map reduce_slides
STATE SPACE GENERATION FRAMEWORK BASED ON BINARY DECISION DIAGRAM FOR DISTRIB...
INVERSIONOF MAGNETIC ANOMALIES DUE TO 2-D CYLINDRICAL STRUCTURES –BY AN ARTIF...
Background Estimation Using Principal Component Analysis Based on Limited Mem...

What's hot (18)

PPTX
Hardware Implementation of Cascade SVM
PPTX
Image recogonization
PDF
SupportVectorRegression
PDF
GPUFish_technical_report
PPTX
Deep Learning for AI (2)
PPTX
Deep learning algorithms
PPTX
Deep neural networks & computational graphs
PDF
MapReduce based SVM
PDF
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
PPTX
Seminar_New -CESG
PDF
Pixel RNN to Pixel CNN++
PPTX
03 image transformations_i
PDF
widely-linear-minimum (1)
DOCX
Dynamic clustering algorithm using fuzzy c means
PPTX
04 image transformations_ii
PDF
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
PDF
Bistablecamnets
PDF
High speed multiplier using vedic mathematics
Hardware Implementation of Cascade SVM
Image recogonization
SupportVectorRegression
GPUFish_technical_report
Deep Learning for AI (2)
Deep learning algorithms
Deep neural networks & computational graphs
MapReduce based SVM
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Seminar_New -CESG
Pixel RNN to Pixel CNN++
03 image transformations_i
widely-linear-minimum (1)
Dynamic clustering algorithm using fuzzy c means
04 image transformations_ii
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Bistablecamnets
High speed multiplier using vedic mathematics
Ad

Similar to Map-Reduce for Machine Learning on Multicore (20)

PDF
Terascale Learning
PPTX
Oxford 05-oct-2012
DOCX
mapReduce for machine learning
PPTX
Nearest Neighbor Customer Insight
PPTX
ACM 2013-02-25
PDF
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
PPTX
Big Practical Recommendations with Alternating Least Squares
PDF
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
PDF
Machine Learning - What, Where and How
PDF
Modeling with Hadoop kdd2011
PDF
Arvindsujeeth scaladays12
PDF
An introduction to Machine Learning
PDF
Statistically adaptive learning for a general class of..
PDF
Statistically adaptive learning for a general class of..
PDF
Instance-based learning (aka Case-based or Memory-based or non-parametric)
PDF
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
PDF
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
PPTX
Paris data-geeks-2013-03-28
PDF
Open Machine Learning
Terascale Learning
Oxford 05-oct-2012
mapReduce for machine learning
Nearest Neighbor Customer Insight
ACM 2013-02-25
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
Big Practical Recommendations with Alternating Least Squares
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Machine Learning - What, Where and How
Modeling with Hadoop kdd2011
Arvindsujeeth scaladays12
An introduction to Machine Learning
Statistically adaptive learning for a general class of..
Statistically adaptive learning for a general class of..
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
Paris data-geeks-2013-03-28
Open Machine Learning
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Artificial Intelligence
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf

Map-Reduce for Machine Learning on Multicore

  • 1. Map-Reduce for Machine Learning on Multicore Cheng-Tao Chu ∗ Sang Kyun Kim ∗ Yi-An Lin ∗ chengtao@stanford.edu skkim38@stanford.edu ianl@stanford.edu YuanYuan Yu ∗ Gary Bradski ∗† Andrew Y. Ng ∗ yuanyuan@stanford.edu garybradski@gmail ang@cs.stanford.edu Kunle Olukotun ∗ kunle@cs.stanford.edu ∗ . CS. Department, Stanford University 353 Serra Mall, Stanford University, Stanford CA 94305-9025. † . Rexee Inc. Abstract We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly ap- plicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily par- allelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regres- sion (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1 Introduction Frequency scaling on silicon—the ability to drive chips at ever higher clock rates—is beginning to hit a power limit as device geometries shrink due to leakage, and simply because CMOS consumes power every time it changes state [9, 10]. Yet Moore’s law [20], the density of circuits doubling every generation, is projected to last between 10 and 20 more years for silicon based circuits [10]. By keeping clock frequency fixed, but doubling the number of processing cores on a chip, one can maintain lower power while doubling the speed of many applications. This has forced an industry- wide shift to multicore. We thus approach an era of increasing numbers of cores per chip, but there is as yet no good frame- work for machine learning to take advantage of massive numbers of cores. There are many parallel programming languages such as Orca, Occam ABCL, SNOW, MPI and PARLOG, but none of these approaches make it obvious how to parallelize a particular algorithm. There is a vast literature on distributed learning and data mining [18], but very little of this literature focuses on our goal: A gen- eral means of programming machine learning on multicore. Much of this literature contains a long
  • 2. and distinguished tradition of developing (often ingenious) ways to speed up or parallelize individ- ual learning algorithms, for instance cascaded SVMs [11]. But these yield no general parallelization technique for machine learning and, more pragmatically, specialized implementations of popular algorithms rarely lead to widespread use. Some examples of more general papers are: Caregea et. al. [5] give some general data distribution conditions for parallelizing machine learning, but restrict the focus to decision trees; Jin and Agrawal [14] give a general machine learning programming ap- proach, but only for shared memory machines. This doesn’t fit the architecture of cellular or grid type multiprocessors where cores have local cache, even if it can be dynamically reallocated. In this paper, we focuses on developing a general and exact technique for parallel programming of a large class of machine learning algorithms for multicore processors. The central idea of this approach is to allow a future programmer or user to speed up machine learning applications by ”throwing more cores” at the problem rather than search for specialized optimizations. This paper’s contributions are: (i) We show that any algorithm fitting the Statistical Query Model may be written in a certain “sum- mation form.” This form does not change the underlying algorithm and so is not an approximation, but is instead an exact implementation. (ii) The summation form does not depend on, but can be easily expressed in a map-reduce [7] framework which is easy to program in. (iii) This technique achieves basically linear speed-up with the number of cores. We attempt to develop a pragmatic and general framework. What we do not claim: (i) We make no claim that our technique will necessarily run faster than a specialized, one-off so- lution. Here we achieve linear speedup which in fact often does beat specific solutions such as cascaded SVM [11] (see section 5; however, they do handle kernels, which we have not addressed). (ii) We make no claim that following our framework (for a specific algorithm) always leads to a novel parallelization undiscovered by others. What is novel is the larger, broadly applicable frame- work, together with a pragmatic programming paradigm, map-reduce. (iii) We focus here on exact implementation of machine learning algorithms, not on parallel approximations to algorithms (a worthy topic, but one which is beyond this paper’s scope). In section 2 we discuss the Statistical Query Model, our summation form framework and an example of its application. In section 3 we describe how our framework may be implemented in a Google- like map-reduce paradigm. In section 4 we choose 10 frequently used machine learning algorithms as examples of what can be coded in this framework. This is followed by experimental runs on 10 moderately large data sets in section 5, where we show a good match to our theoretical computational complexity results. Basically, we often achieve linear speedup in the number of cores. Section 6 concludes the paper. 2 Statistical Query and Summation Form For multicore systems, Sutter and Larus [25] point out that multicore mostly benefits concurrent applications, meaning ones where there is little communication between cores. The best match is thus if the data is subdivided and stays local to the cores. To achieve this, we look to Kearns’ Statistical Query Model [15]. The Statistical Query Model is sometimes posed as a restriction on the Valiant PAC model [26], in which we permit the learning algorithm to access the learning problem only through a statistical query oracle. Given a function f (x, y) over instances, the statistical query oracle returns an estimate of the expectation of f (x, y) (averaged over the training/test distribution). Algorithms that calculate sufficient statistics or gradients fit this model, and since these calculations may be batched, they are expressible as a sum over data points. This class of algorithms is large; We show 10 popular algorithms in section 4 below. An example that does not fit is that of learning an XOR over a subset of bits. [16, 15]. However, when an algorithm does sum over the data, we can easily distribute the calculations over multiple cores: We just divide the data set into as many pieces as there are cores, give each core its share of the data to sum the equations over, and aggregate the results at the end. We call this form of the algorithm the “summation form.” As an example, consider ordinary least squares (linear regression), which fits a model of the form m y = θT x by solving: θ∗ = minθ i=1 (θT xi − yi )2 The parameter θ is typically solved for by
  • 3. 1.1.3.2 Algorithm 1.1.1.2 1.1.3.1: query_info 2 1: run 0: data input Data Engine 1.2 1.1: run 1.1.3: reduce Master Reducer 1.1.4: result 1.1.2: intermediate data 1.1.1: map (split data) Mapper Mapper Mapper Mapper 1.1.1.1: query_info Figure 1: Multicore map-reduce framework defining the design matrix X ∈ Rm×n to be a matrix whose rows contain the training instances x1 , . . . , xm , letting y = [y1 , . . . , ym ]m be the vector of target labels, and solving the normal equa- tions to obtain θ∗ = (X T X)−1 X T y. To put this computation into summation form, we reformulate it into a two phase algorithm where we first compute sufficient statistics by summing over the data, and then aggregate those statistics and solve to get θ∗ = A−1 b. Concretely, we compute A = X T X and b = X T y as follows: m m A = i=1 (xi xT ) and b = i=1 (xi yi ). The computation of A and b can now be divided into i equal size pieces and distributed among the cores. We next discuss an architecture that lends itself to the summation form: Map-reduce. 3 Architecture Many programming frameworks are possible for the summation form, but inspired by Google’s success in adapting a functional programming construct, map-reduce [7], for wide spread parallel programming use inside their company, we adapted this same construct for multicore use. Google’s map-reduce is specialized for use over clusters that have unreliable communication and where indi- vidual computers may go down. These are issues that multicores do not have; thus, we were able to developed a much lighter weight architecture for multicores, shown in Figure 1. Figure 1 shows a high level view of our architecture and how it processes the data. In step 0, the map-reduce engine is responsible for splitting the data by training examples (rows). The engine then caches the split data for the subsequent map-reduce invocations. Every algorithm has its own engine instance, and every map-reduce task will be delegated to its engine (step 1). Similar to the original map-reduce architecture, the engine will run a master (step 1.1) which coordinates the mappers and the reducers. The master is responsible for assigning the split data to different mappers, and then collects the processed intermediate data from the mappers (step 1.1.1 and 1.1.2). After the intermediate data is collected, the master will in turn invoke the reducer to process it (step 1.1.3) and return final results (step 1.1.4). Note that some mapper and reducer operations require additional scalar information from the algorithms. In order to support these operations, the mapper/reducer can obtain this information through the query info interface, which can be customized for each different algorithm (step 1.1.1.1 and 1.1.3.2). 4 Adopted Algorithms In this section, we will briefly discuss the algorithms we have implemented based on our framework. These algorithms were chosen partly by their popularity of use in NIPS papers, and our goal will be to illustrate how each algorithm can be expressed in summation form. We will defer the discussion of the theoretical improvement that can be achieved by this parallelization to Section 4.1. In the following, x or xi denotes a training vector and y or yi denotes a training label.
  • 4. • Locally Weighted Linear Regression (LWLR) LWLR [28, 3] is solved by finding m T the solution of the normal equations Aθ = b, where A = i=1 wi (xi xi ) and b = m i=1 wi (xi yi ). For the summation form, we divide the computation among different map- pers. In this case, one set of mappers is used to compute subgroup wi (xi xT ) and another i set to compute subgroup wi (xi yi ). Two reducers respectively sum up the partial values for A and b, and the algorithm finally computes the solution θ = A−1 b. Note that if wi = 1, the algorithm reduces to the case of ordinary least squares (linear regression). • Naive Bayes (NB) In NB [17, 21], we have to estimate P (xj = k|y = 1), P (xj = k|y = 0), and P (y) from the training data. In order to do so, we need to sum over xj = k for each y label in the training data to calculate P (x|y). We specify different sets of mappers to calculate the following: subgroup 1{xj = k|y = 1}, subgroup 1{xj = k|y = 0}, subgroup 1{y = 1} and subgroup 1{y = 0}. The reducer then sums up intermediate results to get the final result for the parameters. • Gaussian Discriminative Analysis (GDA) The classic GDA algorithm [13] needs to learn the following four statistics P (y), µ0 , µ1 and Σ. For all the summation forms involved in these computations, we may leverage the map-reduce framework to parallelize the process. Each mapper will handle the summation (i.e. Σ 1{yi = 1}, Σ 1{yi = 0}, Σ 1{yi = 0}xi , etc) for a subgroup of the training samples. Finally, the reducer will aggregate the intermediate sums and calculate the final result for the parameters. • k-means In k-means [12], it is clear that the operation of computing the Euclidean distance between the sample vectors and the centroids can be parallelized by splitting the data into individual subgroups and clustering samples in each subgroup separately (by the mapper). In recalculating new centroid vectors, we divide the sample vectors into subgroups, com- pute the sum of vectors in each subgroup in parallel, and finally the reducer will add up the partial sums and compute the new centroids. • Logistic Regression (LR) For logistic regression [23], we choose the form of hypothesis as hθ (x) = g(θT x) = 1/(1 + exp(−θT x)) Learning is done by fitting θ to the training data where the likelihood function can be optimized by using Newton-Raphson to update θ := θ − H −1 θ (θ). θ (θ) is the gradient, which can be computed in parallel by (i) mappers summing up subgroup (y (i) − hθ (x(i) ))xj each NR step i. The computation of the hessian matrix can be also written in a summation form of H(j, k) := H(j, k) + (i) (i) hθ (x(i) )(hθ (x(i) ) − 1)xj xk for the mappers. The reducer will then sum up the values for gradient and hessian to perform the update for θ. • Neural Network (NN) We focus on backpropagation [6] By defining a network struc- ture (we use a three layer network with two output neurons classifying the data into two categories), each mapper propagates its set of data through the network. For each train- ing example, the error is back propagated to calculate the partial gradient for each of the weights in the network. The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network. • Principal Components Analysis (PCA) PCA [29] computes the principle eigenvectors of 1 m T the covariance matrix Σ = m i=1 xi xi − µµT over the data. In the definition for m T Σ, the term i=1 xi xi is already expressed in summation form. Further, we can also 1 m express the mean vector µ as a sum, µ = m i=1 xi . The sums can be mapped to separate cores, and then the reducer will sum up the partial results to produce the final empirical covariance matrix. • Independent Component Analysis (ICA) ICA [1] tries to identify the independent source vectors based on the assumption that the observed data are linearly transformed from the source data. In ICA, the main goal is to compute the unmixing matrix W. We implement batch gradient ascent to optimize the W ’s likelihood. In this scheme, we can independently 1 − 2g(w1 x(i) ) T T calculate the expression . x(i) in the mappers and sum them up in the . . reducer. • Expectation Maximization (EM) For EM [8] we use Mixture of Gaussian as the underly- ing model as per [19]. For parallelization: In the E-step, every mapper processes its subset
  • 5. (i) of the training data and computes the corresponding wj (expected pseudo count). In M- phase, three sets of parameters need to be updated: p(y), µ, and Σ. For p(y), every mapper (i) will compute subgroup (wj ), and the reducer will sum up the partial result and divide it (i) (i) by m. For µ, each mapper will compute subgroup (wj ∗ x(i) ) and subgroup (wj ), and the reducer will sum up the partial result and divide them. For Σ, every mapper will com- (i) (i) pute subgroup (wj ∗ (x(i) − µj ) ∗ (x(i) − µj )T ) and subgroup (wj ), and the reducer will again sum up the partial result and divide them. • Support Vector Machine (SVM) Linear SVM’s [27, 22] primary goal is to optimize the p following primal problem minw,b w 2 + C i:ξi >0 ξi s.t. y (i) (wT x(i) + b) ≥ 1 − ξi where p is either 1 (hinge loss) or 2 (quadratic loss). [2] has shown that the primal problem for quadratic loss can be solved using the following formula where sv are the support vectors: = 2w + 2C i∈sv (w · xi − yi )xi & Hessian H = I + C i∈sv xi xT i We perform batch gradient descent to optimize the objective function. The mappers will calculate the partial gradient subgroup(i∈sv) (w · xi − yi )xi and the reducer will sum up the partial results to update w vector. Some implementations of machine learning algorithms, such as ICA, are commonly done with stochastic gradient ascent, which poses a challenge to parallelization. The problem is that in ev- ery step of gradient ascent, the algorithm updates a common set of parameters (e.g. the unmixing W matrix in ICA). When one gradient ascent step (involving one training sample) is updating W , it has to lock down this matrix, read it, compute the gradient, update W , and finally release the lock. This “lock-release” block creates a bottleneck for parallelization; thus, instead of stochastic gradient ascent, our algorithms above were implemented using batch gradient ascent. 4.1 Algorithm Time Complexity Analysis Table 1 shows the theoretical complexity analysis for the ten algorithms we implemented on top of our framework. We assume that the dimension of the inputs is n (i.e., x ∈ Rn ), that we have m training examples, and that there are P cores. The complexity of iterative algorithms is analyzed for one iteration, and so their actual running time may be slower.1 A few algorithms require matrix inversion or an eigen-decomposition of an n-by-n matrix; we did not parallelize these steps in our experiments, because for us m >> n, and so their cost is small. However, there is extensive research in numerical linear algebra on parallelizing these numerical operations [4], and in the complexity analysis shown in the table, we have assumed that matrix inversion and eigen-decompositions can be sped up by a factor of P on P cores. (In practice, we expect P ≈ P .) In our own software imple- mentation, we had P = 1. Further, the reduce phase can minimize communication by combining data as it’s passed back; this accounts for the log(P ) factor. As an example of our running-time analysis, for single-core LWLR we have to compute A = m T 2 3 i=1 wi (xi xi ), which gives us the mn term. This matrix must be inverted for n ; also, the 2 reduce step incurs a covariance matrix communication cost of n . 5 Experiments To provide fair comparisons, each algorithm had two different versions: One running map-reduce, and the other a serial implementation without the framework. We conducted an extensive series of experiments to compare the speed up on data sets of various sizes (table 2), on eight commonly used machine learning data sets from the UCI Machine Learning repository and two other ones from a [anonymous] research group (Helicopter Control and sensor data). Note that not all the experiments make sense from an output view – regression on categorical data – but our purpose was to test speedup so we ran every algorithm over all the data. The first environment we conducted experiments on was an Intel X86 PC with two Pentium-III 700 MHz CPUs and 1GB physical memory. The operating system was Linux RedHat 8.0 Kernel 2.4.20- 1 If, for example, the number of iterations required grows with m. However, this would affect single- and multi-core implementations equally.
  • 6. single multi 2 n3 LWLR O(mn2 + n3 ) O( mn + P + n2 log(P )) P 2 n3 LR O(mn2 + n3 ) O( mn + P + n2 log(P )) P mn NB O(mn + nc) O( P + nc log(P )) NN O(mn + nc) O( mn + nc log(P )) P 2 n3 GDA O(mn2 + n3 ) O( mn + P + n2 log(P )) P 2 n3 PCA O(mn2 + n3 ) O( mn + P + n2 log(P )) P 2 n3 ICA O(mn2 + n3 ) O( mn + P + n2 log(P )) P k-means O(mnc) O( mnc + mn log(P )) P 2 n3 EM O(mn2 + n3 ) O( mn + P + n2 log(P )) P 2 SVM O(m2 n) O( m n + n log(P )) P Table 1: time complexity analysis Data Sets samples (m) features (n) Adult 30162 14 Helicopter Control 44170 21 Corel Image Features 68040 32 IPUMS Census 88443 61 Synthetic Time Series 100001 10 Census Income 199523 40 ACIP Sensor 229564 8 KDD Cup 99 494021 41 Forest Cover Type 581012 55 1990 US Census 2458285 68 Table 2: data sets size and description 8smp. In addition, we also ran extensive comparison experiments on a 16 way Sun Enterprise 6000, running Solaris 10; here, we compared results using 1,2,4,8, and 16 cores. 5.1 Results and Discussion Table 3 shows the speedup on dual processors over all the algorithms on all the data sets. As can be seen from the table, most of the algorithms achieve more than 1.9x times performance improvement. For some of the experiments, e.g. gda/covertype, ica/ipums, nn/colorhistogram, etc., we obtain a greater than 2x speedup. This is because the original algorithms do not utilize all the cpu cycles efficiently, but do better when we distribute the tasks to separate threads/processes. Figure 2 shows the speedup of the algorithms over all the data sets for 2,4,8 and 16 processing cores. In the figure, the thick lines shows the average speedup, the error bars show the maximum and minimum speedups and the dashed lines show the variance. Speedup is basically linear with number lwlr gda nb logistic pca ica svm nn kmeans em Adult 1.922 1.801 1.844 1.962 1.809 1.857 1.643 1.825 1.947 1.854 Helicopter 1.93 2.155 1.924 1.92 1.791 1.856 1.744 1.847 1.857 1.86 Corel Image 1.96 1.876 2.002 1.929 1.97 1.936 1.754 2.018 1.921 1.832 IPUMS 1.963 2.23 1.965 1.938 1.965 2.025 1.799 1.974 1.957 1.984 Synthetic 1.909 1.964 1.972 1.92 1.842 1.907 1.76 1.902 1.888 1.804 Census Income 1.975 2.179 1.967 1.941 2.019 1.941 1.88 1.896 1.961 1.99 Sensor 1.927 1.853 2.01 1.913 1.955 1.893 1.803 1.914 1.953 1.949 KDD 1.969 2.216 1.848 1.927 2.012 1.998 1.946 1.899 1.973 1.979 Cover Type 1.961 2.232 1.951 1.935 2.007 2.029 1.906 1.887 1.963 1.991 Census 2.327 2.292 2.008 1.906 1.997 2.001 1.959 1.883 1.946 1.977 avg. 1.985 2.080 1.950 1.930 1.937 1.944 1.819 1.905 1.937 1.922 Table 3: Speedups achieved on a dual core processor, without load time. Numbers reported are dual- core time / single-core time. Super linear speedup sometimes occurs due to a reduction in processor idle time with multiple threads.
  • 7. (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 2: (a)-(i) show the speedup from 1 to 16 processors of all the algorithms over all the data sets. The Bold line is the average, error bars are the max and min speedups and the dashed lines are the variance. of cores, but with a slope < 1.0. The reason for the sub-unity slope is increasing communication overhead. For simplicity and because the number of data points m typically dominates reduction phase communication costs (typically a factor of n2 but n << m), we did not parallelize the reduce phase where we could have combined data on the way back. Even so, our simple SVM approach gets about 13.6% speed up on average over 16 cores whereas the specialized SVM cascade [11] averages only 4%. Finally, the above are runs on multiprocessor machines. We finish by reporting some confirming results and higher performance on a proprietary multicore simulator over the sensor dataset.2 NN speedup was [16 cores, 15.5x], [32 cores, 29x], [64 cores, 54x]. LR speedup was [16 cores, 15x], [32 cores, 29.5x], [64 cores, 53x]. Multicore machines are generally faster than multiprocessor machines because communication internal to the chip is much less costly. 6 Conclusion As the Intel and AMD product roadmaps indicate [24], the number of processing cores on a chip will be doubling several times over the next decade, even as individual cores cease to become sig- nificantly faster. For machine learning to continue reaping the bounty of Moore’s law and apply to ever larger datasets and problems, it is important to adopt a programming architecture which takes advantage of multicore. In this paper, by taking advantage of the summation form in a map-reduce 2 This work was done in collaboration with Intel Corporation.
  • 8. framework, we could parallelize a wide range of machine learning algorithms and achieve a 1.9 times speedup on a dual processor on up to 54 times speedup on 64 cores. These results are in line with the complexity analysis in Table 1. We note that the speedups achieved here involved no special optimizations of the algorithms themselves. We have demonstrated a simple programming framework where in the future we can just “throw cores” at the problem of speeding up machine learning code. Acknowledgments We would like to thank Skip Macy from Intel for sharing his valuable experience in VTune perfor- mance analyzer. Yirong Shen, Anya Petrovskaya, and Su-In Lee from Stanford University helped us in preparing various data sets used in our experiments. This research was sponsored in part by the Defense Advanced Research Projects Agency (DARPA) under the ACIP program and grant number NBCH104009. References [1] Sejnowski TJ. Bell AJ. An information-maximization approach to blind separation and blind deconvolution. In Neural Computation, 1995. [2] O. Chapelle. Training a support vector machine in the primal. Journal of Machine Learning Research (submitted), 2006. [3] W. S. Cleveland and S. J. Devlin. Locally weighted regression: An approach to regression analysis by local fitting. In J. Amer. Statist. Assoc. 83, pages 596–610, 1988. [4] L. Csanky. Fast parallel matrix inversion algorithms. SIAM J. Comput., 5(4):618–623, 1976. [5] A. Silvescu D. Caragea and V. Honavar. A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. International Journal of Hybrid Intelligent Systems, 2003. [6] R. J. Williams D. E. Rumelhart, G. E. Hinton. Learning representation by back-propagating errors. In Nature, volume 323, pages 533–536, 1986. [7] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Operating Systems Design and Implementation, pages 137–149, 2004. [8] N.M. Dempster A.P., Laird and Rubin D.B. [9] D.J. Frank. Power-constrained cmos scaling limits. IBM Journal of Research and Development, 46, 2002. [10] P. Gelsinger. Microprocessors for the new millennium: Challenges, opportunities and new frontiers. In ISSCC Tech. Digest, pages 22–25, 2001. [11] Leon Bottou Igor Durdanovic Hans Peter Graf, Eric Cosatto and Vladimire Vapnik. Parallel support vector machines: The cascade svm. In NIPS, 2004. [12] J. Hartigan. Clustering Algorithms. Wiley, 1975. [13] T. Hastie and R. Tibshirani. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society B, pages 155–176, 1996. [14] R. Jin and G. Agrawal. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. In Second SIAM International Conference on Data Mining,, 2002. [15] M. Kearns. Efficient noise-tolerant learning from statistical queries. pages 392–401, 1999. [16] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [17] David Lewis. Naive (bayes) at forty: The independence asssumption in information retrieval. In ECML98: Tenth European Conference On Machine Learning, 1998. [18] Kun Liu and Hillow Kargupta. Distributed data mining bibliography. http://www.cs.umbc.edu/ hillol/DDMBIB/, 2006. [19] T. K. MOON. The expectation-maximization algorithm. In IEEE Trans. Signal Process, pages 47–59, 1996. [20] G. Moore. Progress in digital integrated electronics. In IEDM Tech. Digest, pages 11–13, 1975. [21] Wayne Iba Pat Langley and Kevin Thompson. An analysis of bayesian classifiers. In AAAI, 1992. [22] John C. Platt. Fast training of support vector machines using sequential minimal optimization. pages 185–208, 1999. [23] Daryl Pregibon. Logistic regression diagnostics. In The Annals of Statistics, volume 9, pages 705–724, 1981. [24] T. Studt. There’s a multicore in your future, http://tinyurl.com/ohd2m, 2006. [25] Herb Sutter and James Larus. Software and the concurrency revolution. Queue, 3(7):54–62, 2005. [26] L.G. Valiant. A theory of the learnable. Communications of the ACM, 3(11):1134–1142, 1984. [27] V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer Verlag, 1982. [28] R. E. Welsch and E. KUH. Linear regression diagnostics. In Working Paper 173, Nat. Bur. Econ. Res.Inc, 1977. [29] K. Esbensen Wold, S. and P. Geladi. Principal component analysis. In Chemometrics and Intelligent Laboratory Systems, 1987.