Evan	
  Sparks	
  and	
  Ameet	
  Talwalkar
UC	
  Berkeley
UC Berkeley
baseML
baseML
M
ML
M
Three	
  Converging	
  Trends
Big	
  Data
Three	
  Converging	
  Trends
Distributed	
  
Compu2ng
Big	
  Data
Three	
  Converging	
  Trends
Distributed	
  
Compu2ng
Big	
  Data
Three	
  Converging	
  Trends
Machine	
  
Learning
Distributed	
  
Compu2ng
Big	
  Data
Three	
  Converging	
  Trends
Machine	
  
Learning
MLbase
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Problem:	
  Scalable	
  implementaAons	
  
difficult	
  for	
  ML	
  Developers…
Me
S
ML Contract +
Code
ML
Problem:	
  Scalable	
  implementaAons	
  
difficult	
  for	
  ML	
  Developers…
Me
S
ML Contract +
Code
ML
Problem:	
  Scalable	
  implementaAons	
  
difficult	
  for	
  ML	
  Developers…
Me
S
ML Contract +
Code
ML
Too	
  many	
  
algorithms…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Too	
  many	
  
algorithms…
Too	
  many	
  
knobs…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Too	
  many	
  
algorithms…
Too	
  many	
  
knobs…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Difficult	
  to	
  
debug…
Too	
  many	
  
algorithms…
Too	
  many	
  
knobs…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Difficult	
  to	
  
debug…
Doesn’t	
  scale…
Too	
  many	
  
algorithms…
Too	
  many	
  
knobs…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Difficult	
  to	
  
debug…
Reliable
Fast
Accurate
Provable
Doesn’t	
  scale…
ML	
  Experts Systems	
  ExpertsMLbase
1. Easy	
  scalable	
  ML	
  development	
  (ML	
  Developers)
2. User-­‐friendly	
  ML	
  at	
  scale	
  (End	
  Users)
ML	
  Experts Systems	
  ExpertsMLbase
1. Easy	
  scalable	
  ML	
  development	
  (ML	
  Developers)
2. User-­‐friendly	
  ML	
  at	
  scale	
  (End	
  Users)
Along	
  the	
  way,	
  we	
  gain	
  insight	
  into	
  data	
  intensive	
  
compu2ng
ML	
  Experts Systems	
  ExpertsMLbase
Matlab	
  Stack
Matlab	
  Stack
Single Machine
Lapack
Matlab	
  Stack
Single Machine
✦ Lapack:	
  low-­‐level	
  Fortran	
  linear	
  algebra	
  library
Lapack
Matlab Interface
Matlab	
  Stack
Single Machine
✦ Lapack:	
  low-­‐level	
  Fortran	
  linear	
  algebra	
  library
✦ Matlab	
  Interface
✦ Higher-­‐level	
  abstrac2ons	
  for	
  data	
  access	
  /	
  processing
✦ More	
  extensive	
  func2onality	
  than	
  Lapack
✦ Leverages	
  Lapack	
  whenever	
  possible
Lapack
Matlab Interface
Matlab	
  Stack
Single Machine
✦ Lapack:	
  low-­‐level	
  Fortran	
  linear	
  algebra	
  library
✦ Matlab	
  Interface
✦ Higher-­‐level	
  abstrac2ons	
  for	
  data	
  access	
  /	
  processing
✦ More	
  extensive	
  func2onality	
  than	
  Lapack
✦ Leverages	
  Lapack	
  whenever	
  possible
✦ Similar	
  stories	
  for	
  R	
  and	
  Python
MLbase	
  Stack
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)Spark
Spark:	
  cluster	
  compu=ng	
  system	
  designed	
  for	
  itera=ve	
  computa=on
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)
MLlib
Spark
Spark:	
  cluster	
  compu=ng	
  system	
  designed	
  for	
  itera=ve	
  computa=on
MLlib:	
  low-­‐level	
  ML	
  library	
  in	
  Spark
✦ Callable	
  from	
  Scala,	
  Java
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)
MLlib
MLI
Spark
Spark:	
  cluster	
  compu=ng	
  system	
  designed	
  for	
  itera=ve	
  computa=on
MLlib:	
  low-­‐level	
  ML	
  library	
  in	
  Spark
✦ Callable	
  from	
  Scala,	
  Java
MLI:	
  API	
  /	
  plaHorm	
  for	
  feature	
  extrac=on	
  and	
  algorithm	
  development
✦ Includes	
  higher-­‐level	
  func2onality	
  with	
  faster	
  dev	
  cycle	
  than	
  MLlib
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)
MLlib
MLI
ML Optimizer
Spark
Spark:	
  cluster	
  compu=ng	
  system	
  designed	
  for	
  itera=ve	
  computa=on
MLlib:	
  low-­‐level	
  ML	
  library	
  in	
  Spark
✦ Callable	
  from	
  Scala,	
  Java
MLI:	
  API	
  /	
  plaHorm	
  for	
  feature	
  extrac=on	
  and	
  algorithm	
  development
✦ Includes	
  higher-­‐level	
  func2onality	
  with	
  faster	
  dev	
  cycle	
  than	
  MLlib
ML	
  OpAmizer:	
  automates	
  model	
  selec=on
✦ Solves	
  a	
  search	
  problem	
  over	
  feature	
  extractors	
  and	
  algorithms	
  in	
  MLI
Lapack
Matlab Interface
Single Machine
MLlib
MLI
ML Optimizer
End	
  User
MLbase	
  Stack	
  Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
MLlib
MLI
ML Optimizer
End	
  User
MLbase	
  Stack	
  Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
MLlib
MLI
ML Optimizer
End	
  User
MLbase	
  Stack	
  Status
Goal 1:
Summer Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
MLlib
MLI
ML Optimizer
End	
  User
MLbase	
  Stack	
  Status
Goal 1:
Summer Release
Goal 2:
Winter Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
Example:	
  MLlib
Example:	
  MLlib
✦ Goal:	
  Classifica2on	
  of	
  text	
  file
Example:	
  MLlib
✦ Goal:	
  Classifica2on	
  of	
  text	
  file
✦ Featurize	
  data	
  manually
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
14 .train(featurizedData)
15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
Example:	
  MLlib
✦ Goal:	
  Classifica2on	
  of	
  text	
  file
✦ Featurize	
  data	
  manually
✦ Calls	
  MLlib’s	
  LR	
  func2on
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
14 .train(featurizedData)
15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
Example:	
  MLI
Example:	
  MLI
✦ Use	
  built-­‐in	
  feature	
  extrac2on	
  func2onality
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
Example:	
  MLI
✦ Use	
  built-­‐in	
  feature	
  extrac2on	
  func2onality
✦ MLI	
  Logis2c	
  Regression	
  leverages	
  MLlib
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
Example:	
  MLI
✦ Use	
  built-­‐in	
  feature	
  extrac2on	
  func2onality
✦ MLI	
  Logis2c	
  Regression	
  leverages	
  MLlib
✦ Extensions:
✦ Embed	
  in	
  cross-­‐valida2on	
  rou2ne
✦ Use	
  different	
  feature	
  extractors	
  /	
  algorithms	
  or	
  
write	
  new	
  ones
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
Example:	
  ML	
  Op2mizer
var	
  X	
  =	
  load(”text_file”,	
  2	
  to	
  10)
var	
  y	
  =	
  load(”text_file”,	
  1)
var	
  (fn-­‐model,	
  summary)	
  =	
  doClassify(X,	
  y)
✦ User	
  declara2vely	
  specifies	
  task
✦ ML	
  Op2mizer	
  searches	
  through	
  MLI
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Ease	
  of	
  use
Performance,	
  
Scalability
Lay	
  of	
  the	
  Land
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
Lay	
  of	
  the	
  Land
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
Mahout
x
Lay	
  of	
  the	
  Land
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
GraphLab,	
  VW
x
Mahout
x
Lay	
  of	
  the	
  Land
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
GraphLab,	
  VW
x
Mahout
x
Lay	
  of	
  the	
  Land
MLlib
x
Logis2c	
  Regression,	
  Linear	
  SVM	
  (+L1,	
  L2)
Linear	
  Regression	
  (+Lasso,	
  Ridge)
Alterna2ng	
  Least	
  Squares
K-­‐Means
SGD,	
  Parallel	
  Gradient
MLlib
ClassificaAon:
Regression:
CollaboraAve	
  Filtering:
Clustering:
OpAmizaAon	
  PrimiAves:
Logis2c	
  Regression,	
  Linear	
  SVM	
  (+L1,	
  L2)
Linear	
  Regression	
  (+Lasso,	
  Ridge)
Alterna2ng	
  Least	
  Squares
K-­‐Means
SGD,	
  Parallel	
  Gradient
MLlib
ClassificaAon:
Regression:
CollaboraAve	
  Filtering:
Clustering:
OpAmizaAon	
  PrimiAves:
Included	
  within	
  Spark	
  codebase
✦ Unlike	
  Mahout/Hadoop
✦ Part	
  of	
  Spark	
  0.8	
  release
✦ Con2nued	
  support	
  via	
  Spark	
  project
MLlib	
  Performance
✦ WallAme:	
  elapsed	
  2me	
  to	
  execute	
  task
MLlib	
  Performance
✦ WallAme:	
  elapsed	
  2me	
  to	
  execute	
  task
✦ Weak	
  scaling
✦ fix	
  problem	
  size	
  per	
  processor
✦ ideally:	
  constant	
  wall2me	
  as	
  we	
  grow	
  cluster
MLlib	
  Performance
✦ WallAme:	
  elapsed	
  2me	
  to	
  execute	
  task
✦ Weak	
  scaling
✦ fix	
  problem	
  size	
  per	
  processor
✦ ideally:	
  constant	
  wall2me	
  as	
  we	
  grow	
  cluster
✦ Strong	
  scaling
✦ fix	
  total	
  problem	
  size
✦ ideally:	
  linear	
  speed	
  up	
  as	
  we	
  grow	
  cluster
MLlib	
  Performance
✦ WallAme:	
  elapsed	
  2me	
  to	
  execute	
  task
✦ Weak	
  scaling
✦ fix	
  problem	
  size	
  per	
  processor
✦ ideally:	
  constant	
  wall2me	
  as	
  we	
  grow	
  cluster
✦ Strong	
  scaling
✦ fix	
  total	
  problem	
  size
✦ ideally:	
  linear	
  speed	
  up	
  as	
  we	
  grow	
  cluster
✦ EC2	
  Experiments
✦ m2.4xlarge	
  instances,	
  up	
  to	
  32	
  machine	
  clusters
MLlib	
  Performance
Logis2c	
  Regression	
  -­‐	
  Weak	
  Scaling
Logis2c	
  Regression	
  -­‐	
  Weak	
  Scaling
✦ Full	
  dataset:	
  200K	
  images,	
  160K	
  dense	
  features
Logis2c	
  Regression	
  -­‐	
  Weak	
  Scaling
✦ Full	
  dataset:	
  200K	
  images,	
  160K	
  dense	
  features
✦ Similar	
  weak	
  scaling
0 5 10 15 20 25 30
0
2
4
6
8
10
relativewalltime
# machines
MLbase
VW
Ideal
Fig. 6: Weak scaling for logistic regression
15
20
25
30
35
speedup
MLbase
VW
Ideal
MLlib
Logis2c	
  Regression	
  -­‐	
  Weak	
  Scaling
✦ Full	
  dataset:	
  200K	
  images,	
  160K	
  dense	
  features
✦ Similar	
  weak	
  scaling
✦ MLlib	
  within	
  a	
  factor	
  of	
  2	
  of	
  VW’s	
  wall=me
MLbase VW Matlab
0
1000
2000
3000
4000
walltime(s)
n=6K, d=160K
n=12.5K, d=160K
n=25K, d=160K
n=50K, d=160K
n=100K, d=160K
n=200K, d=160K
MLlib0 5 10 15 20 25 30
0
2
4
6
8
10
relativewalltime
# machines
MLbase
VW
Ideal
Fig. 6: Weak scaling for logistic regression
15
20
25
30
35
speedup
MLbase
VW
Ideal
MLlib
Logis2c	
  Regression	
  -­‐	
  Strong	
  Scaling
Logis2c	
  Regression	
  -­‐	
  Strong	
  Scaling
✦ Fixed	
  Dataset:	
  50K	
  images,	
  160K	
  dense	
  features
Logis2c	
  Regression	
  -­‐	
  Strong	
  Scaling
✦ Fixed	
  Dataset:	
  50K	
  images,	
  160K	
  dense	
  features
✦ MLlib	
  exhibits	
  beTer	
  scaling	
  proper=es
0 5 10 15 20 25 30
0
# machines
ig. 6: Weak scaling for logistic regression
0 5 10 15 20 25 30
0
5
10
15
20
25
30
35
# machines
speedup
MLbase
VW
Ideal
8: Strong scaling for logistic regression
System Lines of Code
MLbase 32
GraphLab 383
MLlib
Logis2c	
  Regression	
  -­‐	
  Strong	
  Scaling
✦ Fixed	
  Dataset:	
  50K	
  images,	
  160K	
  dense	
  features
✦ MLlib	
  exhibits	
  beTer	
  scaling	
  proper=es
✦ MLlib	
  faster	
  than	
  VW	
  with	
  16	
  and	
  32	
  machines
MLbase VW Matlab
0
1000
wa
Fig. 5: Walltime for weak scaling for logistic regressi
MLbase VW Matlab
0
200
400
600
800
1000
1200
1400
walltime(s)
1 Machine
2 Machines
4 Machines
8 Machines
16 Machines
32 Machines
Fig. 7: Walltime for strong scaling for logistic regress
with respect to computation. In practice, we see comp
scaling results as more machines are added.
In MATLAB, we implement gradient descent inste
SGD, as gradient descent requires roughly the same nu
of numeric operations as SGD but does not require an
loop to pass over the data. It can thus be implemented
MLlib
0 5 10 15 20 25 30
0
# machines
ig. 6: Weak scaling for logistic regression
0 5 10 15 20 25 30
0
5
10
15
20
25
30
35
# machines
speedup
MLbase
VW
Ideal
8: Strong scaling for logistic regression
System Lines of Code
MLbase 32
GraphLab 383
MLlib
ALS	
  -­‐	
  Wall2me
ALS	
  -­‐	
  Wall2me
✦ Dataset:	
  Scaled	
  version	
  of	
  NeHlix	
  data	
  (9X	
  in	
  size)
✦ Cluster:	
  9	
  machines
ALS	
  -­‐	
  Wall2me
✦ Dataset:	
  Scaled	
  version	
  of	
  NeHlix	
  data	
  (9X	
  in	
  size)
✦ Cluster:	
  9	
  machines
System WallAme	
  (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
ALS	
  -­‐	
  Wall2me
✦ Dataset:	
  Scaled	
  version	
  of	
  NeHlix	
  data	
  (9X	
  in	
  size)
✦ Cluster:	
  9	
  machines
System WallAme	
  (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
ALS	
  -­‐	
  Wall2me
✦ Dataset:	
  Scaled	
  version	
  of	
  NeHlix	
  data	
  (9X	
  in	
  size)
✦ Cluster:	
  9	
  machines
✦ MLlib	
  an	
  order	
  of	
  magnitude	
  faster	
  than	
  Mahout
✦ MLlib	
  within	
  factor	
  of	
  2	
  of	
  GraphLab
System WallAme	
  (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
Deployment	
  Considera2ons
Deployment	
  Considera2ons
Vowpal	
  Wabbit,	
  GraphLab
✦ Data	
  prepara=on	
  specific	
  to	
  each	
  program
✦ Non-­‐trivial	
  setup	
  on	
  cluster
✦ No	
  fault	
  tolerance
Deployment	
  Considera2ons
Vowpal	
  Wabbit,	
  GraphLab
✦ Data	
  prepara=on	
  specific	
  to	
  each	
  program
✦ Non-­‐trivial	
  setup	
  on	
  cluster
✦ No	
  fault	
  tolerance
MLlib
✦ Reads	
  files	
  from	
  HDFS
✦ Launch/compile/run	
  on	
  cluster	
  with	
  a	
  few	
  commands
✦ RDD’s	
  are	
  fault	
  tolerance
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
GraphLab,	
  VW
x
Mahout
x
Lay	
  of	
  the	
  Land
MLlib
x
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
GraphLab,	
  VW
x
MLI
x
Mahout
x
Lay	
  of	
  the	
  Land
MLlib
x
Current	
  Op2ons
Current	
  Op2ons
	
  +	
  	
  	
  Easy	
  (Resembles	
  math,	
  limited	
  /	
  no	
  set	
  up	
  cost)
	
  +	
  	
  	
  Sufficient	
  for	
  prototyping	
  /	
  wri2ng	
  papers
—	
  	
  Ad-­‐hoc,	
  non-­‐scalable	
  scripts
—	
  	
  Loss	
  of	
  transla2on	
  upon	
  re-­‐implementa2on
Current	
  Op2ons
	
  +	
  	
  	
  Easy	
  (Resembles	
  math,	
  limited	
  /	
  no	
  set	
  up	
  cost)
	
  +	
  	
  	
  Sufficient	
  for	
  prototyping	
  /	
  wri2ng	
  papers
—	
  	
  Ad-­‐hoc,	
  non-­‐scalable	
  scripts
—	
  	
  Loss	
  of	
  transla2on	
  upon	
  re-­‐implementa2on
Current	
  Op2ons
	
  +	
  	
  	
  Easy	
  (Resembles	
  math,	
  limited	
  /	
  no	
  set	
  up	
  cost)
	
  +	
  	
  	
  Sufficient	
  for	
  prototyping	
  /	
  wri2ng	
  papers
—	
  	
  Ad-­‐hoc,	
  non-­‐scalable	
  scripts
—	
  	
  Loss	
  of	
  transla2on	
  upon	
  re-­‐implementa2on
	
  +	
  	
  	
  Scalable	
  and	
  (some2mes)	
  fast
	
  +	
  	
  	
  Exis2ng	
  open-­‐source	
  library	
  of	
  ML	
  algorithms
—	
  	
  Difficult	
  to	
  set	
  up,	
  extend
Examples
ML Developer
Code
Examples
ML Developer
Code
‘Distributed’	
  Divide-­‐Factor-­‐Combine	
  (DFC)
✦ Ini2al	
  studies	
  in	
  MATLAB	
  (Not	
  distributed)
✦ Distributed	
  prototype	
  involving	
  compiled	
  MATLAB
Examples
ML Developer
Code
‘Distributed’	
  Divide-­‐Factor-­‐Combine	
  (DFC)
✦ Ini2al	
  studies	
  in	
  MATLAB	
  (Not	
  distributed)
✦ Distributed	
  prototype	
  involving	
  compiled	
  MATLAB
Mahout	
  ALS	
  with	
  Early	
  Stopping
✦ Theory:	
  simple	
  if-­‐statement	
  (3	
  lines	
  of	
  code)
Examples
ML Developer
Code
‘Distributed’	
  Divide-­‐Factor-­‐Combine	
  (DFC)
✦ Ini2al	
  studies	
  in	
  MATLAB	
  (Not	
  distributed)
✦ Distributed	
  prototype	
  involving	
  compiled	
  MATLAB
Mahout	
  ALS	
  with	
  Early	
  Stopping
✦ Theory:	
  simple	
  if-­‐statement	
  (3	
  lines	
  of	
  code)
✦ Prac2ce:	
  sih	
  through	
  7	
  files,	
  nearly	
  1K	
  lines	
  of	
  code
Insight:	
  Programming	
  Abstrac2ons
Insight:	
  Programming	
  Abstrac2ons
✦ Shield	
  ML	
  Developers	
  from	
  low-­‐details:	
  provide	
  
familiar	
  mathema2cal	
  operators	
  in	
  distributed	
  sejng
✦ ML	
  Developer	
  API	
  (MLI)
Insight:	
  Programming	
  Abstrac2ons
✦ Shield	
  ML	
  Developers	
  from	
  low-­‐details:	
  provide	
  
familiar	
  mathema2cal	
  operators	
  in	
  distributed	
  sejng
✦ ML	
  Developer	
  API	
  (MLI)
✦ Table	
  Computa2on:	
  MLTable
✦ Linear	
  Algebra:	
  MLSubMatrix
✦ Op2miza2on	
  Primi2ves:	
  MLSolve
Insight:	
  Programming	
  Abstrac2ons
✦ Shield	
  ML	
  Developers	
  from	
  low-­‐details:	
  provide	
  
familiar	
  mathema2cal	
  operators	
  in	
  distributed	
  sejng
✦ ML	
  Developer	
  API	
  (MLI)
✦ Table	
  Computa2on:	
  MLTable
✦ Linear	
  Algebra:	
  MLSubMatrix
✦ Op2miza2on	
  Primi2ves:	
  MLSolve
✦ MLI	
  Examples:
✦ DFC:	
  ~50	
  lines	
  of	
  code
Insight:	
  Programming	
  Abstrac2ons
✦ Shield	
  ML	
  Developers	
  from	
  low-­‐details:	
  provide	
  
familiar	
  mathema2cal	
  operators	
  in	
  distributed	
  sejng
✦ ML	
  Developer	
  API	
  (MLI)
✦ Table	
  Computa2on:	
  MLTable
✦ Linear	
  Algebra:	
  MLSubMatrix
✦ Op2miza2on	
  Primi2ves:	
  MLSolve
✦ MLI	
  Examples:
✦ DFC:	
  ~50	
  lines	
  of	
  code
✦ ALS:	
  early	
  stopping	
  in	
  3	
  lines;	
  <	
  40	
  lines	
  total
Lines	
  of	
  Code
Lines	
  of	
  Code
Logis2c	
  Regression
Alterna2ng	
  Least	
  Squares
System Lines	
  of	
  Code
Matlab 11
Vowpal	
  Wabbit 721
MLI 55
System Lines	
  of	
  Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
Lines	
  of	
  Code
Logis2c	
  Regression
Alterna2ng	
  Least	
  Squares
System Lines	
  of	
  Code
Matlab 11
Vowpal	
  Wabbit 721
MLI 55
System Lines	
  of	
  Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
Lines	
  of	
  Code
Logis2c	
  Regression
Alterna2ng	
  Least	
  Squares
System Lines	
  of	
  Code
Matlab 11
Vowpal	
  Wabbit 721
MLI 55
System Lines	
  of	
  Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
MLI	
  Details
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
val	
  x:	
  RDD[BIDMat.SMat]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
val	
  x:	
  RDD[BIDMat.SMat]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
val	
  x:	
  RDD[BIDMat.SMat]
NEW
val	
  x:	
  MLTable
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
val	
  x:	
  RDD[BIDMat.SMat]
NEW
val	
  x:	
  MLTable
✦ Generic	
  interface	
  for	
  feature	
  extrac2on
✦ Common	
  interface	
  to	
  support	
  an	
  op2mizer
✦ Abstract	
  interface	
  for	
  arbitrary	
  backends
MLTable
✦ Flexibility	
  when	
  loading	
  data
✦ e.g.,	
  CSV,	
  JSON,	
  XML
✦ Heterogenous	
  data	
  across	
  
columns
✦ Missing	
  Data
✦ Feature	
  extrac2on
✦ Common	
  Interface
✦ Supports	
  MapReduce	
  and	
  
Rela2onal	
  Operators	
  
✦ Inspired	
  by	
  DataFrames	
  (R)	
  and	
  Pandas	
  (Python)
Feature	
  Extrac2on
where a ke
matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a
data. Outpu
table.
numRows None Long Returns nu
numCols None Long Returns the
Fig. 2: MLTable API Illustration. This table captures core operations of th
1 def main(args: Array[String]) {
2 val mc = new MLContext("local")
3
4 //Read in table from file on HDFS.
5 val rawTextTable = mc.textFile(args(0))
6
7 //Run feature extraction on the raw text - get the top 30000 bigrams.
8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000))
9
10 //Cluster the data using K-Means.
11 val kMeansModel = KMeans(featurizedTable, k=50)
12 }
Fig. 3: Loading, featurizing, and learning clusters on a corpu
Family Example Uses Returns
Shape dims(mat), mat.numRows, mat.numCols Int or (Int,Int)
MLSubMatrix
✦ Linear	
  algebra	
  on	
  local	
  parAAons
✦ E.g.,	
  matrix-­‐vector	
  opera2ons	
  for	
  
mini-­‐batch	
  logis2c	
  regression
✦ E.g.,	
  solving	
  linear	
  system	
  of	
  equa2ons	
  
for	
  Alterna2ng	
  Least	
  Squares
✦ Sparse	
  and	
  Dense	
  Matrix	
  Support
Alterna2ng	
  Least	
  Squares
19 parfor q=1:n
20 Uq = U(Uinds{q},:);
21 V(q,:) = (Uq’*Uq + lambI)  (Uq’ * M(Uinds{q},q));
22 end
23 end
24 end
1 object BroadcastALS extends Algorithm {
2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable,
3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = {
4 val lambI = MLSubMatrix.eye(k).mul(lambda)
5 var U = MLSubMatrix.rand(m, k)
6 var V = MLSubMatrix.rand(n, k)
7 var U_b = trainData.context.broadcast(U)
8 var V_b = trainData.context.broadcast(V)
9 for (iter <- 0 until maxIter) {
10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k))
11 U_b = trainData.context.broadcast(U)
12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k))
13 V_b = trainData.context.broadcast(V)
14 }
15 new ALSModel(U, V)
16 }
17
18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){
19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k)
20 for (i <- 0 until trainDataPart.numRows) {
21 val q = trainDataPart.rowID(i)
22 val nz_inds = trainDataPart.nzCols(q)
23 val Yq = Y(trainDataPart.nzCols(q), ??)
24 localX(i, ??) = ((Yq.transpose times Yq) + lambI)
25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose)
26 }
27 return localX
28 }
29 }
MLSolve
✦ Distributed	
  implementaAons	
  of	
  
common	
  opAmizaAon	
  paZerns
✦ E.g.,	
  Stochas2c	
  Gradient	
  Descent:	
  
Applicable	
  to	
  summable	
  ML	
  losses
✦ E.g.,	
  LBFGS:	
  An	
  approximate	
  2nd-­‐
order	
  op2miza2on	
  method	
  
✦ E.g.,	
  ADMM:	
  Decomposi2on	
  /	
  
coordina2on	
  procedure
Logis2c	
  Regression
5 grad = X’ * (sigmoid(X * w) - y);
6 w = w - learning_rate * grad;
7 end
8 end
9
10 % applies sigmoid function component-wise on the vector x
11 function s = sigmoid(x)
12 s = 1 ./ (1 + exp(-1 .* x));
13 end
1 object LogisticRegression extends Algorithm {
2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z))
3
4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = {
5 x.transpose * (sigmoid(x dot w) - y)
6 }
7
8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = {
9 val d = data.numCols
10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1),
11 maxIterations = p.maxIter, learningRate = p.learningRate,
12 gradientFunction = gradientFunction)
13 val weights = SGD(data, params)
14 new LogRegModel(weights)
15 }
16 }
1 object StochasticGradientDescent extends Optimizer {
2
3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar,
4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatr
5 var localWeights = weights
6 for (i <- 0 to x.numRows) {
Linear	
  Regression	
  (+Lasso,	
  Ridge)
Alterna2ng	
  Least	
  Squares,	
  [DFC]
K-­‐Means,	
  [DP-­‐Means]
Logis2c	
  Regression,	
  Linear	
  SVM	
  (+L1,	
  L2),	
  Mul2nomial	
  
Regression,	
  [Naive	
  Bayes,	
  Decision	
  Trees]
SGD,	
  Parallel	
  Gradient,	
  Local	
  SGD,	
  [L-­‐BFGS,	
  ADMM,	
  
Adagrad]
Principal	
  Component	
  Analysis	
  (PCA),	
  N-­‐grams,	
  feature	
  
cleaning	
  /	
  normaliza2on
Cross	
  Valida2on,	
  Evalua2on	
  Metrics
MLI	
  Func2onality
Regression:
CollaboraAve	
  Filtering:
Clustering:
ClassificaAon:
OpAmizaAon	
  PrimiAves:
Feature	
  ExtracAon:
ML	
  Tools:
Linear	
  Regression	
  (+Lasso,	
  Ridge)
Alterna2ng	
  Least	
  Squares,	
  [DFC]
K-­‐Means,	
  [DP-­‐Means]
Logis2c	
  Regression,	
  Linear	
  SVM	
  (+L1,	
  L2),	
  Mul2nomial	
  
Regression,	
  [Naive	
  Bayes,	
  Decision	
  Trees]
SGD,	
  Parallel	
  Gradient,	
  Local	
  SGD,	
  [L-­‐BFGS,	
  ADMM,	
  
Adagrad]
Principal	
  Component	
  Analysis	
  (PCA),	
  N-­‐grams,	
  feature	
  
cleaning	
  /	
  normaliza2on
Cross	
  Valida2on,	
  Evalua2on	
  Metrics
MLI	
  Func2onality
Regression:
CollaboraAve	
  Filtering:
Clustering:
ClassificaAon:
OpAmizaAon	
  PrimiAves:
Feature	
  ExtracAon:
ML	
  Tools:
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Build	
  a	
  Classifier	
  for	
  X
What	
  you	
  want	
  to	
  do
Build	
  a	
  Classifier	
  for	
  X
What	
  you	
  want	
  to	
  do What	
  you	
  have	
  to	
  do
✦ Learn	
  the	
  internals	
  of	
  ML	
  classificaAon	
  
algorithms,	
  sampling,	
  feature	
  selecAon,	
  
X-­‐validaAon,….
✦ PotenAally	
  learn	
  Spark/Hadoop/…
✦ Implement	
  3-­‐4	
  algorithms
✦ Implement	
  grid-­‐search	
  to	
  find	
  the	
  right	
  
algorithm	
  parameters
✦ Implement	
  validaAon	
  algorithms
✦ Experiment	
  with	
  different	
  sampling-­‐
sizes,	
  algorithms,	
  features
✦ ….
Build	
  a	
  Classifier	
  for	
  X
What	
  you	
  want	
  to	
  do What	
  you	
  have	
  to	
  do
✦ Learn	
  the	
  internals	
  of	
  ML	
  classificaAon	
  
algorithms,	
  sampling,	
  feature	
  selecAon,	
  
X-­‐validaAon,….
✦ PotenAally	
  learn	
  Spark/Hadoop/…
✦ Implement	
  3-­‐4	
  algorithms
✦ Implement	
  grid-­‐search	
  to	
  find	
  the	
  right	
  
algorithm	
  parameters
✦ Implement	
  validaAon	
  algorithms
✦ Experiment	
  with	
  different	
  sampling-­‐
sizes,	
  algorithms,	
  features
✦ ….
and	
  in	
  the	
  end
Ask	
  For	
  Help
Insight:	
  A	
  Declara2ve	
  Approach
SQL Result
✦ End	
  Users	
  tell	
  the	
  system	
  what	
  they	
  want,	
  not	
  how	
  
to	
  get	
  it
Insight:	
  A	
  Declara2ve	
  Approach
SQL Result MQL Model
✦ End	
  Users	
  tell	
  the	
  system	
  what	
  they	
  want,	
  not	
  how	
  
to	
  get	
  it
var	
  X	
  =	
  load(”als_clinical”,	
  2	
  to	
  10)
var	
  y	
  =	
  load(”als_clinical”,	
  1)
var	
  (fn-­‐model,	
  summary)	
  =	
  doClassify(X,	
  y)
Example:	
  Supervised	
  ClassificaAon
✦ End	
  Users	
  tell	
  the	
  system	
  what	
  they	
  want,	
  not	
  how	
  
to	
  get	
  it
Insight:	
  A	
  Declara2ve	
  Approach
var	
  X	
  =	
  load(”als_clinical”,	
  2	
  to	
  10)
var	
  y	
  =	
  load(”als_clinical”,	
  1)
var	
  (fn-­‐model,	
  summary)	
  =	
  doClassify(X,	
  y)
Example:	
  Supervised	
  ClassificaAon
Algorithm	
  Independent	
  
✦ End	
  Users	
  tell	
  the	
  system	
  what	
  they	
  want,	
  not	
  how	
  
to	
  get	
  it
Insight:	
  A	
  Declara2ve	
  Approach
 ML	
  Op2mizer:	
  A	
  Search	
  Problem
5min
Boosting
SVM
✦ System	
  is	
  responsible	
  
for	
  searching	
  through	
  
model	
  space
✦ Opportuni2es	
  for	
  
physical	
  op2miza2on
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
✦ Single	
  pass	
  over	
  the	
  
data,	
  many	
  models	
  
trained
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
✦ Single	
  pass	
  over	
  the	
  
data,	
  many	
  models	
  
trained
✦ Example	
  –	
  Logis2c	
  
Regression	
  via	
  SGD
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Rela2onship	
  with	
  MLI
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Rela2onship	
  with	
  MLI
✦ MLI	
  provides	
  common	
  interface	
  for	
  all	
  algorithms
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Rela2onship	
  with	
  MLI
✦ MLI	
  provides	
  common	
  interface	
  for	
  all	
  algorithms
✦ Contracts:	
  Meta-­‐data	
  for	
  algorithms	
  writen	
  against	
  MLI
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Rela2onship	
  with	
  MLI
✦ MLI	
  provides	
  common	
  interface	
  for	
  all	
  algorithms
✦ Contracts:	
  Meta-­‐data	
  for	
  algorithms	
  writen	
  against	
  MLI
✦ Type	
  (e.g.,	
  classifica2on)
✦ Parameters
✦ Run2me	
  (e.g.,	
  O(n))
✦ Input-­‐Specifica2on
✦ Output-­‐Specifica2on
✦ …
MQL
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Contributors
✦ John	
  Duchi
✦ Michael	
  Franklin
✦ Joseph	
  Gonzalez
✦ Rean	
  Griffith
✦ Michael	
  Jordan
✦ Tim	
  Kraska
✦ Xinghao	
  Pan
✦ Virginia	
  Smith
✦ Shivaram	
  Venkarataram
✦ Matei	
  Zaharia
Contributors
✦ John	
  Duchi
✦ Michael	
  Franklin
✦ Joseph	
  Gonzalez
✦ Rean	
  Griffith
✦ Michael	
  Jordan
✦ Tim	
  Kraska
✦ Xinghao	
  Pan
✦ Virginia	
  Smith
✦ Shivaram	
  Venkarataram
✦ Matei	
  Zaharia
*
*
*
*
First	
  Release	
  (Summer)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
First	
  Release	
  (Summer)
✦ MLlib:	
  low-­‐level	
  ML	
  library	
  and	
  underlying	
  kernels
✦ Callable	
  from	
  Scala,	
  Java
✦ Included	
  as	
  part	
  of	
  Spark
✦ MLI:	
  API	
  for	
  feature	
  extrac2on	
  and	
  ML	
  algorithms
✦ Plaworm	
  for	
  ML	
  development
✦ Includes	
  more	
  extensive	
  library	
  and	
  with	
  faster	
  dev-­‐cycle	
  than	
  MLlib
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
Second	
  Release	
  (Winter)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
Second	
  Release	
  (Winter)
✦ ML	
  OpAmizer:	
  automated	
  model	
  selec2on
✦ Search	
  problem	
  over	
  feature	
  extractors	
  and	
  algorithms	
  in	
  MLI
✦ Contracts
✦ Restricted	
  query	
  language	
  (MQL)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
Second	
  Release	
  (Winter)
✦ ML	
  OpAmizer:	
  automated	
  model	
  selec2on
✦ Search	
  problem	
  over	
  feature	
  extractors	
  and	
  algorithms	
  in	
  MLI
✦ Contracts
✦ Restricted	
  query	
  language	
  (MQL)
✦ Feature	
  extracAon	
  for	
  image	
  data
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
Future	
  Direc2ons
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
✦ Unified	
  language	
  for	
  End	
  Users	
  and	
  ML	
  Developers
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
✦ Unified	
  language	
  for	
  End	
  Users	
  and	
  ML	
  Developers
✦ Plug-­‐ins	
  to	
  Python,	
  R
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
✦ Unified	
  language	
  for	
  End	
  Users	
  and	
  ML	
  Developers
✦ Plug-­‐ins	
  to	
  Python,	
  R
✦ VisualizaAon	
  for	
  unsupervised	
  learning	
  and	
  explora2on
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
✦ Unified	
  language	
  for	
  End	
  Users	
  and	
  ML	
  Developers
✦ Plug-­‐ins	
  to	
  Python,	
  R
✦ VisualizaAon	
  for	
  unsupervised	
  learning	
  and	
  explora2on
✦ Advanced	
  ML	
  capabiliAes
✦ Time-­‐series	
  algorithms
✦ Graphical	
  models
✦ Advanced	
  Op2miza2on	
  (e.g.,	
  asynchronous	
  computa2on)
✦ Online	
  updates
✦ Sampling	
  for	
  efficiency	
  
ContribuAons	
  
encouraged!
Berkeley,	
  CA
August	
  29-­‐30www.mlbase.org
baseML
baseML
baseML
ML base
ML base
ML base
ML base

More Related Content

PDF
Apache Spark & MLlib
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PDF
EclairJS = Node.Js + Apache Spark
PPTX
Apache spark - History and market overview
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
PDF
Spark Summit EU talk by Reza Karimi
Apache Spark & MLlib
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Lightening Fast Big Data Analytics using Apache Spark
EclairJS = Node.Js + Apache Spark
Apache spark - History and market overview
Apache Spark MLlib 2.0 Preview: Data Science and Production
Getting started with SparkSQL - Desert Code Camp 2016
Spark Summit EU talk by Reza Karimi

What's hot (20)

PDF
End-to-end Data Pipeline with Apache Spark
PDF
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
PDF
Spark Summit EU talk by Luca Canali
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PDF
Introduction to apache spark
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PDF
Introduction to Apache Spark
PDF
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Scalable Data Science in Python and R on Apache Spark
PPTX
Introduction to Spark - DataFactZ
PDF
Spark Summit EU talk by Qifan Pu
PDF
Extending Spark With Java Agent (handout)
PDF
What is Distributed Computing, Why we use Apache Spark
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
End-to-end Data Pipeline with Apache Spark
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
Spark Summit EU talk by Luca Canali
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Introduction to apache spark
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Based Distributed Deep Learning Framework For Big Data Applications
Introduction to Apache Spark
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit EU 2015: Lessons from 300+ production users
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Scalable Data Science in Python and R on Apache Spark
Introduction to Spark - DataFactZ
Spark Summit EU talk by Qifan Pu
Extending Spark With Java Agent (handout)
What is Distributed Computing, Why we use Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Ad

Similar to MLlib sparkmeetup_8_6_13_final_reduced (20)

PPTX
MLconf NYC Xiangrui Meng
PDF
H2O World - Sparkling Water - Michal Malohlava
PDF
Unified Big Data Processing with Apache Spark
PDF
Intro to Spark and Spark SQL
PDF
Large-Scale Machine Learning with Apache Spark
PDF
Apache Spark: What? Why? When?
PPT
Apache spark-melbourne-april-2015-meetup
PPT
Spark training-in-bangalore
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PPT
Scala Talk at FOSDEM 2009
PDF
Dax Declarative Api For Xml
PDF
Stanford CS347 Guest Lecture: Apache Spark
PPTX
Introduction to Spark ML
PPTX
Flux - Open Machine Learning Stack / Pipeline
PDF
Spark streaming , Spark SQL
PDF
Building Machine Learning Applications with Sparkling Water
PDF
Apache spark - Architecture , Overview & libraries
PDF
Spark ml streaming
PDF
Recent Developments in Spark MLlib and Beyond
PDF
Recent Developments In SparkR For Advanced Analytics
MLconf NYC Xiangrui Meng
H2O World - Sparkling Water - Michal Malohlava
Unified Big Data Processing with Apache Spark
Intro to Spark and Spark SQL
Large-Scale Machine Learning with Apache Spark
Apache Spark: What? Why? When?
Apache spark-melbourne-april-2015-meetup
Spark training-in-bangalore
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Scala Talk at FOSDEM 2009
Dax Declarative Api For Xml
Stanford CS347 Guest Lecture: Apache Spark
Introduction to Spark ML
Flux - Open Machine Learning Stack / Pipeline
Spark streaming , Spark SQL
Building Machine Learning Applications with Sparkling Water
Apache spark - Architecture , Overview & libraries
Spark ml streaming
Recent Developments in Spark MLlib and Beyond
Recent Developments In SparkR For Advanced Analytics
Ad

Recently uploaded (20)

PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
UiPath Agentic Automation session 1: RPA to Agents
PPT
What is a Computer? Input Devices /output devices
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Modernising the Digital Integration Hub
PPT
Geologic Time for studying geology for geologist
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
STKI Israel Market Study 2025 version august
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Five Habits of High-Impact Board Members
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Microsoft Excel 365/2024 Beginner's training
Abstractive summarization using multilingual text-to-text transfer transforme...
Final SEM Unit 1 for mit wpu at pune .pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
NewMind AI Weekly Chronicles – August ’25 Week III
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Credit Without Borders: AI and Financial Inclusion in Bangladesh
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
UiPath Agentic Automation session 1: RPA to Agents
What is a Computer? Input Devices /output devices
OpenACC and Open Hackathons Monthly Highlights July 2025
Developing a website for English-speaking practice to English as a foreign la...
CloudStack 4.21: First Look Webinar slides
Modernising the Digital Integration Hub
Geologic Time for studying geology for geologist
Benefits of Physical activity for teenagers.pptx
STKI Israel Market Study 2025 version august
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Five Habits of High-Impact Board Members

MLlib sparkmeetup_8_6_13_final_reduced

  • 1. Evan  Sparks  and  Ameet  Talwalkar UC  Berkeley UC Berkeley baseML baseML M ML M
  • 5. Distributed   Compu2ng Big  Data Three  Converging  Trends Machine   Learning
  • 6. Distributed   Compu2ng Big  Data Three  Converging  Trends Machine   Learning MLbase
  • 8. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 9. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 10. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 11. Too  many   algorithms… Problem:  ML  is  difficult for  End  Users…
  • 12. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users…
  • 13. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug…
  • 14. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug… Doesn’t  scale…
  • 15. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug… Reliable Fast Accurate Provable Doesn’t  scale…
  • 16. ML  Experts Systems  ExpertsMLbase
  • 17. 1. Easy  scalable  ML  development  (ML  Developers) 2. User-­‐friendly  ML  at  scale  (End  Users) ML  Experts Systems  ExpertsMLbase
  • 18. 1. Easy  scalable  ML  development  (ML  Developers) 2. User-­‐friendly  ML  at  scale  (End  Users) Along  the  way,  we  gain  insight  into  data  intensive   compu2ng ML  Experts Systems  ExpertsMLbase
  • 21. Lapack Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library
  • 22. Lapack Matlab Interface Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library ✦ Matlab  Interface ✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing ✦ More  extensive  func2onality  than  Lapack ✦ Leverages  Lapack  whenever  possible
  • 23. Lapack Matlab Interface Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library ✦ Matlab  Interface ✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing ✦ More  extensive  func2onality  than  Lapack ✦ Leverages  Lapack  whenever  possible ✦ Similar  stories  for  R  and  Python
  • 26. MLbase  Stack Runtime(s)Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on Lapack Matlab Interface Single Machine
  • 27. MLbase  Stack Runtime(s) MLlib Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java Lapack Matlab Interface Single Machine
  • 28. MLbase  Stack Runtime(s) MLlib MLI Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development ✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib Lapack Matlab Interface Single Machine
  • 29. MLbase  Stack Runtime(s) MLlib MLI ML Optimizer Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development ✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib ML  OpAmizer:  automates  model  selec=on ✦ Solves  a  search  problem  over  feature  extractors  and  algorithms  in  MLI Lapack Matlab Interface Single Machine
  • 30. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 31. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 32. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Goal 1: Summer Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 33. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Goal 1: Summer Release Goal 2: Winter Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 35. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file
  • 36. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file ✦ Featurize  data  manually 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
  • 37. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file ✦ Featurize  data  manually ✦ Calls  MLlib’s  LR  func2on 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
  • 39. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 40. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality ✦ MLI  Logis2c  Regression  leverages  MLlib 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 41. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality ✦ MLI  Logis2c  Regression  leverages  MLlib ✦ Extensions: ✦ Embed  in  cross-­‐valida2on  rou2ne ✦ Use  different  feature  extractors  /  algorithms  or   write  new  ones 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 42. Example:  ML  Op2mizer var  X  =  load(”text_file”,  2  to  10) var  y  =  load(”text_file”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) ✦ User  declara2vely  specifies  task ✦ ML  Op2mizer  searches  through  MLI
  • 44. Ease  of  use Performance,   Scalability Lay  of  the  Land
  • 45. Matlab,  R x Ease  of  use Performance,   Scalability Lay  of  the  Land
  • 46. Matlab,  R x Ease  of  use Performance,   Scalability Mahout x Lay  of  the  Land
  • 47. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land
  • 48. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land MLlib x
  • 49. Logis2c  Regression,  Linear  SVM  (+L1,  L2) Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares K-­‐Means SGD,  Parallel  Gradient MLlib ClassificaAon: Regression: CollaboraAve  Filtering: Clustering: OpAmizaAon  PrimiAves:
  • 50. Logis2c  Regression,  Linear  SVM  (+L1,  L2) Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares K-­‐Means SGD,  Parallel  Gradient MLlib ClassificaAon: Regression: CollaboraAve  Filtering: Clustering: OpAmizaAon  PrimiAves: Included  within  Spark  codebase ✦ Unlike  Mahout/Hadoop ✦ Part  of  Spark  0.8  release ✦ Con2nued  support  via  Spark  project
  • 52. ✦ WallAme:  elapsed  2me  to  execute  task MLlib  Performance
  • 53. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster MLlib  Performance
  • 54. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster ✦ Strong  scaling ✦ fix  total  problem  size ✦ ideally:  linear  speed  up  as  we  grow  cluster MLlib  Performance
  • 55. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster ✦ Strong  scaling ✦ fix  total  problem  size ✦ ideally:  linear  speed  up  as  we  grow  cluster ✦ EC2  Experiments ✦ m2.4xlarge  instances,  up  to  32  machine  clusters MLlib  Performance
  • 56. Logis2c  Regression  -­‐  Weak  Scaling
  • 57. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features
  • 58. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features ✦ Similar  weak  scaling 0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib
  • 59. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features ✦ Similar  weak  scaling ✦ MLlib  within  a  factor  of  2  of  VW’s  wall=me MLbase VW Matlab 0 1000 2000 3000 4000 walltime(s) n=6K, d=160K n=12.5K, d=160K n=25K, d=160K n=50K, d=160K n=100K, d=160K n=200K, d=160K MLlib0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib
  • 60. Logis2c  Regression  -­‐  Strong  Scaling
  • 61. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features
  • 62. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features ✦ MLlib  exhibits  beTer  scaling  proper=es 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib
  • 63. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features ✦ MLlib  exhibits  beTer  scaling  proper=es ✦ MLlib  faster  than  VW  with  16  and  32  machines MLbase VW Matlab 0 1000 wa Fig. 5: Walltime for weak scaling for logistic regressi MLbase VW Matlab 0 200 400 600 800 1000 1200 1400 walltime(s) 1 Machine 2 Machines 4 Machines 8 Machines 16 Machines 32 Machines Fig. 7: Walltime for strong scaling for logistic regress with respect to computation. In practice, we see comp scaling results as more machines are added. In MATLAB, we implement gradient descent inste SGD, as gradient descent requires roughly the same nu of numeric operations as SGD but does not require an loop to pass over the data. It can thus be implemented MLlib 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib
  • 65. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines
  • 66. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 67. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 68. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines ✦ MLlib  an  order  of  magnitude  faster  than  Mahout ✦ MLlib  within  factor  of  2  of  GraphLab System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 70. Deployment  Considera2ons Vowpal  Wabbit,  GraphLab ✦ Data  prepara=on  specific  to  each  program ✦ Non-­‐trivial  setup  on  cluster ✦ No  fault  tolerance
  • 71. Deployment  Considera2ons Vowpal  Wabbit,  GraphLab ✦ Data  prepara=on  specific  to  each  program ✦ Non-­‐trivial  setup  on  cluster ✦ No  fault  tolerance MLlib ✦ Reads  files  from  HDFS ✦ Launch/compile/run  on  cluster  with  a  few  commands ✦ RDD’s  are  fault  tolerance
  • 73. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land MLlib x
  • 74. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x MLI x Mahout x Lay  of  the  Land MLlib x
  • 76. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on
  • 77. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on
  • 78. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on  +      Scalable  and  (some2mes)  fast  +      Exis2ng  open-­‐source  library  of  ML  algorithms —    Difficult  to  set  up,  extend
  • 80. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB
  • 81. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB Mahout  ALS  with  Early  Stopping ✦ Theory:  simple  if-­‐statement  (3  lines  of  code)
  • 82. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB Mahout  ALS  with  Early  Stopping ✦ Theory:  simple  if-­‐statement  (3  lines  of  code) ✦ Prac2ce:  sih  through  7  files,  nearly  1K  lines  of  code
  • 84. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI)
  • 85. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve
  • 86. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve ✦ MLI  Examples: ✦ DFC:  ~50  lines  of  code
  • 87. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve ✦ MLI  Examples: ✦ DFC:  ~50  lines  of  code ✦ ALS:  early  stopping  in  3  lines;  <  40  lines  total
  • 89. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 90. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 91. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 93. MLI  Details OLD val  x:  RDD[Array[Double]]
  • 94. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector]
  • 95. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector]
  • 96. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat]
  • 97. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat]
  • 98. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat] NEW val  x:  MLTable
  • 99. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat] NEW val  x:  MLTable ✦ Generic  interface  for  feature  extrac2on ✦ Common  interface  to  support  an  op2mizer ✦ Abstract  interface  for  arbitrary  backends
  • 100. MLTable ✦ Flexibility  when  loading  data ✦ e.g.,  CSV,  JSON,  XML ✦ Heterogenous  data  across   columns ✦ Missing  Data ✦ Feature  extrac2on ✦ Common  Interface ✦ Supports  MapReduce  and   Rela2onal  Operators   ✦ Inspired  by  DataFrames  (R)  and  Pandas  (Python)
  • 101. Feature  Extrac2on where a ke matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a data. Outpu table. numRows None Long Returns nu numCols None Long Returns the Fig. 2: MLTable API Illustration. This table captures core operations of th 1 def main(args: Array[String]) { 2 val mc = new MLContext("local") 3 4 //Read in table from file on HDFS. 5 val rawTextTable = mc.textFile(args(0)) 6 7 //Run feature extraction on the raw text - get the top 30000 bigrams. 8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000)) 9 10 //Cluster the data using K-Means. 11 val kMeansModel = KMeans(featurizedTable, k=50) 12 } Fig. 3: Loading, featurizing, and learning clusters on a corpu Family Example Uses Returns Shape dims(mat), mat.numRows, mat.numCols Int or (Int,Int)
  • 102. MLSubMatrix ✦ Linear  algebra  on  local  parAAons ✦ E.g.,  matrix-­‐vector  opera2ons  for   mini-­‐batch  logis2c  regression ✦ E.g.,  solving  linear  system  of  equa2ons   for  Alterna2ng  Least  Squares ✦ Sparse  and  Dense  Matrix  Support
  • 103. Alterna2ng  Least  Squares 19 parfor q=1:n 20 Uq = U(Uinds{q},:); 21 V(q,:) = (Uq’*Uq + lambI) (Uq’ * M(Uinds{q},q)); 22 end 23 end 24 end 1 object BroadcastALS extends Algorithm { 2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable, 3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = { 4 val lambI = MLSubMatrix.eye(k).mul(lambda) 5 var U = MLSubMatrix.rand(m, k) 6 var V = MLSubMatrix.rand(n, k) 7 var U_b = trainData.context.broadcast(U) 8 var V_b = trainData.context.broadcast(V) 9 for (iter <- 0 until maxIter) { 10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k)) 11 U_b = trainData.context.broadcast(U) 12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k)) 13 V_b = trainData.context.broadcast(V) 14 } 15 new ALSModel(U, V) 16 } 17 18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){ 19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k) 20 for (i <- 0 until trainDataPart.numRows) { 21 val q = trainDataPart.rowID(i) 22 val nz_inds = trainDataPart.nzCols(q) 23 val Yq = Y(trainDataPart.nzCols(q), ??) 24 localX(i, ??) = ((Yq.transpose times Yq) + lambI) 25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose) 26 } 27 return localX 28 } 29 }
  • 104. MLSolve ✦ Distributed  implementaAons  of   common  opAmizaAon  paZerns ✦ E.g.,  Stochas2c  Gradient  Descent:   Applicable  to  summable  ML  losses ✦ E.g.,  LBFGS:  An  approximate  2nd-­‐ order  op2miza2on  method   ✦ E.g.,  ADMM:  Decomposi2on  /   coordina2on  procedure
  • 105. Logis2c  Regression 5 grad = X’ * (sigmoid(X * w) - y); 6 w = w - learning_rate * grad; 7 end 8 end 9 10 % applies sigmoid function component-wise on the vector x 11 function s = sigmoid(x) 12 s = 1 ./ (1 + exp(-1 .* x)); 13 end 1 object LogisticRegression extends Algorithm { 2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z)) 3 4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = { 5 x.transpose * (sigmoid(x dot w) - y) 6 } 7 8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = { 9 val d = data.numCols 10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1), 11 maxIterations = p.maxIter, learningRate = p.learningRate, 12 gradientFunction = gradientFunction) 13 val weights = SGD(data, params) 14 new LogRegModel(weights) 15 } 16 } 1 object StochasticGradientDescent extends Optimizer { 2 3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar, 4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatr 5 var localWeights = weights 6 for (i <- 0 to x.numRows) {
  • 106. Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares,  [DFC] K-­‐Means,  [DP-­‐Means] Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial   Regression,  [Naive  Bayes,  Decision  Trees] SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,   Adagrad] Principal  Component  Analysis  (PCA),  N-­‐grams,  feature   cleaning  /  normaliza2on Cross  Valida2on,  Evalua2on  Metrics MLI  Func2onality Regression: CollaboraAve  Filtering: Clustering: ClassificaAon: OpAmizaAon  PrimiAves: Feature  ExtracAon: ML  Tools:
  • 107. Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares,  [DFC] K-­‐Means,  [DP-­‐Means] Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial   Regression,  [Naive  Bayes,  Decision  Trees] SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,   Adagrad] Principal  Component  Analysis  (PCA),  N-­‐grams,  feature   cleaning  /  normaliza2on Cross  Valida2on,  Evalua2on  Metrics MLI  Func2onality Regression: CollaboraAve  Filtering: Clustering: ClassificaAon: OpAmizaAon  PrimiAves: Feature  ExtracAon: ML  Tools:
  • 109. Build  a  Classifier  for  X What  you  want  to  do
  • 110. Build  a  Classifier  for  X What  you  want  to  do What  you  have  to  do ✦ Learn  the  internals  of  ML  classificaAon   algorithms,  sampling,  feature  selecAon,   X-­‐validaAon,…. ✦ PotenAally  learn  Spark/Hadoop/… ✦ Implement  3-­‐4  algorithms ✦ Implement  grid-­‐search  to  find  the  right   algorithm  parameters ✦ Implement  validaAon  algorithms ✦ Experiment  with  different  sampling-­‐ sizes,  algorithms,  features ✦ ….
  • 111. Build  a  Classifier  for  X What  you  want  to  do What  you  have  to  do ✦ Learn  the  internals  of  ML  classificaAon   algorithms,  sampling,  feature  selecAon,   X-­‐validaAon,…. ✦ PotenAally  learn  Spark/Hadoop/… ✦ Implement  3-­‐4  algorithms ✦ Implement  grid-­‐search  to  find  the  right   algorithm  parameters ✦ Implement  validaAon  algorithms ✦ Experiment  with  different  sampling-­‐ sizes,  algorithms,  features ✦ …. and  in  the  end Ask  For  Help
  • 112. Insight:  A  Declara2ve  Approach SQL Result ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it
  • 113. Insight:  A  Declara2ve  Approach SQL Result MQL Model ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it
  • 114. var  X  =  load(”als_clinical”,  2  to  10) var  y  =  load(”als_clinical”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) Example:  Supervised  ClassificaAon ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it Insight:  A  Declara2ve  Approach
  • 115. var  X  =  load(”als_clinical”,  2  to  10) var  y  =  load(”als_clinical”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) Example:  Supervised  ClassificaAon Algorithm  Independent   ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it Insight:  A  Declara2ve  Approach
  • 116.  ML  Op2mizer:  A  Search  Problem 5min Boosting SVM ✦ System  is  responsible   for  searching  through   model  space ✦ Opportuni2es  for   physical  op2miza2on
  • 117. Systems  Op2miza2on  of  Model   Search 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey
  • 118. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey
  • 119. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA
  • 120. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB
  • 121. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB ✦ Single  pass  over  the   data,  many  models   trained
  • 122. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB ✦ Single  pass  over  the   data,  many  models   trained ✦ Example  –  Logis2c   Regression  via  SGD
  • 123. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI MQL
  • 124. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms MQL
  • 125. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms ✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI MQL
  • 126. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms ✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI ✦ Type  (e.g.,  classifica2on) ✦ Parameters ✦ Run2me  (e.g.,  O(n)) ✦ Input-­‐Specifica2on ✦ Output-­‐Specifica2on ✦ … MQL
  • 128. Contributors ✦ John  Duchi ✦ Michael  Franklin ✦ Joseph  Gonzalez ✦ Rean  Griffith ✦ Michael  Jordan ✦ Tim  Kraska ✦ Xinghao  Pan ✦ Virginia  Smith ✦ Shivaram  Venkarataram ✦ Matei  Zaharia
  • 129. Contributors ✦ John  Duchi ✦ Michael  Franklin ✦ Joseph  Gonzalez ✦ Rean  Griffith ✦ Michael  Jordan ✦ Tim  Kraska ✦ Xinghao  Pan ✦ Virginia  Smith ✦ Shivaram  Venkarataram ✦ Matei  Zaharia * * * *
  • 130. First  Release  (Summer) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 131. First  Release  (Summer) ✦ MLlib:  low-­‐level  ML  library  and  underlying  kernels ✦ Callable  from  Scala,  Java ✦ Included  as  part  of  Spark ✦ MLI:  API  for  feature  extrac2on  and  ML  algorithms ✦ Plaworm  for  ML  development ✦ Includes  more  extensive  library  and  with  faster  dev-­‐cycle  than  MLlib MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 132. Second  Release  (Winter) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 133. Second  Release  (Winter) ✦ ML  OpAmizer:  automated  model  selec2on ✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI ✦ Contracts ✦ Restricted  query  language  (MQL) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 134. Second  Release  (Winter) ✦ ML  OpAmizer:  automated  model  selec2on ✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI ✦ Contracts ✦ Restricted  query  language  (MQL) ✦ Feature  extracAon  for  image  data MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 136. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer
  • 137. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers
  • 138. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R
  • 139. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R ✦ VisualizaAon  for  unsupervised  learning  and  explora2on
  • 140. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R ✦ VisualizaAon  for  unsupervised  learning  and  explora2on ✦ Advanced  ML  capabiliAes ✦ Time-­‐series  algorithms ✦ Graphical  models ✦ Advanced  Op2miza2on  (e.g.,  asynchronous  computa2on) ✦ Online  updates ✦ Sampling  for  efficiency  
  • 141. ContribuAons   encouraged! Berkeley,  CA August  29-­‐30www.mlbase.org baseML baseML baseML ML base ML base ML base ML base