Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal

Scalable Machine Learning
with Apache SystemML
Berthold Reinwald, Nakul Jindal
IBM
June 21st, 2016
1

Agenda
• What is Apache SystemML
• How to implement SystemML algorithms
è data scientist
• How to run SystemML algorithms
è user
• How does SystemML work
è SystemML developer
2

What is Apache SystemML
• In a nutshell
• a language for data scientists to implement scalable ML algorithms
• 2 language variants: R-like and Python-like syntax
• Strong foundation of linear algebra operations and statistical functions
• Comes with approx. 20+ algorithms pre-implemented
• Cost-based optimizer to compile execution plans
• Depending on data characteristics (tall/skinny, short/wide; dense/sparse)
and cluster characteristics
• ranging from single node to clusters (MapReduce, Spark); hybrid plans
• APIs & Tools
• Command line: hadoop jar, spark-submit, standalone Java app
• JMLC: embed as library
• Spark MLContext: Scala, Python, and Java
• Tools
• REPL (Scala Spark and pyspark)
• Spark ML pipeline
3

Big Data Analytics - Characteristics
• Large number of models
• Large number of data points
• Large number of features
• Sparse data
• Large number/size of intermediates
• Large number of pairs
• Custom analytics
4

SystemML – Declarative ML
• Analytics language for data scientists
(“The SQL for analytics”)
• Algorithms expressed in a declarative,
high-level language DML with R-like syntax
• Productivity of data scientists
• Enable
• Solutions development
• Tools
• Compiler
• Cost-based optimizer to generate
execution plans and to parallelize
• based on data characteristics
• based on cluster and machine characteristics
• Physical operators for in-memory single node
and cluster execution
• Performance & Scalability
5

High-Level SystemML Architecture
6
Hadoop or Spark Cluster
(scale-out)
In-Memory Single Node
(scale-up)
Runtime
Compiler
Language
DML Scripts DML (Declarative Machine
Learning Language)

Apache SystemML Incubator Project
• June, 2015: SystemML open source announced at
Spark Summit
• Sep., 2015: public github
• Oct., 2015: 1st open source binary release (0.8.0)
• Nov., 2015: Enter Apache incubation
• http://systemml.apache.org/
• https://github.com/apache/incubator-systemml
• Jan., 2016: SystemML 0.9.0 (1st Apache release)
• June, 2016: SystemML 0.10.0 release
7

Apache SystemML Incubator
http://systemml.apache.org/
• Get SystemML
• Documentation
• DML Reference Guide
• Algorithms Guide
• Running
• Community
• JIRA server
• GitHub
8

DML Language Reference Guide
9
https://apache.github.io/incubator-systemml/dml-language-reference.html

Sample Code
A = 1.0 # A is an integer
X <- matrix(“4 3 2 5 7 8”, rows=3, cols=2) # X = matrix of size 3,2 '<-' is assignment
Y = matrix(1, rows=3, cols=2) # Y = matrix of size 3,2 with all 1s
b <- t(X) %*% Y # %*% is matrix multiply, t(X) is transpose
S = "hello world"
i=0
while(i < max_iteration) {
H = (H * (t(W) %*% (V/(W%*%H))))/t(colSums(W)) # * is element by element mult
W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H))
i = i + 1; # i is an integer
}
print (toString(H)) # toString converts a matrix to a string
10

Sample Code
source("nn/layers/affine.dml") as affine # import a file in the “affine“ namespace
[W, b] = affine::init(D, M) # calls the init function, multiple
return
parfor (i in 1:nrow(X)) { # i iterates over 1 through num rows in X in parallel
for (j in 1:ncol(X)) { # j iterates over 1 through num cols in X
# Computation ...
}
}
write (M, fileM, format=“text”) # M=matrix, fileM=file, also writes to
HDFS
X = read (fileX) # fileX=file, also reads from HDFS
if (ncol (A) > 1) {
# Matrix A is being sliced by a given range of columns
A[,1:(ncol (A) - 1)] = A[,1:(ncol (A) - 1)] - A[,2:ncol (A)];
}
11

Sample Code
interpSpline = function(
double x, matrix[double] X, matrix[double] Y, matrix[double] K) return (double q) {
i = as.integer(nrow(X) - sum(ppred(X, x, ">=")) + 1)
# misc computation …
q = as.scalar(qm)
}
eigen = externalFunction(Matrix[Double] A)
return(Matrix[Double] eval, Matrix[Double] evec)
implemented in (classname="org.apache.sysml.udf.lib.EigenWrapper",
exectype="mem")
12

Sample Code (From LinearRegDS.dml*)
A = t(X) %*% X
b = t(X) %*% y
if (intercept_status == 2) {
A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ])
A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ]
b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ]
}
A = A + diag (lambda)
print ("Calling the Direct Solver...")
beta_unscaled = solve (A, b)
*https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/LinearRegDS.dml#L133
13

DML Editor Support
• Very rudimentary editor support
• Bit of shameless self-promotion :
• Atom – Hackable Text editor
• Install package - https://atom.io/packages/language-dml
• From GUI - http://flight-manual.atom.io/using-atom/sections/atom-packages/
• Or from command line – apm install language-dml
• Rudimentary snippet based completion of builtin function
• Vim
• Install package - https://github.com/nakul02/vim-dml
• Works with Vundle (vim package manager)
• There is an experimental Zeppelin Notebook integration with DML –
• https://issues.apache.org/jira/browse/SYSTEMML-542
• Available as a docker image to play with - https://hub.docker.com/r/nakul02/incubator-
zeppelin/
• Please send feedback when using these, requests for features, bugs
• I’ll work on them when I can
14

SystemML Algorithms
15
Category Description
Descriptive Statistics
Univariate
Bivariate
Stratified Bivariate
Classification
Logistic Regression (multinomial)
Multi-Class SVM
Naïve Bayes (multinomial)
Decision Trees
Random Forest
Clustering k-Means
Regression
Linear Regression system of equations
CG (conjugate gradient descent)
Generalized Linear Models
(GLM)
Distributions: Gaussian, Poisson, Gamma, InverseGaussian, Binomial, Bernoulli
Links for all distributions: identity, log, sq. root,inverse, 1/μ2
Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit
Stepwise
Linear
GLM
Dimension Reduction PCA
Matrix Factorization ALS
direct solve
CG (conjugate gradient descent)
Survival Models
Kaplan Meier Estimate
Cox Proportional Hazard Regression
Predict Algorithm-specific scoring
Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation
Documentation: https://apache.github.io/incubator-systemml/algorithms-reference.html
Scripts: /usr/SystemML/systemml-0.10.0-incubating/scripts/algorithms/

Running / Invoking SystemML
• Command line
• Standalone (Java application in single JVM, in bin folder)
• Spark (spark-submit, in scripts folder)
• hadoop command line
• APIs (MLContext)
• Scala, e.g. run from Spark shell
• Python, e.g. run from PySpark
• Java
• In-Memory
16

MLContext API – Example Usage
val ml = new MLContext(sc)
val X_train = sc.textFile("amazon0601.txt")
.filter(!_.startsWith("#"))
.map(_.split("t") match{case Array(prod1, prod2)=>(prod1.toInt, prod2.toInt,1.0)})
.toDF("prod_i", "prod_j", "x_ij")
.filter("prod_i < 5000 AND prod_j < 5000") // Change to smaller number
.cache()
17

val pnmf =
"""
# data & args
X = read($X)
rank = as.integer($rank)
# Computation ....
write(negloglik, $negloglikout)
write(W, $Wout)
write(H, $Hout)
"""
18

val pnmf =
"""
# data & args
X = read($X)
rank = as.integer($rank)
# Computation ....
write(negloglik, $negloglikout)
write(W, $Wout)
write(H, $Hout)
"""
ml.registerInput("X", X_train)
ml.registerOutput("W")
ml.registerOutput("H")
ml.registerOutput("negloglik")
val outputs = ml.executeScript(pnmf,
Map("maxiter" -> "100", "rank" -> "10"))
val negloglik = getScalarDouble(outputs,
"negloglik")
19

Run LinReg CG from Spark Shell
(MLContext)
20

Run SystemML in ML Pipeline
21

End-to-end on Spark … in Code
22
import org.apache.spark.sql._
val ctx = new org.apache.spark.sql.SQLContext(sc)
val tweets = ctx.jsonFile("hdfs:/twitter/decahose")
tweets.registerAsTable("tweetTable")
ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println)
ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable
GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println)
val texts = ctx.sql("SELECT text FROM tweetTable").map(_.head.toString)
def featurize(str: String): Vector = { ... }
val vectors = texts.map(featurize).toDF.cache()
val mcV = new MatrixCharacteristics(vectors.count, vocabSize, 1000,1000)
val V = RDDConvertUtilsExt(sc, vectors, mcV, false, "_1")
val ml = new com.ibm.bi.dml.api.MLContext(sc)
ml.registerInput("V", V, mcV)
ml.registerOutput("W")
ml.registerOutput("H")
val args = Array(numTopics, numGNMFIter)
val out = ml.execute("GNMF.dml", args)
val W = out.getDF("W")
val H = out.getDF("H")
def getWords(r: Row): Array[(String, Double)] = { ... }
val topics = H.rdd.map(getWords)
Twitter Data
Explore Data
In SQL
Data Set
Training Set
Topic Modeling
SQLML
Get Topics

SystemML Architecture
Language
• R- like syntax
• Linear algebra, statisticalfunctions, controlstructures, etc.
• User-defined & externalfunction
• Parsing
• Statement blocks & statements
• Program Analysis, type inference, dead code elimination
High-Level Operator (HOP) Component
• Dataflow in DAGs of operations on matrices, frames, and scalars
• Choosing from alternative execution plans based on memoryand
cost estimates: operatorordering & selection; hybrid plans
Low-Level Operator (LOP) Component
• Low-levelphysicalexecution plan (LOPDags)overkey-value pairs
• “Piggybacking”operationsinto minimalnumber Map-Reduce jobs
Runtime
• Hybrid Runtime
• CP: single machine operations & orchestrate jobs
• MR: generic Map-Reduce jobs & operations
• SP: Spark Jobs
• Numerically stable operators
• Dense / sparse matrix representation
• Multi-Levelbuffer pool (caching) to evict in-memory objects
• Dynamic Recompilation for initial unknowns
Command
Line
JMLC
Spark
MLContext
APIs
High-Level Operators
Parser/Language
Low-Level Operators
Compiler
Runtime
Control Program
Runtime
Program
Buffer Pool
ParFor Optimizer/
Runtime
MR
InstSpark
Inst
CP
Inst
Recompiler
Cost-based
optimizations
DFS IOMem/FS IO
Generic
MR Jobs
MatrixBlock Library
(single/multi-threaded)
23

SystemML Compilation Chain
24
CP + b sb _mVar1
SPARK mapmm X.MATRIX.DOUBLE _mvar1.MATRIX.DOUBLE
_mVar2.MATRIX.DOUBLE RIGHT false NONE
CP * y _mVar2 _mVar3

Selected Algebraic Simplification
Rewrites
25
Name Dynamic Pattern
Remove Unnecessary Indexing X[a:b,c:d] = Y à X = Y iff dims(X)=dims(Y)
X = Y[, 1] à X = Y iff ncol(Y)=1
Remove Empty
Matrix Multiply
X%*%Y à matrix(0,nrow(X),ncol(Y))
iff nnz(X)=0|nnz(Y)=0
Removed Unnecessary Outer
Product
X*(Y%*%matrix(1,...)) à X*Y
iff ncol(Y)=1
Simplify Diag Aggregates sum(diag(X))àtrace(X) iff ncol(X)=1
SimplifyMatrix Mult Diag diag(X)%*%Y à X*Y iff ncol(X)=1&ncol(Y)=1
Simplify Diag Matrix Mult diag(X%*%Y) à rowSums(X*t(Y)) iff ncol(Y)>1
Simplify Dot Product Sum sum(X^2) à t(X)%*%X iff ncol(X)=1
Name Static Pattern
Remove Unnecessary Operations t(t(X)), X/1, X*1, X-0 à X matrix(1,)/X à 1/X
rand(,min=-1,max=1)*7 à rand(,min=-7,max=7)
Binary to Unary X+X à 2*X X*X à X^2 X-X*Y à X*(1-Y)
Simplify Diag Aggregates trace(X%*%Y)àsum(X*t(Y))

A Data Scientist – Linear Regression
26
X ≈
Explanatory/
Independent Variables
Predicted/
Dependant VariableModel
w
w = argminw ||Xw-y||2 +λ||w||2
Optimization Problem:
next direction
Iterate until
convergence
initialize
step size
update w
initial direction
accuracy
measures
Conjugate GradientMethod:
• Start off with the (negative) gradient
• For each step
1. Move to the optimal point along the chosen direction;
2. Recompute the gradient;
3. Project it onto the subspace conjugate* to allprior directions;
4. Use this as the next direction
(* conjugate =orthogonalgiven A as the metric)
A = XT X + λ
y

SystemML – Run LinReg CG on Spark
27
100M
10,000
100M
1
yX
100M
1,000
X
100M
100
X
100M
10
X
100M
1
y
100M
1
y
100M
1
y
8 TB
800 GB
80 GB
8 GB …
tMMp
…
Multithreaded
Single Node
20 GB Driver on 16c
6 x 55 GB Executors
Hybrid Plan
with RDD caching
and fused operator
Hybrid Plan
with RDD out-of-
core and fused
operator
Hybrid Plan
with RDD out-of-
core and different
operators
…
x.persist();
...
X.mapValues(tMMp
)
.reduce ()
…
Driver
Fused
Executors
…
RDD cache: X
tMMv tMMv
…
x.persist();
...
X.mapValues(tMMp)
.reduce()
...
Executors
…
RDD cache: X
tMMv tMMv
Driver
Spilling
…
x.persist();
...
// 2 MxV mult
// with broadcast,
// mapToPair, and
// reduceByKey
... Executors
…
RDD cache: X
Mv
tvM
Mv
tvM
Driver
Driver
Cache

LinReg CG for varying Data
28
8 GB
100M x 10
80 GB
100M x 100
800 GB
100M x 1K
8 TB
100M x 10K
CP+Spark 21 92 2,065 40,395
Spark 76 124 2,159 40,130
CP+MR 24 277 2,613 41,006
10
100
1,000
10,000
100,000
ExecutionTimeinsecs(logscale)
Data Size
Note
Driver w+h 20 GB, 16c
6 Executors each 55 GB, 24c
Convergence in 3-4 itera+ons
SystemML as of 10/2015
Single node MT
avoids Spark Ctx
& distributed ops
3.6 x
Hybrid plan &
RDD caching
3x
Out of Core
1.2x
Fully Utilized
Ø Cost-based optimization is
important
Ø Hybrid execution plans
benefit especially medium-
sized data sets
Ø Aggregated in-memory data
sets are sweet spot for
Spark esp. for iterative
algorithms
Ø Graceful degradation for
out-of-core

Apache SystemML - Summary
• Cost-based compilation of machine learning algorithms generates execution plans
• for single-node in-memory, cluster, and hybrid execution
• for varying data characteristics:
• varying number of observations (1,000s to 10s of billions)
• varying number of variables (10s to 10s of millions)
• dense and sparse data
• for varying cluster characteristics (memory configurations, degree of parallelism)
• Out-of-the-box, scalable machine learning algorithms
• e.g. descriptive statistics, regression, clustering, and classification
• "Roll-your-own" algorithms
• Enable programmer productivity (no worry about scalability, numeric stability, and
optimizations)
• Fast turn-around for new algorithms
• Higher-level language shields algorithm development investment from platform
progression
• Yarn for resource negotiation and elasticity
• Spark for in-memory, iterative processing
29

Roadmap
• Algorithms
• kNN, word2vec, non-linear SVM, etc.
• Deep learning
• Engine
• Compressed Linear Algebra
• Code Gen
• Extensions for Deep Learning
• GPU backend
• Usability
• DML notebook
• Language integration
• API cleanup
30

Research Papers
• Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald: Compressed
Linear Algebra for Large Scale Machine Learning. Conditional Accept at VLDB 2016
• Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, FarazMakari
Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, PrithvirajSen, Arvind C. Surve,
Shirish Tatikonda. SystemML: Declarative Machine Learning on Spark. VLDB 2016
• Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R.
Reiss: Resource Elasticity for Large-Scale Machine Learning. SIGMOD Conference 2015:137-152
• Arash Ashari,Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John
Keenleyside, P. Sadayappan: On optimizing machine learning workloads via kernel
fusion. PPOPP 2015:173-182
• Sebastian Schelter, Juan Soto, Volker Markl, Douglas Burdick, Berthold Reinwald, Alexandre V.
Evfimievski: Efficient sample generation for scalable meta learning. ICDE 2015:1191-1202
• Matthias Boehm, Douglas R. Burdick,Alexandre V. Evfimievski, Berthold Reinwald, Frederick R.
Reiss, PrithvirajSen, Shirish Tatikonda, Yuanyuan Tian: SystemML's Optimizer: Plan Generation for
Large-Scale Machine Learning Programs. IEEE Data Eng. Bull. 37(3):52-62 (2014)
• Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, PrithvirajSen, Yuanyuan Tian, Douglas
Burdick, Shivakumar Vaithyanathan: Hybrid Parallelization Strategies for Large-Scale Machine
Learning in SystemML. PVLDB 7(7): 553-564 (2014)
• Peter D. Kirchner, Matthias Boehm, Berthold Reinwald, Daby M. Sow, Michael Schmidt, Deepak S.
Turaga, Alain Biem: Large Scale Discriminative Metric Learning. IPDPS Workshops2014:1656-1663
• Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald: Scalable and Numerically Stable Descriptive
Statistics in SystemML. ICDE 2012:1351-1359
• Amol Ghoting, Rajasekar Krishnamurthy,Edwin P. D. Pednault, Berthold Reinwald, Vikas
Sindhwani, Shirish Tatikonda, Yuanyuan Tian, Shivakumar Vaithyanathan: SystemML: Declarative
machine learning on MapReduce. ICDE 2011:231-242
31
Custom
Algorithm
Optimizer
Resource
Elasticity
GPU
Sampling
Numeric
Stability
Task
Parallelism
1st
paper
on Spark
Compression

Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal (20)

More from Arvind Surve (17)

Recently uploaded (20)

Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal