Inside Apache SystemML

Inside Apache SystemML
Fred Reiss
Chief Architect, IBM Spark Technology Center
Member of the IBM Academy of Technology

• 2007-2008: Multiple projects at IBM Research –
Almaden involving machine learning on Hadoop.
• 2009: We create a dedicated team for scalable ML.
• 2009-2010: Through engagements with customers, we
observe how data scientists create machine learning
algorithms.
Origins of the SystemML Project

State-of-the-Art: Small Data
R or
Python
Data
Scientist
Personal
Computer
Data
Results

State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala

State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
😞 Days or weeks per iteration
😞 Errors while translating
algorithms

The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML

The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
😃 Fast iteration
😃 Same answer

Running Example:
Alternating Least Squares
Products
Customers
i
j
Customer i
bought
product j.
Products Factor
CustomersFactor
Multiply these
two factors to
produce a less-
sparse matrix.
×
New nonzero
values become
product
suggestions.
• Problem:
Recommend
products to
customers

Alternating Least Squares (in R)
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}

while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
R = -G; S = R;
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
V = V + alpha * S;
}
R = R - alpha * HS;
ii = ii + 1;
}
is_U = ! is_U;
}
1. Start with random factors.
2. Hold the Products factor constant
and find the best value for the
Customers factor.
(Value that most closely approximates the original
matrix)
3. Hold the Customers factor
constant and find the best value
for the Products factor.
4. Repeat steps 2-3 until
convergence.
1
2
2
3
3
4
4
4
Every line has a clear purpose!

Alternating Least Squares (spark.ml)

• 25 lines’ worth of algorithm…
• …mixed with 800 lines of performance code

while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
R = -G; S = R;
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
V = V + alpha * S;
}
R = R - alpha * HS;
ii = ii + 1;
}
is_U = ! is_U;
}
(in SystemML’s
subset of R)
• SystemML can
compile and run this
algorithm at scale
• No additional
performance code
needed!

How fast does it run?
Running time comparisons between
machine learning algorithms are
problematic
– Different, equally-valid answers
– Different convergence rates on different
data
– But we’ll do one anyway

Performance Comparison: ALS
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-
distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per
server. Number of iterations tuned so that all algorithms produce comparable result quality.Details:

Takeaway Points
• SystemML runs the R script in parallel
– Same answer as original R script
– Performance is comparable to a low-level
RDD-based implementation
• How does SystemML achieve this result?

Performance Comparison: ALS
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-
distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per
server. Number of iterations tuned so that all algorithms produce comparable result quality.Details:
Several factors at play
• Subtly different
algorithms
• Adaptive execution
strategies
• Runtime differences
Several factors at play
• Subtly different
algorithms
• Adaptive execution
strategies
• Runtime differences

Questions We’ll Focus On
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
distributed jobs here.
• How does SystemML
know it’s better to run on
one machine?
• Why is SystemML so
much faster than single-
node R?

0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
know it’s better to run
on one machine?
node R?

SystemML
Optimizer
High-Level
Algorithm
Parallel
Spark
Program

High-Level
Algorithm
Parallel
Spark
Program

The SystemML Optimizer Stack
Abstract Syntax Tree
Layers
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
R = -G; S = R;
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
V = V + alpha * S;
}
R = R - alpha * HS;
ii = ii + 1;
}
is_U = ! is_U;
}
• Parsing
• Live variable
analysis
• Validation

+
High-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
V = V + alpha * S;
%*%
WU S
*t()
lambda
*
%*%
write(HS)
Construct graph
of High-Level
Operations
(HOPs)

+
Layers
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
%*%
WU S
*t()
lambda
*
%*%
write(HS)• Construct HOPs

Layers
HS = t(U) %*% (W * (U %*% S))
%*%
WU S
*t()
%*%
1.2GB
sparse
80GB
dense
80GB
dense
800MB
dense
800MB
dense
800MB
dense
• Construct HOPs
• Propagate statistics
• Determine distributed operationsAll operands
fit into heap
 use one
node
800MB
dense

0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
one machine?
much faster than
single-node R?

Layers
HS = t(U) %*% (W * (U %*% S))
%*%
WU S
*t()
%*%
1.2GB
sparse
80GB
dense
80GB
dense
800MB
dense
800MB
dense
800MB
dense
All operands
fit into heap
 use one
node
• Construct HOPs
• Propagate stats
• Determine distributed operations
• Rewrites
800MB
dense

Example Rewrite: wdivmm
W
S
U
U × S
*( (
t(U) t(U)×(W*(U×S)))
×
Large dense
intermediate
Can compute
directly from U,
S, and W!
t(U) %*% (W * (U %*% S))

Layers
HS = t(U) %*% (W * (U %*% S)
wdivmm
WU S
1.2GB
sparse
800MB
dense
800MB
dense
• Construct HOPs
• Propagate stats
• Determine distributed operations
• Rewrites
800MB
dense

Low-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S)
wdivmm
WU S
1.2GB
sparse
800MB
dense
800MB
dense
800MB
dense
• Convert HOPs to Low-Level
Operations (LOPs)

Layers
HS = t(U) %*% (W * (U %*% S)
Single-Node
WDivMM
WU S
Operations (LOPs)

Layers
HS = t(U) %*% (W * (U %*% S)
Single-Node
WDivMM
WU S
Operations (LOPs)
• Generate runtime instructions
To SystemML
Runtime

The SystemML Runtime for Spark
• Automates critical performance decisions
– Distributed or local computation?
– How to partition the data?
– To persist or not to persist?

• Distributed vs local: Hybrid runtime
– Multithreaded computation in Spark Driver
– Distributed computation in Spark Executors
– Optimizer makes a cost-based choice

Efficient Linear Algebra
• Binary block matrices
(JavaPairRDD<MatrixIndexes, MatrixBlock>)
• Adaptive block storage formats: Dense, Sparse,
Ultra-Sparse, Empty
• Efficient kernels for all combinations of block
types
Automated RDD
Caching
• Lineage tracking for
RDDs/broadcasts
• Guarded RDD collect/parallelize
• Partitioned Broadcast variables
Logical Blocking
(w/ Bc=1,000)
Physical Blocking and Partitioning
(w/ Bc=1,000)

Recap
Questions
one machine?
node R?
Answers
• Live variable analysis
• Propagation of statistics
• Advanced rewrites
• Efficient runtime

But wait, there’s more!
• Many other rewrites
• Cost-based selection of physical operators
• Dynamic recompilation for accurate stats
• Parallel FOR (ParFor) optimizer
• Direct operations on RDD partitions
• YARN and MapReduce support

• SystemML is open source!
– Announced in June 2015
– Available on Github since September 1
– First open-source binary release (0.8.0) in October 2015
– Entered Apache incubation in November 2015
– First Apache open-source binary release (0.9) available now
• We are actively seeking contributors and users!
http://systemml.apache.org/
Open-Sourcing SystemML

THANK YOU.
For more information, go to
http://systemml.apache.org/

Inside Apache SystemML

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Inside Apache SystemML (20)

Recently uploaded (20)

Inside Apache SystemML

Editor's Notes