SlideShare a Scribd company logo
Inside Apache SystemML
Fred Reiss
Chief Architect, IBM Spark Technology Center
Member of the IBM Academy of Technology
• 2007-2008: Multiple projects at IBM Research –
Almaden involving machine learning on Hadoop.
• 2009: We create a dedicated team for scalable ML.
• 2009-2010: Through engagements with customers, we
observe how data scientists create machine learning
algorithms.
Origins of the SystemML Project
State-of-the-Art: Small Data
R or
Python
Data
Scientist
Personal
Computer
Data
Results
State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
😞 Days or weeks per iteration
😞 Errors while translating
algorithms
The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
😃 Fast iteration
😃 Same answer
Running Example:
Alternating Least Squares
Products
Customers
i
j
Customer i
bought
product j.
Products Factor
CustomersFactor
Multiply these
two factors to
produce a less-
sparse matrix.
×
New nonzero
values become
product
suggestions.
• Problem:
Recommend
products to
customers
Alternating Least Squares (in R)
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
Alternating Least Squares (in R)
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
1. Start with random factors.
2. Hold the Products factor constant
and find the best value for the
Customers factor.
(Value that most closely approximates the original
matrix)
3. Hold the Customers factor
constant and find the best value
for the Products factor.
4. Repeat steps 2-3 until
convergence.
1
2
2
3
3
4
4
4
Every line has a clear purpose!
Alternating Least Squares (spark.ml)
Alternating Least Squares (spark.ml)
Alternating Least Squares (spark.ml)
Alternating Least Squares (spark.ml)
• 25 lines’ worth of algorithm…
• …mixed with 800 lines of performance code
Alternating Least Squares (in R)
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
Alternating Least Squares (in R)
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
(in SystemML’s
subset of R)
• SystemML can
compile and run this
algorithm at scale
• No additional
performance code
needed!
How fast does it run?
Running time comparisons between
machine learning algorithms are
problematic
– Different, equally-valid answers
– Different convergence rates on different
data
– But we’ll do one anyway
Performance Comparison: ALS
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-
distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per
server. Number of iterations tuned so that all algorithms produce comparable result quality.Details:
Takeaway Points
• SystemML runs the R script in parallel
– Same answer as original R script
– Performance is comparable to a low-level
RDD-based implementation
• How does SystemML achieve this result?
Performance Comparison: ALS
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-
distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per
server. Number of iterations tuned so that all algorithms produce comparable result quality.Details:
Several factors at play
• Subtly different
algorithms
• Adaptive execution
strategies
• Runtime differences
Several factors at play
• Subtly different
algorithms
• Adaptive execution
strategies
• Runtime differences
Questions We’ll Focus On
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
distributed jobs here.
• How does SystemML
know it’s better to run on
one machine?
• Why is SystemML so
much faster than single-
node R?
Questions We’ll Focus On
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
distributed jobs here.
• How does SystemML
know it’s better to run
on one machine?
• Why is SystemML so
much faster than single-
node R?
SystemML
Optimizer
High-Level
Algorithm
Parallel
Spark
Program
High-Level
Algorithm
Parallel
Spark
Program
The SystemML Optimizer Stack
The SystemML Optimizer Stack
The SystemML Optimizer Stack
Abstract Syntax Tree
Layers
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
• Parsing
• Live variable
analysis
• Validation
The SystemML Optimizer Stack
Abstract Syntax Tree
Layers
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
• Parsing
• Live variable
analysis
• Validation
+
The SystemML Optimizer Stack
High-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
%*%
WU S
*t()
lambda
*
%*%
write(HS)
Construct graph
of High-Level
Operations
(HOPs)
+
The SystemML Optimizer Stack
High-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
%*%
WU S
*t()
lambda
*
%*%
write(HS)• Construct HOPs
The SystemML Optimizer Stack
High-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S))
%*%
WU S
*t()
%*%
1.2GB
sparse
80GB
dense
80GB
dense
800MB
dense
800MB
dense
800MB
dense
• Construct HOPs
• Propagate statistics
• Determine distributed operationsAll operands
fit into heap
 use one
node
800MB
dense
Questions We’ll Focus On
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
distributed jobs here.
• How does SystemML
know it’s better to run on
one machine?
• Why is SystemML so
much faster than
single-node R?
The SystemML Optimizer Stack
High-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S))
%*%
WU S
*t()
%*%
1.2GB
sparse
80GB
dense
80GB
dense
800MB
dense
800MB
dense
800MB
dense
All operands
fit into heap
 use one
node
• Construct HOPs
• Propagate stats
• Determine distributed operations
• Rewrites
800MB
dense
Example Rewrite: wdivmm
W
S
U
U × S
*( (
t(U) t(U)×(W*(U×S)))
×
Large dense
intermediate
Can compute
directly from U,
S, and W!
t(U) %*% (W * (U %*% S))
The SystemML Optimizer Stack
High-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S)
wdivmm
WU S
1.2GB
sparse
800MB
dense
800MB
dense
• Construct HOPs
• Propagate stats
• Determine distributed operations
• Rewrites
800MB
dense
The SystemML Optimizer Stack
Low-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S)
wdivmm
WU S
1.2GB
sparse
800MB
dense
800MB
dense
800MB
dense
• Convert HOPs to Low-Level
Operations (LOPs)
The SystemML Optimizer Stack
Low-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S)
Single-Node
WDivMM
WU S
• Convert HOPs to Low-Level
Operations (LOPs)
The SystemML Optimizer Stack
Low-Level Operations
Layers
HS = t(U) %*% (W * (U %*% S)
Single-Node
WDivMM
WU S
• Convert HOPs to Low-Level
Operations (LOPs)
• Generate runtime instructions
To SystemML
Runtime
The SystemML Runtime for Spark
• Automates critical performance decisions
– Distributed or local computation?
– How to partition the data?
– To persist or not to persist?
The SystemML Runtime for Spark
• Distributed vs local: Hybrid runtime
– Multithreaded computation in Spark Driver
– Distributed computation in Spark Executors
– Optimizer makes a cost-based choice
The SystemML Runtime for Spark
Efficient Linear Algebra
• Binary block matrices
(JavaPairRDD<MatrixIndexes, MatrixBlock>)
• Adaptive block storage formats: Dense, Sparse,
Ultra-Sparse, Empty
• Efficient kernels for all combinations of block
types
Automated RDD
Caching
• Lineage tracking for
RDDs/broadcasts
• Guarded RDD collect/parallelize
• Partitioned Broadcast variables
Logical Blocking
(w/ Bc=1,000)
Physical Blocking and Partitioning
(w/ Bc=1,000)
Recap
Questions
• How does SystemML
know it’s better to run on
one machine?
• Why is SystemML so
much faster than single-
node R?
Answers
• Live variable analysis
• Propagation of statistics
• Advanced rewrites
• Efficient runtime
But wait, there’s more!
• Many other rewrites
• Cost-based selection of physical operators
• Dynamic recompilation for accurate stats
• Parallel FOR (ParFor) optimizer
• Direct operations on RDD partitions
• YARN and MapReduce support
• SystemML is open source!
– Announced in June 2015
– Available on Github since September 1
– First open-source binary release (0.8.0) in October 2015
– Entered Apache incubation in November 2015
– First Apache open-source binary release (0.9) available now
• We are actively seeking contributors and users!
http://systemml.apache.org/
Open-Sourcing SystemML
THANK YOU.
For more information, go to
http://systemml.apache.org/

More Related Content

PDF
Introduction to Polyhedral Compilation
PDF
Control as Inference (強化学習とベイズ統計)
PDF
Multiclass Logistic Regression: Derivation and Apache Spark Examples
PPTX
Test s velocity_15_5_4
PDF
Explicit Formula for Riemann Prime Counting Function
PPT
Lecture13 controls
PDF
Reading Seminar (140515) Spectral Learning of L-PCFGs
Introduction to Polyhedral Compilation
Control as Inference (強化学習とベイズ統計)
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Test s velocity_15_5_4
Explicit Formula for Riemann Prime Counting Function
Lecture13 controls
Reading Seminar (140515) Spectral Learning of L-PCFGs

What's hot (20)

PDF
深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」
PDF
Large scale logistic regression and linear support vector machines using spark
PPT
PPT
Test (S) on R
PPTX
Rewriting Engine for Process Algebras
PDF
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
PDF
Chapter06
PDF
Chapter06
PDF
[DL輪読会]近年のエネルギーベースモデルの進展
PPT
DC servo motor
PDF
Chapter04
DOCX
Ayush exp 2
PDF
Chapter02b
PPT
Asymptotic Notation and Complexity
PPT
Laplace transformation
PPTX
Asymptotic notations(Big O, Omega, Theta )
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
PPT
Time complexity
PPT
Control chap10
PDF
確率的推論と行動選択
深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」
Large scale logistic regression and linear support vector machines using spark
Test (S) on R
Rewriting Engine for Process Algebras
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
Chapter06
Chapter06
[DL輪読会]近年のエネルギーベースモデルの進展
DC servo motor
Chapter04
Ayush exp 2
Chapter02b
Asymptotic Notation and Complexity
Laplace transformation
Asymptotic notations(Big O, Omega, Theta )
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
Time complexity
Control chap10
確率的推論と行動選択
Ad

Viewers also liked (12)

PDF
Inside Apache SystemML by Frederick Reiss
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Regression using Apache SystemML by Alexandre V Evfimievski
PDF
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
PDF
S1 DML Syntax and Invocation
PDF
Data preparation, training and validation using SystemML by Faraz Makari Mans...
PDF
Amia tb-review-11
PPTX
Parallel Machine Learning- DSGD and SystemML
PDF
南投縣發祥國小辦理教育優先區計畫實施情形考核表
PDF
Spark Summit EU talk by Heiko Korndorf
PPTX
English for industrial mahinery students. gustavo medina
PPTX
Lpe mapa conceptual mujer y drogas
Inside Apache SystemML by Frederick Reiss
Building Custom Machine Learning Algorithms With Apache SystemML
Regression using Apache SystemML by Alexandre V Evfimievski
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
S1 DML Syntax and Invocation
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Amia tb-review-11
Parallel Machine Learning- DSGD and SystemML
南投縣發祥國小辦理教育優先區計畫實施情形考核表
Spark Summit EU talk by Heiko Korndorf
English for industrial mahinery students. gustavo medina
Lpe mapa conceptual mujer y drogas
Ad

Similar to Inside Apache SystemML (20)

PDF
What's new in Apache SystemML - Declarative Machine Learning
PDF
SystemML - Declarative Machine Learning
PPTX
System mldl meetup
PDF
Alpine Tech Talk: System ML by Berthold Reinwald
PDF
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
PDF
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
PPTX
Building Custom
Machine Learning Algorithms
with Apache SystemML
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PPTX
Big Practical Recommendations with Alternating Least Squares
PPTX
2018 03 25 system ml ai and openpower meetup
PDF
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
PDF
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
PDF
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
PDF
SystemML - Datapalooza Denver - 05.17.16 MWD
PDF
Machine_Learning_Trushita
PPTX
System mldl meetup
PDF
PPTX
Machine learning without the PhD - azure ml
PPT
Fast ALS-based matrix factorization for explicit and implicit feedback datasets
What's new in Apache SystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
System mldl meetup
Alpine Tech Talk: System ML by Berthold Reinwald
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Building Custom
Machine Learning Algorithms
with Apache SystemML
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Big Practical Recommendations with Alternating Least Squares
2018 03 25 system ml ai and openpower meetup
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
SystemML - Datapalooza Denver - 05.17.16 MWD
Machine_Learning_Trushita
System mldl meetup
Machine learning without the PhD - azure ml
Fast ALS-based matrix factorization for explicit and implicit feedback datasets

Recently uploaded (20)

PPTX
Machine Learning_overview_presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
sap open course for s4hana steps from ECC to s4
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine Learning_overview_presentation.pptx
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
sap open course for s4hana steps from ECC to s4
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation

Inside Apache SystemML

  • 1. Inside Apache SystemML Fred Reiss Chief Architect, IBM Spark Technology Center Member of the IBM Academy of Technology
  • 2. • 2007-2008: Multiple projects at IBM Research – Almaden involving machine learning on Hadoop. • 2009: We create a dedicated team for scalable ML. • 2009-2010: Through engagements with customers, we observe how data scientists create machine learning algorithms. Origins of the SystemML Project
  • 3. State-of-the-Art: Small Data R or Python Data Scientist Personal Computer Data Results
  • 4. State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala
  • 5. State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala 😞 Days or weeks per iteration 😞 Errors while translating algorithms
  • 6. The SystemML Vision R or Python Data Scientist Results SystemML
  • 7. The SystemML Vision R or Python Data Scientist Results SystemML 😃 Fast iteration 😃 Same answer
  • 8. Running Example: Alternating Least Squares Products Customers i j Customer i bought product j. Products Factor CustomersFactor Multiply these two factors to produce a less- sparse matrix. × New nonzero values become product suggestions. • Problem: Recommend products to customers
  • 9. Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; }
  • 10. Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } 1. Start with random factors. 2. Hold the Products factor constant and find the best value for the Customers factor. (Value that most closely approximates the original matrix) 3. Hold the Customers factor constant and find the best value for the Products factor. 4. Repeat steps 2-3 until convergence. 1 2 2 3 3 4 4 4 Every line has a clear purpose!
  • 15. • 25 lines’ worth of algorithm… • …mixed with 800 lines of performance code
  • 16. Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; }
  • 17. Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } (in SystemML’s subset of R) • SystemML can compile and run this algorithm at scale • No additional performance code needed!
  • 18. How fast does it run? Running time comparisons between machine learning algorithms are problematic – Different, equally-valid answers – Different convergence rates on different data – But we’ll do one anyway
  • 19. Performance Comparison: ALS 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally- distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per server. Number of iterations tuned so that all algorithms produce comparable result quality.Details:
  • 20. Takeaway Points • SystemML runs the R script in parallel – Same answer as original R script – Performance is comparable to a low-level RDD-based implementation • How does SystemML achieve this result?
  • 21. Performance Comparison: ALS 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally- distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per server. Number of iterations tuned so that all algorithms produce comparable result quality.Details: Several factors at play • Subtly different algorithms • Adaptive execution strategies • Runtime differences Several factors at play • Subtly different algorithms • Adaptive execution strategies • Runtime differences
  • 22. Questions We’ll Focus On 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM SystemML runs no distributed jobs here. • How does SystemML know it’s better to run on one machine? • Why is SystemML so much faster than single- node R?
  • 23. Questions We’ll Focus On 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM SystemML runs no distributed jobs here. • How does SystemML know it’s better to run on one machine? • Why is SystemML so much faster than single- node R?
  • 28. The SystemML Optimizer Stack Abstract Syntax Tree Layers U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } • Parsing • Live variable analysis • Validation
  • 29. The SystemML Optimizer Stack Abstract Syntax Tree Layers U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } • Parsing • Live variable analysis • Validation
  • 30. + The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; %*% WU S *t() lambda * %*% write(HS) Construct graph of High-Level Operations (HOPs)
  • 31. + The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S)) + lambda * S; %*% WU S *t() lambda * %*% write(HS)• Construct HOPs
  • 32. The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S)) %*% WU S *t() %*% 1.2GB sparse 80GB dense 80GB dense 800MB dense 800MB dense 800MB dense • Construct HOPs • Propagate statistics • Determine distributed operationsAll operands fit into heap  use one node 800MB dense
  • 33. Questions We’ll Focus On 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM SystemML runs no distributed jobs here. • How does SystemML know it’s better to run on one machine? • Why is SystemML so much faster than single-node R?
  • 34. The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S)) %*% WU S *t() %*% 1.2GB sparse 80GB dense 80GB dense 800MB dense 800MB dense 800MB dense All operands fit into heap  use one node • Construct HOPs • Propagate stats • Determine distributed operations • Rewrites 800MB dense
  • 35. Example Rewrite: wdivmm W S U U × S *( ( t(U) t(U)×(W*(U×S))) × Large dense intermediate Can compute directly from U, S, and W! t(U) %*% (W * (U %*% S))
  • 36. The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S) wdivmm WU S 1.2GB sparse 800MB dense 800MB dense • Construct HOPs • Propagate stats • Determine distributed operations • Rewrites 800MB dense
  • 37. The SystemML Optimizer Stack Low-Level Operations Layers HS = t(U) %*% (W * (U %*% S) wdivmm WU S 1.2GB sparse 800MB dense 800MB dense 800MB dense • Convert HOPs to Low-Level Operations (LOPs)
  • 38. The SystemML Optimizer Stack Low-Level Operations Layers HS = t(U) %*% (W * (U %*% S) Single-Node WDivMM WU S • Convert HOPs to Low-Level Operations (LOPs)
  • 39. The SystemML Optimizer Stack Low-Level Operations Layers HS = t(U) %*% (W * (U %*% S) Single-Node WDivMM WU S • Convert HOPs to Low-Level Operations (LOPs) • Generate runtime instructions To SystemML Runtime
  • 40. The SystemML Runtime for Spark • Automates critical performance decisions – Distributed or local computation? – How to partition the data? – To persist or not to persist?
  • 41. The SystemML Runtime for Spark • Distributed vs local: Hybrid runtime – Multithreaded computation in Spark Driver – Distributed computation in Spark Executors – Optimizer makes a cost-based choice
  • 42. The SystemML Runtime for Spark Efficient Linear Algebra • Binary block matrices (JavaPairRDD<MatrixIndexes, MatrixBlock>) • Adaptive block storage formats: Dense, Sparse, Ultra-Sparse, Empty • Efficient kernels for all combinations of block types Automated RDD Caching • Lineage tracking for RDDs/broadcasts • Guarded RDD collect/parallelize • Partitioned Broadcast variables Logical Blocking (w/ Bc=1,000) Physical Blocking and Partitioning (w/ Bc=1,000)
  • 43. Recap Questions • How does SystemML know it’s better to run on one machine? • Why is SystemML so much faster than single- node R? Answers • Live variable analysis • Propagation of statistics • Advanced rewrites • Efficient runtime
  • 44. But wait, there’s more! • Many other rewrites • Cost-based selection of physical operators • Dynamic recompilation for accurate stats • Parallel FOR (ParFor) optimizer • Direct operations on RDD partitions • YARN and MapReduce support
  • 45. • SystemML is open source! – Announced in June 2015 – Available on Github since September 1 – First open-source binary release (0.8.0) in October 2015 – Entered Apache incubation in November 2015 – First Apache open-source binary release (0.9) available now • We are actively seeking contributors and users! http://systemml.apache.org/ Open-Sourcing SystemML
  • 46. THANK YOU. For more information, go to http://systemml.apache.org/

Editor's Notes