SlideShare a Scribd company logo
OPTIMIZINGTERASCALE MACHINE
LEARNING PIPELINES WITH
Evan R. Sparks, UC Berkeley AMPLab
with ShivaramVenkataraman,Tomer Kaftan, Michael Franklin, Benjamin Recht
MLKeystone
Apache
WHAT’S A MACHINE
LEARNING PIPELINE?
A STANDARD MACHINE LEARNING PIPELINE
Right?
Data
Train
Classifier
Model
A STANDARD MACHINE LEARNING PIPELINE
That’s more like it!
Data
Train
Linear
Classifier
Model
Feature
Extraction
Test
Data
Predictions
A REAL PIPELINE FOR
IMAGE CLASSIFICATION
Inspired by Coates & Ng, 2012
Data
Image
Parser
Normalizer Convolver
sqrt,mean
Zipper
Linear
Solver
Symmetric
Rectifier
ident,abs
ident,mean
Global Pooling
Pooler
Patch
Extractor
Patch
Whitener
KMeans
Clusterer
Feature Extractor
Label
Extractor
ModelLinear
Mapper
Test
Data
Label
Extractor
Feature
Extractor
Test
Error
Error
Computer
Data
Image
Parser
Normalizer Convolver
sqrt,mean
Zipper
Linear
Solver
Symmetric
Rectifier
ident,abs
ident,mean
Global Pooling
Pooler
Patch
Extractor
Patch
Whitener
KMeans
Clusterer
Feature Extractor
Label
Extractor
Linear
Mapper
Model
Test
Data
Label
Extractor
Feature
Extractor
Test
Error
Error
Computer
Embarrassingly Parallel
Requires Coordination
Tricky to Scale
ABOUT KEYSTONEML
• Software framework for building scalable end-to-end machine
learning pipelines on Apache Spark.
• Helps us understand what it means to build systems for robust,
scalable, end-to-end advanced analytics workloads and the patterns
that emerge.
• Example pipelines that achieve state-of-the-art results on large scale
datasets in computer vision, NLP, and speech - fast.
• Open source software, available at: http://keystone-ml.org/
SIMPLE EXAMPLE:
TEXT CLASSIFICATION
20
Newsgroups
.fit( )
Trim
Tokenize
Bigrams
Top Features
Naive Bayes
Max Classifier
Trim
Tokenize
Bigrams
Max Classifier
Top Features
Transformer
Naive Bayes
Model
Once estimated - apply
these steps to your
production data in an
online or batch fashion.
NOT SO SIMPLE EXAMPLE:
IMAGE CLASSIFICATION
Images
(VOC2007)
.fit( )
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassifier
Linear Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance
of Chatfield et. al., 2011
Pleasantly parallel
featurization and evaluation.
7 minutes on a modest cluster.
5,000 examples, 40,000
features, 20 classes
EVEN LESS SIMPLE: IMAGENET
Color Edges
Resize
Grayscale
SIFT
PCA
FisherVector
Top 5 Classifier
LCS
PCA
FisherVector
Block Linear
Solver
<100 SLOC
Upgrading the solver
for higher precision
means changing 1 LOC.
Weighted Block
Linear Solver
Adding 100,000 more
texture features is easy.
Texture
Gabor
Wavelets
PCA
FisherVector
1000 class classification.
1,200,000 examples
64,000 features.
90 minutes on 100 nodes.
OPTIMIZING KEYSTONEML PIPELINES
High-level API enables rich space of optimizations
Automated ML operator selection. Linear
Solver
L-BFGS
Iterative
SGD
Direct
Solver
Training
Data
Grayscaler
SIFT
Extractor
Reduce
Dimensions
Fisher
Vector
Normalize
Column
Sampler
Linear
Map
Distributed
PCA
Column
Sampler
Local
GMM
Least Sq.
L-BFGS
Predictions
Training
Labels
Auto-caching for iterative workloads.
KEYSTONEML OPTIMIZER
• Sampling-based cost model
projects resource usage
• CPU, Memory, Network
• Utilization tracked through
pipeline.
• Decisions made to minimize
total cost of execution.
• Catalyst-based optimizer does
the heavy lifting.
Stage n d size (GB)
Input 5000 1m pixel
JPEG
0.4
Resize 5000 260k pixels 3.6
Grayscale 5000 260k pixels 1.2
SIFT 5000 65000x128 309
PCA 5000 65000x80 154
FV 5000 256x64x2 1.2
Linear
Regression
5000 20 0.0007
Max
Classifier
5000 1 0.00009
CHOOSING A SOLVER
• Datasets have a number of
interesting degrees of freedom.
• Problem size (n, d, k)
• sparsity (nnz)
• condition number
• Platform has degrees of freedom:
• Memory, CPU, Network, Nodes
• Solvers are predictable!
13
Where:
A 2 Rn⇥d
X 2 Rd⇥k
B 2 Rn⇥k
Objective:
min
X
|AX B|
2
2 + |X|2
2
CHOOSING A SOLVER
• Three Solvers
• Exact, Block, LBFGS
• Two datasets
• Amazon - >99% sparse, n=65m
• TIMIT - dense, n=2m
• Exact solve works well for small # features.
• Use LBFGS for sparse problems.
• Block solver scales well to big dense
problems.
• Hundreds of thousands of features.
●
●
●
●
●
●
Amazon TIMIT
100
1000
10000
10
100
1000
1024 2048 4096 8192 16384 1024 2048 4096 8192 16384
Number of Features
Time(s)
Solver ● Exact Block Solver LBFGS
14
SOLVER PERFORMANCE
• Compared KeystoneML with:
• VowpalWabbit - specialized system for
large, sparse problems.
• SystemML - general purpose, optimizing
ML system.
• Two problems:
• Amazon - Sparse text features.
• BinaryTIMIT - Dense phoneme data.
• High Order Bit:
• KeystoneML pipelines featurization and
adapts to workload changes.
Amazon
0
200
400
600
800
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML SystemML
Binary TIMIT
0
100
200
300
400
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML SystemML
Amazon
0
50
100
150
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML Vowpal Wabbit
Binary TIMIT
0
500
1000
1500
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML Vowpal Wabbit
DECIDING WHATTO SAVE
• Pipelines Generate Lots of
intermediate state.
• E.g. SIFT features blow up a
0.42GBVOC dataset to 300GB.
• Iterative algorithms —> state
needed many times.
• How do we determine what to save
for later and what to reuse, given
fixed resource budget?
• Can we adapt to workload changes?
16
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassifier
Linear Regression
CACHING PROBLEM
• Output is computed via depth-
first execution of DAG.
• Caching “truncates” a path
after first visit.
• Want to minimize execution
time.
• Subject to memory
constraints.
• Picking optimal set is hard!
17
A B
C
D
E
60s
50g
40s
200g
20s
40g
40g
15s
5s
10g
Output
Cache set Time Memory
ABCDE 140s 340g
B 140s 200g
A 180s 50g
{} 240s 0g
END-TO-END PERFORMANCE
Dataset
Training
Examples
Features Raw Size (GB)
Feature Size
(GB)
Amazon 65 million 100k (sparse) 14 89
TIMIT 2.25 million 528k 7.5 8800
ImageNet 1.28 million 262k 74 2500
VOC 5000 40k 0.43 1.5
END-TO-END PERFORMANCE
Dataset
KeystoneML
Accuracy
Reported
Accuracy
KeystoneML
Time (m)
Reported
Time (m)
Speedup
over
Reported
Amazon 91.6% N/A 3.3 N/A N/A
TIMIT 66.1% 66.3% 138 120 0.87x
ImageNet 67.4% 66.6% 270 5760 21x
VOC 57.2% 59.2% 7 87 12x
END-TO-END PERFORMANCE
Amazon TIMIT ImageNet
0
5
10
15
0
20
40
60
0
100
200
300
400
500
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
Cluster Size (# of nodes)
Time(minutes)
Stage
Loading Train Data Featurization Model Solve
Loading Test Data Model Eval
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Amazon TIMIT ImageNet
1
2
4
8
16
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
Cluster Size (# of nodes)
Speedupover8nodes(x)
END-TO-END PERFORMANCE
• Tested three levels of
optimization
• None
• Auto-caching only
• Auto-caching and
operator-selection.
• 7x to 15x speedup
0
5
10
15
Amazon TIMIT VOC
Workload
Speedup
Optimization Level None Whole−Pipeline All
QUESTIONS?
http://keystone-ml.org/
Project Page
Code
http://github.com/amplab/keystone
Training
http://goo.gl/axbkkc
BACKUP SLIDES
SOFTWARE FEATURES
• Data Loaders
• CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups
• Transformers
• NLP -Tokenization, n-grams, term frequency, NER*,
parsing*
• Images - Convolution, Grayscaling, LCS, SIFT*,
FisherVector*, Pooling,Windowing, HOG, Daisy
• Speech - MFCCs*
• Stats - Random Features, Normalization, Scaling*,
Signed Hellinger Mapping, FFT
• Utility/misc - Caching,Top-K classifier, indicator label
mapping, sparse/dense encoding transformers.
• Estimators
• Learning - Block linear models, Linear Discriminant
Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
• Example Pipelines
• NLP - Amazon Product Review
Classification, 20 Newsgroups,Wikipedia
Language model
• Images - MNIST, CIFAR,VOC, ImageNet
• Speech -TIMIT
• Evaluation Metrics
• Binary Classification
• Multiclass Classification
• Multilabel Classification
* - Links to external library
Just 11k Lines of Code,
5k of which areTests or JavaDoc.
KEY API CONCEPTS
TRANSFORMERS
TransformerInput Output
abstract classTransformer[In, Out] {
def apply(in: In): Out
def apply(in: RDD[In]): RDD[Out] = in.map(apply)
…
}
TYPE SAFETY HELPS ENSURE ROBUSTNESS
ESTIMATORS
EstimatorRDD[Input]
abstract class Estimator[In, Out] {
def fit(in: RDD[In]):Transformer[In,Out]
…
}
Transformer
.fit()
CHAINING
NGrams(2)String Vectorizer VectorBigrams
val featurizer:Transformer[String,Vector] = NGrams(2) thenVectorizer
featurizerString Vector
=
COMPLEX PIPELINES
.fit(data, labels)
pipelineString Prediction
=
val pipeline = (featurizer thenLabelEstimator LinearModel).fit(data, labels)
featurizerString Vector
Linear
Model
Prediction
featurizerString Vector
Linear
Map
Prediction

More Related Content

PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-Evolving Graph Processing On Commodity Clusters
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Enhancing Spark SQL Optimizer with Reliable Statistics
A Graph-Based Method For Cross-Entity Threat Detection
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...

What's hot (20)

PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
PDF
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
PDF
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
PDF
Scaling up data science applications
PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
PPTX
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
PDF
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
PDF
Productionizing your Streaming Jobs
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PDF
FlinkML - Big data application meetup
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
Spark Summit EU talk by Qifan Pu
PDF
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Scaling up data science applications
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Productionizing your Streaming Jobs
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
FlinkML - Big data application meetup
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Spark Summit EU talk by Qifan Pu
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Ad

Viewers also liked (20)

PDF
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
PDF
Utilizing Human Data Validation For KPI Analysis And Machine Learning
PDF
Open Air 2016 Mini Talk
PDF
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
PPTX
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
PDF
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
PDF
Recent Developments In SparkR For Advanced Analytics
PPT
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
PDF
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
PDF
Distributed Heterogeneous Mixture Learning On Spark
PDF
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
PPTX
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
PPTX
Netflix branding stumbles
PDF
Breaking Down Analytical and Computational Barriers Across the Energy Industr...
PPTX
Sparkling Random Ferns by P Dendek and M Fedoryszak
PPTX
Data Science at Scale by Sarah Guido
PPTX
Netflix in France
PDF
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
PDF
Distributed Heterogeneous Mixture Learning On Spark
PPTX
Sparking Science up with Research Recommendations by Maya Hristakeva
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Open Air 2016 Mini Talk
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
Recent Developments In SparkR For Advanced Analytics
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Distributed Heterogeneous Mixture Learning On Spark
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
Netflix branding stumbles
Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Sparkling Random Ferns by P Dendek and M Fedoryszak
Data Science at Scale by Sarah Guido
Netflix in France
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Distributed Heterogeneous Mixture Learning On Spark
Sparking Science up with Research Recommendations by Maya Hristakeva
Ad

Similar to Optimizing Terascale Machine Learning Pipelines with Keystone ML (20)

PDF
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
PDF
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
PPTX
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
PDF
Machine learning using Kubernetes
PPTX
AWS re:Invent 2018 - Machine Learning recap (December 2018)
PDF
Machine Learning using Kubeflow and Kubernetes
PPTX
Quickly and easily build, train, and deploy machine learning models at any scale
PDF
A survey on Machine Learning In Production (July 2018)
PPTX
Building Machine Learning Inference Pipelines at Scale (July 2019)
PDF
Spark DataFrames and ML Pipelines
PDF
High Performance Distributed TensorFlow with GPUs and Kubernetes
PPTX
Build, train and deploy ML models with SageMaker (October 2019)
PPTX
Building machine learning inference pipelines at scale (March 2019)
PPTX
Deep Learning on Amazon Sagemaker (July 2019)
PDF
Pipelines for model deployment
PPTX
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PPTX
Amazon SageMaker (December 2018)
PDF
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
PDF
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
Machine learning using Kubernetes
AWS re:Invent 2018 - Machine Learning recap (December 2018)
Machine Learning using Kubeflow and Kubernetes
Quickly and easily build, train, and deploy machine learning models at any scale
A survey on Machine Learning In Production (July 2018)
Building Machine Learning Inference Pipelines at Scale (July 2019)
Spark DataFrames and ML Pipelines
High Performance Distributed TensorFlow with GPUs and Kubernetes
Build, train and deploy ML models with SageMaker (October 2019)
Building machine learning inference pipelines at scale (March 2019)
Deep Learning on Amazon Sagemaker (July 2019)
Pipelines for model deployment
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Amazon SageMaker (December 2018)
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Lecture1 pattern recognition............
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
annual-report-2024-2025 original latest.
PDF
Mega Projects Data Mega Projects Data
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Lecture1 pattern recognition............
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Supervised vs unsupervised machine learning algorithms
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
annual-report-2024-2025 original latest.
Mega Projects Data Mega Projects Data
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Knowledge Engineering Part 1
Miokarditis (Inflamasi pada Otot Jantung)

Optimizing Terascale Machine Learning Pipelines with Keystone ML

  • 1. OPTIMIZINGTERASCALE MACHINE LEARNING PIPELINES WITH Evan R. Sparks, UC Berkeley AMPLab with ShivaramVenkataraman,Tomer Kaftan, Michael Franklin, Benjamin Recht MLKeystone Apache
  • 3. A STANDARD MACHINE LEARNING PIPELINE Right? Data Train Classifier Model
  • 4. A STANDARD MACHINE LEARNING PIPELINE That’s more like it! Data Train Linear Classifier Model Feature Extraction Test Data Predictions
  • 5. A REAL PIPELINE FOR IMAGE CLASSIFICATION Inspired by Coates & Ng, 2012 Data Image Parser Normalizer Convolver sqrt,mean Zipper Linear Solver Symmetric Rectifier ident,abs ident,mean Global Pooling Pooler Patch Extractor Patch Whitener KMeans Clusterer Feature Extractor Label Extractor ModelLinear Mapper Test Data Label Extractor Feature Extractor Test Error Error Computer
  • 6. Data Image Parser Normalizer Convolver sqrt,mean Zipper Linear Solver Symmetric Rectifier ident,abs ident,mean Global Pooling Pooler Patch Extractor Patch Whitener KMeans Clusterer Feature Extractor Label Extractor Linear Mapper Model Test Data Label Extractor Feature Extractor Test Error Error Computer Embarrassingly Parallel Requires Coordination Tricky to Scale
  • 7. ABOUT KEYSTONEML • Software framework for building scalable end-to-end machine learning pipelines on Apache Spark. • Helps us understand what it means to build systems for robust, scalable, end-to-end advanced analytics workloads and the patterns that emerge. • Example pipelines that achieve state-of-the-art results on large scale datasets in computer vision, NLP, and speech - fast. • Open source software, available at: http://keystone-ml.org/
  • 8. SIMPLE EXAMPLE: TEXT CLASSIFICATION 20 Newsgroups .fit( ) Trim Tokenize Bigrams Top Features Naive Bayes Max Classifier Trim Tokenize Bigrams Max Classifier Top Features Transformer Naive Bayes Model Once estimated - apply these steps to your production data in an online or batch fashion.
  • 9. NOT SO SIMPLE EXAMPLE: IMAGE CLASSIFICATION Images (VOC2007) .fit( ) Resize Grayscale SIFT PCA FisherVector MaxClassifier Linear Regression Resize Grayscale SIFT MaxClassifier PCA Map Fisher Encoder Linear Model Achieves performance of Chatfield et. al., 2011 Pleasantly parallel featurization and evaluation. 7 minutes on a modest cluster. 5,000 examples, 40,000 features, 20 classes
  • 10. EVEN LESS SIMPLE: IMAGENET Color Edges Resize Grayscale SIFT PCA FisherVector Top 5 Classifier LCS PCA FisherVector Block Linear Solver <100 SLOC Upgrading the solver for higher precision means changing 1 LOC. Weighted Block Linear Solver Adding 100,000 more texture features is easy. Texture Gabor Wavelets PCA FisherVector 1000 class classification. 1,200,000 examples 64,000 features. 90 minutes on 100 nodes.
  • 11. OPTIMIZING KEYSTONEML PIPELINES High-level API enables rich space of optimizations Automated ML operator selection. Linear Solver L-BFGS Iterative SGD Direct Solver Training Data Grayscaler SIFT Extractor Reduce Dimensions Fisher Vector Normalize Column Sampler Linear Map Distributed PCA Column Sampler Local GMM Least Sq. L-BFGS Predictions Training Labels Auto-caching for iterative workloads.
  • 12. KEYSTONEML OPTIMIZER • Sampling-based cost model projects resource usage • CPU, Memory, Network • Utilization tracked through pipeline. • Decisions made to minimize total cost of execution. • Catalyst-based optimizer does the heavy lifting. Stage n d size (GB) Input 5000 1m pixel JPEG 0.4 Resize 5000 260k pixels 3.6 Grayscale 5000 260k pixels 1.2 SIFT 5000 65000x128 309 PCA 5000 65000x80 154 FV 5000 256x64x2 1.2 Linear Regression 5000 20 0.0007 Max Classifier 5000 1 0.00009
  • 13. CHOOSING A SOLVER • Datasets have a number of interesting degrees of freedom. • Problem size (n, d, k) • sparsity (nnz) • condition number • Platform has degrees of freedom: • Memory, CPU, Network, Nodes • Solvers are predictable! 13 Where: A 2 Rn⇥d X 2 Rd⇥k B 2 Rn⇥k Objective: min X |AX B| 2 2 + |X|2 2
  • 14. CHOOSING A SOLVER • Three Solvers • Exact, Block, LBFGS • Two datasets • Amazon - >99% sparse, n=65m • TIMIT - dense, n=2m • Exact solve works well for small # features. • Use LBFGS for sparse problems. • Block solver scales well to big dense problems. • Hundreds of thousands of features. ● ● ● ● ● ● Amazon TIMIT 100 1000 10000 10 100 1000 1024 2048 4096 8192 16384 1024 2048 4096 8192 16384 Number of Features Time(s) Solver ● Exact Block Solver LBFGS 14
  • 15. SOLVER PERFORMANCE • Compared KeystoneML with: • VowpalWabbit - specialized system for large, sparse problems. • SystemML - general purpose, optimizing ML system. • Two problems: • Amazon - Sparse text features. • BinaryTIMIT - Dense phoneme data. • High Order Bit: • KeystoneML pipelines featurization and adapts to workload changes. Amazon 0 200 400 600 800 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML SystemML Binary TIMIT 0 100 200 300 400 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML SystemML Amazon 0 50 100 150 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML Vowpal Wabbit Binary TIMIT 0 500 1000 1500 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML Vowpal Wabbit
  • 16. DECIDING WHATTO SAVE • Pipelines Generate Lots of intermediate state. • E.g. SIFT features blow up a 0.42GBVOC dataset to 300GB. • Iterative algorithms —> state needed many times. • How do we determine what to save for later and what to reuse, given fixed resource budget? • Can we adapt to workload changes? 16 Resize Grayscale SIFT PCA FisherVector MaxClassifier Linear Regression
  • 17. CACHING PROBLEM • Output is computed via depth- first execution of DAG. • Caching “truncates” a path after first visit. • Want to minimize execution time. • Subject to memory constraints. • Picking optimal set is hard! 17 A B C D E 60s 50g 40s 200g 20s 40g 40g 15s 5s 10g Output Cache set Time Memory ABCDE 140s 340g B 140s 200g A 180s 50g {} 240s 0g
  • 18. END-TO-END PERFORMANCE Dataset Training Examples Features Raw Size (GB) Feature Size (GB) Amazon 65 million 100k (sparse) 14 89 TIMIT 2.25 million 528k 7.5 8800 ImageNet 1.28 million 262k 74 2500 VOC 5000 40k 0.43 1.5
  • 19. END-TO-END PERFORMANCE Dataset KeystoneML Accuracy Reported Accuracy KeystoneML Time (m) Reported Time (m) Speedup over Reported Amazon 91.6% N/A 3.3 N/A N/A TIMIT 66.1% 66.3% 138 120 0.87x ImageNet 67.4% 66.6% 270 5760 21x VOC 57.2% 59.2% 7 87 12x
  • 20. END-TO-END PERFORMANCE Amazon TIMIT ImageNet 0 5 10 15 0 20 40 60 0 100 200 300 400 500 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 Cluster Size (# of nodes) Time(minutes) Stage Loading Train Data Featurization Model Solve Loading Test Data Model Eval ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Amazon TIMIT ImageNet 1 2 4 8 16 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 Cluster Size (# of nodes) Speedupover8nodes(x)
  • 21. END-TO-END PERFORMANCE • Tested three levels of optimization • None • Auto-caching only • Auto-caching and operator-selection. • 7x to 15x speedup 0 5 10 15 Amazon TIMIT VOC Workload Speedup Optimization Level None Whole−Pipeline All
  • 24. SOFTWARE FEATURES • Data Loaders • CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups • Transformers • NLP -Tokenization, n-grams, term frequency, NER*, parsing* • Images - Convolution, Grayscaling, LCS, SIFT*, FisherVector*, Pooling,Windowing, HOG, Daisy • Speech - MFCCs* • Stats - Random Features, Normalization, Scaling*, Signed Hellinger Mapping, FFT • Utility/misc - Caching,Top-K classifier, indicator label mapping, sparse/dense encoding transformers. • Estimators • Learning - Block linear models, Linear Discriminant Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM* • Example Pipelines • NLP - Amazon Product Review Classification, 20 Newsgroups,Wikipedia Language model • Images - MNIST, CIFAR,VOC, ImageNet • Speech -TIMIT • Evaluation Metrics • Binary Classification • Multiclass Classification • Multilabel Classification * - Links to external library Just 11k Lines of Code, 5k of which areTests or JavaDoc.
  • 26. TRANSFORMERS TransformerInput Output abstract classTransformer[In, Out] { def apply(in: In): Out def apply(in: RDD[In]): RDD[Out] = in.map(apply) … } TYPE SAFETY HELPS ENSURE ROBUSTNESS
  • 27. ESTIMATORS EstimatorRDD[Input] abstract class Estimator[In, Out] { def fit(in: RDD[In]):Transformer[In,Out] … } Transformer .fit()
  • 28. CHAINING NGrams(2)String Vectorizer VectorBigrams val featurizer:Transformer[String,Vector] = NGrams(2) thenVectorizer featurizerString Vector =
  • 29. COMPLEX PIPELINES .fit(data, labels) pipelineString Prediction = val pipeline = (featurizer thenLabelEstimator LinearModel).fit(data, labels) featurizerString Vector Linear Model Prediction featurizerString Vector Linear Map Prediction