SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Anne Holler, Michael Mui
Uber
Using Spark MLlib Models in a
Production Training and Serving
Platform: Experiences and Extensions
#UnifiedAnalytics #SparkAISummit
Introduction
● Michelangelo is Uber’s Machine Learning Platform
○ Supports training, evaluation, and serving of ML models in production
○ Uses Spark MLlib for training and serving at scale
● Michelangelo's use of Spark MLlib has evolved over time
○ Initially used as part of a monolithic training and serving platform, with
hardcoded model pipeline stages saved/loaded from protobuf
○ Initially customized to support online serving, with
online serving APIs added to Transformers ad hoc
3#UnifiedAnalytics #SparkAISummit
What are Spark Pipelines: Estimators and Transformers
4#UnifiedAnalytics #SparkAISummit
Estimator: Spark
abstraction of a learning
algorithm or any algorithm
that fits or trains on data
Transformer: Spark
abstraction of an ML
model stage that includes
feature transforms and
predictors
5#UnifiedAnalytics #SparkAISummit
Pipeline Models Encode Operational Steps
Pipeline Models Enforce Consistency
● Both Training and Serving involve pre- and post- transform stages in addition to raw
fitting and inferencing from ML model that need to be consistent:
○ Data Transformations
○ Feature Extraction and Pre-Processing
○ ML Model Raw Predictions
○ Post-Prediction Transformations
6#UnifiedAnalytics #SparkAISummit
7#UnifiedAnalytics #SparkAISummit
ML Workflow In Practice
Pipeline Models Encapsulate Complexity
8#UnifiedAnalytics #SparkAISummit
Complexity arises from Different Workflow Needs
9#UnifiedAnalytics #SparkAISummit
Research Scientists / Data Scientists / Research/ML Engineers
Data Analysts / Data
Engineers / Software
Engineers
ML Engineers / Production Engineers
Complexity arises from Different User Needs
Evolution Goal: Retain Performance and Consistency
● Requirement 1: Performant distributed batch serving that comes with the
DataFrame-based execution model on top of Spark’s SQL Engine
● Requirement 2: Low-latency (P99 latency <10ms), high throughput
solution for real-time serving
● Requirement 3: Support consistency in batch and real-time prediction
accuracy by running through common code paths whenever practical
10#UnifiedAnalytics #SparkAISummit
Evolution Goal: Increase Flexibility and Velocity
● Requirement 1: Flexibility in model definitions: libraries, frameworks
○ Allow users to define model pipelines (custom Estimator/Transformer)
○ Train and serve those models efficiently
● Requirement 2: Flexibility in Michelangelo use
○ Decouple its monolithic structure into components
○ Allow interoperability with non-Michelangelo components / pipelines
● Requirement 3: Faster / Easier Spark upgrade path
○ Replace custom protobuf model representation
○ Formalize online serving APIs
11#UnifiedAnalytics #SparkAISummit
Evolve: Replacing Protobuf Model Representation
● Considered MLeap, PMML, PFA, Spark PipelineModel: all supported in Spark MLlib
○ MLeap: non-standard, impacting interoperability w/ Spark compliant ser/de
○ MLeap, PMML, PFA: Lag in supporting new Spark Transformers
○ MLeap, PMML, PFA: Risk of inconsistent model training/serving behavior
● Wanted to choose Spark PipelineModel representation for Michelangelo models
○ Avoids above shortcomings
○ Provides simple interface for adding estimators/transformers
○ But has challenges in Online Serving (see Pentreath’s Spark Summit 2018 talk)
■ Spark MLlib PipelineModel load latency too large
■ Spark MLlib serving APIs too slow for online serving
12#UnifiedAnalytics #SparkAISummit
Spark PipelineModel Representation
● Spark PipelineModel format example file structure
├── 0_strIdx_9ec54829bd7c
│ ├── data part-00000-a9f31485-4200-4845-8977-8aec7fa03157.snappy.parquet
│ ├── metadata part-00000
├── 1_strIdx_5547304a5d3d
│ ├── data part-00000-163942b9-a194-4023-b477-a5bfba236eb0.snappy.parquet
│ ├── metadata part-00000
├── 2_vecAssembler_29b5569f2d98
│ ├── metadata part-00000
├── 3_glm_0b885f8f0843
│ ├── data part-00000-0ead8860-f596-475f-96f3-5b10515f075e.snappy.parquet
│ └── metadata part-00000
└── 4_idxToStr_968f207b70f2
├── metadata part-00000
● Format Read/Written by Spark MLReadable/MLWritable
13#UnifiedAnalytics #SparkAISummit
trait MLReadable[T] {
def read : org.apache.spark.ml.util.MLReader[T]
def load(path : scala.Predef.String) : T
}
trait MLWritable {
def write: org.apache.spark.ml.util.MLWriter
def save(path : scala.Predef.String)
}
Challenge: Spark PipelineModel Load Latency
● Zipped Spark Pipeline and
protobuf files were comparable
sizes (up to 10s of MBs)
● Spark Pipeline load latency was
very high relative to custom
protobuf load latency
● Impacts online serving resource
agility and health monitoring
14#UnifiedAnalytics #SparkAISummit
Pipeline Model Type Spark Pipeline /
Protobuf Load
GBDT Regression 21.22x
GBDT Binary Classification 28.63x
Linear Regression 29.94x
Logistic Regression 43.97x
RF Binary Classification 8.05x
RF Regression 12.16x
Tuning Load Latency: Part 1
Replaced sc.textfile with local metadata read
● DefaultParamsReadable.load uses sc.textfile
● Forming RDD of strings for small 1-line file was slower than simple load
● Replaced with java I/O for local file case, which was much faster
○ Updated loadMetadata method in
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala
● Big reduction in latency of metadata read
15#UnifiedAnalytics #SparkAISummit
Tuning Load Latency: Part 2
Replaced sparkSession.read.parquet with ParquetUtil.read
● Spark distributed read/select for small Transformer data was very slow
● Replaced with direct parquet read/getRecord, which was much faster
○ Relevant to Transformers like LogisticRegression,
StringIndexer, LinearRegression
● Significant reduction in latency of Transformer data read
16#UnifiedAnalytics #SparkAISummit
Tuning Load Latency: Part 3
Updated Tree Ensemble model data save and load to use Parquet directly
● Coalesced tree ensemble node and metadata weights DataFrames at save
time to avoid writing large number of small files that are slow to read
● Loading Tree Ensemble models invoked a groupByKey,sortByKey
○ Spark distributed read/select/sort/collect was very slow
● Replaced with direct parquet read/getRecord, which was much faster
● Significant reduction in latency of tree ensemble data read
17#UnifiedAnalytics #SparkAISummit
Before and After: Tuned Pipeline Load Latency
Greatly improved MLLib load
latency, while retaining
current on-disk format!
18#UnifiedAnalytics #SparkAISummit
Pipeline Model Type Spark Pipeline /
Protobuf Load
Tuned Spark
Pipeline / Protobuf
Load
GBDT Regression 21.22x 2.05x
GBDT Binary
Classification
28.63x 2.50x
Linear Regression 29.94x 2.03x
Logistic Regression 43.97x 2.88x
RF Binary Classification 8.05x 3.14x
RF Regression 12.16x 3.01x
Challenge: SparkContext Cleaner Performance
● Michelangelo online serving creates local SparkContext to handle load of
any unoptimized Transformers
● Periodic context cleaner runs induced non-trivial latency in serving request
responses
● Solution: Stopped SparkContext when models not actively being loaded.
○ Model load only happens at service startup or when new models are
deployed into production online serving
19#UnifiedAnalytics #SparkAISummit
Challenge: Serving APIs too slow for online serving
● Added OnlineTransformer trait to Transformers to be served online
○ Single & small list APIs which leverage low-level spark predict methods
○ Injected at Transformer load time, so pipeline models trained outside of
Michelangelo can be served online by Michelangelo
trait OnlineTransformer {
def scoreInstances(instances: List[Map[String, Any]]): List[Map[String, Any]]
def scoreInstance(instance: Map[String, Any]): Map[String, Any]
}
#UnifiedAnalytics #SparkAISummit 20
Michelangelo Use of Spark MLlib Evolution Outcome
● Michelangelo is using updated Spark MLlib interface in production
○ Spark PipelineModel on-disk representation
○ Optimized Transformer loads to support online serving
○ OnlineTransformer trait to provide online serving APIs
#UnifiedAnalytics #SparkAISummit 21
Example Use Cases Enabled by Evolved MA MLlib
● Flexible Pipeline Model Definition
○ Model Pipeline including TFTransformer
● Flexible Use of Michelangelo
○ Train Model in Notebook, Serve Model in Michelangelo
22#UnifiedAnalytics #SparkAISummit
Flexible Pipeline Model Definition
● Interoperability with non-Michelangelo components / pipelines
○ Cross framework, system, language support via Estimators /
Transformers
● Allow customizability of PipelineModel, Estimators, Transformers while
fully integrated into Michelangelo’s Training and Serving infrastructure
○ Combines Spark’s Data Processing with Training using custom
libraries e.g. XGBoost, Tensorflow
23#UnifiedAnalytics #SparkAISummit
Flexible Pipeline Definition Example: TFTransformer
● Serving TensorFlow Models with TFTransformer
https://eng.uber.com/cota-v2/
○ Spark Pipeline built from training contains both data processing
transformers and TensorFlow transformations (TFTransformer)
○ P95 serving latency < 10ms
○ Combines the distributed computation of Spark and low-latency serving
using CPUs and the acceleration of DL training using GPUs
24#UnifiedAnalytics #SparkAISummit
25#UnifiedAnalytics #SparkAISummit
Serving TF Models
using TFTransformer
Flexible Use Example: Train in DSW, Serve in MA
● Decouple Michelangelo into
functional components
● Consolidate custom data
processing, feature engineering,
model definition, train, and
serve around notebook
environments (DSW)
26#UnifiedAnalytics #SparkAISummit
27#UnifiedAnalytics #SparkAISummit
Experiment in DSW, Serve in Michelangelo
Key Learnings in Evolving Michelangelo
● Pipeline representation of models is powerful
○ Encodes all steps in operational modeling
○ Enforces consistency between training and serving
● Pipeline representation of models needs to be flexible
○ Model pipeline can encapsulate complex stages
○ Complexity stems from differing workflow and user needs
28#UnifiedAnalytics #SparkAISummit
Conclusion
● Michelangelo updated use of Spark MLlib is working well in production
● Propose to open source our changes to Spark MLlib
○ Submitted Spark MLlib Online Serving SPIP
■ https://issues.apache.org/jira/browse/SPARK-26247
○ Posted 2 patches
■ Patch to reduce spark pipeline load latency
■ Patch to add OnlineTransformer trait for online serving APIs
29#UnifiedAnalytics #SparkAISummit
30#UnifiedAnalytics #SparkAISummit
Questions?
31#UnifiedAnalytics #SparkAISummit
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

More Related Content

PDF
1997 FORD MUSTANG Service Repair Manual
PDF
Productionalizing Spark ML
PDF
Operationalizing Machine Learning: Serving ML Models
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
PDF
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
PDF
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
PDF
KFServing and Feast
1997 FORD MUSTANG Service Repair Manual
Productionalizing Spark ML
Operationalizing Machine Learning: Serving ML Models
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
KFServing and Feast

Similar to Using Spark Mllib Models in a Production Training and Serving Platform: Experiences and Extensions (20)

PDF
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
PPTX
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
PPTX
SparkNet presentation
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
ODP
Developing Microservices using Spring - Beginner's Guide
PDF
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
PPTX
Scale machine learning deployment
PDF
FlinkML: Large Scale Machine Learning with Apache Flink
PDF
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PDF
Benefits of a Homemade ML Platform
PDF
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
PDF
Michelangelo - Machine Learning Platform - 2018
PPTX
MLflow Model Serving - DAIS 2021
PDF
Introduction to Flink Streaming
PDF
KFServing - Serverless Model Inferencing
PDF
MLflow Model Serving
PDF
Automatically partitioning packet processing applications for pipelined archi...
PDF
Modern ETL Pipelines with Change Data Capture
PPTX
Profiling & Testing with Spark
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
SparkNet presentation
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Developing Microservices using Spring - Beginner's Guide
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
Scale machine learning deployment
FlinkML: Large Scale Machine Learning with Apache Flink
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
Benefits of a Homemade ML Platform
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Michelangelo - Machine Learning Platform - 2018
MLflow Model Serving - DAIS 2021
Introduction to Flink Streaming
KFServing - Serverless Model Inferencing
MLflow Model Serving
Automatically partitioning packet processing applications for pipelined archi...
Modern ETL Pipelines with Change Data Capture
Profiling & Testing with Spark
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Ad

Recently uploaded (20)

PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Mega Projects Data Mega Projects Data
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to machine learning and Linear Models
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
climate analysis of Dhaka ,Banglades.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Acumen Training GuidePresentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction-to-Cloud-ComputingFinal.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Knowledge Engineering Part 1
Business Ppt On Nestle.pptx huunnnhhgfvu
IBA_Chapter_11_Slides_Final_Accessible.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ISS -ESG Data flows What is ESG and HowHow
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Mega Projects Data Mega Projects Data
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to machine learning and Linear Models
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Using Spark Mllib Models in a Production Training and Serving Platform: Experiences and Extensions

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Anne Holler, Michael Mui Uber Using Spark MLlib Models in a Production Training and Serving Platform: Experiences and Extensions #UnifiedAnalytics #SparkAISummit
  • 3. Introduction ● Michelangelo is Uber’s Machine Learning Platform ○ Supports training, evaluation, and serving of ML models in production ○ Uses Spark MLlib for training and serving at scale ● Michelangelo's use of Spark MLlib has evolved over time ○ Initially used as part of a monolithic training and serving platform, with hardcoded model pipeline stages saved/loaded from protobuf ○ Initially customized to support online serving, with online serving APIs added to Transformers ad hoc 3#UnifiedAnalytics #SparkAISummit
  • 4. What are Spark Pipelines: Estimators and Transformers 4#UnifiedAnalytics #SparkAISummit Estimator: Spark abstraction of a learning algorithm or any algorithm that fits or trains on data Transformer: Spark abstraction of an ML model stage that includes feature transforms and predictors
  • 6. Pipeline Models Enforce Consistency ● Both Training and Serving involve pre- and post- transform stages in addition to raw fitting and inferencing from ML model that need to be consistent: ○ Data Transformations ○ Feature Extraction and Pre-Processing ○ ML Model Raw Predictions ○ Post-Prediction Transformations 6#UnifiedAnalytics #SparkAISummit
  • 7. 7#UnifiedAnalytics #SparkAISummit ML Workflow In Practice Pipeline Models Encapsulate Complexity
  • 8. 8#UnifiedAnalytics #SparkAISummit Complexity arises from Different Workflow Needs
  • 9. 9#UnifiedAnalytics #SparkAISummit Research Scientists / Data Scientists / Research/ML Engineers Data Analysts / Data Engineers / Software Engineers ML Engineers / Production Engineers Complexity arises from Different User Needs
  • 10. Evolution Goal: Retain Performance and Consistency ● Requirement 1: Performant distributed batch serving that comes with the DataFrame-based execution model on top of Spark’s SQL Engine ● Requirement 2: Low-latency (P99 latency <10ms), high throughput solution for real-time serving ● Requirement 3: Support consistency in batch and real-time prediction accuracy by running through common code paths whenever practical 10#UnifiedAnalytics #SparkAISummit
  • 11. Evolution Goal: Increase Flexibility and Velocity ● Requirement 1: Flexibility in model definitions: libraries, frameworks ○ Allow users to define model pipelines (custom Estimator/Transformer) ○ Train and serve those models efficiently ● Requirement 2: Flexibility in Michelangelo use ○ Decouple its monolithic structure into components ○ Allow interoperability with non-Michelangelo components / pipelines ● Requirement 3: Faster / Easier Spark upgrade path ○ Replace custom protobuf model representation ○ Formalize online serving APIs 11#UnifiedAnalytics #SparkAISummit
  • 12. Evolve: Replacing Protobuf Model Representation ● Considered MLeap, PMML, PFA, Spark PipelineModel: all supported in Spark MLlib ○ MLeap: non-standard, impacting interoperability w/ Spark compliant ser/de ○ MLeap, PMML, PFA: Lag in supporting new Spark Transformers ○ MLeap, PMML, PFA: Risk of inconsistent model training/serving behavior ● Wanted to choose Spark PipelineModel representation for Michelangelo models ○ Avoids above shortcomings ○ Provides simple interface for adding estimators/transformers ○ But has challenges in Online Serving (see Pentreath’s Spark Summit 2018 talk) ■ Spark MLlib PipelineModel load latency too large ■ Spark MLlib serving APIs too slow for online serving 12#UnifiedAnalytics #SparkAISummit
  • 13. Spark PipelineModel Representation ● Spark PipelineModel format example file structure ├── 0_strIdx_9ec54829bd7c │ ├── data part-00000-a9f31485-4200-4845-8977-8aec7fa03157.snappy.parquet │ ├── metadata part-00000 ├── 1_strIdx_5547304a5d3d │ ├── data part-00000-163942b9-a194-4023-b477-a5bfba236eb0.snappy.parquet │ ├── metadata part-00000 ├── 2_vecAssembler_29b5569f2d98 │ ├── metadata part-00000 ├── 3_glm_0b885f8f0843 │ ├── data part-00000-0ead8860-f596-475f-96f3-5b10515f075e.snappy.parquet │ └── metadata part-00000 └── 4_idxToStr_968f207b70f2 ├── metadata part-00000 ● Format Read/Written by Spark MLReadable/MLWritable 13#UnifiedAnalytics #SparkAISummit trait MLReadable[T] { def read : org.apache.spark.ml.util.MLReader[T] def load(path : scala.Predef.String) : T } trait MLWritable { def write: org.apache.spark.ml.util.MLWriter def save(path : scala.Predef.String) }
  • 14. Challenge: Spark PipelineModel Load Latency ● Zipped Spark Pipeline and protobuf files were comparable sizes (up to 10s of MBs) ● Spark Pipeline load latency was very high relative to custom protobuf load latency ● Impacts online serving resource agility and health monitoring 14#UnifiedAnalytics #SparkAISummit Pipeline Model Type Spark Pipeline / Protobuf Load GBDT Regression 21.22x GBDT Binary Classification 28.63x Linear Regression 29.94x Logistic Regression 43.97x RF Binary Classification 8.05x RF Regression 12.16x
  • 15. Tuning Load Latency: Part 1 Replaced sc.textfile with local metadata read ● DefaultParamsReadable.load uses sc.textfile ● Forming RDD of strings for small 1-line file was slower than simple load ● Replaced with java I/O for local file case, which was much faster ○ Updated loadMetadata method in mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ● Big reduction in latency of metadata read 15#UnifiedAnalytics #SparkAISummit
  • 16. Tuning Load Latency: Part 2 Replaced sparkSession.read.parquet with ParquetUtil.read ● Spark distributed read/select for small Transformer data was very slow ● Replaced with direct parquet read/getRecord, which was much faster ○ Relevant to Transformers like LogisticRegression, StringIndexer, LinearRegression ● Significant reduction in latency of Transformer data read 16#UnifiedAnalytics #SparkAISummit
  • 17. Tuning Load Latency: Part 3 Updated Tree Ensemble model data save and load to use Parquet directly ● Coalesced tree ensemble node and metadata weights DataFrames at save time to avoid writing large number of small files that are slow to read ● Loading Tree Ensemble models invoked a groupByKey,sortByKey ○ Spark distributed read/select/sort/collect was very slow ● Replaced with direct parquet read/getRecord, which was much faster ● Significant reduction in latency of tree ensemble data read 17#UnifiedAnalytics #SparkAISummit
  • 18. Before and After: Tuned Pipeline Load Latency Greatly improved MLLib load latency, while retaining current on-disk format! 18#UnifiedAnalytics #SparkAISummit Pipeline Model Type Spark Pipeline / Protobuf Load Tuned Spark Pipeline / Protobuf Load GBDT Regression 21.22x 2.05x GBDT Binary Classification 28.63x 2.50x Linear Regression 29.94x 2.03x Logistic Regression 43.97x 2.88x RF Binary Classification 8.05x 3.14x RF Regression 12.16x 3.01x
  • 19. Challenge: SparkContext Cleaner Performance ● Michelangelo online serving creates local SparkContext to handle load of any unoptimized Transformers ● Periodic context cleaner runs induced non-trivial latency in serving request responses ● Solution: Stopped SparkContext when models not actively being loaded. ○ Model load only happens at service startup or when new models are deployed into production online serving 19#UnifiedAnalytics #SparkAISummit
  • 20. Challenge: Serving APIs too slow for online serving ● Added OnlineTransformer trait to Transformers to be served online ○ Single & small list APIs which leverage low-level spark predict methods ○ Injected at Transformer load time, so pipeline models trained outside of Michelangelo can be served online by Michelangelo trait OnlineTransformer { def scoreInstances(instances: List[Map[String, Any]]): List[Map[String, Any]] def scoreInstance(instance: Map[String, Any]): Map[String, Any] } #UnifiedAnalytics #SparkAISummit 20
  • 21. Michelangelo Use of Spark MLlib Evolution Outcome ● Michelangelo is using updated Spark MLlib interface in production ○ Spark PipelineModel on-disk representation ○ Optimized Transformer loads to support online serving ○ OnlineTransformer trait to provide online serving APIs #UnifiedAnalytics #SparkAISummit 21
  • 22. Example Use Cases Enabled by Evolved MA MLlib ● Flexible Pipeline Model Definition ○ Model Pipeline including TFTransformer ● Flexible Use of Michelangelo ○ Train Model in Notebook, Serve Model in Michelangelo 22#UnifiedAnalytics #SparkAISummit
  • 23. Flexible Pipeline Model Definition ● Interoperability with non-Michelangelo components / pipelines ○ Cross framework, system, language support via Estimators / Transformers ● Allow customizability of PipelineModel, Estimators, Transformers while fully integrated into Michelangelo’s Training and Serving infrastructure ○ Combines Spark’s Data Processing with Training using custom libraries e.g. XGBoost, Tensorflow 23#UnifiedAnalytics #SparkAISummit
  • 24. Flexible Pipeline Definition Example: TFTransformer ● Serving TensorFlow Models with TFTransformer https://eng.uber.com/cota-v2/ ○ Spark Pipeline built from training contains both data processing transformers and TensorFlow transformations (TFTransformer) ○ P95 serving latency < 10ms ○ Combines the distributed computation of Spark and low-latency serving using CPUs and the acceleration of DL training using GPUs 24#UnifiedAnalytics #SparkAISummit
  • 25. 25#UnifiedAnalytics #SparkAISummit Serving TF Models using TFTransformer
  • 26. Flexible Use Example: Train in DSW, Serve in MA ● Decouple Michelangelo into functional components ● Consolidate custom data processing, feature engineering, model definition, train, and serve around notebook environments (DSW) 26#UnifiedAnalytics #SparkAISummit
  • 28. Key Learnings in Evolving Michelangelo ● Pipeline representation of models is powerful ○ Encodes all steps in operational modeling ○ Enforces consistency between training and serving ● Pipeline representation of models needs to be flexible ○ Model pipeline can encapsulate complex stages ○ Complexity stems from differing workflow and user needs 28#UnifiedAnalytics #SparkAISummit
  • 29. Conclusion ● Michelangelo updated use of Spark MLlib is working well in production ● Propose to open source our changes to Spark MLlib ○ Submitted Spark MLlib Online Serving SPIP ■ https://issues.apache.org/jira/browse/SPARK-26247 ○ Posted 2 patches ■ Patch to reduce spark pipeline load latency ■ Patch to add OnlineTransformer trait for online serving APIs 29#UnifiedAnalytics #SparkAISummit
  • 32. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics