SlideShare a Scribd company logo
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Productionizing
Spark ML Pipelines with the
Portable Format for Analytics
—
Nick Pentreath
Principal Engineer, IBM
@MLnick
About
DBG / Apr 19, 2018 / © 2018 IBM Corporation
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data & AI
Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
Agenda
DBG / Apr 19, 2018 / © 2018 IBM Corporation
The Machine Learning Workflow
Challenges of ML Deployment
Portable Format for Analytics
PFA for Spark ML
Performance Comparisons
Summary and Future Directions
Perception
DBG / Apr 19, 2018 / © 2018 IBM Corporation
The Machine Learning Workflow
In reality the workflow spans teams …
DBG / Apr 19, 2018 / © 2018 IBM Corporation
The Machine Learning Workflow
… and tools …
DBG / Apr 19, 2018 / © 2018 IBM Corporation
The Machine Learning Workflow
… and is a small (but critical!)
piece of the puzzle
DBG / Apr 19, 2018 / © 2018 IBM Corporation
The Machine Learning Workflow
*Source: Hidden Technical Debt in Machine Learning Systems
Challenges
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Machine Learning Deployment
• Need to manage and bridge many different:
• Languages - Python, R, Notebooks, Scala / Java / C
• Frameworks – too many to count!
• Dependencies
• Versions
• Performance characteristics can be highly
variable across these dimensions
• Lack of standardization leads to custom
solutions
• Where standards exist, limitations lead to
custom extensions, eliminating the benefits
• Friction between teams
• Data scientists & researchers – latest & greatest
• Production – stability, control, minimize changes,
performance
• Business – metrics, business impact, product must
always work!
• Note:
• “Deployment” in this context is different from
“deployment” in the purely devops sense
• e.g. containers are useful but incomplete solutions
Challenges specific to Spark
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Machine Learning Deployment
• Tight coupling to Spark runtime
• Introduces complex dependencies
• Managing version & compatibility issues
• Scoring models in Spark is slow
• Overhead of DataFrames, especially query
planning
• Overhead of task scheduling, even locally
• Optimized for batch scoring (includes
streaming “micro-batch” settings)
• Spark is not suitable for real-time scoring (<
few 100ms latency)
• Currently, in order to use trained models
(pipelines) outside of Spark, users must:
• Write custom readers for Spark’s native format; or
• Create their own custom format; or
• Export to a standard format (not currently supported
within Spark, hence requiring a custom solution)
• To score models outside of Spark, users must also write
their own custom translation between Spark ML
components and an existing (or custom) ML library
Everything is custom!
Overview
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• PFA is being championed by the Data Mining
Group (IBM is a founding member)
• DMG previously created PMML (Predictive
Model Markup Language), arguably the only
viable open standard currently
• PMML has many limitations
• PFA was created specifically to address these
shortcomings
• PFA consists of:
• JSON serialization format
• AVRO schemas for data types
• Encodes functions (actions) that are applied to inputs
to create outputs with a set of built-in functions and
language constructs (e.g. control-flow, conditionals)
• Essentially a mini functional math language + schema
specification
• Type and function system means PFA can be
fully & statically verified on load and run by any
compliant execution engine
• => true portability across languages,
frameworks, run times and versions
A Simple Example
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Example – multi-class logistic regression
• Specify input and output types using Avro
schemas
• Specify the action to perform (typically on input)
Managing State
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Data storage specified by cells
• A cell is a named value acting as a global variable
• Typically used to store state (such as model
coefficients, vocabulary mappings, etc)
• Types specified with Avro schemas
• Cell values are mutable within an action, but
immutable between action executions of a given PFA
document
• Persistent storage specified by pools
• Closer in concept to a database
• Pools values are mutable across action executions
Other Features
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Special forms
• Control structure – conditionals & loops
• Creating and manipulating local variables
• User-defined functions including lambdas
• Casts
• Null checks
• (Very) basic try-catch, user-defined errors and logs
• Comprehensive built-in function library
• Math, strings, arrays, maps, stats, linear algebra
• Built-in support for some common models - decision
tree, clustering, linear models
Aardpfark
DBG / Apr 19, 2018 / © 2018 IBM Corporation
PFA and Spark ML
• PFA export for Spark ML pipelines
• aardpfark-core – Scala DSL for creating PFA
documents
• avro4s to generate schemas from case classes; json4s to
serialize PFA document to JSON
• aardpfark-sparkml – uses DSL to export Spark
ML components and pipelines to PFA
• Coverage
• Almost all predictors (ML models)
• Most feature transformers
• Pipeline support
• Equivalence tests Spark <-> PFa
Aardpfark - Challenges
DBG / Apr 19, 2018 / © 2018 IBM Corporation
PFA and Spark ML
• Spark ML Model has no schema knowledge
• E.g. Binarizer can operate on numeric or vector
columns
• Need to use Avro union types for standalone PFA
components and handle all cases in the action logic
• Combining components into a pipeline
• Trying to match Spark’s DataFrame-based
input/output behavior (typically appending columns)
• Each component is wrapped as a user-defined
function in the PFA document
• Current approach mimics passing a Row (i.e. Avro
record) from function to function, adding fields
• Missing features in PFA
• Generic vector support (mixed dense/sparse)
Similar projects
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Standards for Machine Learning Deployment
• PMML
• Predecessor to PFA
• Model interchange format in XML with operators
• Widely used and supported; open standard
• Spark support lacking natively but 3rd party projects
available: jpmml-sparkml
• Comprehensive support for Spark ML components
(perhaps surprisingly!)
• Watch SPARK-11237
• Shortcomings of PMML as previously discussed
• Works very well for supported models and
operators
Similar projects
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Standards for Machine Learning Deployment
• MLeap
• Created by Combust.ML, a startup focused on ML
model serving
• Model interchange format in JSON / Protobuf
• Components implemented in Scala code
• Initially focused on Spark ML. Offers almost complete
support for Spark ML components
• Recently added some sklearn; working on TensorFlow
• “Open” format, but not a “standard”
• No concept of well-defined operators / functions
• Effectively forces a tight coupling between versions of
model producer / consumer
Similar projects
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Standards for Machine Learning Deployment
• Open Neural Network Exchange (ONNX)
• Championed by Facebook & Microsoft
• Protobuf serialization format
• Describes computation graph (including operators)
• In this way it is similar to PFA in the sense that the serialized
graph is “self-describing”
• More focused on Deep Learning / tensor operations
• No or poor support for more “traditional” ML or
language constructs (currently)
• Tree-based models & ensembles
• String / categorical processing
• Control flow
• Intermediate variables
Scoring Performance Comparison
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Performance
• Comparing scoring performance of PFA with
Spark and MLeap
• PFA uses Hadrian reference implementation for
JVM
• Test dataset of ~80,000 records
• String indexing of 47 categorical columns
• Vector assembling the 47 categorical indices together
with 27 numerical columns
• Linear regression predictor
• Note: Spark time is 1.9s / record (1901ms) - not
shown on the chart 0
0.2
0.4
0.6
0.8
1
1.2
Elapsed time / record (ms)
Average execution time
MLeap PFA
Summary
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Summary and Future Directions
• PFA provides an open standard for serialization
and deployment of analytic workflows
• Portability across languages, frameworks, runtimes
and versions
• Execution environment is independent of the producer
(R, scikit-learn, Spark ML, weka, etc)
• Solves a significant pain point for the Spark ML
ecosystem
• Also benefits the wider ML ecosystem (e.g.
many currently use PMML for exporting models
from R, scikit-learn, XGBoost, LightGBM, etc)
• However there are risks
• PFA is still young and needs to gain adoption
• Performance in production, at scale, is relatively
untested
• Tests indicate PFA reference engines need some
work on robustness and performance
• What about Deep Learning / comparison to ONNX?
• Limitations of PFA
• A standard can move slowly in terms of new features,
fixes and enhancements
Future directions
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Summary and Future Directions
• Open source release of Aardpfark
• Initially focused on Spark ML pipelines
• Later add support for scikit-learn pipelines, XGBoost,
LightGBM, etc
• (Support for many R models exist already in the
Hadrian project)
• Further performance testing in progress vs Spark &
MLeap
• More automated translation (Scala -> PFA, ASTs etc)
• Propose improvements to PFA
• Generic vector (tensor) support
• Less cumbersome schema definitions
• Performance improvements to scoring engine
• PFA for Deep Learning?
• Comparing to ONNX and other emerging standards
• Better suited for the more general pre-processing
steps of DL pipelines
• Requires all the various DL-specific operators
• Requires tensor schema and better tensor support
built-in to the PFA spec
• Should have GPU support
Thank you!
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Nick Pentreath
Principal Engineer
—
nickp@za.ibm.com
@MLnick
ibm.com
Links & References
DBG / Apr 19, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
PMML
Spark MLlib – Saving and Loading Pipelines
Hadrian – Reference Implementation of PFA Engines for JVM, Python, R
jpmml-sparkml
MLeap
Open Neural Network Exchange
DBG / Apr 19, 2018 / © 2018 IBM Corporation

More Related Content

PPTX
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...
PPTX
Machine Learning Models in Production
PPTX
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
PPTX
SAM—streaming analytics made easy
PDF
From an experiment to a real production environment
PPTX
Apache Hadoop YARN: state of the union
PPTX
Lessons learned running a container cloud on YARN
PDF
Present and future of unified, portable and efficient data processing with Ap...
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...
Machine Learning Models in Production
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
SAM—streaming analytics made easy
From an experiment to a real production environment
Apache Hadoop YARN: state of the union
Lessons learned running a container cloud on YARN
Present and future of unified, portable and efficient data processing with Ap...

What's hot (20)

PDF
Apache Metron in the Real World
PPTX
Sharing metadata across the data lake and streams
PPTX
Using LLVM to accelerate processing of data in Apache Arrow
PPTX
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
PPTX
Manage democratization of the data - Data Replication in Hadoop
PDF
Achieving a 360-degree view of manufacturing via open source industrial data ...
PPTX
Apache deep learning 101
PDF
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
PPTX
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
PPTX
IOT, Streaming Analytics and Machine Learning
PDF
Running Apache NiFi with Apache Spark : Integration Options
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PPTX
Flink SQL & TableAPI in Large Scale Production at Alibaba
PPTX
Containers and Big Data
PDF
Deploying End-to-End Deep Learning Pipelines with ONNX
PPTX
Streaming analytics manager
PPTX
Streamline - Stream Analytics for Everyone
PPTX
SDLC with Apache NiFi
PDF
Forget Duplicating Local Changes: Apache NiFi and the Flow Development Lifecy...
PPTX
Accelerating query processing with materialized views in Apache Hive
Apache Metron in the Real World
Sharing metadata across the data lake and streams
Using LLVM to accelerate processing of data in Apache Arrow
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Manage democratization of the data - Data Replication in Hadoop
Achieving a 360-degree view of manufacturing via open source industrial data ...
Apache deep learning 101
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
IOT, Streaming Analytics and Machine Learning
Running Apache NiFi with Apache Spark : Integration Options
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Flink SQL & TableAPI in Large Scale Production at Alibaba
Containers and Big Data
Deploying End-to-End Deep Learning Pipelines with ONNX
Streaming analytics manager
Streamline - Stream Analytics for Everyone
SDLC with Apache NiFi
Forget Duplicating Local Changes: Apache NiFi and the Flow Development Lifecy...
Accelerating query processing with materialized views in Apache Hive
Ad

Similar to Productionizing Spark ML pipelines with the portable format for analytics (20)

PDF
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
PPTX
Productionizing Spark ML Pipelines with the Portable Format for Analytics
PPTX
Open, Secure & Transparent AI Pipelines
PDF
AnalyticOps - Chicago PAW 2016
PDF
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
PPTX
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
PDF
Strata parallel m-ml-ops_sept_2017
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
PDF
Big Data Day LA 2017
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
PPTX
Deploying Data Science Engines to Production
PDF
Machine learning at scale challenges and solutions
PDF
Python meetup
PDF
Ideas spracklen-final
PDF
Use of standards and related issues in predictive analytics
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
PDF
Machine Learning - Intro
PPTX
Apache Spark MLlib
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Open, Secure & Transparent AI Pipelines
AnalyticOps - Chicago PAW 2016
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Strata parallel m-ml-ops_sept_2017
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Big Data Day LA 2017
Apache Spark's MLlib's Past Trajectory and new Directions
Deploying Data Science Engines to Production
Machine learning at scale challenges and solutions
Python meetup
Ideas spracklen-final
Use of standards and related issues in predictive analytics
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Practical Distributed Machine Learning Pipelines on Hadoop
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Machine Learning - Intro
Apache Spark MLlib
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Assigned Numbers - 2025 - Bluetooth® Document

Productionizing Spark ML pipelines with the portable format for analytics

  • 1. DBG / Apr 19, 2018 / © 2018 IBM Corporation Productionizing Spark ML Pipelines with the Portable Format for Analytics — Nick Pentreath Principal Engineer, IBM @MLnick
  • 2. About DBG / Apr 19, 2018 / © 2018 IBM Corporation @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups
  • 3. Agenda DBG / Apr 19, 2018 / © 2018 IBM Corporation The Machine Learning Workflow Challenges of ML Deployment Portable Format for Analytics PFA for Spark ML Performance Comparisons Summary and Future Directions
  • 4. Perception DBG / Apr 19, 2018 / © 2018 IBM Corporation The Machine Learning Workflow
  • 5. In reality the workflow spans teams … DBG / Apr 19, 2018 / © 2018 IBM Corporation The Machine Learning Workflow
  • 6. … and tools … DBG / Apr 19, 2018 / © 2018 IBM Corporation The Machine Learning Workflow
  • 7. … and is a small (but critical!) piece of the puzzle DBG / Apr 19, 2018 / © 2018 IBM Corporation The Machine Learning Workflow *Source: Hidden Technical Debt in Machine Learning Systems
  • 8. Challenges DBG / Apr 19, 2018 / © 2018 IBM Corporation Machine Learning Deployment • Need to manage and bridge many different: • Languages - Python, R, Notebooks, Scala / Java / C • Frameworks – too many to count! • Dependencies • Versions • Performance characteristics can be highly variable across these dimensions • Lack of standardization leads to custom solutions • Where standards exist, limitations lead to custom extensions, eliminating the benefits • Friction between teams • Data scientists & researchers – latest & greatest • Production – stability, control, minimize changes, performance • Business – metrics, business impact, product must always work! • Note: • “Deployment” in this context is different from “deployment” in the purely devops sense • e.g. containers are useful but incomplete solutions
  • 9. Challenges specific to Spark DBG / Apr 19, 2018 / © 2018 IBM Corporation Machine Learning Deployment • Tight coupling to Spark runtime • Introduces complex dependencies • Managing version & compatibility issues • Scoring models in Spark is slow • Overhead of DataFrames, especially query planning • Overhead of task scheduling, even locally • Optimized for batch scoring (includes streaming “micro-batch” settings) • Spark is not suitable for real-time scoring (< few 100ms latency) • Currently, in order to use trained models (pipelines) outside of Spark, users must: • Write custom readers for Spark’s native format; or • Create their own custom format; or • Export to a standard format (not currently supported within Spark, hence requiring a custom solution) • To score models outside of Spark, users must also write their own custom translation between Spark ML components and an existing (or custom) ML library Everything is custom!
  • 10. Overview DBG / Apr 19, 2018 / © 2018 IBM Corporation Portable Format for Analytics • PFA is being championed by the Data Mining Group (IBM is a founding member) • DMG previously created PMML (Predictive Model Markup Language), arguably the only viable open standard currently • PMML has many limitations • PFA was created specifically to address these shortcomings • PFA consists of: • JSON serialization format • AVRO schemas for data types • Encodes functions (actions) that are applied to inputs to create outputs with a set of built-in functions and language constructs (e.g. control-flow, conditionals) • Essentially a mini functional math language + schema specification • Type and function system means PFA can be fully & statically verified on load and run by any compliant execution engine • => true portability across languages, frameworks, run times and versions
  • 11. A Simple Example DBG / Apr 19, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Example – multi-class logistic regression • Specify input and output types using Avro schemas • Specify the action to perform (typically on input)
  • 12. Managing State DBG / Apr 19, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Data storage specified by cells • A cell is a named value acting as a global variable • Typically used to store state (such as model coefficients, vocabulary mappings, etc) • Types specified with Avro schemas • Cell values are mutable within an action, but immutable between action executions of a given PFA document • Persistent storage specified by pools • Closer in concept to a database • Pools values are mutable across action executions
  • 13. Other Features DBG / Apr 19, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Special forms • Control structure – conditionals & loops • Creating and manipulating local variables • User-defined functions including lambdas • Casts • Null checks • (Very) basic try-catch, user-defined errors and logs • Comprehensive built-in function library • Math, strings, arrays, maps, stats, linear algebra • Built-in support for some common models - decision tree, clustering, linear models
  • 14. Aardpfark DBG / Apr 19, 2018 / © 2018 IBM Corporation PFA and Spark ML • PFA export for Spark ML pipelines • aardpfark-core – Scala DSL for creating PFA documents • avro4s to generate schemas from case classes; json4s to serialize PFA document to JSON • aardpfark-sparkml – uses DSL to export Spark ML components and pipelines to PFA • Coverage • Almost all predictors (ML models) • Most feature transformers • Pipeline support • Equivalence tests Spark <-> PFa
  • 15. Aardpfark - Challenges DBG / Apr 19, 2018 / © 2018 IBM Corporation PFA and Spark ML • Spark ML Model has no schema knowledge • E.g. Binarizer can operate on numeric or vector columns • Need to use Avro union types for standalone PFA components and handle all cases in the action logic • Combining components into a pipeline • Trying to match Spark’s DataFrame-based input/output behavior (typically appending columns) • Each component is wrapped as a user-defined function in the PFA document • Current approach mimics passing a Row (i.e. Avro record) from function to function, adding fields • Missing features in PFA • Generic vector support (mixed dense/sparse)
  • 16. Similar projects DBG / Apr 19, 2018 / © 2018 IBM Corporation Standards for Machine Learning Deployment • PMML • Predecessor to PFA • Model interchange format in XML with operators • Widely used and supported; open standard • Spark support lacking natively but 3rd party projects available: jpmml-sparkml • Comprehensive support for Spark ML components (perhaps surprisingly!) • Watch SPARK-11237 • Shortcomings of PMML as previously discussed • Works very well for supported models and operators
  • 17. Similar projects DBG / Apr 19, 2018 / © 2018 IBM Corporation Standards for Machine Learning Deployment • MLeap • Created by Combust.ML, a startup focused on ML model serving • Model interchange format in JSON / Protobuf • Components implemented in Scala code • Initially focused on Spark ML. Offers almost complete support for Spark ML components • Recently added some sklearn; working on TensorFlow • “Open” format, but not a “standard” • No concept of well-defined operators / functions • Effectively forces a tight coupling between versions of model producer / consumer
  • 18. Similar projects DBG / Apr 19, 2018 / © 2018 IBM Corporation Standards for Machine Learning Deployment • Open Neural Network Exchange (ONNX) • Championed by Facebook & Microsoft • Protobuf serialization format • Describes computation graph (including operators) • In this way it is similar to PFA in the sense that the serialized graph is “self-describing” • More focused on Deep Learning / tensor operations • No or poor support for more “traditional” ML or language constructs (currently) • Tree-based models & ensembles • String / categorical processing • Control flow • Intermediate variables
  • 19. Scoring Performance Comparison DBG / Apr 19, 2018 / © 2018 IBM Corporation Performance • Comparing scoring performance of PFA with Spark and MLeap • PFA uses Hadrian reference implementation for JVM • Test dataset of ~80,000 records • String indexing of 47 categorical columns • Vector assembling the 47 categorical indices together with 27 numerical columns • Linear regression predictor • Note: Spark time is 1.9s / record (1901ms) - not shown on the chart 0 0.2 0.4 0.6 0.8 1 1.2 Elapsed time / record (ms) Average execution time MLeap PFA
  • 20. Summary DBG / Apr 19, 2018 / © 2018 IBM Corporation Summary and Future Directions • PFA provides an open standard for serialization and deployment of analytic workflows • Portability across languages, frameworks, runtimes and versions • Execution environment is independent of the producer (R, scikit-learn, Spark ML, weka, etc) • Solves a significant pain point for the Spark ML ecosystem • Also benefits the wider ML ecosystem (e.g. many currently use PMML for exporting models from R, scikit-learn, XGBoost, LightGBM, etc) • However there are risks • PFA is still young and needs to gain adoption • Performance in production, at scale, is relatively untested • Tests indicate PFA reference engines need some work on robustness and performance • What about Deep Learning / comparison to ONNX? • Limitations of PFA • A standard can move slowly in terms of new features, fixes and enhancements
  • 21. Future directions DBG / Apr 19, 2018 / © 2018 IBM Corporation Summary and Future Directions • Open source release of Aardpfark • Initially focused on Spark ML pipelines • Later add support for scikit-learn pipelines, XGBoost, LightGBM, etc • (Support for many R models exist already in the Hadrian project) • Further performance testing in progress vs Spark & MLeap • More automated translation (Scala -> PFA, ASTs etc) • Propose improvements to PFA • Generic vector (tensor) support • Less cumbersome schema definitions • Performance improvements to scoring engine • PFA for Deep Learning? • Comparing to ONNX and other emerging standards • Better suited for the more general pre-processing steps of DL pipelines • Requires all the various DL-specific operators • Requires tensor schema and better tensor support built-in to the PFA spec • Should have GPU support
  • 22. Thank you! DBG / Apr 19, 2018 / © 2018 IBM Corporation Nick Pentreath Principal Engineer — nickp@za.ibm.com @MLnick ibm.com
  • 23. Links & References DBG / Apr 19, 2018 / © 2018 IBM Corporation Portable Format for Analytics PMML Spark MLlib – Saving and Loading Pipelines Hadrian – Reference Implementation of PFA Engines for JVM, Python, R jpmml-sparkml MLeap Open Neural Network Exchange
  • 24. DBG / Apr 19, 2018 / © 2018 IBM Corporation