Productionizing Spark ML pipelines with the portable format for analytics

DBG / Apr 19, 2018 / © 2018 IBM Corporation
Productionizing
Spark ML Pipelines with the
Portable Format for Analytics
—
Nick Pentreath
Principal Engineer, IBM
@MLnick

About
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data & AI
Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups

Agenda
The Machine Learning Workflow
Challenges of ML Deployment
PFA for Spark ML
Performance Comparisons
Summary and Future Directions

Perception

In reality the workflow spans teams …

… and tools …

… and is a small (but critical!)
piece of the puzzle
*Source: Hidden Technical Debt in Machine Learning Systems

Challenges
Machine Learning Deployment
• Need to manage and bridge many different:
• Languages - Python, R, Notebooks, Scala / Java / C
• Frameworks – too many to count!
• Dependencies
• Versions
• Performance characteristics can be highly
variable across these dimensions
• Lack of standardization leads to custom
solutions
• Where standards exist, limitations lead to
custom extensions, eliminating the benefits
• Friction between teams
• Data scientists & researchers – latest & greatest
• Production – stability, control, minimize changes,
performance
• Business – metrics, business impact, product must
always work!
• Note:
• “Deployment” in this context is different from
“deployment” in the purely devops sense
• e.g. containers are useful but incomplete solutions

Challenges specific to Spark
Machine Learning Deployment
• Tight coupling to Spark runtime
• Introduces complex dependencies
• Managing version & compatibility issues
• Scoring models in Spark is slow
• Overhead of DataFrames, especially query
planning
• Overhead of task scheduling, even locally
• Optimized for batch scoring (includes
streaming “micro-batch” settings)
• Spark is not suitable for real-time scoring (<
few 100ms latency)
• Currently, in order to use trained models
(pipelines) outside of Spark, users must:
• Write custom readers for Spark’s native format; or
• Create their own custom format; or
• Export to a standard format (not currently supported
within Spark, hence requiring a custom solution)
• To score models outside of Spark, users must also write
their own custom translation between Spark ML
components and an existing (or custom) ML library
Everything is custom!

Overview
• PFA is being championed by the Data Mining
Group (IBM is a founding member)
• DMG previously created PMML (Predictive
Model Markup Language), arguably the only
viable open standard currently
• PMML has many limitations
• PFA was created specifically to address these
shortcomings
• PFA consists of:
• JSON serialization format
• AVRO schemas for data types
• Encodes functions (actions) that are applied to inputs
to create outputs with a set of built-in functions and
language constructs (e.g. control-flow, conditionals)
• Essentially a mini functional math language + schema
specification
• Type and function system means PFA can be
fully & statically verified on load and run by any
compliant execution engine
• => true portability across languages,
frameworks, run times and versions

A Simple Example
• Example – multi-class logistic regression
• Specify input and output types using Avro
schemas
• Specify the action to perform (typically on input)

Managing State
• Data storage specified by cells
• A cell is a named value acting as a global variable
• Typically used to store state (such as model
coefficients, vocabulary mappings, etc)
• Types specified with Avro schemas
• Cell values are mutable within an action, but
immutable between action executions of a given PFA
document
• Persistent storage specified by pools
• Closer in concept to a database
• Pools values are mutable across action executions

Other Features
• Special forms
• Control structure – conditionals & loops
• Creating and manipulating local variables
• User-defined functions including lambdas
• Casts
• Null checks
• (Very) basic try-catch, user-defined errors and logs
• Comprehensive built-in function library
• Math, strings, arrays, maps, stats, linear algebra
• Built-in support for some common models - decision
tree, clustering, linear models

Aardpfark
PFA and Spark ML
• PFA export for Spark ML pipelines
• aardpfark-core – Scala DSL for creating PFA
documents
• avro4s to generate schemas from case classes; json4s to
serialize PFA document to JSON
• aardpfark-sparkml – uses DSL to export Spark
ML components and pipelines to PFA
• Coverage
• Almost all predictors (ML models)
• Most feature transformers
• Pipeline support
• Equivalence tests Spark <-> PFa

Aardpfark - Challenges
PFA and Spark ML
• Spark ML Model has no schema knowledge
• E.g. Binarizer can operate on numeric or vector
columns
• Need to use Avro union types for standalone PFA
components and handle all cases in the action logic
• Combining components into a pipeline
• Trying to match Spark’s DataFrame-based
input/output behavior (typically appending columns)
• Each component is wrapped as a user-defined
function in the PFA document
• Current approach mimics passing a Row (i.e. Avro
record) from function to function, adding fields
• Missing features in PFA
• Generic vector support (mixed dense/sparse)

Similar projects
Standards for Machine Learning Deployment
• PMML
• Predecessor to PFA
• Model interchange format in XML with operators
• Widely used and supported; open standard
• Spark support lacking natively but 3rd party projects
available: jpmml-sparkml
• Comprehensive support for Spark ML components
(perhaps surprisingly!)
• Watch SPARK-11237
• Shortcomings of PMML as previously discussed
• Works very well for supported models and
operators

Similar projects
• MLeap
• Created by Combust.ML, a startup focused on ML
model serving
• Model interchange format in JSON / Protobuf
• Components implemented in Scala code
• Initially focused on Spark ML. Offers almost complete
support for Spark ML components
• Recently added some sklearn; working on TensorFlow
• “Open” format, but not a “standard”
• No concept of well-defined operators / functions
• Effectively forces a tight coupling between versions of
model producer / consumer

Similar projects
• Open Neural Network Exchange (ONNX)
• Championed by Facebook & Microsoft
• Protobuf serialization format
• Describes computation graph (including operators)
• In this way it is similar to PFA in the sense that the serialized
graph is “self-describing”
• More focused on Deep Learning / tensor operations
• No or poor support for more “traditional” ML or
language constructs (currently)
• Tree-based models & ensembles
• String / categorical processing
• Control flow
• Intermediate variables

Scoring Performance Comparison
Performance
• Comparing scoring performance of PFA with
Spark and MLeap
• PFA uses Hadrian reference implementation for
JVM
• Test dataset of ~80,000 records
• String indexing of 47 categorical columns
• Vector assembling the 47 categorical indices together
with 27 numerical columns
• Linear regression predictor
• Note: Spark time is 1.9s / record (1901ms) - not
shown on the chart 0
0.2
0.4
0.6
0.8
1
1.2
Elapsed time / record (ms)
Average execution time
MLeap PFA

Summary
• PFA provides an open standard for serialization
and deployment of analytic workflows
• Portability across languages, frameworks, runtimes
and versions
• Execution environment is independent of the producer
(R, scikit-learn, Spark ML, weka, etc)
• Solves a significant pain point for the Spark ML
ecosystem
• Also benefits the wider ML ecosystem (e.g.
many currently use PMML for exporting models
from R, scikit-learn, XGBoost, LightGBM, etc)
• However there are risks
• PFA is still young and needs to gain adoption
• Performance in production, at scale, is relatively
untested
• Tests indicate PFA reference engines need some
work on robustness and performance
• What about Deep Learning / comparison to ONNX?
• Limitations of PFA
• A standard can move slowly in terms of new features,
fixes and enhancements

Future directions
• Open source release of Aardpfark
• Initially focused on Spark ML pipelines
• Later add support for scikit-learn pipelines, XGBoost,
LightGBM, etc
• (Support for many R models exist already in the
Hadrian project)
• Further performance testing in progress vs Spark &
MLeap
• More automated translation (Scala -> PFA, ASTs etc)
• Propose improvements to PFA
• Generic vector (tensor) support
• Less cumbersome schema definitions
• Performance improvements to scoring engine
• PFA for Deep Learning?
• Comparing to ONNX and other emerging standards
• Better suited for the more general pre-processing
steps of DL pipelines
• Requires all the various DL-specific operators
• Requires tensor schema and better tensor support
built-in to the PFA spec
• Should have GPU support

Thank you!
Nick Pentreath
Principal Engineer
—
nickp@za.ibm.com
@MLnick
ibm.com

Links & References
PMML
Spark MLlib – Saving and Loading Pipelines
Hadrian – Reference Implementation of PFA Engines for JVM, Python, R
jpmml-sparkml
MLeap
Open Neural Network Exchange

Productionizing Spark ML pipelines with the portable format for analytics

More Related Content

What's hot (20)

Similar to Productionizing Spark ML pipelines with the portable format for analytics (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Productionizing Spark ML pipelines with the portable format for analytics