SlideShare a Scribd company logo
On the representation and reuse of
machine learning models
Villu Ruusmann
Openscoring OÜ
https://github.com/jpmml
2
Def: "Model"
Output = func(Input)
3
Def: "Representation"
Generic Specific
Data
structure
Application
code
4
The problem
"Train once, deploy anywhere"
5
A solution
Matching model representation (MR) with the task at hand:
1. Storing a generic and stable MR
2. Generating a wide variety of more specific and volatile
MRs upon request
6
The Predictive Model Markup Language (PMML)
● XML dialect for marking up models and associated data
transformations
● Version 1.0 in 1999, version 4.3 in 2016
● "Conventions over configuration"
● 17 top-level model types + ensembling
http://dmg.org/
http://dmg.org/pmml/pmml-v4-3.html
http://dmg.org/pmml/products.html
7
A continuum from black to white boxes
Introducing transparency in the form of rich, easy to use,
well-documented APIs:
1. Unmarshalling and marshalling
2. Static analyses. Ex: schema querying
3. Dynamic analyses. Ex: scoring
4. Tracing and explaining individual predictions
8
The Zen of Machine Learning
"Making the model requires large data and many cpus.
Using it does not"
--darren
https://www.mail-archive.com/user@spark.apache.org/msg40636.html
9
Model training workflow
Real-world
feature space
ML-platform
feature space
ML-platform
model
10
Model deployment workflow
Real-world
feature space
ML-platform
feature space
ML-platform
model
Real-world
feature space
Real-world
model
vs.
11
Model resources
R code
Scikit-Learn
code
Apache Spark
ML code
Original
PMML markup
Java code
Python code
Optimized
PMML markup
Training DeploymentVersioned storage
12
Comparison of model persistence options
R Scikit-Learn Apache Spark ML
Model data structure
stability
Fair to excellent Fair Poor
Native serialization
data format
RDS (binary) Pickle (binary) SER (binary) and
JSON (text)
Export to PMML Few external N/A Built-in trait
PMMLWritable
Import from PMML Few external N/A
JPMML projects JPMML-R and
r2pmml
JPMML-SkLearn
and sklearn2pmml
JPMML-SparkML
(-Package)
13
PMML production: R
library("r2pmml")
auto <- read.csv("Auto.csv")
auto$origin <- as.factor(auto$origin)
auto.formula <- formula(mpg ~
(.) ^ 2 + # simple features and their two way interactions
I(displacement / cylinders) + I(log(weight))) # derived features
auto.lm <- lm(auto.formula, data = auto)
r2pmml(auto.lm, "auto_lm.pmml", dataset = auto)
auto.glm <- glm(auto.formula, data = auto, family = "gaussian")
r2pmml(auto.glm, "auto_glm.pmml", dataset = auto)
14
R quirks
● No pipeline concept. Some workflow standardization
efforts by third parties. Ex: caret package
● Many (equally right-) ways of doing the same thing.
Ex: "formula interface" vs. "matrix interface"
● High variance in the design and quality of packages.
Ex: academia vs. industry
● Model objects may enclose the training data set
15
PMML production: Scikit-Learn
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
audit_df = pandas.read_csv("Audit.csv")
audit_mapper = DataFrameMapper([
(["Age", "Income", "Hours"], ContinuousDomain()),
(["Employment", "Education", "Marital", "Occupation"], [CategoricalDomain(), LabelBinarizer()]),
(["Gender", "Deductions"], [CategoricalDomain(), LabelEncoder()]),
("Adjusted", None)])
audit = audit_mapper.fit_transform(audit_df)
audit_classifier = DecisionTreeClassifier(min_samples_split = 10)
audit_classifier.fit(audit[:, 0:48], audit[:, 48].astype(int))
sklearn2pmml(audit_classifier, audit_mapper, "audit_tree.pmml")
16
Scikit-Learn quirks
● Completely schema-less at algorithm level. Ex: no
identification of columns, no tracking of column groups
● Very limited, simple data structures. Mix of Python and C
● No built-in persistence mechanism. Serialization in
generic pickle data format. Upon de-serialization, hope
that class definitions haven't changed in the meantime.
17
PMML production: Apache Spark ML
// $ spark-shell --packages org.jpmml:jpmml-sparkml-package:1.0-SNAPSHOT ..
import org.jpmml.sparkml.ConverterUtil
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Wine.csv")
val formula = new RFormula().setFormula("quality ~ .")
val regressor = new DecisionTreeRegressor()
val pipeline = new Pipeline().setStages(Array(formula, regressor))
val pipelineModel = pipeline.fit(df)
val pmmlBytes = ConverterUtil.toPMMLByteArray(df.schema, pipelineModel)
Files.write(Paths.get("wine_tree.pmml"), pmmlBytes)
18
Apache Spark ML quirks
● Split schema. Static def via Dataset#schema(),
dynamic def via Dataset column metadata
● Models make predictions in transformed output space
● High internal complexity, overhead. Ex: temporary
Dataset columns for feature transformation
● Built-in PMML export capabilities leak the JPMML-Model
library to application classpath
19
PMML consumption: Apache Spark ML
// $ spark-submit --packages org.jpmml:jpmml-spark:1.0-SNAPSHOT ..
import org.jpmml.spark.EvaluatorUtil;
import org.jpmml.spark.TransformerBuilder;
Evaluator evaluator = EvaluatorUtil.createEvaluator(new File("audit_tree.pmml"));
TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator)
.withLabelCol("Adjusted") // String column
.withProbabilityCol("Adjusted_probability", Arrays.asList("0", "1")) // Vector column
.exploded(true);
Transformer pmmlTransformer = pmmlTransformerBuilder.build();
Dataset<Row> input = ...;
Dataset<Row> output = pmmlTransformer.transform(input);
20
Comparison of feature spaces
R Scikit-Learn Apache Spark ML
Feature
identification
Named Positional Pseudo-named
Feature data type Any Float, Double Double
Feature operational
type
Continuous,
Categorical, Ordinal
Continuous Continuous,
pseudo-categorical
Dataset abstraction List<Map<String,?>> float[][] or double[][] List<double[]>
Effect of
transformations on
dataset size
Low Medium (sparse) to high (dense)
21
Feature declaration
<DataField name="Age" dataType="float" optype="continuous">
<Interval closure="closedClosed" leftMargin="17.0" rightMargin="83.0"/>
</DataField>
<DataField name="Gender" dataType="string" optype="categorical">
<Value value="Male"/>
<Value value="Female"/>
<Value value="N/A" property="missing"/>
</DataField>
http://dmg.org/pmml/v4-3/DataDictionary.html
<MiningField name="Age" outliers="asExtremeValues" lowValue="18.0" highValue="75.0"/>
<MiningField name="Gender"/>
http://dmg.org/pmml/v4-3/MiningSchema.html
22
Feature statistics
<UnivariateStats field="Age">
<NumericInfo mean="38.30279" standardDeviation="13.01375" median="3.70"/>
<ContStats>
<Interval closure="openClosed" leftMargin="17.0" rightMargin="23.6"/>
<!-- Intervals 2 through 9 omitted for clarity -->
<Interval closure="openClosed" leftMargin="76.4" rightMargin="83.0"/>
<Array type="int">261 360 297 340 280 156 135 51 13 6</Array>
</ContStats>
</UnivariateStats>
<UnivariateStats field="Gender">
<Counts totalFreq="1899" missingFreq="0" invalidFreq="0"/>
<DiscrStats>
<Array type="string">Male Female</Array>
<Array type="int">1307 592</Array>
</DiscrStats>
</UnivariateStats>
http://dmg.org/pmml/v4-3/Statistics.html23
Comparison of tree models
R Scikit-Learn Apache Spark ML
Algorithms No built-in, many
external
Few built-in Single built-in
Split type(s) Binary or multi-way;
simple and derived
features
Binary; simple features
Continuous features Rel. op. (<, <=) Rel. op. (<=)
Categorical features Set op. (%in%) Pseudo-rel. op. (==) Pseudo-set op.
Reuse Hard Easy to medium
24
Tree model declaration
<TreeModel functionName="classification" splitCharacteristic="binarySplit">
<Node id="1" recordCount="165">
<True/>
<Node id="2" score="1" recordCount="35">
<SimplePredicate field="Education" operator="equal" value="Master"/>
<ScoreDistribution value="1" recordCount="25"/>
<ScoreDistribution value="0" recordCount="10"/>
</Node>
<Node id="3" score="0" recordCount="130">
<SimplePredicate field="Education" operator="notEqual" value="Master"/>
<ScoreDistribution value="1" recordCount="20"/>
<ScoreDistribution value="0" recordCount="110"/>
</Node>
</Node>
</TreeModel>
http://dmg.org/pmml/v4-3/TreeModel.html
25
Optimization (1/3)
<Node>
SimplePredicate: Gender != "Male"
<Node>
SimplePredicate: Age <= 34.5
</Node>
<Node>
SimplePredicate: Age > 34.5
</Node>
</Node>
<Node>
SimplePredicate: Gender == "Male"
</Node>
<Node>
SimplePredicate: Gender == "Male"
</Node>
<Node>
SimplePredicate: Age <= 34.5
</Node>
<Node>
SimplePredicate: Age > 34.5
</Node>
Replacing "deep" binary splits with "shallow" multi-splits:
26
Optimization (2/3)
<TreeModel
noTrueChildStrategy="returnNullPrediction"
>
<Node>
<True/>
<Node score="4.333333333333333">
SimplePredicate: pH <= 2.93
</Node>
<Node score="5.483870967741935">
SimplePredicate: pH > 2.93
</Node>
</Node>
</TreeModel>
<TreeModel
noTrueChildStrategy="returnLastPrediction"
>
<Node score="5.483870967741935">
<True/>
<Node score="4.333333333333333">
SimplePredicate: pH <= 2.93
</Node>
</Node>
</TreeModel>
Cutting the number of terminal nodes in half:
27
Optimization (3/3)
<Node>
SimplePredicate: Age > 35.5
<Node score="0" recordCount="40">
SimplePredicate: Income <= 194386
ScoreDistribution:
"0" = 40, "1" = 0
</Node>
<Node score="0" recordCount="8">
SimplePredicate: Income > 194386
ScoreDistribution:
"0" = 7, "1" = 1
</Node>
</Node>
<Node score="0" recordCount="48">
SimplePredicate: Age > 35.5
ScoreDistribution:
"0" = 47, "1" = 1
</Node>
Removing split levels that don't affect the prediction:
28
Code generation
<TreeModel
missingValueStrategy="defaultChild"
>
<Node id="1" defaultChild="3">
<True/>
<Node id="2" score="0">
SimplePredicate: Age <= 52
ScoreDistribution:
"0" = 7, "1" = 0
</Node>
<Node id="3" score="1">
SimplePredicate: Age > 52
ScoreDistribution:
"0" = 1, "1" = 2
</Node>
</Node>
</TreeModel>
Object[] node_1(FieldValue age, ...){
if(age != null && age.asFloat() <= 52f){
return node_2(...);
} else {
return node_3(...);
}
}
Object[] node_2(...){
return new Object[]{"0", 7d, 0d};
}
Object[] node_3(...){
return new Object[]{"1", 1d, 2d};
}
Bad Idea!
29
Common pitfalls
Not treating splits on continuous features with the required
precision (tolerance < 0.5 ULP):
● Truncating values. Ex: "1.5000(..)1" → "1.50"
● Changing data type. Ex: float ↔ double
● Changing arithmetic expressions. Ex: (x1
/ x2
) ↔ (1 / x2
) * x1
30
Comparison of regression models
R Scikit-Learn Apache Spark ML
Algorithms Few built-in, many
external
Many built-in Few built-in
Term type(s) Simple and derived
features;
interactions
Simple features
Reuse Hard Easy
31
Regression model declaration
<RegressionModel functionName="regression" normalizationMethod="none">
<RegressionTable intercept="15.50143741004145">
<!-- Simple continuous feature -->
<NumericPredictor name="cylinders" coefficient="2.1609496766686194"/>
<!-- Simple categorical feature -->
<CategoricalPredictor name="origin" value="2" coefficient="-35.87525051244351"/>
<CategoricalPredictor name="origin" value="3" coefficient="-38.206750156693424"/>
<!-- Interaction -->
<PredictorTerm coefficient="-0.007734946028064237">
<FieldRef field="cylinders"/>
<FieldRef field="displacement"/>
</PredictorTerm>
<!-- Derived feature; I(log(weight)) is backed by a DerivedField element -->
<NumericPredictor name="I(log(weight))" coefficient="4.874500863508498"/>
</RegressionTable>
</RegressionModel>
http://dmg.org/pmml/v4-3/Regression.html32
Generalized regression model declaration
<GeneralRegressionModel functionName="regression" linkFunction="identity">
<ParameterList>
<Parameter name="p0" label="(intercept)"/>
<Parameter name="p11" label="cylinders:displacement"/>
</ParameterList>
<PPMatrix>
<PPCell value="1" predictorName="cylinders" parameterName="p11"/>
<PPCell value="1" predictorName="displacement" parameterName="p11"/>
</PPMatrix>
<ParamMatrix>
<PCell parameterName="p0" beta="15.50143741004145"/>
<PCell parameterName="p11" beta="-0.007734946028064237"/>
</ParamMatrix>
</GeneralRegressionModel>
http://dmg.org/pmml/v4-3/GeneralRegression.html
33
Code generation
double mpg(FieldValue cylinders, FieldValue displacement, FieldValue weight, FieldValue origin){
return 15.50143741004145d + // intercept
2.1609496766686194d * cylinders.asDouble() + // simple continuous feature
origin(origin.asString()) + // simple categorical feature
-0.007734946028064237d * (cylinders.asDouble() * displacement.asDouble()) + // interaction
4.874500863508498d * Math.ln(weight.asDouble()) // derived feature
}
double origin(String origin){
switch(origin){
case "1": return 0d; // baseline
case "2": return -35.87525051244351d;
case "3": return -38.206750156693424d;
}
throw new IllegalArgumentException(origin);
}
34
Grand summary
R Scikit-Learn Apache Spark ML
ML-platform model
information density
Medium to high Low Low to medium
ML-platform model
interpretability
Good Bad Bad
PMML markup
optimizability
Low High Medium
ML-platform to
PMML learning and
migration effort
Medium Low to medium Low
35
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml
PMML vs. PFA
https://xkcd.com/927/
37

More Related Content

PDF
Converting R to PMML
PDF
R, Scikit-Learn and Apache Spark ML - What difference does it make?
PDF
Converting Scikit-Learn to PMML
PDF
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
PPTX
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
PDF
Data mining with caret package
PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
Converting R to PMML
R, Scikit-Learn and Apache Spark ML - What difference does it make?
Converting Scikit-Learn to PMML
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data mining with caret package
Nyc open-data-2015-andvanced-sklearn-expanded

What's hot (20)

PDF
XGBoost @ Fyber
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PPTX
Introduction of Xgboost
PPTX
wk5ppt1_Titanic
PDF
Demystifying Xgboost
PPTX
Parallel and Iterative Processing for Machine Learning Recommendations with S...
PDF
Hadoop France meetup Feb2016 : recommendations with spark
PPT
Circles graphic
PPT
Optimization toolbox presentation
PDF
PDF
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
PDF
MATLAB for Technical Computing
PDF
03. oop concepts
PDF
Computer graphics practical(jainam)
PDF
Introduction to XGBoost
PDF
Optimization
PDF
An Introduction to Functional Programming - DeveloperUG - 20140311
PDF
Competition 1 (blog 1)
PDF
No internet? No Problem!
PPTX
Tomato Classification using Computer Vision
XGBoost @ Fyber
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Introduction of Xgboost
wk5ppt1_Titanic
Demystifying Xgboost
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Hadoop France meetup Feb2016 : recommendations with spark
Circles graphic
Optimization toolbox presentation
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
MATLAB for Technical Computing
03. oop concepts
Computer graphics practical(jainam)
Introduction to XGBoost
Optimization
An Introduction to Functional Programming - DeveloperUG - 20140311
Competition 1 (blog 1)
No internet? No Problem!
Tomato Classification using Computer Vision
Ad

Viewers also liked (20)

PDF
Representing TF and TF-IDF transformations in PMML
PPTX
Operationalizing analytics to scale
PPTX
Machine Learning In Production
PDF
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
PDF
From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...
PDF
Boligløsninger for eldre 040214: Lene Schmidts presentasjon
PDF
Velox at SF Data Mining Meetup
PPT
PMML - Predictive Model Markup Language
PPTX
Production Grade Data Science for Hadoop
PDF
Use of standards and related issues in predictive analytics
PDF
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
PPTX
Production and Beyond: Deploying and Managing Machine Learning Models
PDF
Machine learning model to production
PPT
Running Spark in Production
PDF
Low Latency Execution For Apache Spark
PDF
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
PPTX
Apache Spark Model Deployment
PDF
Enabling Real-Time Analytics for IoT
PDF
AnalyticOps - Chicago PAW 2016
PDF
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Representing TF and TF-IDF transformations in PMML
Operationalizing analytics to scale
Machine Learning In Production
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...
Boligløsninger for eldre 040214: Lene Schmidts presentasjon
Velox at SF Data Mining Meetup
PMML - Predictive Model Markup Language
Production Grade Data Science for Hadoop
Use of standards and related issues in predictive analytics
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
Production and Beyond: Deploying and Managing Machine Learning Models
Machine learning model to production
Running Spark in Production
Low Latency Execution For Apache Spark
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Apache Spark Model Deployment
Enabling Real-Time Analytics for IoT
AnalyticOps - Chicago PAW 2016
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Ad

Similar to On the representation and reuse of machine learning (ML) models (20)

PDF
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
PDF
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
PDF
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
PDF
State of the (J)PMML art
PDF
MLlib: Spark's Machine Learning Library
PPTX
Apache Spark MLlib
PDF
Deploying Machine Learning Models to Production
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Tensors Are All You Need: Faster Inference with Hummingbird
PPTX
MLlib and Machine Learning on Spark
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Distributed ML in Apache Spark
PPTX
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
PDF
Foundations for Scaling ML in Apache Spark
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PDF
VSSML16 L5. Basic Data Transformations
PPTX
Intro to ML for product school meetup
PPTX
machine learning workflow with data input.pptx
PDF
Pragmatic Machine Learning @ ML Spain
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
State of the (J)PMML art
MLlib: Spark's Machine Learning Library
Apache Spark MLlib
Deploying Machine Learning Models to Production
Introduction to Spark ML Pipelines Workshop
Tensors Are All You Need: Faster Inference with Hummingbird
MLlib and Machine Learning on Spark
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Distributed ML in Apache Spark
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
VSSML16 L5. Basic Data Transformations
Intro to ML for product school meetup
machine learning workflow with data input.pptx
Pragmatic Machine Learning @ ML Spain

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
annual-report-2024-2025 original latest.
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Computer network topology notes for revision
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Quality review (1)_presentation of this 21
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Fluorescence-microscope_Botany_detailed content
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ISS -ESG Data flows What is ESG and HowHow
annual-report-2024-2025 original latest.
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Computer network topology notes for revision
climate analysis of Dhaka ,Banglades.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Quality review (1)_presentation of this 21
Reliability_Chapter_ presentation 1221.5784
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Fluorescence-microscope_Botany_detailed content

On the representation and reuse of machine learning (ML) models

  • 1. On the representation and reuse of machine learning models Villu Ruusmann Openscoring OÜ
  • 3. Def: "Model" Output = func(Input) 3
  • 5. The problem "Train once, deploy anywhere" 5
  • 6. A solution Matching model representation (MR) with the task at hand: 1. Storing a generic and stable MR 2. Generating a wide variety of more specific and volatile MRs upon request 6
  • 7. The Predictive Model Markup Language (PMML) ● XML dialect for marking up models and associated data transformations ● Version 1.0 in 1999, version 4.3 in 2016 ● "Conventions over configuration" ● 17 top-level model types + ensembling http://dmg.org/ http://dmg.org/pmml/pmml-v4-3.html http://dmg.org/pmml/products.html 7
  • 8. A continuum from black to white boxes Introducing transparency in the form of rich, easy to use, well-documented APIs: 1. Unmarshalling and marshalling 2. Static analyses. Ex: schema querying 3. Dynamic analyses. Ex: scoring 4. Tracing and explaining individual predictions 8
  • 9. The Zen of Machine Learning "Making the model requires large data and many cpus. Using it does not" --darren https://www.mail-archive.com/user@spark.apache.org/msg40636.html 9
  • 10. Model training workflow Real-world feature space ML-platform feature space ML-platform model 10
  • 11. Model deployment workflow Real-world feature space ML-platform feature space ML-platform model Real-world feature space Real-world model vs. 11
  • 12. Model resources R code Scikit-Learn code Apache Spark ML code Original PMML markup Java code Python code Optimized PMML markup Training DeploymentVersioned storage 12
  • 13. Comparison of model persistence options R Scikit-Learn Apache Spark ML Model data structure stability Fair to excellent Fair Poor Native serialization data format RDS (binary) Pickle (binary) SER (binary) and JSON (text) Export to PMML Few external N/A Built-in trait PMMLWritable Import from PMML Few external N/A JPMML projects JPMML-R and r2pmml JPMML-SkLearn and sklearn2pmml JPMML-SparkML (-Package) 13
  • 14. PMML production: R library("r2pmml") auto <- read.csv("Auto.csv") auto$origin <- as.factor(auto$origin) auto.formula <- formula(mpg ~ (.) ^ 2 + # simple features and their two way interactions I(displacement / cylinders) + I(log(weight))) # derived features auto.lm <- lm(auto.formula, data = auto) r2pmml(auto.lm, "auto_lm.pmml", dataset = auto) auto.glm <- glm(auto.formula, data = auto, family = "gaussian") r2pmml(auto.glm, "auto_glm.pmml", dataset = auto) 14
  • 15. R quirks ● No pipeline concept. Some workflow standardization efforts by third parties. Ex: caret package ● Many (equally right-) ways of doing the same thing. Ex: "formula interface" vs. "matrix interface" ● High variance in the design and quality of packages. Ex: academia vs. industry ● Model objects may enclose the training data set 15
  • 16. PMML production: Scikit-Learn from sklearn2pmml import sklearn2pmml from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain audit_df = pandas.read_csv("Audit.csv") audit_mapper = DataFrameMapper([ (["Age", "Income", "Hours"], ContinuousDomain()), (["Employment", "Education", "Marital", "Occupation"], [CategoricalDomain(), LabelBinarizer()]), (["Gender", "Deductions"], [CategoricalDomain(), LabelEncoder()]), ("Adjusted", None)]) audit = audit_mapper.fit_transform(audit_df) audit_classifier = DecisionTreeClassifier(min_samples_split = 10) audit_classifier.fit(audit[:, 0:48], audit[:, 48].astype(int)) sklearn2pmml(audit_classifier, audit_mapper, "audit_tree.pmml") 16
  • 17. Scikit-Learn quirks ● Completely schema-less at algorithm level. Ex: no identification of columns, no tracking of column groups ● Very limited, simple data structures. Mix of Python and C ● No built-in persistence mechanism. Serialization in generic pickle data format. Upon de-serialization, hope that class definitions haven't changed in the meantime. 17
  • 18. PMML production: Apache Spark ML // $ spark-shell --packages org.jpmml:jpmml-sparkml-package:1.0-SNAPSHOT .. import org.jpmml.sparkml.ConverterUtil val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Wine.csv") val formula = new RFormula().setFormula("quality ~ .") val regressor = new DecisionTreeRegressor() val pipeline = new Pipeline().setStages(Array(formula, regressor)) val pipelineModel = pipeline.fit(df) val pmmlBytes = ConverterUtil.toPMMLByteArray(df.schema, pipelineModel) Files.write(Paths.get("wine_tree.pmml"), pmmlBytes) 18
  • 19. Apache Spark ML quirks ● Split schema. Static def via Dataset#schema(), dynamic def via Dataset column metadata ● Models make predictions in transformed output space ● High internal complexity, overhead. Ex: temporary Dataset columns for feature transformation ● Built-in PMML export capabilities leak the JPMML-Model library to application classpath 19
  • 20. PMML consumption: Apache Spark ML // $ spark-submit --packages org.jpmml:jpmml-spark:1.0-SNAPSHOT .. import org.jpmml.spark.EvaluatorUtil; import org.jpmml.spark.TransformerBuilder; Evaluator evaluator = EvaluatorUtil.createEvaluator(new File("audit_tree.pmml")); TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator) .withLabelCol("Adjusted") // String column .withProbabilityCol("Adjusted_probability", Arrays.asList("0", "1")) // Vector column .exploded(true); Transformer pmmlTransformer = pmmlTransformerBuilder.build(); Dataset<Row> input = ...; Dataset<Row> output = pmmlTransformer.transform(input); 20
  • 21. Comparison of feature spaces R Scikit-Learn Apache Spark ML Feature identification Named Positional Pseudo-named Feature data type Any Float, Double Double Feature operational type Continuous, Categorical, Ordinal Continuous Continuous, pseudo-categorical Dataset abstraction List<Map<String,?>> float[][] or double[][] List<double[]> Effect of transformations on dataset size Low Medium (sparse) to high (dense) 21
  • 22. Feature declaration <DataField name="Age" dataType="float" optype="continuous"> <Interval closure="closedClosed" leftMargin="17.0" rightMargin="83.0"/> </DataField> <DataField name="Gender" dataType="string" optype="categorical"> <Value value="Male"/> <Value value="Female"/> <Value value="N/A" property="missing"/> </DataField> http://dmg.org/pmml/v4-3/DataDictionary.html <MiningField name="Age" outliers="asExtremeValues" lowValue="18.0" highValue="75.0"/> <MiningField name="Gender"/> http://dmg.org/pmml/v4-3/MiningSchema.html 22
  • 23. Feature statistics <UnivariateStats field="Age"> <NumericInfo mean="38.30279" standardDeviation="13.01375" median="3.70"/> <ContStats> <Interval closure="openClosed" leftMargin="17.0" rightMargin="23.6"/> <!-- Intervals 2 through 9 omitted for clarity --> <Interval closure="openClosed" leftMargin="76.4" rightMargin="83.0"/> <Array type="int">261 360 297 340 280 156 135 51 13 6</Array> </ContStats> </UnivariateStats> <UnivariateStats field="Gender"> <Counts totalFreq="1899" missingFreq="0" invalidFreq="0"/> <DiscrStats> <Array type="string">Male Female</Array> <Array type="int">1307 592</Array> </DiscrStats> </UnivariateStats> http://dmg.org/pmml/v4-3/Statistics.html23
  • 24. Comparison of tree models R Scikit-Learn Apache Spark ML Algorithms No built-in, many external Few built-in Single built-in Split type(s) Binary or multi-way; simple and derived features Binary; simple features Continuous features Rel. op. (<, <=) Rel. op. (<=) Categorical features Set op. (%in%) Pseudo-rel. op. (==) Pseudo-set op. Reuse Hard Easy to medium 24
  • 25. Tree model declaration <TreeModel functionName="classification" splitCharacteristic="binarySplit"> <Node id="1" recordCount="165"> <True/> <Node id="2" score="1" recordCount="35"> <SimplePredicate field="Education" operator="equal" value="Master"/> <ScoreDistribution value="1" recordCount="25"/> <ScoreDistribution value="0" recordCount="10"/> </Node> <Node id="3" score="0" recordCount="130"> <SimplePredicate field="Education" operator="notEqual" value="Master"/> <ScoreDistribution value="1" recordCount="20"/> <ScoreDistribution value="0" recordCount="110"/> </Node> </Node> </TreeModel> http://dmg.org/pmml/v4-3/TreeModel.html 25
  • 26. Optimization (1/3) <Node> SimplePredicate: Gender != "Male" <Node> SimplePredicate: Age <= 34.5 </Node> <Node> SimplePredicate: Age > 34.5 </Node> </Node> <Node> SimplePredicate: Gender == "Male" </Node> <Node> SimplePredicate: Gender == "Male" </Node> <Node> SimplePredicate: Age <= 34.5 </Node> <Node> SimplePredicate: Age > 34.5 </Node> Replacing "deep" binary splits with "shallow" multi-splits: 26
  • 27. Optimization (2/3) <TreeModel noTrueChildStrategy="returnNullPrediction" > <Node> <True/> <Node score="4.333333333333333"> SimplePredicate: pH <= 2.93 </Node> <Node score="5.483870967741935"> SimplePredicate: pH > 2.93 </Node> </Node> </TreeModel> <TreeModel noTrueChildStrategy="returnLastPrediction" > <Node score="5.483870967741935"> <True/> <Node score="4.333333333333333"> SimplePredicate: pH <= 2.93 </Node> </Node> </TreeModel> Cutting the number of terminal nodes in half: 27
  • 28. Optimization (3/3) <Node> SimplePredicate: Age > 35.5 <Node score="0" recordCount="40"> SimplePredicate: Income <= 194386 ScoreDistribution: "0" = 40, "1" = 0 </Node> <Node score="0" recordCount="8"> SimplePredicate: Income > 194386 ScoreDistribution: "0" = 7, "1" = 1 </Node> </Node> <Node score="0" recordCount="48"> SimplePredicate: Age > 35.5 ScoreDistribution: "0" = 47, "1" = 1 </Node> Removing split levels that don't affect the prediction: 28
  • 29. Code generation <TreeModel missingValueStrategy="defaultChild" > <Node id="1" defaultChild="3"> <True/> <Node id="2" score="0"> SimplePredicate: Age <= 52 ScoreDistribution: "0" = 7, "1" = 0 </Node> <Node id="3" score="1"> SimplePredicate: Age > 52 ScoreDistribution: "0" = 1, "1" = 2 </Node> </Node> </TreeModel> Object[] node_1(FieldValue age, ...){ if(age != null && age.asFloat() <= 52f){ return node_2(...); } else { return node_3(...); } } Object[] node_2(...){ return new Object[]{"0", 7d, 0d}; } Object[] node_3(...){ return new Object[]{"1", 1d, 2d}; } Bad Idea! 29
  • 30. Common pitfalls Not treating splits on continuous features with the required precision (tolerance < 0.5 ULP): ● Truncating values. Ex: "1.5000(..)1" → "1.50" ● Changing data type. Ex: float ↔ double ● Changing arithmetic expressions. Ex: (x1 / x2 ) ↔ (1 / x2 ) * x1 30
  • 31. Comparison of regression models R Scikit-Learn Apache Spark ML Algorithms Few built-in, many external Many built-in Few built-in Term type(s) Simple and derived features; interactions Simple features Reuse Hard Easy 31
  • 32. Regression model declaration <RegressionModel functionName="regression" normalizationMethod="none"> <RegressionTable intercept="15.50143741004145"> <!-- Simple continuous feature --> <NumericPredictor name="cylinders" coefficient="2.1609496766686194"/> <!-- Simple categorical feature --> <CategoricalPredictor name="origin" value="2" coefficient="-35.87525051244351"/> <CategoricalPredictor name="origin" value="3" coefficient="-38.206750156693424"/> <!-- Interaction --> <PredictorTerm coefficient="-0.007734946028064237"> <FieldRef field="cylinders"/> <FieldRef field="displacement"/> </PredictorTerm> <!-- Derived feature; I(log(weight)) is backed by a DerivedField element --> <NumericPredictor name="I(log(weight))" coefficient="4.874500863508498"/> </RegressionTable> </RegressionModel> http://dmg.org/pmml/v4-3/Regression.html32
  • 33. Generalized regression model declaration <GeneralRegressionModel functionName="regression" linkFunction="identity"> <ParameterList> <Parameter name="p0" label="(intercept)"/> <Parameter name="p11" label="cylinders:displacement"/> </ParameterList> <PPMatrix> <PPCell value="1" predictorName="cylinders" parameterName="p11"/> <PPCell value="1" predictorName="displacement" parameterName="p11"/> </PPMatrix> <ParamMatrix> <PCell parameterName="p0" beta="15.50143741004145"/> <PCell parameterName="p11" beta="-0.007734946028064237"/> </ParamMatrix> </GeneralRegressionModel> http://dmg.org/pmml/v4-3/GeneralRegression.html 33
  • 34. Code generation double mpg(FieldValue cylinders, FieldValue displacement, FieldValue weight, FieldValue origin){ return 15.50143741004145d + // intercept 2.1609496766686194d * cylinders.asDouble() + // simple continuous feature origin(origin.asString()) + // simple categorical feature -0.007734946028064237d * (cylinders.asDouble() * displacement.asDouble()) + // interaction 4.874500863508498d * Math.ln(weight.asDouble()) // derived feature } double origin(String origin){ switch(origin){ case "1": return 0d; // baseline case "2": return -35.87525051244351d; case "3": return -38.206750156693424d; } throw new IllegalArgumentException(origin); } 34
  • 35. Grand summary R Scikit-Learn Apache Spark ML ML-platform model information density Medium to high Low Low to medium ML-platform model interpretability Good Bad Bad PMML markup optimizability Low High Medium ML-platform to PMML learning and migration effort Medium Low to medium Low 35