SlideShare a Scribd company logo
Converting R to PMML
Villu Ruusmann
Openscoring OÜ
"Train once, deploy anywhere"
R challenge
"Fat code around thin model objects":
● Many packages to solve the same problem, many ways of
doing and/or representing the same thing
● The unit of reusability is R script
● Model object is tightly coupled to the R script that
produced it. Incomplete/split schemas
● No (widely adopted-) pipeline formalization
(J)PMML solution
Evaluator evaluator = ...;
List<InputField> argumentFields = evaluator.getInputFields();
List<ResultField> resultFields =
Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields());
Map<FieldName, ?> arguments = readRecord(argumentFields);
Map<FieldName, ?> result = evaluator.evaluate(arguments);
writeRecord(result, resultFields);
● The R script (data pre-and post-processing, model) is
represented using standardized PMML data structures
● Well-defined data entry and exit interfaces
Workflow
R challenge
Training:
audit_train = read.csv("Audit.csv")
audit_train$HourlyIncome = (audit_train$Income / (audit_train$Hours * 52))
audit.model = model(Adjusted ~ ., data = audit_train)
saveRDS(audit.model, "Audit.rds")
Deployment:
audit_test = read.csv("Audit.csv")
audit.model = readRDS("Audit.rds")
# "Error in eval(expr, envir, enclos) : object 'HourlyIncome' not found"
predict(audit.model, newdata = audit_test)
R solution
Matrix interface ( ):
label = audit$Adjusted
features = audit[, !(colnames(audit) %in% c("Adjusted"))]
# Returns a `model` object
audit.model = model(x = features, y = label)
Formula interface ( ):
# Returns a `model.formula` object
audit.model = model(Adjusted ~ . + I(Income / (Hours * 52)), data = audit)
From `pmml` to `r2pmml`
library("pmml")
library("r2pmml")
audit = read.csv("Audit.csv")
audit$Adjusted = as.factor(audit$Adjusted)
audit.glm = glm(Adjusted ~ . + I(Income / (Hours * 52)),
data = audit, family = "binomial")
saveXML(pmml(audit.glm), "Audit.pmml")
r2pmml(audit.glm, "Audit.pmml")
Package `pmml` ( )
<PMML>
<DataDictionary>
<DataField name="I(Income/(Hours * 52))" optype="continuous" dataType="double"/>
</DataDictionary>
<GeneralRegressionModel>
<MiningSchema>
<MiningField name="I(Income/(Hours * 52))"/>
</MiningSchema>
</GeneralRegressionModel>
</PMML>
Package `r2pmml` ( )
<PMML>
<DataDictionary>
<DataField name="Income" optype="continuous" dataType="double"/>
<DataField name="Hours" optype="continuous" dataType="double"/>
</DataDictionary>
<TransformationDictionary>
<DerivedField name="Income/(Hours * 52)" optype="continuous" dataType="double">
<Apply function="/"><FieldRef field="Income"/><Apply function="*"><FieldRef
field="Hours"/><Constant>52</Constant></Apply></Apply>
</DerivedField>
</TransformationDictionary>
<GeneralRegressionModel>
<MiningSchema>
<MiningField name="Income"/>
<MiningField name="Hours"/>
</MiningSchema>
</GeneralRegressionModel>
</PMML>
Manual conversion (1/2)
r2pmml::r2pmml
r2pmml::decorate base::saveRDS RDS file JPMML-R library
The R side The Java side
Manual conversion (2/2)
The R side:
audit.model = model(Adjusted ~ ., data = audit)
# Decorate the model object with supporting information
audit.model = r2pmml::decorate(audit.model, dataset = audit)
saveRDS(audit.model, "Audit.rds")
The Java side:
$ git clone https://github.com/jpmml/jpmml-r.git
$ cd jpmml-r; mvn clean package
$ java -jar target/converter-executable-1.2-SNAPSHOT.jar --rds-input
/path/to/Audit.rds --pmml-output /path/to/Audit.pmml
In-formula feature engineering
Sanitization (1/2)
?base::ifelse
# Static replacement value
as.formula(Adjusted ~ ifelse(is.na(Hours), 0, Hours))
# Dynamic replacement value
as.formula(paste("Adjusted ~ ifelse((is.na(Hours) | Hours < 0 | Hours >
168),", mean(df$Hours), ", Hours)", sep = ""))
Sanitization (2/2)
<DerivedField name="ifelse(is.na(Hours))" optype="continuous" dataType="double">
<Extension>ifelse(is.na(Hours), 0, Hours)</Extension>
<Apply function="if">
<Apply function="isMissing">
<FieldRef field="Hours"/>
</Apply>
<Constant>0</Constant>
<FieldRef field="Hours"/>
</Apply>
</DerivedField>
Binarization (1/2)
?base::ifelse
as.formula(Adjusted ~ ifelse(Hours >= 40, "full-time", "part-time"))
Binarization (2/2)
<DerivedField name="ifelse(Hours &gt;= 40)" optype="categorical" dataType="string">
<Extension>ifelse(Hours &gt;= 40, "full-time", "part-time")</Extension>
<Apply function="if">
<Apply function="greaterOrEqual">
<FieldRef field="Hours"/>
<Constant>40</Constant>
</Apply>
<Constant>full-time</Constant>
<Constant>part-time</Constant>
</Apply>
</DerivedField>
Bucketization (1/2)
?base::cut
# Static intervals
as.formula(Adjusted ~ cut(Income, breaks = 3))
# Dynamic intervals
as.formula(Adjusted ~ cut(log10(Income), breaks = quantile(log10(Income)))
Bucketization (2/2)
<DerivedField name="cut(Income)" optype="categorical" dataType="string">
<Extension>cut(Income, breaks = 3)</Extension>
<Discretize field="Income">
<DiscretizeBin binValue="(129,1.61e+05]">
<Interval closure="openClosed" leftMargin="129.0" rightMargin="161000.0"/>
</DiscretizeBin>
<DiscretizeBin binValue="(1.61e+05,3.21e+05]">
<Interval closure="openClosed" leftMargin="161000.0" rightMargin="321000.0"/>
</DiscretizeBin>
<DiscretizeBin binValue="(3.21e+05,4.82e+05]">
<Interval closure="openClosed" leftMargin="321000.0" rightMargin="482000.0"/>
</DiscretizeBin>
</Discretize>
</DerivedField>
Mathematical expressions (1/2)
?base::I
as.formula(Adjusted ~ I(floor(Income / (Hours * 52))))
Supported constructs:
● Arithmetic operators `+`, `-`, `*`, `/`, `^` and `**`
● Relational operators `==`, `!=`, `<`, `<=`, `>=` and `>`
● Logical operators `!`, `&` and `|`
● Functions `abs`, `ceiling`, `exp`, `floor`, `is.na`, `log`, `log10`, `round`
and `sqrt`
Mathematical expressions (2/2)
<DerivedField name="floor(Income/(Hours * 52))" optype="continuous" dataType="double">
<Extension>I(floor(Income/(Hours * 52)))</Extension>
<Apply function="floor">
<Apply function="/">
<FieldRef field="Income"/>
<Apply function="*">
<FieldRef field="Hours"/>
<Constant>52</Constant>
</Apply>
</Apply>
</Apply>
</DerivedField>
Renaming and regrouping categories (1/2)
library("plyr")
?plyr::revalue
?plyr::mapvalues
as.formula(Adjusted ~ plyr::revalue(Employment, c("PSFederal" = "Public",
"PSState" = "Public", "PSLocal" = "Public")))
as.formula(Adjusted ~ plyr::revalue(Education, c("Yr1t4" = "Yr1t6",
"Yr5t6" = "Yr1t6", "Yr7t8" = "Yr7t9", "Yr9" = "Yr7t9", "Yr10" = "Yr10t12",
"Yr11" = "Yr10t12", "Yr12" = "Yr10t12")))
as.formula(Adjusted ~ plyr::mapvalues(Gender, c("Male", "Female"), c(0, 1)))
Renaming and regrouping categories (2/2)
<DerivedField name="revalue(Employment)" optype="categorical" dataType="string">
<Extension>plyr::revalue(Employment, c(PSFederal = "Public", PSState = "Public",
PSLocal = "Public"))</Extension>
<MapValues outputColumn="to">
<FieldColumnPair field="Employment" column="from"/>
<InlineTable>
<row><from>PSFederal</from><to>Public</to></row>
<row><from>PSState</from><to>Public</to></row>
<row><from>PSLocal</from><to>Public</to></row>
<row><from>Consultant</from><to>Consultant</to></row>
<row><from>Private</from><to>Private</to></row>
<row><from>SelfEmp</from><to>SelfEmp</to></row>
<row><from>Volunteer</from><to>Volunteer</to></row>
</InlineTable>
</MapValues>
</DerivedField>
Interactions
as.formula(Adjusted ~
# Continuous vs. continuous
Age:Income +
# Continuous vs. transformed continuous
Age:I(log10(Income)) +
# Continuous vs. categorical
Gender:Income +
# Categorical vs. categorical
(Gender + Education) ^ 2
)
Estimator fitting
Alternative encodings (1/2)
# Continuous label ~ Continuous and categorical features
audit.glm = glm(Income ~ Age + Employment + Education + Marital +
Occupation + Gender + Hours, data = audit, family = "gaussian")
# Encoded as GeneralRegressionModel element
r2pmml(audit.glm, "Audit.pmml")
# Encoded as RegressionModel element
r2pmml(audit.glm, "Audit.pmml", converter = "org.jpmml.rexp.LMConverter")
Alternative encodings (2/2)
# Binary label ~ Categorical features
audit.glm = glm(Adjusted ~ cut(Age, breaks = c(0, 21, 65, 100)) +
Employment + Education + Marital + Occupation + cut(Income, breaks =
quantile(Income)) + Gender + ifelse(Hours >= 40, "full-time",
"part-time"), data = audit, family = "binomial")
audit.scorecard = r2pmml::as.scorecard(audit.glm)
# Encoded as Scorecard element
r2pmml(audit.scorecard, "Audit.pmml")
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml
Software (Nov 2017)
● The R side:
○ base 3.3(.1)
○ pmml 1.5(.2)
○ r2pmml 0.15(.0)
○ plyr 1.8(.4)
● The Java side:
○ JPMML-R 1.2(.20)
○ JPMML-Evaluator 1.3(.10)

More Related Content

PDF
Celery: The Distributed Task Queue
PDF
A Spring Data’s Guide to Persistence
ODP
ES6 PPT FOR 2016
PDF
Data processing with celery and rabbit mq
PDF
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...
PDF
PHP, Under The Hood - DPC
PDF
Perfomatix - NodeJS Coding Standards
PDF
Spatio-temporal Data Handling With GeoServer for MetOc And Remote Sensing
Celery: The Distributed Task Queue
A Spring Data’s Guide to Persistence
ES6 PPT FOR 2016
Data processing with celery and rabbit mq
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...
PHP, Under The Hood - DPC
Perfomatix - NodeJS Coding Standards
Spatio-temporal Data Handling With GeoServer for MetOc And Remote Sensing

What's hot (20)

PDF
Extensible Data Modeling
PDF
Penerapan balanced scorecard dalam pengukuran sistem kinerja organisasi dan...
PDF
Why rust?
DOCX
PDF
Understanding Java Garbage Collection
PPTX
20 DFSORT Tricks For Zos Users - Interview Questions
PPTX
komunikasi dan jaringan komputer
PDF
Impact of the New ORM on Your Modules
PPT
Mainframe Architecture & Product Overview
PDF
Tips on how to improve the performance of your custom modules for high volume...
PDF
KMM survival guide: how to tackle struggles between Kotlin and Swift
PDF
ERP Manager meets SDLC and CMMI
PPSX
Dx11 performancereloaded
PDF
The innerHTML Apocalypse
PPTX
Photogrammetry and Star Wars Battlefront
PPTX
php and sapi and zendengine2 and...
PPTX
ppt ekraf
PDF
IMS DC Self Study Complete Tutorial
PPT
Introduction to Data Oriented Design
PPTX
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
Extensible Data Modeling
Penerapan balanced scorecard dalam pengukuran sistem kinerja organisasi dan...
Why rust?
Understanding Java Garbage Collection
20 DFSORT Tricks For Zos Users - Interview Questions
komunikasi dan jaringan komputer
Impact of the New ORM on Your Modules
Mainframe Architecture & Product Overview
Tips on how to improve the performance of your custom modules for high volume...
KMM survival guide: how to tackle struggles between Kotlin and Swift
ERP Manager meets SDLC and CMMI
Dx11 performancereloaded
The innerHTML Apocalypse
Photogrammetry and Star Wars Battlefront
php and sapi and zendengine2 and...
ppt ekraf
IMS DC Self Study Complete Tutorial
Introduction to Data Oriented Design
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
Ad

Similar to Converting R to PMML (20)

PDF
Innovative Specifications for Better Performance Logging and Monitoring
PDF
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
PDF
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
PPTX
Angular 2 Architecture (Bucharest 26/10/2016)
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
PDF
Groovy On Trading Desk (2010)
PPTX
Drupal 8 migrate!
KEY
Zend framework service
KEY
Zend framework service
PDF
Introduction to Zend Framework web services
PDF
Mind Your Business. And Its Logic
ODP
Finagle and Java Service Framework at Pinterest
PDF
Reason and GraphQL
PPTX
Mapfilterreducepresentation
PPTX
C concepts and programming examples for beginners
PDF
Broadleaf Presents Thymeleaf
PPTX
Functional Reactive Programming with RxJS
PPTX
An introduction to Test Driven Development on MapReduce
PPT
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
PPTX
Flink Batch Processing and Iterations
Innovative Specifications for Better Performance Logging and Monitoring
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
Angular 2 Architecture (Bucharest 26/10/2016)
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Groovy On Trading Desk (2010)
Drupal 8 migrate!
Zend framework service
Zend framework service
Introduction to Zend Framework web services
Mind Your Business. And Its Logic
Finagle and Java Service Framework at Pinterest
Reason and GraphQL
Mapfilterreducepresentation
C concepts and programming examples for beginners
Broadleaf Presents Thymeleaf
Functional Reactive Programming with RxJS
An introduction to Test Driven Development on MapReduce
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
Flink Batch Processing and Iterations
Ad

More from Villu Ruusmann (6)

PDF
State of the (J)PMML art
PDF
Converting Scikit-Learn to PMML
PDF
The case for (J)PMML
PDF
R, Scikit-Learn and Apache Spark ML - What difference does it make?
PDF
Representing TF and TF-IDF transformations in PMML
PDF
On the representation and reuse of machine learning (ML) models
State of the (J)PMML art
Converting Scikit-Learn to PMML
The case for (J)PMML
R, Scikit-Learn and Apache Spark ML - What difference does it make?
Representing TF and TF-IDF transformations in PMML
On the representation and reuse of machine learning (ML) models

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
annual-report-2024-2025 original latest.
PDF
Business Analytics and business intelligence.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Introduction to Data Science and Data Analysis
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
modul_python (1).pptx for professional and student
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Supervised vs unsupervised machine learning algorithms
Computer network topology notes for revision
Qualitative Qantitative and Mixed Methods.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
annual-report-2024-2025 original latest.
Business Analytics and business intelligence.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Data Science and Data Analysis
ISS -ESG Data flows What is ESG and HowHow
oil_refinery_comprehensive_20250804084928 (1).pptx
modul_python (1).pptx for professional and student
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
SAP 2 completion done . PRESENTATION.pptx
.pdf is not working space design for the following data for the following dat...
Supervised vs unsupervised machine learning algorithms

Converting R to PMML

  • 1. Converting R to PMML Villu Ruusmann Openscoring OÜ
  • 3. R challenge "Fat code around thin model objects": ● Many packages to solve the same problem, many ways of doing and/or representing the same thing ● The unit of reusability is R script ● Model object is tightly coupled to the R script that produced it. Incomplete/split schemas ● No (widely adopted-) pipeline formalization
  • 4. (J)PMML solution Evaluator evaluator = ...; List<InputField> argumentFields = evaluator.getInputFields(); List<ResultField> resultFields = Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields()); Map<FieldName, ?> arguments = readRecord(argumentFields); Map<FieldName, ?> result = evaluator.evaluate(arguments); writeRecord(result, resultFields); ● The R script (data pre-and post-processing, model) is represented using standardized PMML data structures ● Well-defined data entry and exit interfaces
  • 6. R challenge Training: audit_train = read.csv("Audit.csv") audit_train$HourlyIncome = (audit_train$Income / (audit_train$Hours * 52)) audit.model = model(Adjusted ~ ., data = audit_train) saveRDS(audit.model, "Audit.rds") Deployment: audit_test = read.csv("Audit.csv") audit.model = readRDS("Audit.rds") # "Error in eval(expr, envir, enclos) : object 'HourlyIncome' not found" predict(audit.model, newdata = audit_test)
  • 7. R solution Matrix interface ( ): label = audit$Adjusted features = audit[, !(colnames(audit) %in% c("Adjusted"))] # Returns a `model` object audit.model = model(x = features, y = label) Formula interface ( ): # Returns a `model.formula` object audit.model = model(Adjusted ~ . + I(Income / (Hours * 52)), data = audit)
  • 8. From `pmml` to `r2pmml` library("pmml") library("r2pmml") audit = read.csv("Audit.csv") audit$Adjusted = as.factor(audit$Adjusted) audit.glm = glm(Adjusted ~ . + I(Income / (Hours * 52)), data = audit, family = "binomial") saveXML(pmml(audit.glm), "Audit.pmml") r2pmml(audit.glm, "Audit.pmml")
  • 9. Package `pmml` ( ) <PMML> <DataDictionary> <DataField name="I(Income/(Hours * 52))" optype="continuous" dataType="double"/> </DataDictionary> <GeneralRegressionModel> <MiningSchema> <MiningField name="I(Income/(Hours * 52))"/> </MiningSchema> </GeneralRegressionModel> </PMML>
  • 10. Package `r2pmml` ( ) <PMML> <DataDictionary> <DataField name="Income" optype="continuous" dataType="double"/> <DataField name="Hours" optype="continuous" dataType="double"/> </DataDictionary> <TransformationDictionary> <DerivedField name="Income/(Hours * 52)" optype="continuous" dataType="double"> <Apply function="/"><FieldRef field="Income"/><Apply function="*"><FieldRef field="Hours"/><Constant>52</Constant></Apply></Apply> </DerivedField> </TransformationDictionary> <GeneralRegressionModel> <MiningSchema> <MiningField name="Income"/> <MiningField name="Hours"/> </MiningSchema> </GeneralRegressionModel> </PMML>
  • 11. Manual conversion (1/2) r2pmml::r2pmml r2pmml::decorate base::saveRDS RDS file JPMML-R library The R side The Java side
  • 12. Manual conversion (2/2) The R side: audit.model = model(Adjusted ~ ., data = audit) # Decorate the model object with supporting information audit.model = r2pmml::decorate(audit.model, dataset = audit) saveRDS(audit.model, "Audit.rds") The Java side: $ git clone https://github.com/jpmml/jpmml-r.git $ cd jpmml-r; mvn clean package $ java -jar target/converter-executable-1.2-SNAPSHOT.jar --rds-input /path/to/Audit.rds --pmml-output /path/to/Audit.pmml
  • 14. Sanitization (1/2) ?base::ifelse # Static replacement value as.formula(Adjusted ~ ifelse(is.na(Hours), 0, Hours)) # Dynamic replacement value as.formula(paste("Adjusted ~ ifelse((is.na(Hours) | Hours < 0 | Hours > 168),", mean(df$Hours), ", Hours)", sep = ""))
  • 15. Sanitization (2/2) <DerivedField name="ifelse(is.na(Hours))" optype="continuous" dataType="double"> <Extension>ifelse(is.na(Hours), 0, Hours)</Extension> <Apply function="if"> <Apply function="isMissing"> <FieldRef field="Hours"/> </Apply> <Constant>0</Constant> <FieldRef field="Hours"/> </Apply> </DerivedField>
  • 16. Binarization (1/2) ?base::ifelse as.formula(Adjusted ~ ifelse(Hours >= 40, "full-time", "part-time"))
  • 17. Binarization (2/2) <DerivedField name="ifelse(Hours &gt;= 40)" optype="categorical" dataType="string"> <Extension>ifelse(Hours &gt;= 40, "full-time", "part-time")</Extension> <Apply function="if"> <Apply function="greaterOrEqual"> <FieldRef field="Hours"/> <Constant>40</Constant> </Apply> <Constant>full-time</Constant> <Constant>part-time</Constant> </Apply> </DerivedField>
  • 18. Bucketization (1/2) ?base::cut # Static intervals as.formula(Adjusted ~ cut(Income, breaks = 3)) # Dynamic intervals as.formula(Adjusted ~ cut(log10(Income), breaks = quantile(log10(Income)))
  • 19. Bucketization (2/2) <DerivedField name="cut(Income)" optype="categorical" dataType="string"> <Extension>cut(Income, breaks = 3)</Extension> <Discretize field="Income"> <DiscretizeBin binValue="(129,1.61e+05]"> <Interval closure="openClosed" leftMargin="129.0" rightMargin="161000.0"/> </DiscretizeBin> <DiscretizeBin binValue="(1.61e+05,3.21e+05]"> <Interval closure="openClosed" leftMargin="161000.0" rightMargin="321000.0"/> </DiscretizeBin> <DiscretizeBin binValue="(3.21e+05,4.82e+05]"> <Interval closure="openClosed" leftMargin="321000.0" rightMargin="482000.0"/> </DiscretizeBin> </Discretize> </DerivedField>
  • 20. Mathematical expressions (1/2) ?base::I as.formula(Adjusted ~ I(floor(Income / (Hours * 52)))) Supported constructs: ● Arithmetic operators `+`, `-`, `*`, `/`, `^` and `**` ● Relational operators `==`, `!=`, `<`, `<=`, `>=` and `>` ● Logical operators `!`, `&` and `|` ● Functions `abs`, `ceiling`, `exp`, `floor`, `is.na`, `log`, `log10`, `round` and `sqrt`
  • 21. Mathematical expressions (2/2) <DerivedField name="floor(Income/(Hours * 52))" optype="continuous" dataType="double"> <Extension>I(floor(Income/(Hours * 52)))</Extension> <Apply function="floor"> <Apply function="/"> <FieldRef field="Income"/> <Apply function="*"> <FieldRef field="Hours"/> <Constant>52</Constant> </Apply> </Apply> </Apply> </DerivedField>
  • 22. Renaming and regrouping categories (1/2) library("plyr") ?plyr::revalue ?plyr::mapvalues as.formula(Adjusted ~ plyr::revalue(Employment, c("PSFederal" = "Public", "PSState" = "Public", "PSLocal" = "Public"))) as.formula(Adjusted ~ plyr::revalue(Education, c("Yr1t4" = "Yr1t6", "Yr5t6" = "Yr1t6", "Yr7t8" = "Yr7t9", "Yr9" = "Yr7t9", "Yr10" = "Yr10t12", "Yr11" = "Yr10t12", "Yr12" = "Yr10t12"))) as.formula(Adjusted ~ plyr::mapvalues(Gender, c("Male", "Female"), c(0, 1)))
  • 23. Renaming and regrouping categories (2/2) <DerivedField name="revalue(Employment)" optype="categorical" dataType="string"> <Extension>plyr::revalue(Employment, c(PSFederal = "Public", PSState = "Public", PSLocal = "Public"))</Extension> <MapValues outputColumn="to"> <FieldColumnPair field="Employment" column="from"/> <InlineTable> <row><from>PSFederal</from><to>Public</to></row> <row><from>PSState</from><to>Public</to></row> <row><from>PSLocal</from><to>Public</to></row> <row><from>Consultant</from><to>Consultant</to></row> <row><from>Private</from><to>Private</to></row> <row><from>SelfEmp</from><to>SelfEmp</to></row> <row><from>Volunteer</from><to>Volunteer</to></row> </InlineTable> </MapValues> </DerivedField>
  • 24. Interactions as.formula(Adjusted ~ # Continuous vs. continuous Age:Income + # Continuous vs. transformed continuous Age:I(log10(Income)) + # Continuous vs. categorical Gender:Income + # Categorical vs. categorical (Gender + Education) ^ 2 )
  • 26. Alternative encodings (1/2) # Continuous label ~ Continuous and categorical features audit.glm = glm(Income ~ Age + Employment + Education + Marital + Occupation + Gender + Hours, data = audit, family = "gaussian") # Encoded as GeneralRegressionModel element r2pmml(audit.glm, "Audit.pmml") # Encoded as RegressionModel element r2pmml(audit.glm, "Audit.pmml", converter = "org.jpmml.rexp.LMConverter")
  • 27. Alternative encodings (2/2) # Binary label ~ Categorical features audit.glm = glm(Adjusted ~ cut(Age, breaks = c(0, 21, 65, 100)) + Employment + Education + Marital + Occupation + cut(Income, breaks = quantile(Income)) + Gender + ifelse(Hours >= 40, "full-time", "part-time"), data = audit, family = "binomial") audit.scorecard = r2pmml::as.scorecard(audit.glm) # Encoded as Scorecard element r2pmml(audit.scorecard, "Audit.pmml")
  • 29. Software (Nov 2017) ● The R side: ○ base 3.3(.1) ○ pmml 1.5(.2) ○ r2pmml 0.15(.0) ○ plyr 1.8(.4) ● The Java side: ○ JPMML-R 1.2(.20) ○ JPMML-Evaluator 1.3(.10)