Operationalize Apache Spark Analytics

Operationalize Apache
Spark Analytics
Ivan Nardini
SAS Governance options with Apache® Spark
Analytics
▪ Govern Spark Models – PMML
▪ Orchestrate Spark Models - Livy
Artem Glazkov
Managing Spark ML model lifecycle demo
scenario:
▪ Code-agnostics model repository
▪ BPM tool for model governance
▪ Capturing model performance over time

Model Ops Challenges
▪ Change in customer
behavior
▪ Internal and external
environment changes
▪ Track performance for
models with long and short
target actualization
▪ Role-based approach
▪ Elaborate clear action plan
for the model
▪ Combine business rules,
scripts, and user expertise
in governance process
DecisioningModel Performance decay Retrain automation
▪ Orchestrate repetitive
procedures
▪ Reduce time gap between
model development and
deployment stages
▪ Figure out right model in the
right moment for retrain

How we meet ModelOps challenges
using SAS Model Manager and SAS Workflow Manager
Including two build-in
scoring engines
(CAS and MAS) and
external engines
Integration with engines Orchestration
GUI + code
govern SAS and Open
Source models
Openness
One place to
store all
models
Repository
Built-in and
customized model
quality assessment
Reporting
Automate all
repetitive model
management
tasks

Why we should track model performance decay
Predictivepowerofthemodel
time
t1 t2 t3 t4
Deployed
model
Alerting trigger
Additional value
Retrained and
redeployed model

should track model performance
decay
How do you operationalize
Spark Models?

SAS Governance options with Apache Spark Analytics

Govern Spark Models using SAS
- PMML

PMML is one of the leading standard for
statistical and data mining models.
PMML enables model development on one
system using one application and deploy the
model on another system using another
application, simply by transmitting an XML
configuration file.
Govern Spark models – Spark PMML

Govern Spark models – Spark PMML
The JPMML-SparkML library converts Apache Spark ML pipelines to PMML data
format. It is written in Java. But the JPMML family includes Python (and R) wrapper
libraries for the JPMML-SparkML library.
For Python, we have the pyspark2pmml package that works with the official PySpark
interface:
• The pyspark2pmml.PMMLBuilder Python class is an API clone of the org.jpmml.sparkml.PMMLBuilder Java class.
• The Apache Spark connection is typically available in PySpark session as the sc variable. The SparkContext class
has an _jvm attribute, which gives Python users direct access to JPMML-SparkML functionality via the Py4J
gateway.
Then in your Spark session, you fit your pipeline and then use PMMLBuilder to create
its PMML file.

Govern Spark models: SAS Model Manager and PMML
SAS Model Manager
GUI/
REST API
PySpark Mlib
Register into
Spark Development
Environment
SAS Workflow Manager
SAS Data
Connector
Spark Production
Environment
SAS Viya
Governance Environment
Score new data
In-DB Process for
Spark by SAS
REST API

In this scenario we are translate OS
model score code to SAS and utilize
Embeded Process for Hadoop.
We use build-in SAS Viya capabilities
for creating SAS Model Manager
reports, based on the scored data
provided by running of the Embedded
process.
Govern Spark models: The «PMML» workflow

PMML approach
Pro and Cons
PROs:
• SAS In-database technology
(Accelator Scoring)
CONs:
• Technology Bottlenecks
(PMML supports a limited set of
algorithms)
Govern Spark models
(PMML)

Orchestrate Spark Models
– Apache Livy

Orchestrate Spark models – What’s Apache Livy?
Apache Livy is a service enables easy submission of Spark jobs or snippets of
Spark code, synchronous or asynchronous result retrieval, as well as Spark
Context management, all via a simple REST interface or an RPC client library.
SAS Viya
client

Govern Spark models – Apache Livy
Like Python Sklearn models, we register the parquet version of Spark Mlib model and (optionally) the
scoring code:
• Parquet model contains the model metainfo to score new data in the Hadoop/Spark ecosystems.
• Scoring code is a REST API recipe will submit from Livy Server to Spark cluster for loading the model
and get score back
Then we use SAS Workflow Manager capabilities (Job execution and REST API service) to:
1. Submit Scoring REST API call
2. Get back the scoring data
3. Generate Performance monitoring

SAS Model Manager
GUI/
REST API
PySpark Mlib
Register into
Spark Development
Environment
SAS Workflow Manager
Spark Production
Environment
SAS Viya
Governance Environment
REST API
Apache Livy
Score new data
Govern Spark models: SAS Model Manager and Apache Livy

In this scenario SAS Model Manager and
SAS Workflow Manager acting more like
orchestrator of service task and user
reviews.
We utilize build-in SAS Viya capabilities
for creating Model Manager reports,
based on the scored data provided by
native spark.
Govern Spark models: «Apache Livy» workflow

PMML and Livy approaches
Pro and Cons
PROs:
• SAS In-database technology
(Accelator Scoring)
CONs:
• Technology Bottlenecks
(PMML supports a
limited set of algorithms)
Govern Spark models
(PMML)
Orchestrate Spark Models
(Livy)
PROs:
• Native integrations (no score code
manipulation or conversion)
CONs:
• Configuration needed (Livy server)

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Operationalize Apache Spark Analytics

More Related Content

What's hot (20)

Similar to Operationalize Apache Spark Analytics (20)

More from Databricks (20)

Recently uploaded (20)

Operationalize Apache Spark Analytics