SlideShare a Scribd company logo
Operationalize Apache Spark Analytics
Operationalize Apache Spark
Analytics
Ivan Nardini
Sr. Associate Customer Advisor, SAS Institute | CI & Analytics | ModelOps | Decisioning
Artem Glazkov
Sr. Consultant, SAS Instintute | Decisioning | ModelOps | Customer Advisory
Operationalize Apache
Spark Analytics
Ivan Nardini
SAS Governance options with Apache® Spark
Analytics
▪ Govern Spark Models – PMML
▪ Orchestrate Spark Models - Livy
Artem Glazkov
Managing Spark ML model lifecycle demo
scenario:
▪ Code-agnostics model repository
▪ BPM tool for model governance
▪ Capturing model performance over time
Model Ops Challenges
Model Ops Challenges
▪ Change in customer
behavior
▪ Internal and external
environment changes
▪ Track performance for
models with long and short
target actualization
▪ Role-based approach
▪ Elaborate clear action plan
for the model
▪ Combine business rules,
scripts, and user expertise
in governance process
DecisioningModel Performance decay Retrain automation
▪ Orchestrate repetitive
procedures
▪ Reduce time gap between
model development and
deployment stages
▪ Figure out right model in the
right moment for retrain
How we meet ModelOps challenges
using SAS Model Manager and SAS Workflow Manager
Including two build-in
scoring engines
(CAS and MAS) and
external engines
Integration with engines Orchestration
GUI + code
govern SAS and Open
Source models
Openness
One place to
store all
models
Repository
Built-in and
customized model
quality assessment
Reporting
Automate all
repetitive model
management
tasks
Why we should track model performance decay
Predictivepowerofthemodel
time
t1 t2 t3 t4
Deployed
model
Alerting trigger
Additional value
Retrained and
redeployed model
should track model performance
decay
How do you operationalize
Spark Models?
SAS Governance options with Apache Spark Analytics
Govern Spark Models using SAS
- PMML
PMML is one of the leading standard for
statistical and data mining models.
PMML enables model development on one
system using one application and deploy the
model on another system using another
application, simply by transmitting an XML
configuration file.
Govern Spark models – Spark PMML
Govern Spark models – Spark PMML
The JPMML-SparkML library converts Apache Spark ML pipelines to PMML data
format. It is written in Java. But the JPMML family includes Python (and R) wrapper
libraries for the JPMML-SparkML library.
For Python, we have the pyspark2pmml package that works with the official PySpark
interface:
• The pyspark2pmml.PMMLBuilder Python class is an API clone of the org.jpmml.sparkml.PMMLBuilder Java class.
• The Apache Spark connection is typically available in PySpark session as the sc variable. The SparkContext class
has an _jvm attribute, which gives Python users direct access to JPMML-SparkML functionality via the Py4J
gateway.
Then in your Spark session, you fit your pipeline and then use PMMLBuilder to create
its PMML file.
Govern Spark models: SAS Model Manager and PMML
SAS Model Manager
GUI/
REST API
PySpark Mlib
Register into
Spark Development
Environment
SAS Workflow Manager
SAS Data
Connector
Spark Production
Environment
SAS Viya
Governance Environment
Score new data
In-DB Process for
Spark by SAS
REST API
In this scenario we are translate OS
model score code to SAS and utilize
Embeded Process for Hadoop.
We use build-in SAS Viya capabilities
for creating SAS Model Manager
reports, based on the scored data
provided by running of the Embedded
process.
Govern Spark models: The «PMML» workflow
PMML approach
Pro and Cons
PROs:
• SAS In-database technology
(Accelator Scoring)
CONs:
• Technology Bottlenecks
(PMML supports a limited set of
algorithms)
Govern Spark models
(PMML)
Orchestrate Spark Models
– Apache Livy
Orchestrate Spark models – What’s Apache Livy?
Apache Livy is a service enables easy submission of Spark jobs or snippets of
Spark code, synchronous or asynchronous result retrieval, as well as Spark
Context management, all via a simple REST interface or an RPC client library.
SAS Viya
client
Govern Spark models – Apache Livy
Like Python Sklearn models, we register the parquet version of Spark Mlib model and (optionally) the
scoring code:
• Parquet model contains the model metainfo to score new data in the Hadoop/Spark ecosystems.
• Scoring code is a REST API recipe will submit from Livy Server to Spark cluster for loading the model
and get score back
Then we use SAS Workflow Manager capabilities (Job execution and REST API service) to:
1. Submit Scoring REST API call
2. Get back the scoring data
3. Generate Performance monitoring
SAS Model Manager
GUI/
REST API
PySpark Mlib
Register into
Spark Development
Environment
SAS Workflow Manager
Spark Production
Environment
SAS Viya
Governance Environment
REST API
Apache Livy
Score new data
Govern Spark models: SAS Model Manager and Apache Livy
In this scenario SAS Model Manager and
SAS Workflow Manager acting more like
orchestrator of service task and user
reviews.
We utilize build-in SAS Viya capabilities
for creating Model Manager reports,
based on the scored data provided by
native spark.
Govern Spark models: «Apache Livy» workflow
PMML and Livy approaches
Pro and Cons
PROs:
• SAS In-database technology
(Accelator Scoring)
CONs:
• Technology Bottlenecks
(PMML supports a
limited set of algorithms)
Govern Spark models
(PMML)
Orchestrate Spark Models
(Livy)
PROs:
• Native integrations (no score code
manipulation or conversion)
CONs:
• Configuration needed (Livy server)
Demo
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

PDF
Productionizing Machine Learning with a Microservices Architecture
PDF
End-to-End Deep Learning with Horovod on Apache Spark
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
End-to-End Data Pipelines with Apache Spark
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
PDF
Spark Summit EU talk by Berni Schiefer
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Productionizing Machine Learning with a Microservices Architecture
End-to-End Deep Learning with Horovod on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Apache Spark MLlib 2.0 Preview: Data Science and Production
End-to-End Data Pipelines with Apache Spark
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Spark Summit EU talk by Berni Schiefer
Running Apache Spark on Kubernetes: Best Practices and Pitfalls

What's hot (20)

PDF
Koalas: How Well Does Koalas Work?
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PDF
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
PDF
Spark Summit EU talk by Stephan Kessler
PDF
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
PDF
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Spark at Bloomberg: Dynamically Composable Analytics
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
Spark Summit EU talk by Luca Canali
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
PDF
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
PDF
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Koalas: How Well Does Koalas Work?
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Spark Summit EU talk by Stephan Kessler
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark at Bloomberg: Dynamically Composable Analytics
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Yiannis Gkoufas
Efficient State Management With Spark 2.0 And Scale-Out Databases
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Ad

Similar to Operationalize Apache Spark Analytics (20)

PDF
MLeap: Release Spark ML Pipelines
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Building Applications with Scala 1st Edition Pacheco
PDF
AI at Scale
PDF
Learning Concurrent Programming In Scala Second Edition 2nd Edition Aleksanda...
PDF
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
PDF
Spark Summit East 2016 - MLeap Presentation
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
PDF
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
PDF
Putting the Spark into Functional Fashion Tech Analystics
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PPTX
Building Machine Learning Inference Pipelines at Scale (July 2019)
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
PDF
Running Apache Spark Jobs Using Kubernetes
PDF
Media_Entertainment_Veriticals
DOCX
Himansu-Java&BigdataDeveloper
PPTX
Building machine learning inference pipelines at scale (March 2019)
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
MLeap: Release Spark ML Pipelines
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Building Applications with Scala 1st Edition Pacheco
AI at Scale
Learning Concurrent Programming In Scala Second Edition 2nd Edition Aleksanda...
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
Spark Summit East 2016 - MLeap Presentation
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Putting the Spark into Functional Fashion Tech Analystics
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Building Machine Learning Inference Pipelines at Scale (July 2019)
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
Running Apache Spark Jobs Using Kubernetes
Media_Entertainment_Veriticals
Himansu-Java&BigdataDeveloper
Building machine learning inference pipelines at scale (March 2019)
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Quality review (1)_presentation of this 21
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
1_Introduction to advance data techniques.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Lecture1 pattern recognition............
PDF
Introduction to the R Programming Language
annual-report-2024-2025 original latest.
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Quality review (1)_presentation of this 21
IBA_Chapter_11_Slides_Final_Accessible.pptx
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
Qualitative Qantitative and Mixed Methods.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
1_Introduction to advance data techniques.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Lecture1 pattern recognition............
Introduction to the R Programming Language

Operationalize Apache Spark Analytics

  • 2. Operationalize Apache Spark Analytics Ivan Nardini Sr. Associate Customer Advisor, SAS Institute | CI & Analytics | ModelOps | Decisioning Artem Glazkov Sr. Consultant, SAS Instintute | Decisioning | ModelOps | Customer Advisory
  • 3. Operationalize Apache Spark Analytics Ivan Nardini SAS Governance options with Apache® Spark Analytics ▪ Govern Spark Models – PMML ▪ Orchestrate Spark Models - Livy Artem Glazkov Managing Spark ML model lifecycle demo scenario: ▪ Code-agnostics model repository ▪ BPM tool for model governance ▪ Capturing model performance over time
  • 5. Model Ops Challenges ▪ Change in customer behavior ▪ Internal and external environment changes ▪ Track performance for models with long and short target actualization ▪ Role-based approach ▪ Elaborate clear action plan for the model ▪ Combine business rules, scripts, and user expertise in governance process DecisioningModel Performance decay Retrain automation ▪ Orchestrate repetitive procedures ▪ Reduce time gap between model development and deployment stages ▪ Figure out right model in the right moment for retrain
  • 6. How we meet ModelOps challenges using SAS Model Manager and SAS Workflow Manager Including two build-in scoring engines (CAS and MAS) and external engines Integration with engines Orchestration GUI + code govern SAS and Open Source models Openness One place to store all models Repository Built-in and customized model quality assessment Reporting Automate all repetitive model management tasks
  • 7. Why we should track model performance decay Predictivepowerofthemodel time t1 t2 t3 t4 Deployed model Alerting trigger Additional value Retrained and redeployed model
  • 8. should track model performance decay How do you operationalize Spark Models?
  • 9. SAS Governance options with Apache Spark Analytics
  • 10. Govern Spark Models using SAS - PMML
  • 11. PMML is one of the leading standard for statistical and data mining models. PMML enables model development on one system using one application and deploy the model on another system using another application, simply by transmitting an XML configuration file. Govern Spark models – Spark PMML
  • 12. Govern Spark models – Spark PMML The JPMML-SparkML library converts Apache Spark ML pipelines to PMML data format. It is written in Java. But the JPMML family includes Python (and R) wrapper libraries for the JPMML-SparkML library. For Python, we have the pyspark2pmml package that works with the official PySpark interface: • The pyspark2pmml.PMMLBuilder Python class is an API clone of the org.jpmml.sparkml.PMMLBuilder Java class. • The Apache Spark connection is typically available in PySpark session as the sc variable. The SparkContext class has an _jvm attribute, which gives Python users direct access to JPMML-SparkML functionality via the Py4J gateway. Then in your Spark session, you fit your pipeline and then use PMMLBuilder to create its PMML file.
  • 13. Govern Spark models: SAS Model Manager and PMML SAS Model Manager GUI/ REST API PySpark Mlib Register into Spark Development Environment SAS Workflow Manager SAS Data Connector Spark Production Environment SAS Viya Governance Environment Score new data In-DB Process for Spark by SAS REST API
  • 14. In this scenario we are translate OS model score code to SAS and utilize Embeded Process for Hadoop. We use build-in SAS Viya capabilities for creating SAS Model Manager reports, based on the scored data provided by running of the Embedded process. Govern Spark models: The «PMML» workflow
  • 15. PMML approach Pro and Cons PROs: • SAS In-database technology (Accelator Scoring) CONs: • Technology Bottlenecks (PMML supports a limited set of algorithms) Govern Spark models (PMML)
  • 17. Orchestrate Spark models – What’s Apache Livy? Apache Livy is a service enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library. SAS Viya client
  • 18. Govern Spark models – Apache Livy Like Python Sklearn models, we register the parquet version of Spark Mlib model and (optionally) the scoring code: • Parquet model contains the model metainfo to score new data in the Hadoop/Spark ecosystems. • Scoring code is a REST API recipe will submit from Livy Server to Spark cluster for loading the model and get score back Then we use SAS Workflow Manager capabilities (Job execution and REST API service) to: 1. Submit Scoring REST API call 2. Get back the scoring data 3. Generate Performance monitoring
  • 19. SAS Model Manager GUI/ REST API PySpark Mlib Register into Spark Development Environment SAS Workflow Manager Spark Production Environment SAS Viya Governance Environment REST API Apache Livy Score new data Govern Spark models: SAS Model Manager and Apache Livy
  • 20. In this scenario SAS Model Manager and SAS Workflow Manager acting more like orchestrator of service task and user reviews. We utilize build-in SAS Viya capabilities for creating Model Manager reports, based on the scored data provided by native spark. Govern Spark models: «Apache Livy» workflow
  • 21. PMML and Livy approaches Pro and Cons PROs: • SAS In-database technology (Accelator Scoring) CONs: • Technology Bottlenecks (PMML supports a limited set of algorithms) Govern Spark models (PMML) Orchestrate Spark Models (Livy) PROs: • Native integrations (no score code manipulation or conversion) CONs: • Configuration needed (Livy server)
  • 22. Demo
  • 23. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.