SlideShare a Scribd company logo
Apache Spark(™)
Model Deployment
Bay Area Spark Meetup – June 30, 2016
Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
About Me
Richard L Garris
• rlgarris@databricks.com
• @rlgarris [Twitter]
Big Data Solutions Architect @ Databricks
12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
Prior Work Experience PwC, Google, Skytree
Ohio State Buckeye and CMU Alumni
2
About Apache Spark MLlib
Started at Berkeley AMPLab
(Apache Spark 0.8)
Now (Apache Spark 2.0)
• Contributions from 75+ orgs, ~250 individuals
• Development driven by Databricks: roadmap + 50% of
PRs
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphFrames
3
MLlib Goals
General Machine Learning library for big data
• Scalable & robust
• Coverage of common algorithms
• Leverages Apache Spark
Tools for practical workflows
Integration with existing data science tools
4
Apache Spark MLlib
• spark.mllib
• Pre Mllib < Spark 1.4
• Spark Mllib was a lower
level library that used
Spark RDDs
• Uses LabeledPoint,
Vectors and Tuples
• Maintenance Mode only
after Spark 2.X
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split('
').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, st
epSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Apache Spark – ML Pipelines
• spark.ml
• Spark > 1.4
• Spark.ML pipelines –
able to create more
complex models
• Integrated with
DataFrames
// Let's initialize our linear regression learner
val lr = new LinearRegression()
// Now we set the parameters for the method
lr.setPredictionCol("Predicted_PE")
.setLabelCol("PE").setMaxIter(100).setRegParam(0.1)
// We will use the new spark.ml pipeline API. If you
have worked with scikit-learn this will be very
familiar.
val lrPipeline = new Pipeline()
lrPipeline.setStages(Array(vectorizer, lr))
// Let's first train on the entire dataset to see what
we get
val lrModel = lrPipeline.fit(trainingSet)
The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Focus of this
talk
What is a Model?
•
But What Really is a Model?
A model is a complex pipeline of components
• Data Sources
• Joins
• Featurization Logic
• Algorithm(s)
• Transformers
• Estimators
• Tuning Parameters
ML Pipelines
11
Train model
Evaluate
Load data
Extract features
A very simple pipeline
ML Pipelines
12
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract features
Extract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline
Why ML persistence?
13
Data
Science
Software
Engineering
Prototype (Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
Why ML persistence?
14
Data
Science
Software
Engineering
Prototype (Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline for
production (Java)
Deploy Pipeline
With ML persistence...
15
Data
Science
Software
Engineering
Prototype (Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
Demo
Model Serialization in Apache Spark 2.0 using Parquet
What are the Requirements
for a Robust Model
Deployment System?
Customer SLAs
• Response time
• Throughput (predictions per second)
• Uptime / Reliability
Tech Stack
• C / C++
• Legacy (mainframe)
• Java
• Docker
Your Model Scoring Environment
Offline
• Internal Use (batch)
• Emails, Notifications
(batch)
• Offline – schedule based or
event trigger based
Model Scoring Offline vs Online
Online
• Customer Waiting on the
Response (human real-time)
• Super low-latency with fixed
response window
(transactional fraud, ad
bidding)
Not All Models Return a Yes / No
Model Scoring Considerations
Example: Login Bot Detector
Different behavior depending on
probability score
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Challenge Question
0.6 to 0.75 ☞ Send SMS
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block
Example: Item Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Return sorted set of items to recommend
Optional – pass context sensitive information
to tailor results
Model Updates and Versioning
• Model Update Frequency
(nightly, weekly, monthly, quarterly)
• Model Version Tracking
• Model Release Process
• Dev ‣ Test ‣ Staging ‣ Production
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
• Models can have both reward and risk to the business
– Well designed models prevent fraud, reduce churn, increase sales
– Poorly designed models increase fraud, could impact the company’s brand,
cause compliance violations or other risks
• Models should be governed by the company's policies and procedures,
laws and regulations and the organization's management goals
Model Governance
Considerations
• Models have to be transparent, explainable, traceable and interpretable for
auditors / regulators
• Models may need reason codes for rejections (e.g. if I decline someone credit why?)
• Models should have an approval and release process
• Models also cannot violate any discrimination laws or use features that could be
traced to religion, gender, ethnicity,
Model A/B Testing
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate
Results
• A/B testing – comparing two
versions to see what performs
better
• Historical data works for
evaluating models in testing, but
production experiments required
to validate model hypothesis
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
A/B Framework should support these steps
• Monitoring is the process of
observing the model’s
performance, logging it’s
behavior and alerting when the
model degrades
• Logging should log exactly the
data feed into the model at the
time of scoring
• Model alerting is critical to
detect unusual or unexpected
behaviors
Model Monitoring
Open Loop vs Closed Loop
• Open Loop – human being involved
• Closed Loop – no human involved
Model Scoring – almost always closed loop, some models alert
agents or customer service
Model Training – usually open loop with a data scientist in the
loop to update the model
Online Learning
• closed loop, entirely machine driven modeling is
risky
• need to have proper model monitoring and
safeguards to prevent abuse / sensitivity to noise
• Mllib supports online through streaming models (k-
means, logistic regression support online)
• Alternative – use a more complex model to better fit
new data rather than using online learning
Model Deployment
Architectures
Architecture #1
Offline Recommendations
Train ALS Model Send Offers to Customers
Save Offers to NoSQL
Ranked Offers
Display Ranked Offers in
Web / Mobile
Nightly Batch
Architecture #2
Precomputed Features with Streaming
Web Logs
Kill User’s Login SessionPre-compute Features Features
Spark Streaming
Architecture #3
Local Apache Spark(™)
Train Model in Spark Save Model to S3 / HDFS
New Data
Copy
Model to
Production
Predictions
Run Spark Local
Demo
• Example of Offline Recommendations using ALS and
Redis as a NoSQL Cache
Try Databricks Community Edition
2016 Apache Spark Survey
33
Spark Summit EU
Brussels
October 25-27
The CFP closes at 11:59pm on July 1st
For more information and to submit:
https://spark-summit.org/eu-2016/
34

More Related Content

PDF
nioで作ったBufferedWriterに変えたら例外になった
PDF
Paris Kafka Meetup - Concepts & Architecture
PDF
Keycloak拡張入門
PPT
DataGuard体験記
PDF
Introduction to Apache NiFi 1.11.4
PDF
Apache NiFi の紹介 #streamctjp
PPTX
大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...
PDF
Apache Sparkの紹介
nioで作ったBufferedWriterに変えたら例外になった
Paris Kafka Meetup - Concepts & Architecture
Keycloak拡張入門
DataGuard体験記
Introduction to Apache NiFi 1.11.4
Apache NiFi の紹介 #streamctjp
大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...
Apache Sparkの紹介

What's hot (20)

PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
Hadoop Meetup Jan 2019 - Overview of Ozone
PDF
Hello, kafka! (an introduction to apache kafka)
PDF
MuleSoft Online Meetup - MuleSoft integration with snowflake and kafka
PDF
Apache Hadoop YARNとマルチテナントにおけるリソース管理
PDF
[B31,32]SQL Server Internal と パフォーマンスチューニング by Yukio Kumazawa
PDF
Hadoop and Kerberos
PDF
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
PPTX
Keycloakの全体像: 基本概念、ユースケース、そして最新の開発動向
PDF
Apache Kafka® and API Management
PPTX
え、まって。その並列分散処理、Kafkaのしくみでもできるの? Apache Kafkaの機能を利用した大規模ストリームデータの並列分散処理
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
PPTX
Amazon S3のターゲットエンドポイントとしての利用
PPTX
サイバーエージェントにおけるプライベートコンテナ基盤AKEを支える技術
PPTX
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
PDF
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
PDF
Hue勉強会 20131008
PPTX
Apache Tez - A unifying Framework for Hadoop Data Processing
PDF
Data ingestion and distribution with apache NiFi
PDF
What's New in Apache Hive
Apache Tez: Accelerating Hadoop Query Processing
Hadoop Meetup Jan 2019 - Overview of Ozone
Hello, kafka! (an introduction to apache kafka)
MuleSoft Online Meetup - MuleSoft integration with snowflake and kafka
Apache Hadoop YARNとマルチテナントにおけるリソース管理
[B31,32]SQL Server Internal と パフォーマンスチューニング by Yukio Kumazawa
Hadoop and Kerberos
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Keycloakの全体像: 基本概念、ユースケース、そして最新の開発動向
Apache Kafka® and API Management
え、まって。その並列分散処理、Kafkaのしくみでもできるの? Apache Kafkaの機能を利用した大規模ストリームデータの並列分散処理
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Amazon S3のターゲットエンドポイントとしての利用
サイバーエージェントにおけるプライベートコンテナ基盤AKEを支える技術
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Hue勉強会 20131008
Apache Tez - A unifying Framework for Hadoop Data Processing
Data ingestion and distribution with apache NiFi
What's New in Apache Hive
Ad

Viewers also liked (20)

PPTX
Use r tutorial part1, introduction to sparkr
PDF
Scalable Data Science with SparkR
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PDF
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
PPTX
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
PPTX
Get most out of Spark on YARN
PDF
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
PDF
Spark Compute as a Service at Paypal with Prabhu Kasinathan
PDF
Spark on yarn
PDF
Dynamically Allocate Cluster Resources to your Spark Application
PPTX
Producing Spark on YARN for ETL
PPTX
Hadoop and Spark Analytics over Better Storage
PPT
SocSciBot(01 Mar2010) - Korean Manual
PDF
Productionizing Spark and the Spark Job Server
PPTX
ETL with SPARK - First Spark London meetup
PDF
Why your Spark job is failing
PPT
Proxy Servers
PPT
Proxy Server
PDF
Spark 2.x Troubleshooting Guide
 
Use r tutorial part1, introduction to sparkr
Scalable Data Science with SparkR
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Get most out of Spark on YARN
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark on yarn
Dynamically Allocate Cluster Resources to your Spark Application
Producing Spark on YARN for ETL
Hadoop and Spark Analytics over Better Storage
SocSciBot(01 Mar2010) - Korean Manual
Productionizing Spark and the Spark Job Server
ETL with SPARK - First Spark London meetup
Why your Spark job is failing
Proxy Servers
Proxy Server
Spark 2.x Troubleshooting Guide
 
Ad

Similar to Apache Spark Model Deployment (20)

PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Ml ops intro session
PDF
Making Netflix Machine Learning Algorithms Reliable
PDF
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
PPTX
Machine Learning Models in Production
PDF
Machine Learning Operations Cababilities
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
PDF
From Labelling Open data images to building a private recommender system
PPTX
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
PPTX
DevOps for Machine Learning overview en-us
PPTX
Building High Available and Scalable Machine Learning Applications
PPTX
Driving Digital Transformation with Machine Learning in Oracle Analytics
PDF
DevOps for DataScience
PPTX
Machine learning
PDF
Productionising Machine Learning Models
PDF
Machine learning in production
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PDF
A survey on Machine Learning In Production (July 2018)
PDF
Consolidating MLOps at One of Europe’s Biggest Airports
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Ml ops intro session
Making Netflix Machine Learning Algorithms Reliable
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Machine Learning Models in Production
Machine Learning Operations Cababilities
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
From Labelling Open data images to building a private recommender system
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
DevOps for Machine Learning overview en-us
Building High Available and Scalable Machine Learning Applications
Driving Digital Transformation with Machine Learning in Oracle Analytics
DevOps for DataScience
Machine learning
Productionising Machine Learning Models
Machine learning in production
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Python for Machine Learning_ A Comprehensive Overview.pptx
A survey on Machine Learning In Production (July 2018)
Consolidating MLOps at One of Europe’s Biggest Airports

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
KodekX | Application Modernization Development
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
KodekX | Application Modernization Development
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Apache Spark Model Deployment

  • 1. Apache Spark(™) Model Deployment Bay Area Spark Meetup – June 30, 2016 Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
  • 2. About Me Richard L Garris • rlgarris@databricks.com • @rlgarris [Twitter] Big Data Solutions Architect @ Databricks 12+ years designing Enterprise Data Solutions for everyone from startups to Global 2000 Prior Work Experience PwC, Google, Skytree Ohio State Buckeye and CMU Alumni 2
  • 3. About Apache Spark MLlib Started at Berkeley AMPLab (Apache Spark 0.8) Now (Apache Spark 2.0) • Contributions from 75+ orgs, ~250 individuals • Development driven by Databricks: roadmap + 50% of PRs • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphFrames 3
  • 4. MLlib Goals General Machine Learning library for big data • Scalable & robust • Coverage of common algorithms • Leverages Apache Spark Tools for practical workflows Integration with existing data science tools 4
  • 5. Apache Spark MLlib • spark.mllib • Pre Mllib < Spark 1.4 • Spark Mllib was a lower level library that used Spark RDDs • Uses LabeledPoint, Vectors and Tuples • Maintenance Mode only after Spark 2.X // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, st epSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }
  • 6. Apache Spark – ML Pipelines • spark.ml • Spark > 1.4 • Spark.ML pipelines – able to create more complex models • Integrated with DataFrames // Let's initialize our linear regression learner val lr = new LinearRegression() // Now we set the parameters for the method lr.setPredictionCol("Predicted_PE") .setLabelCol("PE").setMaxIter(100).setRegParam(0.1) // We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar. val lrPipeline = new Pipeline() lrPipeline.setStages(Array(vectorizer, lr)) // Let's first train on the entire dataset to see what we get val lrModel = lrPipeline.fit(trainingSet)
  • 7. The Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results
  • 8. The Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results Focus of this talk
  • 9. What is a Model? •
  • 10. But What Really is a Model? A model is a complex pipeline of components • Data Sources • Joins • Featurization Logic • Algorithm(s) • Transformers • Estimators • Tuning Parameters
  • 11. ML Pipelines 11 Train model Evaluate Load data Extract features A very simple pipeline
  • 12. ML Pipelines 12 Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract features Extract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline
  • 13. Why ML persistence? 13 Data Science Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model
  • 14. Why ML persistence? 14 Data Science Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline
  • 15. With ML persistence... 15 Data Science Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 16. Demo Model Serialization in Apache Spark 2.0 using Parquet
  • 17. What are the Requirements for a Robust Model Deployment System?
  • 18. Customer SLAs • Response time • Throughput (predictions per second) • Uptime / Reliability Tech Stack • C / C++ • Legacy (mainframe) • Java • Docker Your Model Scoring Environment
  • 19. Offline • Internal Use (batch) • Emails, Notifications (batch) • Offline – schedule based or event trigger based Model Scoring Offline vs Online Online • Customer Waiting on the Response (human real-time) • Super low-latency with fixed response window (transactional fraud, ad bidding)
  • 20. Not All Models Return a Yes / No Model Scoring Considerations Example: Login Bot Detector Different behavior depending on probability score 0.0-0.4 ☞ Allow login 0.4-0.6 ☞ Challenge Question 0.6 to 0.75 ☞ Send SMS 0.75 to 0.9 ☞ Refer to Agent 0.9 - 1.0 ☞ Block Example: Item Recommendations Output is a ranking of the top n items API – send user ID + number of items Return sorted set of items to recommend Optional – pass context sensitive information to tailor results
  • 21. Model Updates and Versioning • Model Update Frequency (nightly, weekly, monthly, quarterly) • Model Version Tracking • Model Release Process • Dev ‣ Test ‣ Staging ‣ Production • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Big Bang
  • 22. • Models can have both reward and risk to the business – Well designed models prevent fraud, reduce churn, increase sales – Poorly designed models increase fraud, could impact the company’s brand, cause compliance violations or other risks • Models should be governed by the company's policies and procedures, laws and regulations and the organization's management goals Model Governance Considerations • Models have to be transparent, explainable, traceable and interpretable for auditors / regulators • Models may need reason codes for rejections (e.g. if I decline someone credit why?) • Models should have an approval and release process • Models also cannot violate any discrimination laws or use features that could be traced to religion, gender, ethnicity,
  • 23. Model A/B Testing Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results • A/B testing – comparing two versions to see what performs better • Historical data works for evaluating models in testing, but production experiments required to validate model hypothesis • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Big Bang A/B Framework should support these steps
  • 24. • Monitoring is the process of observing the model’s performance, logging it’s behavior and alerting when the model degrades • Logging should log exactly the data feed into the model at the time of scoring • Model alerting is critical to detect unusual or unexpected behaviors Model Monitoring
  • 25. Open Loop vs Closed Loop • Open Loop – human being involved • Closed Loop – no human involved Model Scoring – almost always closed loop, some models alert agents or customer service Model Training – usually open loop with a data scientist in the loop to update the model
  • 26. Online Learning • closed loop, entirely machine driven modeling is risky • need to have proper model monitoring and safeguards to prevent abuse / sensitivity to noise • Mllib supports online through streaming models (k- means, logistic regression support online) • Alternative – use a more complex model to better fit new data rather than using online learning
  • 28. Architecture #1 Offline Recommendations Train ALS Model Send Offers to Customers Save Offers to NoSQL Ranked Offers Display Ranked Offers in Web / Mobile Nightly Batch
  • 29. Architecture #2 Precomputed Features with Streaming Web Logs Kill User’s Login SessionPre-compute Features Features Spark Streaming
  • 30. Architecture #3 Local Apache Spark(™) Train Model in Spark Save Model to S3 / HDFS New Data Copy Model to Production Predictions Run Spark Local
  • 31. Demo • Example of Offline Recommendations using ALS and Redis as a NoSQL Cache
  • 33. 2016 Apache Spark Survey 33
  • 34. Spark Summit EU Brussels October 25-27 The CFP closes at 11:59pm on July 1st For more information and to submit: https://spark-summit.org/eu-2016/ 34