How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x with Richard Garris

Richard Garris (Principal Solutions Architect)
Apache Spark™ MLlib 2.x:
How to Productionize your Machine
Learning Models

Empower anyone to innovate faster with big data.
Founded by the creators of Apache Spark.
Contributes 75% of the open source code,
10x more than any other company.
VISION
WHO WE ARE
A fully-managed data processing platform
for the enterprise, powered by Apache Spark.
PRODUCT

CLUSTER TUNING &
MANAGEMENT
INTERACTIVE
WORKSPACE
PRODUCTION
PIPELINE
AUTOMATION
OPTIMIZED DATA
ACCESS
DATABRICKSENTERPRISE SECURITY
YOUR TEAMS
Data Science
Data Engineering
Manyothers…
BIAnalysts
YOUR DATA
Cloud Storage
Data Warehouses
Data Lake
VIRTUAL ANALYTICSPLATFORM

About Me
• Richard L Garris
• rlgarris@databricks.com
• Twitter @rlgarris
• Principal Data Solutions Architect @ Databricks
• 12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
• Prior Work ExperiencePwC, Google and Skytree – the Machine
Learning Company
• Ohio State Buckeye and Masters from CMU

Outline
• Spark Mllib2.X
• Model Serialization
• Model Scoring SystemRequirements
• Model Scoring Architectures
• Databricks Model Scoring

About Apache Spark™ MLlib
• Started with Spark 0.8 in the
AMPLab in 2014
• Migration to Spark
DataFrames started with
Spark 1.3 with feature parity
within 2.X
• Contributions by 75+ orgs,
~250 individuals
• Distributed algorithms that
scale linearly with the data

MLlib’s Goals
• General purpose machine learning library optimizedfor big data
• Linearly scalable = 2x more machines , runtime theoretically cut in half
• Fault tolerant = resilient to the failure of nodes
• Covers the most common algorithms with distributed implementations
• Built around the concept of a Data Science Pipeline (scikit-learn)
• Written entirelyusing Apache Spark™
• Integrateswell withthe Agile Modeling Process

A Model is a MathematicalFunction
• A model is a function: 𝑓 𝑥
• Linear regression 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2

ML Pipelines
Train model
Evaluate
Load data
Extract features
A very simple pipeline

ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline!

ProductionizingModels Today
Data Science Data Engineering
Develop Prototype
Model using Python/R Re-implement model for
production (Java)

Problems with ProductionizingModels
Develop Prototype
Model using Python/R
Re-implement model for
production (Java)
- Extra work
- Different code paths
- Data science does not translate to production
- Slow to update models

MLlib 2.X Model Serialization
Develop Prototype
Model using Python/R
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production

Scala
val lrModel = lrPipeline.fit(dataset)
// Save the Model
lrModel.write.save("/models/lr")
•
MLlib 2.X Model Serialization Snippet
Python
lrModel = lrPipeline.fit(dataset)
# Save the Model
lrModel.write.save("/models/lr")
•

Model Serialization Output
Code
// List Contents of the Model Dir
dbutils.fs.ls("/models/lr")
•
Output
Remember this is a pipeline
model and these are the stages!

TransformerStage (StringIndexer)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/00_strI
dx_bb9728f85745/metadata/part-00000")
// Display the Parquet File in the Data dir
display(spark.read.parquet(”/models/lr/sta
ges/00_strIdx_bb9728f85745/data/"))
Output
{
"class":"org.apache.spark.ml.feature.StringIndexerModel",
"timestamp":1488120411719,
"sparkVersion":"2.1.0",
"uid":"strIdx_bb9728f85745",
"paramMap":{
"outputCol":"workclassIdx",
"inputCol":"workclass",
"handleInvalid":"error"
}
}
Metadata and params
Data (Hashmap)

Estimator Stage (LogisticRegression)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/18_logr
eg_325fa760f925/metadata/part-00000")
display(spark.read.parquet("/models/lr/sta
ges/18_logreg_325fa760f925/data/"))
Output
Model params
Intercept + Coefficients
{"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp":1488120446324,
"sparkVersion":"2.1.0",
"uid":"logreg_325fa760f925",
"paramMap":{
"predictionCol":"prediction",
"standardization":true,
"probabilityCol":"probability",
"maxIter":100,
"elasticNetParam":0.0,
"family":"auto",
"regParam":0.0,
"threshold":0.5,
"fitIntercept":true,
"labelCol":"label” }}

Output
Decision Tree Splits
Estimator Stage (DecisionTree)
Code
display(spark.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/"))
// Re-save as JSON
spark.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").

Visualize Stage (DecisionTree)
Visualization of the Tree
In Databricks

What are the Requirementsfor
a Robust Model Deployment
System?

Model ScoringEnvironment Examples
• In Web Applications / EcommercePortals
• Mainframe / Batch ProcessingSystems
• Real-TimeProcessingSystems/ Middleware
• Via API / Microservice
• Embeddedin Devices (Mobile Phones, Medical Devices,Autos)

Hidden Technical Debt in ML Systems
“Hidden Technical Debt in Machine
Learning Systems “, Google NIPS 2015
“Hidden Technical Debt in Machine Learning Systems “, Google NIPS 2015

Agile Modeling Process
Set Business Goals
Understand Your
Data
Create Hypothesis
Devise Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure/Evaluate
Results

Agile Modeling Process
Set Business Goals
Understand Your
Data
Create Hypothesis
Devise Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure/Evaluate
Results
Focus of this
talk

Set Business Goals
Understand Your
Data
Create Hypothesis
Devise Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure/Evaluate
Results
Deployment Should be Agile
• Deploymentneeds to
support A/B testing and
experiments
• Deploymentshould
support measuring and
evaluating model
performance
• Deploymentshould be
fast and adaptive to
business needs

Model A/B Testing, Monitoring, Updates
• A/B testing – comparing two versions to see what performs better
• Monitoring is the process of observing the model’s performance, logging it’s
behavior and alerting when the model degrades
• Logging should log exactly the data feed into the model at the time of scoring
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Avoid Big Bang

Consider the Scoring Environment
Customer SLAs
•Response time
•Throughput
(predictions per
second)
•Uptime / Reliability
Tech Stack
–C / C++
–Legacy (mainframe)
–Java

Batch Real-Time
Scoringin Batch vs Real-Time
• Synchronous
• Could be Seconds:
– Customer is waiting
(human real-time)
• Subsecond:
– High Frequency Trading
– Fraud Detection on the Swipe
• Asynchronous
• Internal Use
• Triggers can be event based on time
based
• Used for Email Campaigns,
Notifications

Open Loop – human being involved
Closed Loop – no human involved
• Model Scoring – almost always
closed loop, some open loop e.g.
alert agents or customer service
• Model Training – usually open loop
with a data scientist in the loop to
update the model
Online Learning and Open / Closed Loop
• Online is closed loop, entirely
machine driven but modeling is
risky
• need to have proper model
monitoring and safeguards to
prevent abuse / sensitivity to noise
• MLlib supports online through
streaming models (k-means, logistic
regression support online)
• Alternative – use a more complex
model to better fit new data rather
than using online learning
Open / ClosedLoop Online Learning

Model Scoring– Bot Detection
Not All Models Return Boolean – e.g. a Yes / No
Example: Login Bot Detector
Different behavior depending on probability that use is a bot
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Send Challenge Question
0.6 to 0.75 ☞ Send SMS Code
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block

Model Scoring– Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Returnsorted set of items to recommend
Optional –
pass contextsensitive informationto tailor results

Architecture Option A
PrecomputePredictions using Spark and Serve fromDatabase
Train ALS Model
Send Email Offers
to Customers
Save Offers to
NoSQL
Ranked Offers
Display Ranked
Offers in Web /
Mobile
Recurring
Batch

Architecture Option B
Spark Streamand Score using an API with CachedPredictions
Web Activity Logs
Kill User’s Login
SessionCompute Features Run Prediction
Streaming
Cache Predictions API Check

Architecture Option C
Train with Spark and Score Outside of Spark
Train Model in
Spark
Save Model to S3
/ HDFS
New Data
Copy
Model to
Production
Predictions
Load coefficients and
intercept from file

Databricks Model Scoring
• Based on ArchitectureOption C
• Goal: DeployMLlib model outside of Apache Spark and
Databricks.
• Easy to Embed in Existing Environments
• Low Latency and Complexity
• Low Overhead

• Train Model in Databricks
– Call Fit on Pipeline
– Save Model as JSON
• Deploy model in external system
– Add dependency on “dbml-local”
package (without Spark)
– Load model from JSON at startup
– Make predictions in real time
Databricks Model Scoring
Code
// Fit and Export the Model in Databricks
val lrModel = lrPipeline.fit(dataset)
ModelExporter.export(lrModel, " /models/db ")
// In Your Application (Scala)
import com.databricks.ml.local.ModelImport
val lrModel= ModelImport.import("s3a:/...")
val jsonInput = ...
val jsonOutput= lrModel.transform(jsonInput)

Databricks Model ScoringPrivate Beta
• Private BetaAvailable for Databricks Customers
• Available on Databricks using Apache Spark 2.1
• Only logistic regressionavailable now
• Additional Estimatorsand Transformersin Progress

Demo Model Scoring
https://community.cloud.databricks.com/?o=1526931011080774
#notebook/1904316851197504

Thank You.
Questions?
Happy Sparking
richard@databricks.com

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x with Richard Garris

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x with Richard Garris (20)

More from Databricks (20)

Recently uploaded (20)

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x with Richard Garris