Part 3: Models in Production: A Look From Beginning to End

1© Cloudera, Inc. All rights reserved.
Models in Production: A Look
From Beginning to End
Sean Owen – Director of Data Science, Cloudera
Sean Anderson – Product Marketing, Cloudera

Data
Preparation
Data
Modeling
Model
Deployment
(maybe)
What does a Data Scientist Do?

• Team: Data scientists and analysts
• Goal: Understand data, develop and improve models,
share insights
• Data: New and changing; often sampled
• Environment: Local machine, sandbox cluster
• Tools: R, Python, SAS/SPSS, SQL; notebooks; data
wrangling/discovery tools, …
• End State: Reports, dashboards, PDF, MS Office
• Team: Data engineers, developers, SREs
• Goal: Build and maintain applications, improve
model performance, manage models in production
• Data: Known data; full scale
• Environment: Production clusters
• Tools: Java/Scala, C++; IDEs; continuous
integration, source control, …
• End State: Online/production applications
Types of data science
Exploratory
(discover and quantify opportunities)
Operational
(deploy production systems)

Typical data science workflow
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition

Common Limitations
Access
Many times secured clusters are hard
for data science professionals to
connect either because they don’t
have the right permissions or
resources are to scarce to afford them
access. In addition popular
frameworks and libraries don’t read
Hadoop data formats out-of-the-box.
Scale
Notebook environments seldom
have large enough data storage
for medium, let alone big data.
Data scientists are often relegated
to sample data and constrained
when working on distributed
systems. Popular frameworks and
libraries don’t easily parallelize
across the cluster.
Developer Experience
Popular notebooks don’t work well
with access engines like Spark and
package deployment and
dependency management across
multiple software versions is often
hard to manage. Then once a model
is built there is no easy path from
model development to production

Introducing Cloudera Data Science Workbench
Self-service data science for the enterprise
Accelerates data science from
development to production with:
• Secure self-service environments
for data scientists to work against
Cloudera clusters
• Support for Python, R, and Scala,
plus project dependency isolation
for multiple library versions
• Workflow automation, version
control, collaboration and sharing

Solving Data Science is a Full-Stack Problem
• Leverage Big Data
• Enable real-time use cases
• Provide sufficient toolset for the Data Analysts
• Provide sufficient toolset for the Data Scientists
+ Data Engineers
• Provide standard data governance capabilities
• Provide standard security across the stack
• Provide flexible deployment options
• Integrate with partner tools
• Provide management tools that make it easy
for IT to deploy/maintain
✓Hadoop
✓Kafka, Spark Streaming
✓Spark, Hive, Hue
✓Data Science Workbench
✓Navigator + Partners
✓Kerberos, Sentry, Record Service, KMS/KTS
✓Cloudera Director
✓Rich Ecosystem
✓Cloudera Manager/Director

© Cloudera, Inc. All rights reserved. 8
ACME Occupancy Detection
Predicting-room-occupancy-
from-environmental-sensors-
As A Service
github.com/srowen/cdsw-simple-serving

Three Key Roles
Ingest sensor data at scale. Store
and secure data. Clean and
transform data for analysis.
Explore data and build predictive
model, offline. Evaluate and tune
the model. Develop modeling
pipeline and deliver models
Verify and approve model for
deployment. Create and
maintain model APIs. Update
models in production.
Data Engineering Data Science Model Deployment

• Manages ingest of
raw CSV data to
HDFS
• Writes Scala Spark
code to ETL the data
• Uses an IDE
• Checks code into git
• Adds code to Maven
project

"date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"
"1","2015-02-04 17:51:00",23.18,27.272,426,721.25,0.00479298817650529,1
"2","2015-02-04 17:51:59",23.15,27.2675,429.5,714,0.00478344094931065,1
"3","2015-02-04 17:53:00",23.15,27.245,426,713.5,0.00477946352442199,1
"4","2015-02-04 17:54:00",23.15,27.2,426,708.25,0.00477150882608175,1
spark.read.textFile(rawInput).
map { line =>
if (line.startsWith(""date"")) {
line
} else {
line.substring(line.indexOf(',') + 1)
}
}.
repartition(1).
write.text(csvInput)
spark.read.
option("inferSchema", true).
option("header", true).
csv(csvInput).
drop("date")
Temperature Humidity Light CO2 Humidity
Ratio
Occupancy
23.18 27.272 426 721.25 0.00479 1
23.15 27.2675 429.5 714 0.00478 1
23.15 27.245 426 713.5 0.00477 1
23.15 27.2 426 708.25 0.00477 1

• Builds, evaluates and
tunes predictive
models
• Builds visualizations
• Writes Scala, Python
or R Spark code to
model using MLlib,
etc
• Uses Cloudera Data
Science Workbench
or similar
• Checks code, PMML
model into git

Ratio
Occupancy
23.18 27.272 426 721.25 0.00479 1
23.15 27.2675 429.5 714 0.00478 1
23.15 27.245 426 713.5 0.00477 1
23.15 27.2 426 708.25 0.00477 1
val assembler = new VectorAssembler().
setInputCols(training.columns.filter(_ != "Occupancy")).
setOutputCol("featureVec")
val lr = new LogisticRegression().
setFeaturesCol("featureVec").
setLabelCol("Occupancy").
setRawPredictionCol("rawPrediction")
val pipeline =
new Pipeline().setStages(Array(assembler, lr))
LogisticRegression
[regParam=0.01]

(Demo)

• Validates PMML
model and deploys
to production
• Uses continuous
integration like
Travis CI
• Maintains REST API
via OpenScoring
• Uses an IDE
• Checks code into git

Ratio
Occupancy
23.18 27.272 426 721.25 0.00479 1
23.15 27.2675 429.5 714 0.00478 1
23.15 27.245 426 713.5 0.00477 1
23.15 27.2 426 708.25 0.00477 1
<PMML version="4.3" xmlns="http://www.dmg.org/PMML-4_3">
…
<RegressionModel functionName="classification" normalizationMethod="softmax">
…
<RegressionTable intercept="16.121752149952" targetCategory="1">
<NumericPredictor name="Temperature" coefficient="-1.239411520229105"/>
<NumericPredictor name="Humidity" coefficient="0.040079547154413746"/>
<NumericPredictor name="Light" coefficient="0.020182888698828436"/>
<NumericPredictor name="CO2" coefficient="0.0060762157896669"/>
<NumericPredictor name="HumidityRatio" coefficient="-500.42306896474247"/>
</RegressionTable>
…
</RegressionModel>
</PMML>
POST /model/occupancy

(Demo)

github.com/srowen/
cdsw-simple-serving

© Cloudera, Inc. All rights reserved.
2
7
A conference for and by practicing data scientists
Save the Date: July 20th at the Chapel
Wrangle is a one-day, single track community event that hosts the best and brightest in the
Bay Area talking about the principles, practice, and application of Data Science, across
multiple data-rich industries. Join Cloudera to discuss future trends, how they can can be
predicted, and most importantly—how can they be anticipated.
wrangleconf.com

Thank you

Part 3: Models in Production: A Look From Beginning to End

More Related Content

What's hot (20)

Similar to Part 3: Models in Production: A Look From Beginning to End (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Part 3: Models in Production: A Look From Beginning to End