SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Models in Production: A Look
From Beginning to End
Sean Owen – Director of Data Science, Cloudera
Sean Anderson – Product Marketing, Cloudera
2© Cloudera, Inc. All rights reserved.
Data
Preparation
Data
Modeling
Model
Deployment
(maybe)
What does a Data Scientist Do?
3© Cloudera, Inc. All rights reserved.
• Team: Data scientists and analysts
• Goal: Understand data, develop and improve models,
share insights
• Data: New and changing; often sampled
• Environment: Local machine, sandbox cluster
• Tools: R, Python, SAS/SPSS, SQL; notebooks; data
wrangling/discovery tools, …
• End State: Reports, dashboards, PDF, MS Office
• Team: Data engineers, developers, SREs
• Goal: Build and maintain applications, improve
model performance, manage models in production
• Data: Known data; full scale
• Environment: Production clusters
• Tools: Java/Scala, C++; IDEs; continuous
integration, source control, …
• End State: Online/production applications
Types of data science
Exploratory
(discover and quantify opportunities)
Operational
(deploy production systems)
4© Cloudera, Inc. All rights reserved.
Typical data science workflow
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition
5© Cloudera, Inc. All rights reserved.
Common Limitations
Access
Many times secured clusters are hard
for data science professionals to
connect either because they don’t
have the right permissions or
resources are to scarce to afford them
access. In addition popular
frameworks and libraries don’t read
Hadoop data formats out-of-the-box.
Scale
Notebook environments seldom
have large enough data storage
for medium, let alone big data.
Data scientists are often relegated
to sample data and constrained
when working on distributed
systems. Popular frameworks and
libraries don’t easily parallelize
across the cluster.
Developer Experience
Popular notebooks don’t work well
with access engines like Spark and
package deployment and
dependency management across
multiple software versions is often
hard to manage. Then once a model
is built there is no easy path from
model development to production
6© Cloudera, Inc. All rights reserved.
Introducing Cloudera Data Science Workbench
Self-service data science for the enterprise
Accelerates data science from
development to production with:
• Secure self-service environments
for data scientists to work against
Cloudera clusters
• Support for Python, R, and Scala,
plus project dependency isolation
for multiple library versions
• Workflow automation, version
control, collaboration and sharing
7© Cloudera, Inc. All rights reserved.
Solving Data Science is a Full-Stack Problem
• Leverage Big Data
• Enable real-time use cases
• Provide sufficient toolset for the Data Analysts
• Provide sufficient toolset for the Data Scientists
+ Data Engineers
• Provide standard data governance capabilities
• Provide standard security across the stack
• Provide flexible deployment options
• Integrate with partner tools
• Provide management tools that make it easy
for IT to deploy/maintain
✓Hadoop
✓Kafka, Spark Streaming
✓Spark, Hive, Hue
✓Data Science Workbench
✓Navigator + Partners
✓Kerberos, Sentry, Record Service, KMS/KTS
✓Cloudera Director
✓Rich Ecosystem
✓Cloudera Manager/Director
© Cloudera, Inc. All rights reserved. 8
ACME Occupancy Detection
Predicting-room-occupancy-
from-environmental-sensors-
As A Service
github.com/srowen/cdsw-simple-serving
© Cloudera, Inc. All rights reserved. 9
© Cloudera, Inc. All rights reserved. 10
Three Key Roles
Ingest sensor data at scale. Store
and secure data. Clean and
transform data for analysis.
Explore data and build predictive
model, offline. Evaluate and tune
the model. Develop modeling
pipeline and deliver models
Verify and approve model for
deployment. Create and
maintain model APIs. Update
models in production.
Data Engineering Data Science Model Deployment
© Cloudera, Inc. All rights reserved. 11
• Manages ingest of
raw CSV data to
HDFS
• Writes Scala Spark
code to ETL the data
• Uses an IDE
• Checks code into git
• Adds code to Maven
project
© Cloudera, Inc. All rights reserved. 12
"date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"
"1","2015-02-04 17:51:00",23.18,27.272,426,721.25,0.00479298817650529,1
"2","2015-02-04 17:51:59",23.15,27.2675,429.5,714,0.00478344094931065,1
"3","2015-02-04 17:53:00",23.15,27.245,426,713.5,0.00477946352442199,1
"4","2015-02-04 17:54:00",23.15,27.2,426,708.25,0.00477150882608175,1
spark.read.textFile(rawInput).
map { line =>
if (line.startsWith(""date"")) {
line
} else {
line.substring(line.indexOf(',') + 1)
}
}.
repartition(1).
write.text(csvInput)
spark.read.
option("inferSchema", true).
option("header", true).
csv(csvInput).
drop("date")
Temperature Humidity Light CO2 Humidity
Ratio
Occupancy
23.18 27.272 426 721.25 0.00479 1
23.15 27.2675 429.5 714 0.00478 1
23.15 27.245 426 713.5 0.00477 1
23.15 27.2 426 708.25 0.00477 1
© Cloudera, Inc. All rights reserved. 13
• Builds, evaluates and
tunes predictive
models
• Builds visualizations
• Writes Scala, Python
or R Spark code to
model using MLlib,
etc
• Uses Cloudera Data
Science Workbench
or similar
• Checks code, PMML
model into git
© Cloudera, Inc. All rights reserved. 14
Temperature Humidity Light CO2 Humidity
Ratio
Occupancy
23.18 27.272 426 721.25 0.00479 1
23.15 27.2675 429.5 714 0.00478 1
23.15 27.245 426 713.5 0.00477 1
23.15 27.2 426 708.25 0.00477 1
val assembler = new VectorAssembler().
setInputCols(training.columns.filter(_ != "Occupancy")).
setOutputCol("featureVec")
val lr = new LogisticRegression().
setFeaturesCol("featureVec").
setLabelCol("Occupancy").
setRawPredictionCol("rawPrediction")
val pipeline =
new Pipeline().setStages(Array(assembler, lr))
LogisticRegression
[regParam=0.01]
© Cloudera, Inc. All rights reserved. 15
(Demo)
© Cloudera, Inc. All rights reserved. 16
© Cloudera, Inc. All rights reserved. 17
© Cloudera, Inc. All rights reserved. 18
© Cloudera, Inc. All rights reserved. 19
• Validates PMML
model and deploys
to production
• Uses continuous
integration like
Travis CI
• Maintains REST API
via OpenScoring
• Uses an IDE
• Checks code into git
© Cloudera, Inc. All rights reserved. 20
Temperature Humidity Light CO2 Humidity
Ratio
Occupancy
23.18 27.272 426 721.25 0.00479 1
23.15 27.2675 429.5 714 0.00478 1
23.15 27.245 426 713.5 0.00477 1
23.15 27.2 426 708.25 0.00477 1
<PMML version="4.3" xmlns="http://www.dmg.org/PMML-4_3">
…
<RegressionModel functionName="classification" normalizationMethod="softmax">
…
<RegressionTable intercept="16.121752149952" targetCategory="1">
<NumericPredictor name="Temperature" coefficient="-1.239411520229105"/>
<NumericPredictor name="Humidity" coefficient="0.040079547154413746"/>
<NumericPredictor name="Light" coefficient="0.020182888698828436"/>
<NumericPredictor name="CO2" coefficient="0.0060762157896669"/>
<NumericPredictor name="HumidityRatio" coefficient="-500.42306896474247"/>
</RegressionTable>
…
</RegressionModel>
</PMML>
POST /model/occupancy
© Cloudera, Inc. All rights reserved. 21
(Demo)
© Cloudera, Inc. All rights reserved. 22
© Cloudera, Inc. All rights reserved. 23
© Cloudera, Inc. All rights reserved. 24
© Cloudera, Inc. All rights reserved. 25
© Cloudera, Inc. All rights reserved. 26
github.com/srowen/
cdsw-simple-serving
© Cloudera, Inc. All rights reserved.
2
7
A conference for and by practicing data scientists
Save the Date: July 20th at the Chapel
Wrangle is a one-day, single track community event that hosts the best and brightest in the
Bay Area talking about the principles, practice, and application of Data Science, across
multiple data-rich industries. Join Cloudera to discuss future trends, how they can can be
predicted, and most importantly—how can they be anticipated.
wrangleconf.com
© Cloudera, Inc. All rights reserved. 28
Thank you

More Related Content

PPTX
Cloudera Altus: Big Data in the Cloud Made Easy
PPTX
Analyzing Hadoop Data Using Sparklyr

PPTX
Supercharge Splunk with Cloudera

PPTX
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
PPTX
Kudu Forrester Webinar
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PPTX
Part 1: Introducing the Cloudera Data Science Workbench
PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera Altus: Big Data in the Cloud Made Easy
Analyzing Hadoop Data Using Sparklyr

Supercharge Splunk with Cloudera

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Kudu Forrester Webinar
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Part 1: Introducing the Cloudera Data Science Workbench
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud

What's hot (20)

PPT
A Community Approach to Fighting Cyber Threats
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
PPTX
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PPTX
How Data Drives Business at Choice Hotels
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
PPTX
Apache Kudu: Technical Deep Dive


PPTX
Solr consistency and recovery internals
PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
PPTX
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
PPTX
The Big Picture: Learned Behaviors in Churn
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

PPTX
Big data journey to the cloud rohit pujari 5.30.18
PPTX
Consolidate your data marts for fast, flexible analytics 5.24.18
PPTX
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
PDF
Hadoop on Cloud: Why and How?
PPTX
Data Science and CDSW
PPTX
End to End Streaming Architectures
PDF
Data Science and Machine Learning for the Enterprise
PPTX
Live Cloudera Cybersecurity Solution Demo
A Community Approach to Fighting Cyber Threats
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Simplifying Real-Time Architectures for IoT with Apache Kudu
How Data Drives Business at Choice Hotels
Part 1: Lambda Architectures: Simplified by Apache Kudu
Apache Kudu: Technical Deep Dive


Solr consistency and recovery internals
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
The Big Picture: Learned Behaviors in Churn
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Big data journey to the cloud rohit pujari 5.30.18
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Hadoop on Cloud: Why and How?
Data Science and CDSW
End to End Streaming Architectures
Data Science and Machine Learning for the Enterprise
Live Cloudera Cybersecurity Solution Demo
Ad

Similar to Part 3: Models in Production: A Look From Beginning to End (20)

PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PDF
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
PPTX
The Edge to AI Deep Dive Barcelona Meetup March 2019
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PPTX
From Insight to Action: Using Data Science to Transform Your Organization
PPTX
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
PPTX
Manoj Shanmugasundaram - Agile Machine Learning Development
PPTX
Machine Learning Models: From Research to Production 6.13.18
PDF
Machine Learning Model Deployment: Strategy to Implementation
PPTX
Data Science in Enterprise
PPTX
Machine Learning Models in Production
PPTX
Introducing the data science sandbox as a service 8.30.18
PDF
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
PDF
Data Science in the Enterprise
PPTX
Unlocking data science in the enterprise - with Oracle and Cloudera
PPTX
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PDF
Machine Learning in the Enterprise 2019
PPTX
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Introducing Cloudera Data Science Workbench for HDP 2.12.19
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
The Edge to AI Deep Dive Barcelona Meetup March 2019
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
From Insight to Action: Using Data Science to Transform Your Organization
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
Manoj Shanmugasundaram - Agile Machine Learning Development
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Model Deployment: Strategy to Implementation
Data Science in Enterprise
Machine Learning Models in Production
Introducing the data science sandbox as a service 8.30.18
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
Data Science in the Enterprise
Unlocking data science in the enterprise - with Oracle and Cloudera
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
Data Science at Scale Using Apache Spark and Apache Hadoop
Machine Learning in the Enterprise 2019
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Cloudera SDX
PPTX
Introducing Workload XM 8.7.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Cloudera SDX
Introducing Workload XM 8.7.18

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
System and Network Administraation Chapter 3
PDF
Nekopoi APK 2025 free lastest update
PDF
System and Network Administration Chapter 2
PDF
medical staffing services at VALiNTRY
PPT
Introduction Database Management System for Course Database
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
top salesforce developer skills in 2025.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
AI in Product Development-omnex systems
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
ai tools demonstartion for schools and inter college
PPTX
Odoo POS Development Services by CandidRoot Solutions
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
System and Network Administraation Chapter 3
Nekopoi APK 2025 free lastest update
System and Network Administration Chapter 2
medical staffing services at VALiNTRY
Introduction Database Management System for Course Database
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
CHAPTER 2 - PM Management and IT Context
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
top salesforce developer skills in 2025.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
VVF-Customer-Presentation2025-Ver1.9.pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
AI in Product Development-omnex systems
Softaken Excel to vCard Converter Software.pdf
Digital Strategies for Manufacturing Companies
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
ai tools demonstartion for schools and inter college
Odoo POS Development Services by CandidRoot Solutions

Part 3: Models in Production: A Look From Beginning to End

  • 1. 1© Cloudera, Inc. All rights reserved. Models in Production: A Look From Beginning to End Sean Owen – Director of Data Science, Cloudera Sean Anderson – Product Marketing, Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Data Preparation Data Modeling Model Deployment (maybe) What does a Data Scientist Do?
  • 3. 3© Cloudera, Inc. All rights reserved. • Team: Data scientists and analysts • Goal: Understand data, develop and improve models, share insights • Data: New and changing; often sampled • Environment: Local machine, sandbox cluster • Tools: R, Python, SAS/SPSS, SQL; notebooks; data wrangling/discovery tools, … • End State: Reports, dashboards, PDF, MS Office • Team: Data engineers, developers, SREs • Goal: Build and maintain applications, improve model performance, manage models in production • Data: Known data; full scale • Environment: Production clusters • Tools: Java/Scala, C++; IDEs; continuous integration, source control, … • End State: Online/production applications Types of data science Exploratory (discover and quantify opportunities) Operational (deploy production systems)
  • 4. 4© Cloudera, Inc. All rights reserved. Typical data science workflow Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition
  • 5. 5© Cloudera, Inc. All rights reserved. Common Limitations Access Many times secured clusters are hard for data science professionals to connect either because they don’t have the right permissions or resources are to scarce to afford them access. In addition popular frameworks and libraries don’t read Hadoop data formats out-of-the-box. Scale Notebook environments seldom have large enough data storage for medium, let alone big data. Data scientists are often relegated to sample data and constrained when working on distributed systems. Popular frameworks and libraries don’t easily parallelize across the cluster. Developer Experience Popular notebooks don’t work well with access engines like Spark and package deployment and dependency management across multiple software versions is often hard to manage. Then once a model is built there is no easy path from model development to production
  • 6. 6© Cloudera, Inc. All rights reserved. Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: • Secure self-service environments for data scientists to work against Cloudera clusters • Support for Python, R, and Scala, plus project dependency isolation for multiple library versions • Workflow automation, version control, collaboration and sharing
  • 7. 7© Cloudera, Inc. All rights reserved. Solving Data Science is a Full-Stack Problem • Leverage Big Data • Enable real-time use cases • Provide sufficient toolset for the Data Analysts • Provide sufficient toolset for the Data Scientists + Data Engineers • Provide standard data governance capabilities • Provide standard security across the stack • Provide flexible deployment options • Integrate with partner tools • Provide management tools that make it easy for IT to deploy/maintain ✓Hadoop ✓Kafka, Spark Streaming ✓Spark, Hive, Hue ✓Data Science Workbench ✓Navigator + Partners ✓Kerberos, Sentry, Record Service, KMS/KTS ✓Cloudera Director ✓Rich Ecosystem ✓Cloudera Manager/Director
  • 8. © Cloudera, Inc. All rights reserved. 8 ACME Occupancy Detection Predicting-room-occupancy- from-environmental-sensors- As A Service github.com/srowen/cdsw-simple-serving
  • 9. © Cloudera, Inc. All rights reserved. 9
  • 10. © Cloudera, Inc. All rights reserved. 10 Three Key Roles Ingest sensor data at scale. Store and secure data. Clean and transform data for analysis. Explore data and build predictive model, offline. Evaluate and tune the model. Develop modeling pipeline and deliver models Verify and approve model for deployment. Create and maintain model APIs. Update models in production. Data Engineering Data Science Model Deployment
  • 11. © Cloudera, Inc. All rights reserved. 11 • Manages ingest of raw CSV data to HDFS • Writes Scala Spark code to ETL the data • Uses an IDE • Checks code into git • Adds code to Maven project
  • 12. © Cloudera, Inc. All rights reserved. 12 "date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy" "1","2015-02-04 17:51:00",23.18,27.272,426,721.25,0.00479298817650529,1 "2","2015-02-04 17:51:59",23.15,27.2675,429.5,714,0.00478344094931065,1 "3","2015-02-04 17:53:00",23.15,27.245,426,713.5,0.00477946352442199,1 "4","2015-02-04 17:54:00",23.15,27.2,426,708.25,0.00477150882608175,1 spark.read.textFile(rawInput). map { line => if (line.startsWith(""date"")) { line } else { line.substring(line.indexOf(',') + 1) } }. repartition(1). write.text(csvInput) spark.read. option("inferSchema", true). option("header", true). csv(csvInput). drop("date") Temperature Humidity Light CO2 Humidity Ratio Occupancy 23.18 27.272 426 721.25 0.00479 1 23.15 27.2675 429.5 714 0.00478 1 23.15 27.245 426 713.5 0.00477 1 23.15 27.2 426 708.25 0.00477 1
  • 13. © Cloudera, Inc. All rights reserved. 13 • Builds, evaluates and tunes predictive models • Builds visualizations • Writes Scala, Python or R Spark code to model using MLlib, etc • Uses Cloudera Data Science Workbench or similar • Checks code, PMML model into git
  • 14. © Cloudera, Inc. All rights reserved. 14 Temperature Humidity Light CO2 Humidity Ratio Occupancy 23.18 27.272 426 721.25 0.00479 1 23.15 27.2675 429.5 714 0.00478 1 23.15 27.245 426 713.5 0.00477 1 23.15 27.2 426 708.25 0.00477 1 val assembler = new VectorAssembler(). setInputCols(training.columns.filter(_ != "Occupancy")). setOutputCol("featureVec") val lr = new LogisticRegression(). setFeaturesCol("featureVec"). setLabelCol("Occupancy"). setRawPredictionCol("rawPrediction") val pipeline = new Pipeline().setStages(Array(assembler, lr)) LogisticRegression [regParam=0.01]
  • 15. © Cloudera, Inc. All rights reserved. 15 (Demo)
  • 16. © Cloudera, Inc. All rights reserved. 16
  • 17. © Cloudera, Inc. All rights reserved. 17
  • 18. © Cloudera, Inc. All rights reserved. 18
  • 19. © Cloudera, Inc. All rights reserved. 19 • Validates PMML model and deploys to production • Uses continuous integration like Travis CI • Maintains REST API via OpenScoring • Uses an IDE • Checks code into git
  • 20. © Cloudera, Inc. All rights reserved. 20 Temperature Humidity Light CO2 Humidity Ratio Occupancy 23.18 27.272 426 721.25 0.00479 1 23.15 27.2675 429.5 714 0.00478 1 23.15 27.245 426 713.5 0.00477 1 23.15 27.2 426 708.25 0.00477 1 <PMML version="4.3" xmlns="http://www.dmg.org/PMML-4_3"> … <RegressionModel functionName="classification" normalizationMethod="softmax"> … <RegressionTable intercept="16.121752149952" targetCategory="1"> <NumericPredictor name="Temperature" coefficient="-1.239411520229105"/> <NumericPredictor name="Humidity" coefficient="0.040079547154413746"/> <NumericPredictor name="Light" coefficient="0.020182888698828436"/> <NumericPredictor name="CO2" coefficient="0.0060762157896669"/> <NumericPredictor name="HumidityRatio" coefficient="-500.42306896474247"/> </RegressionTable> … </RegressionModel> </PMML> POST /model/occupancy
  • 21. © Cloudera, Inc. All rights reserved. 21 (Demo)
  • 22. © Cloudera, Inc. All rights reserved. 22
  • 23. © Cloudera, Inc. All rights reserved. 23
  • 24. © Cloudera, Inc. All rights reserved. 24
  • 25. © Cloudera, Inc. All rights reserved. 25
  • 26. © Cloudera, Inc. All rights reserved. 26 github.com/srowen/ cdsw-simple-serving
  • 27. © Cloudera, Inc. All rights reserved. 2 7 A conference for and by practicing data scientists Save the Date: July 20th at the Chapel Wrangle is a one-day, single track community event that hosts the best and brightest in the Bay Area talking about the principles, practice, and application of Data Science, across multiple data-rich industries. Join Cloudera to discuss future trends, how they can can be predicted, and most importantly—how can they be anticipated. wrangleconf.com
  • 28. © Cloudera, Inc. All rights reserved. 28 Thank you