SlideShare a Scribd company logo
Towards Data Science Principles
@joerg_schad
tl;dr
(Why) Do we need Data Science Engineering
Principles?
2
Software Engineering
The application of a systematic, disciplined,
quantifiable approach to the development,
operation, and maintenance of software
IEEE Standard Glossary of Software Engineering
Terminology
Jörg Schad
Technical Lead/Engineer
Deep Learning
● Core Mesos
developer at
Mesosphere
● Twitter:
@joerg_schad
Why is machine learning taking off?
4
5
CONFIDENTIAL
CONFIDENTIAL
7
CONFIDENTIAL
8
What you want to be doing
9
Data
(clean)
Write awesome ML code
Train
(once)
Deploy
(not me)
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
10
1. Data Preparation
& Model Engineering
2. Model Training 3. Monitoring 4. Debugging 5. Model Serving
1. Data Preparation using
Spark
7. Streaming of requests
...
Public Cloud Pipeline
1. Data Preparation
& Model Engineering
2. Model Training 3. Monitoring 4. Debugging 5. Model Serving
1. Data Preparation using
Spark
7. Kafka stream of
requests
DIY Open Source Pipeline
1. Data Preparation
& Model Engineering
2. Model Training 3. Monitoring 4. Debugging 5. Model Serving
14
Sculley, D., Holt, G., Golovin, D. et al. Hidden Technical Debt in Machine Learning Systems
What you’re actually doing
15
Data Science
Principles
16
17
• Regulatory Requirements
• Dependencies
• CI/CD
• Git
• MLFlow
Challenge: Repeatable Builds
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
18
19Data Science CI/CD
Train Model(s)
Optimize
Model(s)
Test Model(s)
Build Serving
Container(s)
Deploy
Model
Engineering
Continuous Integration
20
• What dataset(s)?
• What target/serving environment?
• What model architecture?
• Pre-trained model available?
• How many training resources?
Challenge: Requirements Engineering
21
Challenge: Data Quality
Challenges
• Data is typically not ready to be
consumed by ML job*
– Data Cleaning
• Missing/incorrect labels
– Data Preparation
• Same Format
• Same Distribution
* Demo datasets are a fortunate exception :)
22
Challenge: Data Quality
• Data is typically not ready to be
consumed by ML job*
– Data Cleaning
• Missing/incorrect labels
– Data Preparation
• Same Format
• Same Distribution
* Demo datasets are a fortunate exception :)
Don’t forget about the
serving environment!!
23
“Libraries”
• Different Frameworks
• Existing architectures
• Pretrained models
© 2017 Mesosphere, Inc. All Rights Reserved.
TensorFlow Hub
24
https://www.tensorflow.org/hub/
25
Writing Distributed Model Functions
26
Testing
• Training/Test/Validation
Datasets
• Unit Tests?
• Different factors
– Accuracy
– Serving performance
– ….
• A/B Testing with live Data
27
Debugging
https://www.tensorflow.org/programmers_guide/debugger
28
Profiling
https://www.tensorflow.org/performance/performance_guide
29
Monitoring
Challenges
• Understand {...}
• Debug
• Model Quality
– Accuracy
– Training Time
– …
• Overall Architecture
– Availability
– Latencies
– ...
Solutions
• TensorBoard
• Traditional Cluster Monitoring Tool
30
Serving
Challenges
• How to Deploy Models?
– Zero Downtime
– Canary
Solutions
• TensorFlow Serving
31
Serving
Challenges
• Small/Fast model without
losing too much performance
• 500 KB models….
Data Science Pipeline
Continuous Integration
Monitoring & Operations
Distributed Data
Storage and
Streaming
Data Preparation
and Analysis
Storage of trained
Models and
Metadata
Use trained Model
for Inference
Distributed
Training using
Machine Learning
Frameworks
Data & Streaming
Model
Engineering
Model
Management
Model Serving
Model
Training
Management
© 2018 Mesosphere, Inc. All Rights Reserved. 33
THANK YOU!
ANY
QUESTIONS?
@dcos
users@dcos.io
/groups/8295652
/dcos
/dcos/examples
/dcos/demos
chat.dcos.io
https://mesosphere.com/resources/building-data-science-platform/
CONFIDENTIAL
Make it insanely easy
to build and scale
world-changing technology

More Related Content

PDF
Reproducibility with Unstructured Data in 3 steps
PPTX
Nasscom ml ops webinar
PDF
Data Testing
PPTX
Exploring the Future of Eclipse Modeling: Web and Semantic Collaboration
PDF
Bridging the Gap: from Data Science to Production
PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
PDF
TensorFlow 16: Building a Data Science Platform
PPTX
DevOps for Machine Learning overview en-us
Reproducibility with Unstructured Data in 3 steps
Nasscom ml ops webinar
Data Testing
Exploring the Future of Eclipse Modeling: Web and Semantic Collaboration
Bridging the Gap: from Data Science to Production
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
TensorFlow 16: Building a Data Science Platform
DevOps for Machine Learning overview en-us

Similar to Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad (20)

PDF
Software Analytics - Achievements and Challenges
PDF
Machine Learning Infrastructure
PDF
MLOps Virtual Event: Automating ML at Scale
PDF
Data Production Pipelines: Legacy, practices, and innovation
PDF
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
PPTX
MLOps - The Assembly Line of ML
PDF
Machine Learning Operations Cababilities
PDF
Intake_35_Professional_Developer_Track_SD
PDF
Intake_35_Professional_Developer_Track_SD
PDF
Intake_35_Professional_Developer_Track_SD
PDF
Making Data Science Scalable - 5 Lessons Learned
PDF
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
PPTX
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
PPTX
What's New in Innoslate 4.4?
PDF
Datamingse
PPTX
Machine Learning Models: From Research to Production 6.13.18
PPT
Database continuous integration, unit test and functional test
PPTX
Developing Digital Twins
PDF
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
DOC
Ravindra Prasad
Software Analytics - Achievements and Challenges
Machine Learning Infrastructure
MLOps Virtual Event: Automating ML at Scale
Data Production Pipelines: Legacy, practices, and innovation
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
MLOps - The Assembly Line of ML
Machine Learning Operations Cababilities
Intake_35_Professional_Developer_Track_SD
Intake_35_Professional_Developer_Track_SD
Intake_35_Professional_Developer_Track_SD
Making Data Science Scalable - 5 Lessons Learned
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
What's New in Innoslate 4.4?
Datamingse
Machine Learning Models: From Research to Production 6.13.18
Database continuous integration, unit test and functional test
Developing Digital Twins
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
Ravindra Prasad
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
Teaching material agriculture food technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
KodekX | Application Modernization Development
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Teaching material agriculture food technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Programs and apps: productivity, graphics, security and other tools
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectroscopy.pptx food analysis technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad