Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad

Towards Data Science Principles
@joerg_schad

tl;dr
(Why) Do we need Data Science Engineering
Principles?
2
Software Engineering
The application of a systematic, disciplined,
quantifiable approach to the development,
operation, and maintenance of software
IEEE Standard Glossary of Software Engineering
Terminology

Jörg Schad
Technical Lead/Engineer
Deep Learning
● Core Mesos
developer at
Mesosphere
● Twitter:
@joerg_schad

Why is machine learning taking off?
4

What you want to be doing
9
Data
(clean)
Write awesome ML code
Train
(once)
Deploy
(not me)

© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
10

1. Data Preparation
& Model Engineering
2. Model Training 3. Monitoring 4. Debugging 5. Model Serving

1. Data Preparation using
Spark
7. Streaming of requests
...
Public Cloud Pipeline
1. Data Preparation
& Model Engineering

1. Data Preparation using
Spark
7. Kafka stream of
requests
DIY Open Source Pipeline
1. Data Preparation
& Model Engineering

14
Sculley, D., Holt, G., Golovin, D. et al. Hidden Technical Debt in Machine Learning Systems
What you’re actually doing

17
• Regulatory Requirements
• Dependencies
• CI/CD
• Git
• MLFlow
Challenge: Repeatable Builds
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model

19Data Science CI/CD
Train Model(s)
Optimize
Model(s)
Test Model(s)
Build Serving
Container(s)
Deploy
Model
Engineering
Continuous Integration

20
• What dataset(s)?
• What target/serving environment?
• What model architecture?
• Pre-trained model available?
• How many training resources?
Challenge: Requirements Engineering

21
Challenge: Data Quality
Challenges
• Data is typically not ready to be
consumed by ML job*
– Data Cleaning
• Missing/incorrect labels
– Data Preparation
• Same Format
• Same Distribution
* Demo datasets are a fortunate exception :)

22
Challenge: Data Quality
• Data is typically not ready to be
consumed by ML job*
– Data Cleaning
• Missing/incorrect labels
– Data Preparation
• Same Format
• Same Distribution
* Demo datasets are a fortunate exception :)
Don’t forget about the
serving environment!!

23
“Libraries”
• Different Frameworks
• Existing architectures
• Pretrained models

© 2017 Mesosphere, Inc. All Rights Reserved.
TensorFlow Hub
24
https://www.tensorflow.org/hub/

25
Writing Distributed Model Functions

26
Testing
• Training/Test/Validation
Datasets
• Unit Tests?
• Different factors
– Accuracy
– Serving performance
– ….
• A/B Testing with live Data

27
Debugging
https://www.tensorflow.org/programmers_guide/debugger

28
Profiling
https://www.tensorflow.org/performance/performance_guide

29
Monitoring
Challenges
• Understand {...}
• Debug
• Model Quality
– Accuracy
– Training Time
– …
• Overall Architecture
– Availability
– Latencies
– ...
Solutions
• TensorBoard
• Traditional Cluster Monitoring Tool

30
Serving
Challenges
• How to Deploy Models?
– Zero Downtime
– Canary
Solutions
• TensorFlow Serving

31
Serving
Challenges
• Small/Fast model without
losing too much performance
• 500 KB models….

Data Science Pipeline
Continuous Integration
Monitoring & Operations
Distributed Data
Storage and
Streaming
Data Preparation
and Analysis
Storage of trained
Models and
Metadata
Use trained Model
for Inference
Distributed
Training using
Machine Learning
Frameworks
Data & Streaming
Model
Engineering
Model
Management
Model Serving
Model
Training
Management

© 2018 Mesosphere, Inc. All Rights Reserved. 33
THANK YOU!
ANY
QUESTIONS?
@dcos
users@dcos.io
/groups/8295652
/dcos
/dcos/examples
/dcos/demos
chat.dcos.io
https://mesosphere.com/resources/building-data-science-platform/

CONFIDENTIAL
Make it insanely easy
to build and scale
world-changing technology

Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad

More Related Content

Similar to Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad (20)

More from Data Con LA (20)

Recently uploaded (20)

Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad