Improving How We Deliver Machine Learning Models (XCONF 2019)

23 April 2019
IMPROVING HOW WE
DELIVER MACHINE
LEARNING MODELS

2
WHO ARE WE?
David Tan
Developer @ ThoughtWorks
Jonathan Heng
Developer @ ThoughtWorks

3
THE PLAN TODAY
Why do we need to improve the ML workflow?
What are some better practices?
How can we practice these practices?

4
TEMPERATURE CHECK
Who has...
● Trained a ML model before?
● Deployed a ML model for fun?
● Deployed a ML model at work?
● Deployed a model using an automated CI pipeline?

6
OBSERVATION
One of these is not like the others
Continuous delivery practices can help us change this.

7
JOURNEY ON AN ML PROJECT
Jupyter
Notebooks
Production
???

8
IT’S NEVER JUST ML
We got 99 problems and machine learning ain’t one
Potentially
unethical
outcomes
How can we help people do difficult things?
Source: Machine Learning: The High Interest Credit Card of Technical Debt (Google, 2015)

9
HELPING PEOPLE DO DIFFICULT THINGS
Sensible defaults
● Two reference repos:
○ github.com/ThoughtWorksInc/ml-cd-starter-kit
○ github.com/ThoughtWorksInc/ml-app-template
● Better ways of working
○ 6 common problems and suggested solutions

10
LET’S GO
6 common problems and suggested solutions

1212
1. WORKS ON MY MACHINE
To start, simply:
• docker build …
• docker run ...

1313
1. WORKS ON MY MACHINE
Demo

Business problem
ML model
Feature engineering
Data
1515
2. NO DATA / DATA SUCKS
Mitigation measures
● Think about data access before starting project
● Collect better data with every
release (more on this a few slides
from now)
● “Wizard of Oz” / Fake it till we make it
○ Provide interface but without ML
implementation

16
3DEPLOYMENTS ARE
COMPLICATED

1717
3. DEPLOYMENTS ARE COMPLICATED
Mitigation measures
● CI pipeline
● Deploy early and often
● Tracer bullet
● Bring the pain forward

1818
3. DEPLOY EARLY AND OFTEN
Source: Continuous Delivery (Jez Humble, Dave Farley)
feedback
Run unit
tests
push
Source code
repository
trigger
Local env

1919
feedback
Run unit
tests
Train and
evaluate
model
push
Source code
repository
trigger
Local env

2020
feedback
Run unit
tests
Deploy
candidate
model to
staging
Train and
evaluate
model
push
Source code
repository
trigger
Local env
Artifact
repositor
y

2121
Run unit
tests
Deploy
candidate
model to
staging
Deploy
model to
production
Train and
evaluate
model
push
Source code
repository
trigger
Local env
Artifact
repositor
y

22
4HOW DO WE CHOOSE
BETWEEN CANDIDATE
MODELS?

2323
4. CHOOSING BUILDS
Include model evaluation metrics in CI pipeline

24
5HOW’S THE MODEL
DOING IN THE WILD?

2525
5. OBSERVE!
Monitoring service usage
Benefit #1: Feedback on production model

2626
5. OBSERVE!
Monitoring model output

2727
5. OBSERVE!
Monitoring model inputs
● Could help identify training-serving skew

2828
5. OBSERVE!
Benefit #2: Interpretability of predictions

2929
5. OBSERVE!
Benefit #3: Closing the data collection loop
Data turking Train &
test
model
Deploy
model
Data / feature
repository
data
Evaluate
models
Flow of data
Flow of model
Model
Service
Logs

30
5. OBSERVE!
Benefit #4: Ability to measure goodness of any model
build_and_
test
deploy_
staging
deploy_
prod
evaluate_
model_w_new_data
(git push)
evaluate_
model
model = my-image:$BUILD_ID
r_2 = 0.7
rmse = 42

3131
5. OBSERVE!
Benefit #4: Ability to measure goodness of any model

3232
5. HOW’S THE MODEL IN THE WILD?
OBSERVE!
Summing up
● Mitigation measures
○ Logging + Monitoring
● Benefits
○ Feedback on production models
○ Interpretability (how did the model decide on this particular prediction?)
○ Better data for training
○ Better (unseen) data for evaluating candidate/champion models

3333
Demo
GoCD
MLFlow
Kubernetes
+
Helm
ElasticSearch
Fluentd
Kibana
Grafana

34
Demo

35
6HARMFUL MODELS IN
PRODUCTION

3636
6. HARMFUL MODELS IN PRODUCTION
● PredPol algorithm reinforces racial biases in policing data
● Recruiting tool shows bias against women
Actual news headlines
Image source: I’m an AI researcher, and here’s what scares me about AI (Rachel Thomas)

3737
● Discuss and define what “bad” looks like in our context
● “Black mirror” retros
● Measure unfairness
○ Make fairness a measurable fitness function
● Data ethics checklist (link)
● Human-in-the-loop / appeal processes
● Ability to recover from harmful models
37
Mitigation measures

38
Demo: rollback to last good build

39
SUMMING UP
How can we make easier to do the right thing?

40
MAKE IT EASIER TO DO THE RIGHT THING
● Better ways of working
○ Environment management
○ Closing the data collection loop
○ Deploy early and often
○ Automated tracking of hyperparameters and metrics
○ Logging and monitoring
○ Do no harm
● Two reference repos:
○ github.com/ThoughtWorksInc/ml-cd-starter-kit
○ github.com/ThoughtWorksInc/ml-app-template

4141
Provision and configure cross-cutting services
GoCD
EFKG
MLFlow
github.com/ThoughtWorksInc/ml-cd-starter-kit
github.com/ThoughtWorksInc/ml-app-template
Project boilerplate template
Unit tests
Train model
Test model metrics
Dockerised setup
Store CI pipeline as code
Track hyperparameters and metrics of each training run on CI
Logging (predictions, inputs, explanatory variables)

424242
SUMMING UP
Notebook
/
playgroun
d
PROD
(maybe
)
commit and push
Experiment /
Develop
Monitor Deploy
Test
Continuous
Delivery

43
FURTHER READING
● https://www.thoughtworks.com/intelligent-empowerment
● www.continuousdelivery.com

David Tan / Jonathan Heng
davidtan+jonheng@thoughtworks.co
m
THANK YOU.
44

Improving How We Deliver Machine Learning Models (XCONF 2019)

More Related Content

What's hot (20)

Similar to Improving How We Deliver Machine Learning Models (XCONF 2019) (20)

Recently uploaded (20)

Improving How We Deliver Machine Learning Models (XCONF 2019)