SlideShare a Scribd company logo
@anandsampat
Version Control for Machine
Learning + AI
Workshop
Stanford
@anandsampat
Before we begin:
datmo.com/get-started
Datmo installation:*
Install VirtualBox and follow along instead
https://docs.datmo.com/guides/using-datmo-on-virtualbox.html
Having Trouble?
@anandsampat
Anand Sampat
Co-founder, Datmo
@anandsampat
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
What is Version Control?
The management of changes to
documents, computer programs, large
web sites, and other collections of
information.
*AKA `Source Control`
“
@anandsampat
https://www.ctl.io/developers/assets/images/blog/scmhistory.png
Version Control Timeline
mercurial
@anandsampat
https://www.ctl.io/developers/assets/images/blog/scmhistory.png
Version Control Timeline
mercurial
@anandsampat
You’ve probably heard of Git.
Git is a version control system for tracking
changes in computer files and
coordinating work on those files among
multiple people. It is primarily used
for source code management in software
development, but it can be used to keep
track of changes in any set of files.
@anandsampat
So, GitHub, right?
(Yes, and no.)
@anandsampat
Git(Hub) Revolutionized
Software Development
@anandsampat
GitHub = SCM + Hosting + Much More
@anandsampat
For developers: For enterprises:
• Self-managed SCM servers
became a thing of the past
• Developers could leverage
industry best practices for their
own personal work
• Community of knowledge
built around a known standard
• Collaboration on Open Source
Software
• Advent of continuous
integration / deployment
• Removed need for external
code issue tracking tool
• Consolidation of code storage
and versioning tool

• Pull Requests, code review,
documentation through
ReadMe
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
QoD’s == Quantitative Oriented Developers
Artificial IntelligenceData Science Machine Learning
@anandsampat
https://blog.datmo.io/demystifying-the-ml-ai-and-data-science-development-
ecosystem-part-1-build-76c6d4911d07
@anandsampat
https://blog.datmo.io/demystifying-the-ml-ai-and-data-science-development-
ecosystem-part-1-build-76c6d4911d07
+ Deployment!

+ Post-Deployment!
(DevOps!)
@anandsampat
It’s time to talk about MLOps
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-
systems.pdf
@anandsampat
MLOps: The Elephant in the Room
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-
systems.pdf
@anandsampat
ML systems have a special capacity for incurring
technical debt, because they have all of the
maintenance problems of traditional code plus an
additional set of ML-specific issues. This debt may be
difficult to detect because it exists at the system level.
“
— Google (Sculley et. al, 2015)
@anandsampat
Typical methods for paying down code level
technical debt are not sufficient to address
ML-specific technical debt at the system level.
“
— Google (Sculley et. al, 2015)
@anandsampat
http://eng.uber.com/wp-content/uploads/2017/09/image8.png
Here’s where traditional tools fall short
@anandsampat
http://eng.uber.com/wp-content/uploads/2017/09/image8.png
Here’s where traditional tools fall short
@anandsampat
@anandsampat
https://eng.uber.com/michelangelo/
https://code.facebook.com/posts/1072626246134461/
introducing-fblearner-flow-facebook-s-ai-backbone/
@anandsampat
As for everyone else?
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
What is Datmo?
Datmo is a workflow tool for ML, AI,
and Data Science developers. It helps
with managing model version control,
easy environment handling, and
reproducing results through the
power of snapshots.
@anandsampat
What are Datmo Snapshots?
Code
Environment
Configuration
Files*
Metrics
@anandsampat
Why are they important?
Environment
Configuration
Metrics
Datmo Snapshots
Git Commits
Code
Files*
@anandsampat
How will it help?
Datmo leverages containers to quickly
spin up perfectly reproducible
developer environments. It tracks this
environment, along with model
metadata inside of snapshots.
@anandsampat
From a broad perspective:
Make ML Ops and workflows
manageable and simple, not
completely abstracted away.
Reduce the amount of glue code
so that people can have more
robust pipelines.
@anandsampat
From a broad perspective:
Make ML Ops and workflows
manageable and simple, not
completely abstracted away.
Reduce the amount of glue code
so that people can have more
robust pipelines.
@anandsampat
GitHub = SCM + Hosting + More
Datmo = Model Versioning +
Environments + Deployment + More
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
Datmo in today’s example
We’re going to use Datmo to show how we can
quickly iterate on our model and streamline our
workflow.
We’ll go through using snapshots for A/B testing,
saving our tasks, and enabling you all to reproduce
my results/make your own changes to the model.
@anandsampat
Problem:
Multiple Classification of Flower Species
@anandsampat
Dataset: Fisher’s Iris Flower
http://archive.ics.uci.edu/ml/datasets/Iris
@anandsampat
At a glance:
- 4 Features
- 3 Classes
- 150 Rows (50 per class)
@anandsampat
Model Experimentation
@anandsampat
Live Demo
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
Reproducing the Model
https://datmo.com/signup
Ensure you are signed up on Datmo:
$ [sudo] datmo setup
One time initial setup:
https://datmo.com/settings/integration
Connect Github:
@anandsampat
Fork the model
Fork from Web Platform GUI (top right corner):
https://datmo.com/anands/workshop-iris-classification
@anandsampat
Fetch your model from Datmo
$ datmo clone [YOUR-USERNAME]/workshop-iris-classification
Clone the Datmo Model:
$ cd workshop-iris-classification
Jump into this directory:
@anandsampat
Checkout an existing snapshot
@anandsampat
View all model snapshots
$ datmo snapshot ls
@anandsampat
Checkout to a particular snapshot
$ datmo snapshot checkout --id ______
@anandsampat
Create your own snapshot
@anandsampat
Track Snapshots
https://datmo.com/anands/workshop-iris-classification/snapshots?grid=1
@anandsampat
Run the Task
$ datmo task run “python3 classifier.py”
@anandsampat
Run the Task
$ datmo task run “python3 classifier.py”
We want our Python file to be run
inside of the container. Why?
@anandsampat
Create a Snapshot from Task output
$ datmo snapshot task --id _________
@anandsampat
What just happened?
• Datmo cloned the model from the platform,
bringing all of the necessary resources to local.
• Datmo set your current code to the state of the
desired snapshot.
• Datmo built the environment inside of a container.
• Datmo executed the task inside of the container,
and logged the results.
• Datmo combines the task output files,
environment, code, configs, and metrics into a
snapshot
datmo clone
datmo snapshot
checkout
Command Result
datmo task run
datmo snapshot
task
@anandsampat
1. Traditional Source Control isn’t enough for QoD
(Data Science, ML, and AI)
Key Takeaways
2. Think about ML Ops before you’re “in too deep”
3. In the same way GitHub revolutionized Software
Engineering, Datmo does the same for QoD’s
@anandsampat
Code Available at:
https://datmo.com/anands/workshop-iris-classification
@anandsampat
Full Slides Available at:
https://bit.ly/stanford-version-control
@anandsampat
Going Forward
@anandsampat
2. Learn more about ML and browse more content
at our blog: https://blog.datmo.com
Next Steps
3. Interested in updates? You’ll be signed up for our
weekly newsletter if you signed up today.
4. Stay tuned for our open source library this
month. It’ll be at https://github.com/datmo/datmo
1. Check out example workflows in our docs to
create your own Datmo project here
@anandsampat
Thank You!
@anandsampat
References
@anandsampat
Nuts and Bolts of Source Control:
http://ericsink.com/scm/source_control.html
@anandsampat
2015 NIPS Paper from Google
https://papers.nips.cc/paper/5656-hidden-
technical-debt-in-machine-learning-systems.pdf

More Related Content

PDF
Provenance in Production-Grade Machine Learning
PPTX
Managing and Versioning Machine Learning Models in Python
PDF
Using dataset versioning in data science
PDF
Seamless MLOps with Seldon and MLflow
PPTX
Machine Learning In Production
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PPTX
Next.ml Boston: Data Science Dev Ops
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
Provenance in Production-Grade Machine Learning
Managing and Versioning Machine Learning Models in Python
Using dataset versioning in data science
Seamless MLOps with Seldon and MLflow
Machine Learning In Production
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
Next.ml Boston: Data Science Dev Ops
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow

What's hot (20)

PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PDF
DevOps for DataScience
PDF
Scaling Analysis Responsibly
PDF
Managers guide to effective building of machine learning products
PDF
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
PDF
Simplifying AI integration on Apache Spark
PPTX
Magdalena Stenius: MLOPS Will Change Machine Learning
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PDF
CD4ML and the challenges of testing and quality in ML systems
PDF
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
PPTX
Deploying ML models to production (frequently and safely) - PYCON 2018
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PDF
From NASA to Startups to Big Commerce
PPTX
Data Science as a Service: Intersection of Cloud Computing and Data Science
PDF
Machine Learning in Production
PPTX
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
PDF
Spark NLP: State of the Art Natural Language Processing at Scale
PDF
Patrick Hall, H2O.ai - Human Friendly Machine Learning - H2O World San Francisco
PPTX
DataSciencePT #27 - Fifty Shades of Automated Machine Learning
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
DevOps for DataScience
Scaling Analysis Responsibly
Managers guide to effective building of machine learning products
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
Simplifying AI integration on Apache Spark
Magdalena Stenius: MLOPS Will Change Machine Learning
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
CD4ML and the challenges of testing and quality in ML systems
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
Deploying ML models to production (frequently and safely) - PYCON 2018
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
From NASA to Startups to Big Commerce
Data Science as a Service: Intersection of Cloud Computing and Data Science
Machine Learning in Production
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Spark NLP: State of the Art Natural Language Processing at Scale
Patrick Hall, H2O.ai - Human Friendly Machine Learning - H2O World San Francisco
DataSciencePT #27 - Fifty Shades of Automated Machine Learning
Ad

Similar to Version Control in Machine Learning + AI (Stanford) (20)

PDF
Version Control in AI/Machine Learning by Datmo
PDF
10 Ways To Improve Your Code
ODP
Drupal Efficiency - Coding, Deployment, Scaling
PDF
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
PDF
The Data Janitor Returns | Daniel Molnar | DN18
PDF
There is something about serverless
ODP
Drupal Efficiency using open source technologies from Sun
PDF
2019 StartIT - Boosting your performance with Blackfire
PPT
jvm goes to big data
PPTX
carrow - Go bindings to Apache Arrow via C++-API
PPTX
Fast and Reproducible Deep Learning
PPTX
Creating Developer-Friendly Docker Containers with Chaperone
PDF
Monitoring MySQL with DTrace/SystemTap
PPTX
SFDX – Myth Buster, Svatopluk Sejkora
ODP
Why Sun for Drupal?
PDF
Off-Label Data Mesh: A Prescription for Healthier Data
PDF
10 Ways To Improve Your Code( Neal Ford)
PDF
Getting Started with the OpenNTF Domino API
PDF
Serverless? How (not) to develop, deploy and operate serverless applications.
PDF
Data Engineer's Lunch #63: Building a Cryptocurrency Data Catalogue
Version Control in AI/Machine Learning by Datmo
10 Ways To Improve Your Code
Drupal Efficiency - Coding, Deployment, Scaling
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
The Data Janitor Returns | Daniel Molnar | DN18
There is something about serverless
Drupal Efficiency using open source technologies from Sun
2019 StartIT - Boosting your performance with Blackfire
jvm goes to big data
carrow - Go bindings to Apache Arrow via C++-API
Fast and Reproducible Deep Learning
Creating Developer-Friendly Docker Containers with Chaperone
Monitoring MySQL with DTrace/SystemTap
SFDX – Myth Buster, Svatopluk Sejkora
Why Sun for Drupal?
Off-Label Data Mesh: A Prescription for Healthier Data
10 Ways To Improve Your Code( Neal Ford)
Getting Started with the OpenNTF Domino API
Serverless? How (not) to develop, deploy and operate serverless applications.
Data Engineer's Lunch #63: Building a Cryptocurrency Data Catalogue
Ad

Recently uploaded (20)

PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Construction Project Organization Group 2.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
web development for engineering and engineering
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
composite construction of structures.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CH1 Production IntroductoryConcepts.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Foundation to blockchain - A guide to Blockchain Tech
Construction Project Organization Group 2.pptx
UNIT 4 Total Quality Management .pptx
Structs to JSON How Go Powers REST APIs.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Sustainable Sites - Green Building Construction
web development for engineering and engineering
Arduino robotics embedded978-1-4302-3184-4.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
composite construction of structures.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Digital Logic Computer Design lecture notes
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT

Version Control in Machine Learning + AI (Stanford)