Data versioning in
machine learning
projects
Dmitry Petrov
dmitry@iterative.ai
PyData Berlin 2018
Agenda
1. What makes ML process special?
2. Data files
3. ML pipelines and reproducibility
4. Data science workflow
5. Beyond the Horizon
1
Dmitry Petrov
Twitter: @FullStackML
PhD in Computer Science
Hello
Co-Founder & CEO at Iterative.AI. San Francisco, USA
ex-Data Scientist at Microsoft (BingAds). Seattle, USA
ex-Head of Lab at St. Petersburg Electrotechnical University, Russia
Chapter I
What makes ML process special?
Data files hell problem
1. Data files are not in your repository.
2. Tons of data file versions:
- model.pkl
- model_L7_e120.pkl
- model_vgg16_L5tune_e120.pkl
- model_L7_e160_cleansed.pkl
- model_vgg16_L45tune_e120.pkl
- model_vgg16_L45tune_e160_noempty.pkl
- …
3. Data files are not connected to code files.
- $ git checkout finetune_head # creates even more mess
Data files hell in a team
1. How to create a reproducible ML project?
2. How to scale ML process in a team:
a. feature extraction
b. a current model tuning
c. experimenting with new models
3. How to pass ML model to deployment or revert a model (to
devops)
Methodology mismatch
“Data science as different from software as software was
different from hardware”
https://dominodatalab.wistia.com/medias/fq0l4152sh
Agile development methodology should cover data science.
Hardware Software Data science/ML
Methodology Waterfall Agile/Scrum Agile/?
What is special about Data Science?
1
New artifacts to manage:
● Experiment: Code + Data files.
● Metrics.
● ML pipelines and reproducibility.
Different process:
● R&D like. A lot of trials and errors, progress should be measured in a
different way.
● Ephemerality. Hard to communicate and track the progress.
* The image from: https://www.customsigns.com/experiment-fail-learn-repeat-poster-sign-18-x-24
DVC project motivation
Open source tool Data Version Control to manage ML projects:
http://dvc.org
DVC is a data science platform on top of open source stack.
GitHub repo: https://github.com/iterative/dvc
Download binaries (Mac, Linux, Windows) or $ pip install dvc
It extends Git by commands: dvc add, dvc run, dvc repro, dvc
remote
● Experiment as commitbranch: Code + Data files.
● Large data files:
○ Local cache.
○ Optimized for 1Gb - 100Gb file size.
○ Data remotes: S3, GCP, SSH.
● Metrics per experiment.
● ML pipelines.
● Reproducibility.
What DVC does?
Chapter II
Data files
Existing solutions
Git-LFS What is required
A single file size < 2Gb 1Gb - 100Gb
Workspace size (all
files)
Slow if 5Gb+ Unlimited
Not garbage collector
for data
20 experiments by 5Gb
each ~= 100Gb
Remove data files from
some of experiments
Data storage Proprietary and paid: only
GitHub and GitLab.
S3, GCP or custom
server (rsync, SFTP)
DVC: add data files
How data file works
Commit data file
Add data remote
DVC: checkout and optimization
Optimizations:
1. No data file copying - hardlinks copy instead.
2. Checksum caching and timesteps tracking.
3. Supports reflinks (CoW - Copy on Write) in modern file systems: BTRFS,
ReFS, XFS.
As a result: 100Gb data file checkout works instantaneously.
Chapter III
ML pipelines and reproducibility
A simple pipeline
Pipeline: images.zip → images/ → model.p → plots.jpg
Specify: input (-d), output (-o) and command.
Reproducibility
DVC reproduces ML pipeline in a single command:
Any DAG (Directed acyclic graph) is supported.
Chapter IV
Data science workflow
Workflow change is needed
Methodology → Workflows → Tools
Git is flexible: you can define your workflow.
master
new_feature
Git workflows: from software to ML
master
new_feature
Gitflow: feature driven
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
Data science flow: why new workflow?
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
Different people can
work on different ideas.
Collaborate without
waiting.
Data science flow: hints
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
1. Do not forget to create
branches:
$ git checkout master
$ git checkout -b alpha_change master
2. Keep failed experiments:
$ git checkout alpha_change
$ git push master alpha_change
3. Clean up not important
experiments.
Chapter V
Beyond the Horizon
Special DVC scenarios
1. Tracking data files - like Git-LFS but S3GCPSSH backend.
2. ML model deployment tool.
3. Experimentation on HDFS/Apache Spark.
When you need DVC?
DVC is a data science platform on top of open source stack.
It uses some ideas from existing data science platforms but uses
open source stack and Git as a foundation.
Data science platforms helps on creating ML projects in teams
(3+ members).
Thank you!1
Questions
Twitter: @FullStackML
Email: dmitry@iterative.ai
Discuss: discuss.dvc.org
Actions
Visit dvc.org
Star github.com/iterative/dvc

More Related Content

PDF
DVC - Git-like Data Version Control for Machine Learning projects
KEY
Concurrent Programming Using the Disruptor
PDF
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
PDF
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
PDF
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
PDF
The Quest for an Open Source Data Science Platform
PDF
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
PDF
2018 data engineering for ml asset management for features and models
DVC - Git-like Data Version Control for Machine Learning projects
Concurrent Programming Using the Disruptor
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
The Quest for an Open Source Data Science Platform
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
2018 data engineering for ml asset management for features and models

Similar to PyData Berlin 2018: dvc.org (20)

PPTX
Scientific Computing @ Fred Hutch
PPTX
Fast and Reproducible Deep Learning
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
PDF
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
PDF
Handout3o
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
The DBpedia databus
PDF
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
PDF
Scaling MLOps on NVIDIA DGX Systems
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PPTX
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
PDF
IoT NY - Google Cloud Services for IoT
PDF
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
PPT
Google Cloud Computing on Google Developer 2008 Day
PDF
Very large scale distributed deep learning on BigDL
PDF
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
PDF
GOAI: GPU-Accelerated Data Science DataSciCon 2017
PPT
generate IP CORES
PDF
NVIDIA Rapids presentation
PDF
Rapids: Data Science on GPUs
Scientific Computing @ Fred Hutch
Fast and Reproducible Deep Learning
End-to-end pipeline agility - Berlin Buzzwords 2024
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
Handout3o
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
The DBpedia databus
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
Scaling MLOps on NVIDIA DGX Systems
Data Engineer's Lunch #85: Designing a Modern Data Stack
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
IoT NY - Google Cloud Services for IoT
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
Google Cloud Computing on Google Developer 2008 Day
Very large scale distributed deep learning on BigDL
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
GOAI: GPU-Accelerated Data Science DataSciCon 2017
generate IP CORES
NVIDIA Rapids presentation
Rapids: Data Science on GPUs
Ad

Recently uploaded (20)

PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
The Data Security Envisioning Workshop provides a summary of an organization...
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPT
statistics analysis - topic 3 - describing data visually
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Machine Learning and working of machine Learning
PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
eGramSWARAJ-PPT Training Module for beginners
PPTX
chrmotography.pptx food anaylysis techni
AI AND ML PROPOSAL PRESENTATION MUST.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
SET 1 Compulsory MNH machine learning intro
The Data Security Envisioning Workshop provides a summary of an organization...
A biomechanical Functional analysis of the masitary muscles in man
statistics analysis - topic 3 - describing data visually
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
1 hour to get there before the game is done so you don’t need a car seat for ...
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
retention in jsjsksksksnbsndjddjdnFPD.pptx
CYBER SECURITY the Next Warefare Tactics
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Machine Learning and working of machine Learning
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
eGramSWARAJ-PPT Training Module for beginners
chrmotography.pptx food anaylysis techni
Ad

PyData Berlin 2018: dvc.org

  • 1. Data versioning in machine learning projects Dmitry Petrov dmitry@iterative.ai PyData Berlin 2018
  • 2. Agenda 1. What makes ML process special? 2. Data files 3. ML pipelines and reproducibility 4. Data science workflow 5. Beyond the Horizon
  • 3. 1 Dmitry Petrov Twitter: @FullStackML PhD in Computer Science Hello Co-Founder & CEO at Iterative.AI. San Francisco, USA ex-Data Scientist at Microsoft (BingAds). Seattle, USA ex-Head of Lab at St. Petersburg Electrotechnical University, Russia
  • 4. Chapter I What makes ML process special?
  • 5. Data files hell problem 1. Data files are not in your repository. 2. Tons of data file versions: - model.pkl - model_L7_e120.pkl - model_vgg16_L5tune_e120.pkl - model_L7_e160_cleansed.pkl - model_vgg16_L45tune_e120.pkl - model_vgg16_L45tune_e160_noempty.pkl - … 3. Data files are not connected to code files. - $ git checkout finetune_head # creates even more mess
  • 6. Data files hell in a team 1. How to create a reproducible ML project? 2. How to scale ML process in a team: a. feature extraction b. a current model tuning c. experimenting with new models 3. How to pass ML model to deployment or revert a model (to devops)
  • 7. Methodology mismatch “Data science as different from software as software was different from hardware” https://dominodatalab.wistia.com/medias/fq0l4152sh Agile development methodology should cover data science. Hardware Software Data science/ML Methodology Waterfall Agile/Scrum Agile/?
  • 8. What is special about Data Science? 1 New artifacts to manage: ● Experiment: Code + Data files. ● Metrics. ● ML pipelines and reproducibility. Different process: ● R&D like. A lot of trials and errors, progress should be measured in a different way. ● Ephemerality. Hard to communicate and track the progress. * The image from: https://www.customsigns.com/experiment-fail-learn-repeat-poster-sign-18-x-24
  • 9. DVC project motivation Open source tool Data Version Control to manage ML projects: http://dvc.org DVC is a data science platform on top of open source stack. GitHub repo: https://github.com/iterative/dvc Download binaries (Mac, Linux, Windows) or $ pip install dvc It extends Git by commands: dvc add, dvc run, dvc repro, dvc remote
  • 10. ● Experiment as commitbranch: Code + Data files. ● Large data files: ○ Local cache. ○ Optimized for 1Gb - 100Gb file size. ○ Data remotes: S3, GCP, SSH. ● Metrics per experiment. ● ML pipelines. ● Reproducibility. What DVC does?
  • 12. Existing solutions Git-LFS What is required A single file size < 2Gb 1Gb - 100Gb Workspace size (all files) Slow if 5Gb+ Unlimited Not garbage collector for data 20 experiments by 5Gb each ~= 100Gb Remove data files from some of experiments Data storage Proprietary and paid: only GitHub and GitLab. S3, GCP or custom server (rsync, SFTP)
  • 13. DVC: add data files
  • 14. How data file works
  • 17. DVC: checkout and optimization Optimizations: 1. No data file copying - hardlinks copy instead. 2. Checksum caching and timesteps tracking. 3. Supports reflinks (CoW - Copy on Write) in modern file systems: BTRFS, ReFS, XFS. As a result: 100Gb data file checkout works instantaneously.
  • 18. Chapter III ML pipelines and reproducibility
  • 19. A simple pipeline Pipeline: images.zip → images/ → model.p → plots.jpg Specify: input (-d), output (-o) and command.
  • 20. Reproducibility DVC reproduces ML pipeline in a single command: Any DAG (Directed acyclic graph) is supported.
  • 22. Workflow change is needed Methodology → Workflows → Tools Git is flexible: you can define your workflow. master new_feature
  • 23. Git workflows: from software to ML master new_feature Gitflow: feature driven increase_beta Data science flow: metrics driven .721 tune_L4 alpha_change .736 .832 .832.827 .745 master .736 .809 .810
  • 24. Data science flow: why new workflow? increase_beta Data science flow: metrics driven .721 tune_L4 alpha_change .736 .832 .832.827 .745 master .736 .809 .810 Different people can work on different ideas. Collaborate without waiting.
  • 25. Data science flow: hints increase_beta Data science flow: metrics driven .721 tune_L4 alpha_change .736 .832 .832.827 .745 master .736 .809 .810 1. Do not forget to create branches: $ git checkout master $ git checkout -b alpha_change master 2. Keep failed experiments: $ git checkout alpha_change $ git push master alpha_change 3. Clean up not important experiments.
  • 27. Special DVC scenarios 1. Tracking data files - like Git-LFS but S3GCPSSH backend. 2. ML model deployment tool. 3. Experimentation on HDFS/Apache Spark.
  • 28. When you need DVC? DVC is a data science platform on top of open source stack. It uses some ideas from existing data science platforms but uses open source stack and Git as a foundation. Data science platforms helps on creating ML projects in teams (3+ members).
  • 29. Thank you!1 Questions Twitter: @FullStackML Email: dmitry@iterative.ai Discuss: discuss.dvc.org Actions Visit dvc.org Star github.com/iterative/dvc