SlideShare a Scribd company logo
SigOpt. Confidential.
Machine Learning Infrastructure
Alexandra Johnson
alexandra@sigopt.com @alexandraj777
SigOpt. Confidential.
Alexandra Johnson
Tech Lead, Platform Team
SigOpt. Confidential.
Let's Build a Data Science Team!
• Who do we hire?
• What do we ask them to do?
• What does success look like?
3
SigOpt. Confidential.
Let's Build a Data Science Team!
Call out your answers!
• Who do we hire?
• Statisticians
• PhDs in science / math related fields
• People interested in building models!
• What do we ask them to do?
• Gather data
• Build models
• Extract insights
• What does success look like?
• ML models driving business decisions
4
SigOpt. Confidential.
A Data Scientist Wants to Build a Model
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Train model
5. Analyze results
5
A typical model-building
workflow for a data scientist
working in a local
development environment,
such as their work laptop
SigOpt. Confidential.
A Data Scientist Wants to Build a Model
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Train model
5. Analyze results
6
Out of memory!
Out of memory errors could
occur for a number of
reasons, including:
• data set too large
• features too large
• model too large
SigOpt. Confidential.
In addition to memory
concerns, here are some
additional reasons why a data
scientist might not be able to
train their model in their local
development environment:
• High degree of
parallelism
• Specialized hardware
(GPUs)
• Don't want to
monopolize laptop
resources
New Model Building Workflow
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Spin up AWS EC2 instance
5. Setup machine
6. Launch training job
7. Analyze results
7
SigOpt. Confidential.
New Model Building Workflow
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Spin up AWS EC2 instance
5. Setup machine
6. Launch training job
7. Analyze results
8
Data Science work
In the new workflow, only half
of the work relates to the
data scientist's specialty
SigOpt. Confidential.
New Model Building Workflow
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Spin up AWS EC2 instance
5. Setup machine
6. Launch training job
7. Analyze results
9
Infrastructure work
Half of the work here is
infrastructure work, which is a
separate field of engineering
Writing code to spin up AWS
EC2 instances is very
different from the team's
original goal of "ML models
driving business decisions"
SigOpt. Confidential.
• Want to close your laptop without accidentally stopping your model
training
• Large datasets / features / models
• Specialized hardware (GPUs)
• High degree of parallelism helps projects finish faster
• Large teams pool access to compute resources to save money
When the Need for Infrastructure Scales Up
i.e. Is it really a big deal that a data scientist is ssh'ing into one EC2 instance?
10
SigOpt. Confidential.
SigOpt
Who is responsible for spinning up and managing
the data scientist's infrastructure?
SigOpt. Confidential.
Traditional Infrastructure Teams
• Who do we hire?
• What do we ask them to do?
• What does success look like?
12
SigOpt. Confidential.
• Who do we hire?
• Systems experts
• Backend engineers
• People who love reliability and scalability!
• What do we ask them to do?
• Reliability
• Scalability
• Performance
• What does success look like?
• 99.99% uptime of API
• 99.99% uptime of website
• No data loss
Traditional Infrastructure Teams
Call out your answers!
13
SigOpt. Confidential.
SigOpt
The data science team feels the pain, but the
infrastructure team has pre-existing objectives
SigOpt. Confidential.15
Machine Learning Infrastructure
Data science users
/ workloads
Infrastructure /
devops tools+ = Machine learning
infrastructure
SigOpt. Confidential.
Case Studies
SigOpt. Confidential.
Example: Hyperparameter Optimization
What is hyperparameter optimization?
• Every model has hyperparameters, aka configurations that you set
before you train the model
• Different settings of hyperparameters product different levels of model
performance
• Hyperparameter optimization (HPO) is the search for the set of
hyperparameters that produces the best model performance
17
SigOpt. Confidential.
Example: Hyperparameter Optimization
Example hyperparameters
• Random Forest (sklearn)
• Number of trees in a forest
• Maximum depth per tree
• Elastic Net (sklearn)
• Regularization coefficient
• Weight of the l1 norm term
• Deep Learning Models (MXNet, TensorFlow, PyTorch)
• Learning rate
• Number of hidden layers
18
SigOpt. Confidential.
Example: Hyperparameter Optimization
19
• 100 configurations of
hyperparameters x 1 hour of
training time ≈ 4 days
• Start job Monday at noon,
check results Friday at noon
• On the order of one week
Parallelism reduces wall clock time
• 100 configurations of
hyperparameters / 6 machines
x 1 hour training time ≈ 17
hours
• Start job Monday at noon,
check results Tuesday morning
• On the order of one day
SigOpt. Confidential.
• In 2017, every new machine learning project at SigOpt produced new a
new machine learning infrastructure tool
• Code to launch HPO projects was never the primary focus of the project
• Case studies here cover common architecture choices seen among at
least four tools
20
Case Study: Data Scientist Build Incrementally
SigOpt. Confidential.
Problem: Setup code and
dependencies on each remote
machine
Solution: Use scp to send data,
code, and setup script from local
environment to every remote
machine
21
Data Scientist: Setup Machines
SigOpt. Confidential.
Problem: Start training ML
model on each remote
instance
Solution: Use ssh to run
commands on remote
instances
22
Data Scientist: Launch Job
SigOpt. Confidential.
Problem: View the status of a job
at a glance
Solution: Rely on third-party
APIs to track metadata, run ML
training processes in tmux so
logs can be viewed later
23
Data Scientist: View Progress and Debug
SigOpt. Confidential.
• Simple design
• Data scientist has full
understanding of their tool
• Data scientist has full control
over their tool
• No external dependencies to
build features or fix bugs
Data Science Solution: Pros and Cons
24
Pros Cons
• Few debugging tools
• Decentralized logs
• Not scalable
• Closing laptop during
long-running commands loses
progress
• Difficult to set
organization-level standards
SigOpt. Confidential.
SigOpt
"Creating shared services also creates
dependencies and can impinge on autonomy"
- Marty Cagan, Inspired
SigOpt. Confidential.
• Infrastructure engineers started a dedicated effort to build tools for
launching HPO jobs in 2018
• Viewed as an overhaul of previous infrastructure managment tools
• Resulting product was SigOpt Orchestrate
26
Case Study: Infrastructure Engineer Overhaul
SigOpt. Confidential.
Problem: Setup code and
dependencies on each remote
machine
Solution: Use Docker to
containerize model
development environment
27
Infrastructure Engineer: Setup Machines
Registry
SigOpt. Confidential.
Problem: Start training ML
model on each remote instance
Solution: Use Kubernetes to
provide a uniform interface to
the cluster
28
Infrastructure Engineer: Launch Job
SigOpt. Confidential.
Problem: View the status of a job
at a glance
Solution: Build a command line
interface (CLI) that abstracts
away infrastructure tools
29
Infrastructure Engineer: View Progress and Debug
SigOpt. Confidential.
• Pre-existing APIs lead to rapid
feature development
• Debugging tools
• Highly scalable
• User can close laptop and job
still runs
• Easy to install
Infrastructure Engineer Solution: Pros and Cons
• Data scientist may not
understand underlying
technologies (Docker and
Kubernetes)
• External dependency on
infrastructure team to build
new features and fix bugs
• Difficult to onboard
30
Pros Cons
SigOpt. Confidential.
Looking Forwards
SigOpt. Confidential.
SigOpt
Machine Learning Infrastructure requires a tight user
feedback loop
SigOpt. Confidential.
ML Infrastructure Within Large Companies
33
• Google's Borg
• Uber's Michelangelo
• AirBnb's BigHead
• Lyft's ML Platform
SigOpt. Confidential.
• Polyaxon
• Kubeflow
• MLFlow
Open Source ML Infrastructure Projects
34
SigOpt. Confidential.
Further Reading
35
• Paper: Orchestrate: Infrastructure for Enabling Parallelism during
Hyperparameter Optimization https://arxiv.org/abs/1812.07751
• Blog Post: Machine Learning Infrastructure Tools for Hyperparameter
Optimization
https://sigopt.com/blog/machine-learning-infrastructure-tools-for-hyperp
arameter-optimization/
• Talk: Reducing Operational Barriers to Model Training
https://mlconf.com/sessions/reducing-operational-barriers-to-model-trai
ning/
SigOpt. Confidential.
• Data scientists built tools that were brittle, but allowed for great freedom
• Infrastructure engineers built tools that suffered usability issues
• Successful teams will have a tight feedback loop between infrastructure
engineers and data science users
Takeaways
36
SigOpt. Confidential.
I Want to Learn From You!
I'm around Ann Arbor until about 5pm tomorrow!
I'd love to stop by your office and learn about your work in data science / ML
Email alexandra@sigopt.com or talk to me right after this to setup a time
37
SigOpt. Confidential.
Thank you!
Any questions?
Alexandra Johnson
alexandra@sigopt.com @alexandraj777

More Related Content

PDF
Machine learning model to production
PDF
Scaling up Deep Learning by Scaling Down
PPTX
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
PPTX
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
PPTX
Production machine learning_infrastructure
PDF
Knowledge Discovery
PDF
Scalable and Automatic Machine Learning with H2O
PDF
DeepLearning and Advanced Machine Learning on IoT
Machine learning model to production
Scaling up Deep Learning by Scaling Down
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
Production machine learning_infrastructure
Knowledge Discovery
Scalable and Automatic Machine Learning with H2O
DeepLearning and Advanced Machine Learning on IoT

What's hot (20)

PDF
Weekly #106: Deep Learning on Mobile
PDF
The Data Science Process - Do we need it and how to apply?
PDF
IBM Middle East Data Science Connect 2016 - Doha, Qatar
PPTX
Machine learning 101 dkom 2017
PDF
TensorFlow 16: Building a Data Science Platform
PDF
Seamless MLOps with Seldon and MLflow
PPTX
Machine Learning with Apache Spark
PDF
Managers guide to effective building of machine learning products
PDF
AI with Azure Machine Learning
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PPTX
IBM Strategy for Spark
PDF
Love & Innovative technology presented by a technology pioneer and an AI expe...
PDF
Deep learning in production with the best
PDF
Infrastructure and Tooling - Full Stack Deep Learning
PDF
Seldon: Deploying Models at Scale
PDF
Making Data Science Scalable - 5 Lessons Learned
PDF
Python Development in VS2019
PDF
Geo Python16 keynote
PPTX
Machine Learning In Production
PDF
Dato Keynote
Weekly #106: Deep Learning on Mobile
The Data Science Process - Do we need it and how to apply?
IBM Middle East Data Science Connect 2016 - Doha, Qatar
Machine learning 101 dkom 2017
TensorFlow 16: Building a Data Science Platform
Seamless MLOps with Seldon and MLflow
Machine Learning with Apache Spark
Managers guide to effective building of machine learning products
AI with Azure Machine Learning
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
IBM Strategy for Spark
Love & Innovative technology presented by a technology pioneer and an AI expe...
Deep learning in production with the best
Infrastructure and Tooling - Full Stack Deep Learning
Seldon: Deploying Models at Scale
Making Data Science Scalable - 5 Lessons Learned
Python Development in VS2019
Geo Python16 keynote
Machine Learning In Production
Dato Keynote
Ad

Similar to Machine Learning Infrastructure (20)

PDF
Alexandra johnson reducing operational barriers to model training
PDF
SigOpt at MLconf - Reducing Operational Barriers to Model Training
PDF
DevOps Days Rockies MLOps
PDF
World Artificial Intelligence Conference Shanghai 2018
PDF
Machine Learning Infrastructure
PDF
Ds for finance day 4
PPTX
Danny Bickson - Python based predictive analytics with GraphLab Create
PDF
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
PPTX
Developing Digital Twins
PDF
Adopting software design practices for better machine learning
PDF
201906 02 Introduction to AutoML with ML.NET 1.0
PDF
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
PDF
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
PDF
Consolidating MLOps at One of Europe’s Biggest Airports
PPTX
Machine Learning
PPTX
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
PDF
DutchMLSchool. ML for Energy Trading and Automotive Sector
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PPTX
Data science tools of the trade
PPTX
AI hype or reality
Alexandra johnson reducing operational barriers to model training
SigOpt at MLconf - Reducing Operational Barriers to Model Training
DevOps Days Rockies MLOps
World Artificial Intelligence Conference Shanghai 2018
Machine Learning Infrastructure
Ds for finance day 4
Danny Bickson - Python based predictive analytics with GraphLab Create
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Developing Digital Twins
Adopting software design practices for better machine learning
201906 02 Introduction to AutoML with ML.NET 1.0
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Consolidating MLOps at One of Europe’s Biggest Airports
Machine Learning
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
DutchMLSchool. ML for Energy Trading and Automotive Sector
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Data science tools of the trade
AI hype or reality
Ad

More from SigOpt (20)

PDF
Optimizing BERT and Natural Language Models with SigOpt Experiment Management
PDF
Experiment Management for the Enterprise
PDF
Efficient NLP by Distilling BERT and Multimetric Optimization
PDF
Detecting COVID-19 Cases with Deep Learning
PDF
Metric Management: a SigOpt Applied Use Case
PDF
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
PDF
Tuning for Systematic Trading: Talk 2: Deep Learning
PDF
Tuning for Systematic Trading: Talk 1
PDF
Tuning Data Augmentation to Boost Model Performance
PDF
Advanced Optimization for the Enterprise Webinar
PDF
Modeling at Scale: SigOpt at TWIMLcon 2019
PDF
Tuning 2.0: Advanced Optimization Techniques Webinar
PDF
SigOpt at Ai4 Finance—Modeling at Scale
PDF
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
PDF
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
PDF
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
PDF
SigOpt at GTC - Tuning the Untunable
PDF
SigOpt at GTC - Reducing operational barriers to optimization
PDF
Lessons for an enterprise approach to modeling at scale
PDF
Modeling at scale in systematic trading
Optimizing BERT and Natural Language Models with SigOpt Experiment Management
Experiment Management for the Enterprise
Efficient NLP by Distilling BERT and Multimetric Optimization
Detecting COVID-19 Cases with Deep Learning
Metric Management: a SigOpt Applied Use Case
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 1
Tuning Data Augmentation to Boost Model Performance
Advanced Optimization for the Enterprise Webinar
Modeling at Scale: SigOpt at TWIMLcon 2019
Tuning 2.0: Advanced Optimization Techniques Webinar
SigOpt at Ai4 Finance—Modeling at Scale
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Reducing operational barriers to optimization
Lessons for an enterprise approach to modeling at scale
Modeling at scale in systematic trading

Recently uploaded (20)

PDF
medical staffing services at VALiNTRY
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
System and Network Administration Chapter 2
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
PTS Company Brochure 2025 (1).pdf.......
PPT
JAVA ppt tutorial basics to learn java programming
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
AI in Product Development-omnex systems
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
medical staffing services at VALiNTRY
How to Choose the Right IT Partner for Your Business in Malaysia
Which alternative to Crystal Reports is best for small or large businesses.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
System and Network Administration Chapter 2
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Softaken Excel to vCard Converter Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo POS Development Services by CandidRoot Solutions
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Design an Analysis of Algorithms II-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily
PTS Company Brochure 2025 (1).pdf.......
JAVA ppt tutorial basics to learn java programming
Operating system designcfffgfgggggggvggggggggg
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Upgrade and Innovation Strategies for SAP ERP Customers
AI in Product Development-omnex systems
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Machine Learning Infrastructure

  • 1. SigOpt. Confidential. Machine Learning Infrastructure Alexandra Johnson alexandra@sigopt.com @alexandraj777
  • 3. SigOpt. Confidential. Let's Build a Data Science Team! • Who do we hire? • What do we ask them to do? • What does success look like? 3
  • 4. SigOpt. Confidential. Let's Build a Data Science Team! Call out your answers! • Who do we hire? • Statisticians • PhDs in science / math related fields • People interested in building models! • What do we ask them to do? • Gather data • Build models • Extract insights • What does success look like? • ML models driving business decisions 4
  • 5. SigOpt. Confidential. A Data Scientist Wants to Build a Model 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Train model 5. Analyze results 5 A typical model-building workflow for a data scientist working in a local development environment, such as their work laptop
  • 6. SigOpt. Confidential. A Data Scientist Wants to Build a Model 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Train model 5. Analyze results 6 Out of memory! Out of memory errors could occur for a number of reasons, including: • data set too large • features too large • model too large
  • 7. SigOpt. Confidential. In addition to memory concerns, here are some additional reasons why a data scientist might not be able to train their model in their local development environment: • High degree of parallelism • Specialized hardware (GPUs) • Don't want to monopolize laptop resources New Model Building Workflow 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Spin up AWS EC2 instance 5. Setup machine 6. Launch training job 7. Analyze results 7
  • 8. SigOpt. Confidential. New Model Building Workflow 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Spin up AWS EC2 instance 5. Setup machine 6. Launch training job 7. Analyze results 8 Data Science work In the new workflow, only half of the work relates to the data scientist's specialty
  • 9. SigOpt. Confidential. New Model Building Workflow 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Spin up AWS EC2 instance 5. Setup machine 6. Launch training job 7. Analyze results 9 Infrastructure work Half of the work here is infrastructure work, which is a separate field of engineering Writing code to spin up AWS EC2 instances is very different from the team's original goal of "ML models driving business decisions"
  • 10. SigOpt. Confidential. • Want to close your laptop without accidentally stopping your model training • Large datasets / features / models • Specialized hardware (GPUs) • High degree of parallelism helps projects finish faster • Large teams pool access to compute resources to save money When the Need for Infrastructure Scales Up i.e. Is it really a big deal that a data scientist is ssh'ing into one EC2 instance? 10
  • 11. SigOpt. Confidential. SigOpt Who is responsible for spinning up and managing the data scientist's infrastructure?
  • 12. SigOpt. Confidential. Traditional Infrastructure Teams • Who do we hire? • What do we ask them to do? • What does success look like? 12
  • 13. SigOpt. Confidential. • Who do we hire? • Systems experts • Backend engineers • People who love reliability and scalability! • What do we ask them to do? • Reliability • Scalability • Performance • What does success look like? • 99.99% uptime of API • 99.99% uptime of website • No data loss Traditional Infrastructure Teams Call out your answers! 13
  • 14. SigOpt. Confidential. SigOpt The data science team feels the pain, but the infrastructure team has pre-existing objectives
  • 15. SigOpt. Confidential.15 Machine Learning Infrastructure Data science users / workloads Infrastructure / devops tools+ = Machine learning infrastructure
  • 17. SigOpt. Confidential. Example: Hyperparameter Optimization What is hyperparameter optimization? • Every model has hyperparameters, aka configurations that you set before you train the model • Different settings of hyperparameters product different levels of model performance • Hyperparameter optimization (HPO) is the search for the set of hyperparameters that produces the best model performance 17
  • 18. SigOpt. Confidential. Example: Hyperparameter Optimization Example hyperparameters • Random Forest (sklearn) • Number of trees in a forest • Maximum depth per tree • Elastic Net (sklearn) • Regularization coefficient • Weight of the l1 norm term • Deep Learning Models (MXNet, TensorFlow, PyTorch) • Learning rate • Number of hidden layers 18
  • 19. SigOpt. Confidential. Example: Hyperparameter Optimization 19 • 100 configurations of hyperparameters x 1 hour of training time ≈ 4 days • Start job Monday at noon, check results Friday at noon • On the order of one week Parallelism reduces wall clock time • 100 configurations of hyperparameters / 6 machines x 1 hour training time ≈ 17 hours • Start job Monday at noon, check results Tuesday morning • On the order of one day
  • 20. SigOpt. Confidential. • In 2017, every new machine learning project at SigOpt produced new a new machine learning infrastructure tool • Code to launch HPO projects was never the primary focus of the project • Case studies here cover common architecture choices seen among at least four tools 20 Case Study: Data Scientist Build Incrementally
  • 21. SigOpt. Confidential. Problem: Setup code and dependencies on each remote machine Solution: Use scp to send data, code, and setup script from local environment to every remote machine 21 Data Scientist: Setup Machines
  • 22. SigOpt. Confidential. Problem: Start training ML model on each remote instance Solution: Use ssh to run commands on remote instances 22 Data Scientist: Launch Job
  • 23. SigOpt. Confidential. Problem: View the status of a job at a glance Solution: Rely on third-party APIs to track metadata, run ML training processes in tmux so logs can be viewed later 23 Data Scientist: View Progress and Debug
  • 24. SigOpt. Confidential. • Simple design • Data scientist has full understanding of their tool • Data scientist has full control over their tool • No external dependencies to build features or fix bugs Data Science Solution: Pros and Cons 24 Pros Cons • Few debugging tools • Decentralized logs • Not scalable • Closing laptop during long-running commands loses progress • Difficult to set organization-level standards
  • 25. SigOpt. Confidential. SigOpt "Creating shared services also creates dependencies and can impinge on autonomy" - Marty Cagan, Inspired
  • 26. SigOpt. Confidential. • Infrastructure engineers started a dedicated effort to build tools for launching HPO jobs in 2018 • Viewed as an overhaul of previous infrastructure managment tools • Resulting product was SigOpt Orchestrate 26 Case Study: Infrastructure Engineer Overhaul
  • 27. SigOpt. Confidential. Problem: Setup code and dependencies on each remote machine Solution: Use Docker to containerize model development environment 27 Infrastructure Engineer: Setup Machines Registry
  • 28. SigOpt. Confidential. Problem: Start training ML model on each remote instance Solution: Use Kubernetes to provide a uniform interface to the cluster 28 Infrastructure Engineer: Launch Job
  • 29. SigOpt. Confidential. Problem: View the status of a job at a glance Solution: Build a command line interface (CLI) that abstracts away infrastructure tools 29 Infrastructure Engineer: View Progress and Debug
  • 30. SigOpt. Confidential. • Pre-existing APIs lead to rapid feature development • Debugging tools • Highly scalable • User can close laptop and job still runs • Easy to install Infrastructure Engineer Solution: Pros and Cons • Data scientist may not understand underlying technologies (Docker and Kubernetes) • External dependency on infrastructure team to build new features and fix bugs • Difficult to onboard 30 Pros Cons
  • 32. SigOpt. Confidential. SigOpt Machine Learning Infrastructure requires a tight user feedback loop
  • 33. SigOpt. Confidential. ML Infrastructure Within Large Companies 33 • Google's Borg • Uber's Michelangelo • AirBnb's BigHead • Lyft's ML Platform
  • 34. SigOpt. Confidential. • Polyaxon • Kubeflow • MLFlow Open Source ML Infrastructure Projects 34
  • 35. SigOpt. Confidential. Further Reading 35 • Paper: Orchestrate: Infrastructure for Enabling Parallelism during Hyperparameter Optimization https://arxiv.org/abs/1812.07751 • Blog Post: Machine Learning Infrastructure Tools for Hyperparameter Optimization https://sigopt.com/blog/machine-learning-infrastructure-tools-for-hyperp arameter-optimization/ • Talk: Reducing Operational Barriers to Model Training https://mlconf.com/sessions/reducing-operational-barriers-to-model-trai ning/
  • 36. SigOpt. Confidential. • Data scientists built tools that were brittle, but allowed for great freedom • Infrastructure engineers built tools that suffered usability issues • Successful teams will have a tight feedback loop between infrastructure engineers and data science users Takeaways 36
  • 37. SigOpt. Confidential. I Want to Learn From You! I'm around Ann Arbor until about 5pm tomorrow! I'd love to stop by your office and learn about your work in data science / ML Email alexandra@sigopt.com or talk to me right after this to setup a time 37
  • 38. SigOpt. Confidential. Thank you! Any questions? Alexandra Johnson alexandra@sigopt.com @alexandraj777