SlideShare a Scribd company logo
What are the Unique Challenges and
Opportunities in Systems for ML?
Matei Zaharia
😀 🙂
AI is going to
change all of
computing!
AI Researcher Systems Researcher
😀 🙂
It’s intelligent and you
don’t need to program
anymore and you just
differentiate things...
AI Researcher Systems Researcher
😀 🙂
How does it affect
your research
field?
AI Researcher Systems Researcher
😀 🙂
How does it affect
your research
field?
AI Researcher Systems Researcher
Umm, I figured out
a way to shave off
some system calls!
😀 🙂
How does it affect
your research
field?
AI Researcher Networking Researcher
I came up with a
new congestion
control scheme
😐 🙂AI Researcher Networking Researcher
I came up with a
new congestion
control scheme
Motivation
ML workloads can certainly influence a lot of systems,
but what are the unique research challenges they raise?
Turns out there are a lot! ML is very different from
traditional software, and we should look at how
My Perspective
Research lab focused on infrastructure for
usable machine learning
Data & ML platform for 2000+ orgs
How Does ML Differ from Traditional Software?
Traditional Software
Goal: meet a functional
specification
Quality depends only on
application code
Mostly deterministic
Machine Learning
Goal: optimize a metric
(e.g. accuracy)
Quality depends on input data
and tuning parameters
Stochastic
Some Interesting Opportunities
ML Platforms: software for managing and productionizing ML
Data-oriented model training, QA and debugging tools
Optimizations leveraging the stochastic nature of ML
ML-Aware System Optimization:
NoScope & BlazeIt
The ML Inference Bottleneck
Inference cost is often 100x higher than training
overall, and greatly limits deployments
Example: processing 1 video
stream in real time with CNNs
requires a $1000 GPU
Inference Optimization in NoScope
Idea: optimize execution of ML models for a specific
application or query
• Model specialization: train a small DNN to recognize the
specific class in the dataset (e.g. “buses in street video”)
• Query optimization: tune a cascade of
models to achieve a target accuracy Target
Model
Specialized
Model
Dataset
User Query
NoScope Results
VLDB ‘17, github.com/stanford-futuredata/noscope
Optimizing ML + SQL in BlazeIt
[Kang et al, CIDR 2019]
Object Detection DNN
Frames from Video
Query Plan with
Specialized DNNs
Resnet 50
SQL Query
BlazeIt Optimizations
Accelerate approximate queries by
using specialized model’s output
as a control variate for sampling
E.g.: find average # of cars/frame
Use specialized models to sort
frames by likelihood of matching
query, then run full model
E.g.: SELECT * FROM frames
WHERE #(red buses) > 3 LIMIT 5
Aggregation Queries Limit Queries
BlazeIt Results
Aggregation Queries Limit Queries
Quality Assurance for ML with
Model Assertions
Motivation
ML applications fail in complex, hard-to-debug ways
• Tesla cars crashing into lane dividers
• Gender classification incorrect
based on race
How can we test and improve quality of ML apps?
Model Assertions
Predicates on input/output of an ML application
(similar to software assertions)
[Kang, Raghavan et al, NeurIPS MLSys 2018]
Frame 1 Frame 2 Frame 3
assert(cars should not flicker in and out)
Improved training
(data selection &
weak supervision)
Runtime
monitoring
Example Assertions
Problem Domain Assertion
Video analytics
Objects should not flicker
in and out across frames
Autonomous vehicles
LIDAR and video object
detectors should agree
Heart rhythm
classification
Output class should not
change frequently
Using Model Assertions
Inference time
» Runtime monitoring
» Corrective action
Training time
» Active learning
» Weak supervision via
correction rules
Active Learning with Assertions:
Can assertions help select data to label & train on?
Key idea: new active learning algorithm samples data that
is most likely to reduce # failing assertions
Active Learning with Assertions:
Can assertions help select data to label & train on?
Using assertions
for active learning
improves model
quality.
Selection Method for 2000 New Labels
mAP
Weak Supervision with Assertions:
Can assertions improve quality without human labeling?
Key idea: consistency constraints API lets devs say which
attributes should stay constant across outputs in a dataset
E.g. “each tracked object should always have same class”,
“each person should have consistent detected gender”
Task Pretrained Weakly Supervised
AV perception (mAP) 10.6 14.1 (+33%)
Object detection (mAP) 34.4 49.9 (+45%)
ECG (% accuracy) 70.7 72.1 (+2%)
Weak Supervision with Assertions:
Can assertions improve quality without human labeling?
Model Quality After Retraining
Retrained SSD ModelOriginal SSD Model
[Kang, Raghavan et al, NeurIPS MLSys 2018]
ML Platforms: Programming and
Deployment Systems for ML
ML at Industrial Scale
Today, ML development is ad-hoc:
• Hard to track experiments & metrics: users do it best-effort
• Hard to reproduce results: won’t happen by default
• Hard to share & deploy models: different dev & deploy stacks
Each app takes months to build, and then needs to
continuously be maintained!
ML Platforms
A new class of systems to manage the ML lifecycle
Pioneered by company-specific platforms: Facebook
FBLearner, Uber Michelangelo, Google TFX, etc
+Standardize the data prep / training / deploy cycle:
if you work with the platform, you get these!
–Limited to a few algorithms or frameworks
–Tied to one company’s infrastructure
MLflow from Databricks
Open source, open-interface ML platform (mlflow.org)
• Works with any existing ML library and deployment service
Project
Project Spec
your_code.py
. . .
log_param(“alpha”, 0.5)
log_metric(“rmse”, 0.2)
log_model(my_model)
. . .
Deps Params
Tracking Server
UI
API
Inference Code
Bulk Scoring
Cloud Serving Tools
Deployment TargetsExperiment TrackingReproducible Projects
REST
API
my_project/
├── MLproject
│
│
│
│
│
├── conda.yaml
├── main.py
└── model.py
...
MLflow Projects: Reproducible Runs
conda_env: conda.yaml
entry_points:
main:
parameters:
training_data: path
lr: {type: float, default: 0.1}
command: python main.py {training_data} {lr}
$ mlflow run git://<my_project>
mlflow.run(“git://<my_project>”, ...)
Simple packaging format for code + dependencies
Composing Projects
r1 = mlflow.run(“ProjectA”, params)
if r1 > 0:
r2 = mlflow.run(“ProjectB”, …)
else:
r2 = mlflow.run(“ProjectC”, …)
r3 = mlflow.run(“ProjectD”, r2)
MLflow Tracking: Logging for ML
Notebooks
Local Apps
Cloud Jobs
Tracking Server
UI
API
mlflow.log_param(“alpha”, 0.5)
mlflow.log_metric(“accuracy”, 0.9)
...
REST API
Tracking UI: Inspecting Runs
Model Format
ONNX Flavor
Python Flavor
Model Logic
Batch Inference
REST Serving
Packaging Format
. . .
Testing & Debug Tools
LIME
TCAV
Packages arbitrary code (not just model weights)
MLflow Models: Packaging Models
MLflow Community Growth
140 contributors from >50 companies since June 2018
850K downloads/month
Major external contributions:
• Docker & Kubernetes execution
• R API
• Integrations with PyTorch, H2O, HDFS, GCS, …
• Plugin system
Other ML-Specific Research Opportunities
Data validation and monitoring (e.g. TFX Data Validation)
Supervision-oriented systems (e.g. Snorkel, Overton)
Leveraging the numeric nature of ML for optimization,
security, etc (e.g. TASO, HogWild, SSP, federated ML)
Conclusion
Many systems problems specific to ML are not
heavily studied in research
• App lifecycle, data quality & monitoring, model QA, etc
These are also major problems in practice!
Follow DAWN’s research at dawn.cs.stanford.edu

More Related Content

PDF
Scaling up Machine Learning Development
PPTX
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
PDF
Databricks Overview for MLOps
PDF
Managing the Machine Learning Lifecycle with MLflow
PDF
MLOps Virtual Event: Automating ML at Scale
PDF
Lessons from Large-Scale Cloud Software at Databricks
PPTX
MLOps in action
PDF
Feature drift monitoring as a service for machine learning models at scale
Scaling up Machine Learning Development
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks Overview for MLOps
Managing the Machine Learning Lifecycle with MLflow
MLOps Virtual Event: Automating ML at Scale
Lessons from Large-Scale Cloud Software at Databricks
MLOps in action
Feature drift monitoring as a service for machine learning models at scale

What's hot (20)

PDF
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
PDF
Simplifying Model Management with MLflow
PDF
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
PDF
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
PDF
Productionalizing Models through CI/CD Design with MLflow
PDF
MLflow: A Platform for Production Machine Learning
PDF
Ml ops past_present_future
PDF
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
PDF
Productionizing Deep Reinforcement Learning with Spark and MLflow
PDF
MLFlow: Platform for Complete Machine Learning Lifecycle
PDF
ML-Ops how to bring your data science to production
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PDF
Challenges of Operationalising Data Science in Production
PDF
What's Next for MLflow in 2019
PPTX
3 App Compat Win7
PDF
Managers guide to effective building of machine learning products
PDF
The A-Z of Data: Introduction to MLOps
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Simplifying Model Management with MLflow
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Hamburg Data Science Meetup - MLOps with a Feature Store
Raven: End-to-end Optimization of ML Prediction Queries
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Productionalizing Models through CI/CD Design with MLflow
MLflow: A Platform for Production Machine Learning
Ml ops past_present_future
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
Productionizing Deep Reinforcement Learning with Spark and MLflow
MLFlow: Platform for Complete Machine Learning Lifecycle
ML-Ops how to bring your data science to production
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Challenges of Operationalising Data Science in Production
What's Next for MLflow in 2019
3 App Compat Win7
Managers guide to effective building of machine learning products
The A-Z of Data: Introduction to MLOps
Ad

Similar to What are the Unique Challenges and Opportunities in Systems for ML? (20)

PDF
201909 Automated ML for Developers
PDF
Walk through of azure machine learning studio new features
PPTX
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
PPTX
Recommendations for Building Machine Learning Software
PPTX
Recommendations for Building Machine Learning Software
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
PDF
AI for Software Engineering
PDF
Introduction to ML.NET
PDF
MLOps Using MLflow
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
PDF
The Power of Auto ML and How Does it Work
PPTX
Mohamed Sabri: Operationalize machine learning with Kubeflow
PPTX
Mohamed Sabri: Operationalize machine learning with Kubeflow
PPTX
TechDayPakistan-Slides RAG with Cosmos DB.pptx
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PDF
Data Science Course Curriculum - Quality Thought
PDF
201906 02 Introduction to AutoML with ML.NET 1.0
PDF
The Data Science Process - Do we need it and how to apply?
PDF
Unleashing the Power of Machine Learning Prototyping Using Azure AutoML and P...
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
201909 Automated ML for Developers
Walk through of azure machine learning studio new features
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
Lessons Learned from Building Machine Learning Software at Netflix
AI for Software Engineering
Introduction to ML.NET
MLOps Using MLflow
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
The Power of Auto ML and How Does it Work
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
TechDayPakistan-Slides RAG with Cosmos DB.pptx
MLOps and Data Quality: Deploying Reliable ML Models in Production
Data Science Course Curriculum - Quality Thought
201906 02 Introduction to AutoML with ML.NET 1.0
The Data Science Process - Do we need it and how to apply?
Unleashing the Power of Machine Learning Prototyping Using Azure AutoML and P...
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Ad

Recently uploaded (20)

PPTX
history of c programming in notes for students .pptx
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
top salesforce developer skills in 2025.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Digital Strategies for Manufacturing Companies
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
System and Network Administraation Chapter 3
PPT
Introduction Database Management System for Course Database
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
history of c programming in notes for students .pptx
Understanding Forklifts - TECH EHS Solution
ManageIQ - Sprint 268 Review - Slide Deck
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
top salesforce developer skills in 2025.pdf
Transform Your Business with a Software ERP System
PTS Company Brochure 2025 (1).pdf.......
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo POS Development Services by CandidRoot Solutions
Adobe Illustrator 28.6 Crack My Vision of Vector Design
How to Choose the Right IT Partner for Your Business in Malaysia
Digital Strategies for Manufacturing Companies
ISO 45001 Occupational Health and Safety Management System
System and Network Administraation Chapter 3
Introduction Database Management System for Course Database
VVF-Customer-Presentation2025-Ver1.9.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
L1 - Introduction to python Backend.pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

What are the Unique Challenges and Opportunities in Systems for ML?

  • 1. What are the Unique Challenges and Opportunities in Systems for ML? Matei Zaharia
  • 2. 😀 🙂 AI is going to change all of computing! AI Researcher Systems Researcher
  • 3. 😀 🙂 It’s intelligent and you don’t need to program anymore and you just differentiate things... AI Researcher Systems Researcher
  • 4. 😀 🙂 How does it affect your research field? AI Researcher Systems Researcher
  • 5. 😀 🙂 How does it affect your research field? AI Researcher Systems Researcher Umm, I figured out a way to shave off some system calls!
  • 6. 😀 🙂 How does it affect your research field? AI Researcher Networking Researcher I came up with a new congestion control scheme
  • 7. 😐 🙂AI Researcher Networking Researcher I came up with a new congestion control scheme
  • 8. Motivation ML workloads can certainly influence a lot of systems, but what are the unique research challenges they raise? Turns out there are a lot! ML is very different from traditional software, and we should look at how
  • 9. My Perspective Research lab focused on infrastructure for usable machine learning Data & ML platform for 2000+ orgs
  • 10. How Does ML Differ from Traditional Software? Traditional Software Goal: meet a functional specification Quality depends only on application code Mostly deterministic Machine Learning Goal: optimize a metric (e.g. accuracy) Quality depends on input data and tuning parameters Stochastic
  • 11. Some Interesting Opportunities ML Platforms: software for managing and productionizing ML Data-oriented model training, QA and debugging tools Optimizations leveraging the stochastic nature of ML
  • 13. The ML Inference Bottleneck Inference cost is often 100x higher than training overall, and greatly limits deployments Example: processing 1 video stream in real time with CNNs requires a $1000 GPU
  • 14. Inference Optimization in NoScope Idea: optimize execution of ML models for a specific application or query • Model specialization: train a small DNN to recognize the specific class in the dataset (e.g. “buses in street video”) • Query optimization: tune a cascade of models to achieve a target accuracy Target Model Specialized Model Dataset User Query
  • 15. NoScope Results VLDB ‘17, github.com/stanford-futuredata/noscope
  • 16. Optimizing ML + SQL in BlazeIt [Kang et al, CIDR 2019] Object Detection DNN Frames from Video Query Plan with Specialized DNNs Resnet 50 SQL Query
  • 17. BlazeIt Optimizations Accelerate approximate queries by using specialized model’s output as a control variate for sampling E.g.: find average # of cars/frame Use specialized models to sort frames by likelihood of matching query, then run full model E.g.: SELECT * FROM frames WHERE #(red buses) > 3 LIMIT 5 Aggregation Queries Limit Queries
  • 19. Quality Assurance for ML with Model Assertions
  • 20. Motivation ML applications fail in complex, hard-to-debug ways • Tesla cars crashing into lane dividers • Gender classification incorrect based on race How can we test and improve quality of ML apps?
  • 21. Model Assertions Predicates on input/output of an ML application (similar to software assertions) [Kang, Raghavan et al, NeurIPS MLSys 2018] Frame 1 Frame 2 Frame 3 assert(cars should not flicker in and out) Improved training (data selection & weak supervision) Runtime monitoring
  • 22. Example Assertions Problem Domain Assertion Video analytics Objects should not flicker in and out across frames Autonomous vehicles LIDAR and video object detectors should agree Heart rhythm classification Output class should not change frequently
  • 23. Using Model Assertions Inference time » Runtime monitoring » Corrective action Training time » Active learning » Weak supervision via correction rules
  • 24. Active Learning with Assertions: Can assertions help select data to label & train on? Key idea: new active learning algorithm samples data that is most likely to reduce # failing assertions
  • 25. Active Learning with Assertions: Can assertions help select data to label & train on? Using assertions for active learning improves model quality. Selection Method for 2000 New Labels mAP
  • 26. Weak Supervision with Assertions: Can assertions improve quality without human labeling? Key idea: consistency constraints API lets devs say which attributes should stay constant across outputs in a dataset E.g. “each tracked object should always have same class”, “each person should have consistent detected gender”
  • 27. Task Pretrained Weakly Supervised AV perception (mAP) 10.6 14.1 (+33%) Object detection (mAP) 34.4 49.9 (+45%) ECG (% accuracy) 70.7 72.1 (+2%) Weak Supervision with Assertions: Can assertions improve quality without human labeling?
  • 28. Model Quality After Retraining Retrained SSD ModelOriginal SSD Model [Kang, Raghavan et al, NeurIPS MLSys 2018]
  • 29. ML Platforms: Programming and Deployment Systems for ML
  • 30. ML at Industrial Scale Today, ML development is ad-hoc: • Hard to track experiments & metrics: users do it best-effort • Hard to reproduce results: won’t happen by default • Hard to share & deploy models: different dev & deploy stacks Each app takes months to build, and then needs to continuously be maintained!
  • 31. ML Platforms A new class of systems to manage the ML lifecycle Pioneered by company-specific platforms: Facebook FBLearner, Uber Michelangelo, Google TFX, etc +Standardize the data prep / training / deploy cycle: if you work with the platform, you get these! –Limited to a few algorithms or frameworks –Tied to one company’s infrastructure
  • 32. MLflow from Databricks Open source, open-interface ML platform (mlflow.org) • Works with any existing ML library and deployment service Project Project Spec your_code.py . . . log_param(“alpha”, 0.5) log_metric(“rmse”, 0.2) log_model(my_model) . . . Deps Params Tracking Server UI API Inference Code Bulk Scoring Cloud Serving Tools Deployment TargetsExperiment TrackingReproducible Projects REST API
  • 33. my_project/ ├── MLproject │ │ │ │ │ ├── conda.yaml ├── main.py └── model.py ... MLflow Projects: Reproducible Runs conda_env: conda.yaml entry_points: main: parameters: training_data: path lr: {type: float, default: 0.1} command: python main.py {training_data} {lr} $ mlflow run git://<my_project> mlflow.run(“git://<my_project>”, ...) Simple packaging format for code + dependencies
  • 34. Composing Projects r1 = mlflow.run(“ProjectA”, params) if r1 > 0: r2 = mlflow.run(“ProjectB”, …) else: r2 = mlflow.run(“ProjectC”, …) r3 = mlflow.run(“ProjectD”, r2)
  • 35. MLflow Tracking: Logging for ML Notebooks Local Apps Cloud Jobs Tracking Server UI API mlflow.log_param(“alpha”, 0.5) mlflow.log_metric(“accuracy”, 0.9) ... REST API
  • 37. Model Format ONNX Flavor Python Flavor Model Logic Batch Inference REST Serving Packaging Format . . . Testing & Debug Tools LIME TCAV Packages arbitrary code (not just model weights) MLflow Models: Packaging Models
  • 38. MLflow Community Growth 140 contributors from >50 companies since June 2018 850K downloads/month Major external contributions: • Docker & Kubernetes execution • R API • Integrations with PyTorch, H2O, HDFS, GCS, … • Plugin system
  • 39. Other ML-Specific Research Opportunities Data validation and monitoring (e.g. TFX Data Validation) Supervision-oriented systems (e.g. Snorkel, Overton) Leveraging the numeric nature of ML for optimization, security, etc (e.g. TASO, HogWild, SSP, federated ML)
  • 40. Conclusion Many systems problems specific to ML are not heavily studied in research • App lifecycle, data quality & monitoring, model QA, etc These are also major problems in practice! Follow DAWN’s research at dawn.cs.stanford.edu