What are the Unique Challenges and Opportunities in Systems for ML?

What are the Unique Challenges and
Opportunities in Systems for ML?
Matei Zaharia

😀 🙂
AI is going to
change all of
computing!
AI Researcher Systems Researcher

😀 🙂
It’s intelligent and you
don’t need to program
anymore and you just
differentiate things...

😀 🙂
How does it affect
your research
field?

😀 🙂
How does it affect
your research
field?
Umm, I figured out
a way to shave off
some system calls!

😀 🙂
How does it affect
your research
field?
AI Researcher Networking Researcher
I came up with a
new congestion
control scheme

😐 🙂AI Researcher Networking Researcher
I came up with a
new congestion
control scheme

Motivation
ML workloads can certainly influence a lot of systems,
but what are the unique research challenges they raise?
Turns out there are a lot! ML is very different from
traditional software, and we should look at how

My Perspective
Research lab focused on infrastructure for
usable machine learning
Data & ML platform for 2000+ orgs

How Does ML Differ from Traditional Software?
Traditional Software
Goal: meet a functional
specification
Quality depends only on
application code
Mostly deterministic
Machine Learning
Goal: optimize a metric
(e.g. accuracy)
Quality depends on input data
and tuning parameters
Stochastic

Some Interesting Opportunities
ML Platforms: software for managing and productionizing ML
Data-oriented model training, QA and debugging tools
Optimizations leveraging the stochastic nature of ML

ML-Aware System Optimization:
NoScope & BlazeIt

The ML Inference Bottleneck
Inference cost is often 100x higher than training
overall, and greatly limits deployments
Example: processing 1 video
stream in real time with CNNs
requires a $1000 GPU

Inference Optimization in NoScope
Idea: optimize execution of ML models for a specific
application or query
• Model specialization: train a small DNN to recognize the
specific class in the dataset (e.g. “buses in street video”)
• Query optimization: tune a cascade of
models to achieve a target accuracy Target
Model
Specialized
Model
Dataset
User Query

NoScope Results
VLDB ‘17, github.com/stanford-futuredata/noscope

Optimizing ML + SQL in BlazeIt
[Kang et al, CIDR 2019]
Object Detection DNN
Frames from Video
Query Plan with
Specialized DNNs
Resnet 50
SQL Query

BlazeIt Optimizations
Accelerate approximate queries by
using specialized model’s output
as a control variate for sampling
E.g.: find average # of cars/frame
Use specialized models to sort
frames by likelihood of matching
query, then run full model
E.g.: SELECT * FROM frames
WHERE #(red buses) > 3 LIMIT 5
Aggregation Queries Limit Queries

BlazeIt Results
Aggregation Queries Limit Queries

Quality Assurance for ML with
Model Assertions

Motivation
ML applications fail in complex, hard-to-debug ways
• Tesla cars crashing into lane dividers
• Gender classification incorrect
based on race
How can we test and improve quality of ML apps?

Model Assertions
Predicates on input/output of an ML application
(similar to software assertions)
[Kang, Raghavan et al, NeurIPS MLSys 2018]
Frame 1 Frame 2 Frame 3
assert(cars should not flicker in and out)
Improved training
(data selection &
weak supervision)
Runtime
monitoring

Example Assertions
Problem Domain Assertion
Video analytics
Objects should not flicker
in and out across frames
Autonomous vehicles
LIDAR and video object
detectors should agree
Heart rhythm
classification
Output class should not
change frequently

Using Model Assertions
Inference time
» Runtime monitoring
» Corrective action
Training time
» Active learning
» Weak supervision via
correction rules

Active Learning with Assertions:
Can assertions help select data to label & train on?
Key idea: new active learning algorithm samples data that
is most likely to reduce # failing assertions

Active Learning with Assertions:
Can assertions help select data to label & train on?
Using assertions
for active learning
improves model
quality.
Selection Method for 2000 New Labels
mAP

Weak Supervision with Assertions:
Can assertions improve quality without human labeling?
Key idea: consistency constraints API lets devs say which
attributes should stay constant across outputs in a dataset
E.g. “each tracked object should always have same class”,
“each person should have consistent detected gender”

Task Pretrained Weakly Supervised
AV perception (mAP) 10.6 14.1 (+33%)
Object detection (mAP) 34.4 49.9 (+45%)
ECG (% accuracy) 70.7 72.1 (+2%)
Weak Supervision with Assertions:
Can assertions improve quality without human labeling?

Model Quality After Retraining
Retrained SSD ModelOriginal SSD Model
[Kang, Raghavan et al, NeurIPS MLSys 2018]

ML Platforms: Programming and
Deployment Systems for ML

ML at Industrial Scale
Today, ML development is ad-hoc:
• Hard to track experiments & metrics: users do it best-effort
• Hard to reproduce results: won’t happen by default
• Hard to share & deploy models: different dev & deploy stacks
Each app takes months to build, and then needs to
continuously be maintained!

ML Platforms
A new class of systems to manage the ML lifecycle
Pioneered by company-specific platforms: Facebook
FBLearner, Uber Michelangelo, Google TFX, etc
+Standardize the data prep / training / deploy cycle:
if you work with the platform, you get these!
–Limited to a few algorithms or frameworks
–Tied to one company’s infrastructure

MLflow from Databricks
Open source, open-interface ML platform (mlflow.org)
• Works with any existing ML library and deployment service
Project
Project Spec
your_code.py
. . .
log_param(“alpha”, 0.5)
log_metric(“rmse”, 0.2)
log_model(my_model)
. . .
Deps Params
Tracking Server
UI
API
Inference Code
Bulk Scoring
Cloud Serving Tools
Deployment TargetsExperiment TrackingReproducible Projects
REST
API

my_project/
├── MLproject
│
│
│
│
│
├── conda.yaml
├── main.py
└── model.py
...
MLflow Projects: Reproducible Runs
conda_env: conda.yaml
entry_points:
main:
parameters:
training_data: path
lr: {type: float, default: 0.1}
command: python main.py {training_data} {lr}
$ mlflow run git://<my_project>
mlflow.run(“git://<my_project>”, ...)
Simple packaging format for code + dependencies

Composing Projects
r1 = mlflow.run(“ProjectA”, params)
if r1 > 0:
r2 = mlflow.run(“ProjectB”, …)
else:
r2 = mlflow.run(“ProjectC”, …)
r3 = mlflow.run(“ProjectD”, r2)

MLflow Tracking: Logging for ML
Notebooks
Local Apps
Cloud Jobs
Tracking Server
UI
API
mlflow.log_param(“alpha”, 0.5)
mlflow.log_metric(“accuracy”, 0.9)
...
REST API

Model Format
ONNX Flavor
Python Flavor
Model Logic
Batch Inference
REST Serving
Packaging Format
. . .
Testing & Debug Tools
LIME
TCAV
Packages arbitrary code (not just model weights)
MLflow Models: Packaging Models

MLflow Community Growth
140 contributors from >50 companies since June 2018
850K downloads/month
Major external contributions:
• Docker & Kubernetes execution
• R API
• Integrations with PyTorch, H2O, HDFS, GCS, …
• Plugin system

Other ML-Specific Research Opportunities
Data validation and monitoring (e.g. TFX Data Validation)
Supervision-oriented systems (e.g. Snorkel, Overton)
Leveraging the numeric nature of ML for optimization,
security, etc (e.g. TASO, HogWild, SSP, federated ML)

Conclusion
Many systems problems specific to ML are not
heavily studied in research
• App lifecycle, data quality & monitoring, model QA, etc
These are also major problems in practice!
Follow DAWN’s research at dawn.cs.stanford.edu

What are the Unique Challenges and Opportunities in Systems for ML?

More Related Content

What's hot (20)

Similar to What are the Unique Challenges and Opportunities in Systems for ML? (20)

Recently uploaded (20)

What are the Unique Challenges and Opportunities in Systems for ML?