SlideShare a Scribd company logo
Democratizing AI Across Clouds:
Low-Cost, Easy-to-Deploy
Machine Learning
Deep learning defines the future
Healthcare
Logistics
Banking
● 76% of companies prioritize deep learning in 2021
● 70% of these companies are small-to-medium sized
businesses that rely on public clouds to run AI jobs
● They typically focus on business logic, not infrastructural
management (e.g., what resources to use and how
expensive they are)
https://www.forbes.com/sites/louiscolumbus/2021/01/17/76-of-enterprises-prioritize-ai--machine-learning-in-2021-it-budgets/?sh=378288e5618a
Problem 1: Extremely High Costs
Newer models = higher accuracy (competitive advantage) and higher costs
e.g., GPT-2 ($43K) => GPT-3 ($12M)
70% of AI businesses use
cloud and 40% are concerned
about their expense
6- to 8-digit $
amount/year on cloud
infra for ML tasks
AI Costs are massive…
…and they are only getting worse
A Short-term Goal: Cheap and Reliable AI for All
● Goals
○ Lower costs
○ Zero developer effort, compatible with existing
jobs/pipelines
○ Guaranteed accuracy and performance SLAs
● Affordable variants of resources available
● Certain GPUs better for certain models
● Multi-tenant smart networking
● Serverless threads
● Spot instances
Our Product: ML Platform as a Service
Popular Platforms
Spot
AIOps Tools
Demand
Breeze
Runtime
Schedule
Breeze Virtual Cloud
…
Framework
Interface
Scheduler
B
B
B
…
Lower Tier
Checkpointing Cannot Handle Many Failures
Takeaway: Less than 50%
of time spent doing useful
work (blue)
Blue: training progressing
Orange: cluster made
progress but was wasted
Red: cluster restarting
from checkpoint
How Can We Provide Fast Recovery And
Accuracy?
Introduce redundancy to the pipeline
● Feasible given the fact that we are using discounted resources
● Slightly over-provisioning to maintain performance
Can we do it more intelligently?
● Duplicate layer on every pipeline so that at least 2 copies of
weights always exist within the system
● Replicate it on the previous node to exploit data locality
8
Redundant stages provide redundancy more quickly than checkpointing ✅
High performance and memory overheads if done naively! ❌
Redundancy Provides Resilience
9
Pipeline has Bubbles
10
Each mini-batch split up
into micro-batches
Accumulate micro-batch
gradients to get full batch
gradients
Bubble
Using Pipeline Bubbles to Hide Overhead
11
Improvements Provided by BreezeML
● 2-3x more
cost effective
for popular
training jobs
● Red line: use
on-demand
AWS
On-prem
Cloud
AWS Cloud
GCP Cloud
Team 1 Team 2 Team 3 Team 4 Team 5
TF PyTorch
JAX TF PyTorch
Breeze Cloud-Neutral AI Platform
Problem 2: Highly Diverse Software and Hardware
TPU CPU GPU Graviton CPU
Breeze Multi-cloud Platform
● Users create a dataflow graph and annotate tasks
● We partition the dataflow and run tasks on most
appropriate resources across clouds to satisfy users’
constraints
Years of Research and Development
[NSDI 2023] Thorpe et al. , “Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs”
[NSDI 2022] Cangialosi et al., “Privid: Practical, Privacy-Preserving Video Analytics Queries”
[ICLR 2022] Zhang et al., “GradSign: Model Performance Inference with Theoretical Insights”
[MLSys 2022] Dogga et al., “Revelio: ML-Generated Debugging Queries for Distributed Systems”
[OSDI 2021] Thorpe et al., “Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU servers and
Serverless Threads”
[OSDI 2021] Wang et al., “PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated
Corrections”
[MLSys 2021] Ding et al. “IOS: Inter-Operator Scheduler for CNN Acceleration”
[SIGCOMM 2020] Li et al., “Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics”
[SoCC 2019] Dogga et al., “A System-Wide Debugging Assistant Powered by Natural Language Processing”
[SOSP 2019] Jia et al., “TASO: Optimizing Deep Learning Computation with Automated Generation of Graph
Substitutions”
GTM and Customer Integration
Windmill API
Server
● We control the
backend
http://windmill.breezeml.ai/
apis/
Free trial, academics, and
small businesses
On-site
Deployment
● Pytorch/Tensorflo
w/Ray plugin
● K8S plugin
● Deploy at user’s
site
Enterprises that control
backend themselves
Partner with
Cloud Providers
● AWS
● GCP
● Azure
● Oracle Cloud
Enterprises on special deals
with cloud providers
Monetization Schemes
Licensing
Deploy our system on
customers’ cloud environment
License Fee
● Fixed-price per-year
license
Cloud Service
Subscribe to BreezeML’s cloud
service:
Service charge: $1000/year
Cut from savings: $ amount in
savings per job * # jobs * 20%
BreezeML Enables Low-Cost, Cross-Cloud AI
● Reduce burden of running on low-cost spot instances while maintaining high performance
and reliability
● Allow developers to leverage the increasingly heterogeneous cloud environment
Thank you!
http://breezeml.ai

More Related Content

PDF
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
PPTX
Deep Learning on Qubole Data Platform
PDF
C19013010 the tutorial to build shared ai services session 1
PDF
Training and deploying ML models with Google Cloud Platform
PDF
On premise ai platform - from dc to edge
PDF
World Artificial Intelligence Conference Shanghai 2018
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
PPTX
03_aiops-1.pptx
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Deep Learning on Qubole Data Platform
C19013010 the tutorial to build shared ai services session 1
Training and deploying ML models with Google Cloud Platform
On premise ai platform - from dc to edge
World Artificial Intelligence Conference Shanghai 2018
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
03_aiops-1.pptx

Similar to Data Con LA 2022 - Democratizing AI Across Clouds: Low-Cost, Easy-to-Deploy Machine Learning (20)

PDF
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
PDF
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PDF
High Performance Distributed TensorFlow with GPUs and Kubernetes
PPTX
The Edge to AI Deep Dive Barcelona Meetup March 2019
PPTX
MOPs & ML Pipelines on GCP - Session 6, RGDC
PPTX
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
PDF
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PPTX
Integrating Machine Learning Capabilities into your team
PDF
WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PPTX
Leonid Kuligin "Training ML models with Cloud"
PPTX
Production ML Systems and Computer Vision with Google Cloud
PDF
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
PDF
Clipper: A Low-Latency Online Prediction Serving System
PDF
Very large scale distributed deep learning on BigDL
PPTX
Microsoft AI Platform Overview
PDF
Deep Learning Neural Networks in the Cloud
PPT
Enabling a hardware accelerated deep learning data science experience for Apa...
PDF
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PDF
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
High Performance Distributed TensorFlow with GPUs and Kubernetes
The Edge to AI Deep Dive Barcelona Meetup March 2019
MOPs & ML Pipelines on GCP - Session 6, RGDC
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Integrating Machine Learning Capabilities into your team
WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Leonid Kuligin "Training ML models with Cloud"
Production ML Systems and Computer Vision with Google Cloud
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
Clipper: A Low-Latency Online Prediction Serving System
Very large scale distributed deep learning on BigDL
Microsoft AI Platform Overview
Deep Learning Neural Networks in the Cloud
Enabling a hardware accelerated deep learning data science experience for Apa...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka
Ad

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to Data Science and Data Analysis
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Introduction to the R Programming Language
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Mega Projects Data Mega Projects Data
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Data Science and Data Analysis
Business Analytics and business intelligence.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Galatica Smart Energy Infrastructure Startup Pitch Deck
STUDY DESIGN details- Lt Col Maksud (21).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to the R Programming Language
Data_Analytics_and_PowerBI_Presentation.pptx

Data Con LA 2022 - Democratizing AI Across Clouds: Low-Cost, Easy-to-Deploy Machine Learning

  • 1. Democratizing AI Across Clouds: Low-Cost, Easy-to-Deploy Machine Learning
  • 2. Deep learning defines the future Healthcare Logistics Banking ● 76% of companies prioritize deep learning in 2021 ● 70% of these companies are small-to-medium sized businesses that rely on public clouds to run AI jobs ● They typically focus on business logic, not infrastructural management (e.g., what resources to use and how expensive they are) https://www.forbes.com/sites/louiscolumbus/2021/01/17/76-of-enterprises-prioritize-ai--machine-learning-in-2021-it-budgets/?sh=378288e5618a
  • 3. Problem 1: Extremely High Costs Newer models = higher accuracy (competitive advantage) and higher costs e.g., GPT-2 ($43K) => GPT-3 ($12M) 70% of AI businesses use cloud and 40% are concerned about their expense 6- to 8-digit $ amount/year on cloud infra for ML tasks AI Costs are massive… …and they are only getting worse
  • 4. A Short-term Goal: Cheap and Reliable AI for All ● Goals ○ Lower costs ○ Zero developer effort, compatible with existing jobs/pipelines ○ Guaranteed accuracy and performance SLAs ● Affordable variants of resources available ● Certain GPUs better for certain models ● Multi-tenant smart networking ● Serverless threads ● Spot instances
  • 5. Our Product: ML Platform as a Service Popular Platforms Spot AIOps Tools Demand Breeze Runtime Schedule Breeze Virtual Cloud … Framework Interface Scheduler B B B … Lower Tier
  • 6. Checkpointing Cannot Handle Many Failures Takeaway: Less than 50% of time spent doing useful work (blue) Blue: training progressing Orange: cluster made progress but was wasted Red: cluster restarting from checkpoint
  • 7. How Can We Provide Fast Recovery And Accuracy? Introduce redundancy to the pipeline ● Feasible given the fact that we are using discounted resources ● Slightly over-provisioning to maintain performance
  • 8. Can we do it more intelligently? ● Duplicate layer on every pipeline so that at least 2 copies of weights always exist within the system ● Replicate it on the previous node to exploit data locality 8
  • 9. Redundant stages provide redundancy more quickly than checkpointing ✅ High performance and memory overheads if done naively! ❌ Redundancy Provides Resilience 9
  • 10. Pipeline has Bubbles 10 Each mini-batch split up into micro-batches Accumulate micro-batch gradients to get full batch gradients Bubble
  • 11. Using Pipeline Bubbles to Hide Overhead 11
  • 12. Improvements Provided by BreezeML ● 2-3x more cost effective for popular training jobs ● Red line: use on-demand AWS
  • 13. On-prem Cloud AWS Cloud GCP Cloud Team 1 Team 2 Team 3 Team 4 Team 5 TF PyTorch JAX TF PyTorch Breeze Cloud-Neutral AI Platform Problem 2: Highly Diverse Software and Hardware TPU CPU GPU Graviton CPU
  • 14. Breeze Multi-cloud Platform ● Users create a dataflow graph and annotate tasks ● We partition the dataflow and run tasks on most appropriate resources across clouds to satisfy users’ constraints
  • 15. Years of Research and Development [NSDI 2023] Thorpe et al. , “Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs” [NSDI 2022] Cangialosi et al., “Privid: Practical, Privacy-Preserving Video Analytics Queries” [ICLR 2022] Zhang et al., “GradSign: Model Performance Inference with Theoretical Insights” [MLSys 2022] Dogga et al., “Revelio: ML-Generated Debugging Queries for Distributed Systems” [OSDI 2021] Thorpe et al., “Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU servers and Serverless Threads” [OSDI 2021] Wang et al., “PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections” [MLSys 2021] Ding et al. “IOS: Inter-Operator Scheduler for CNN Acceleration” [SIGCOMM 2020] Li et al., “Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics” [SoCC 2019] Dogga et al., “A System-Wide Debugging Assistant Powered by Natural Language Processing” [SOSP 2019] Jia et al., “TASO: Optimizing Deep Learning Computation with Automated Generation of Graph Substitutions”
  • 16. GTM and Customer Integration Windmill API Server ● We control the backend http://windmill.breezeml.ai/ apis/ Free trial, academics, and small businesses On-site Deployment ● Pytorch/Tensorflo w/Ray plugin ● K8S plugin ● Deploy at user’s site Enterprises that control backend themselves Partner with Cloud Providers ● AWS ● GCP ● Azure ● Oracle Cloud Enterprises on special deals with cloud providers
  • 17. Monetization Schemes Licensing Deploy our system on customers’ cloud environment License Fee ● Fixed-price per-year license Cloud Service Subscribe to BreezeML’s cloud service: Service charge: $1000/year Cut from savings: $ amount in savings per job * # jobs * 20%
  • 18. BreezeML Enables Low-Cost, Cross-Cloud AI ● Reduce burden of running on low-cost spot instances while maintaining high performance and reliability ● Allow developers to leverage the increasingly heterogeneous cloud environment