SlideShare a Scribd company logo
The power of Ray in the era of
LLM and multi-modality AI
Current: NVIDIA (DGX Cloud)
‘20 ~ ‘24: Anyscale Head of Ray / OSS
Before: Cloudera, LinkedIn, OSS
Hadoop, Spark etc.
Ray is a popular OSS project
Ray
Apache Spark
MLFlow
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
ChatGPT / GPT-4 Trained on Ray!
Fast iterations at Hyper-scale!
This Talk
Adoption Highlights
What is Ray
How is Ray Used?
Future Outlook
Ray: a short history
2016: Started in UC Berkeley (same lab as Spark / vLLM)
2019: Anyscale founded (company behind Ray)
2020: Ray v1.0 release; Ray Serve released
2022: Ray v2.0 release; (KubeRay, Ray Data etc)
2023 / 2024: a lot of focus on LLMs
A Typical ML Pipeline
Ray: Holistically addresses AI/LLM challenges
Unified Framework for Scaling AI Workloads
Ray Core
“Operating System” for heterogeneous distributed computing
Ray AI Libraries
Data Train Reinforcement Learning Serve
Tune
Minimalist API
ray.init() Initialize Ray context.
@ray.remote
Function or class decorator specifying that the function will be executed as a task or
the class as an actor in a different process.
.remote
Postfix to every remote function, remote class declaration, or invocation of a remote
class method. Remote operations are asynchronous.
ray.put()
Store object in object store, and return its ID. This ID can be used to pass object as an
argument to any remote function or method call. This is a synchronous operation.
ray.get()
Return an object or list of objects from the object ID or list of object IDs. This is a
synchronous (i.e., blocking) operation.
def read_array(file):
# read ndarray “a”
# from “file”
return a
def add(a, b):
return np.add(a, b)
a = read_array(file1)
b = read_array(file2)
sum = add(a, b)
Function
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter()
c.inc()
c.inc()
Class
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
a = read_array(file1)
b = read_array(file2)
sum = add(a, b)
@ray.remote
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter()
c.inc()
c.inc()
Function → Task Class → Actor
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
@ray.remote
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id4 = c.inc.remote()
id5 = c.inc.remote()
Function → Task Object → Actor
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
id1: distributed future (object id)
read_array
id1
Return future id1; before
read_array() finishes
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
Dynamic task graph:
build at runtime
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
add
id
Node 3
Every task submitted,
but not finished yet
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
add
id
Node 3
ray.get() block until
result available
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1
Node 1 Node 2
Node 3
read_array
file2
read_array
add
sum
Task graph executed to
compute sum
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
@ray.remote(num_cpus=2, num_gpus=1)
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id1 = c.inc.remote()
id2 = c.inc.remote()
val = ray.get(id2)
Function → Task Class → Actor
can specify
resource demands;
support heterogeneous hardware
The “No B.S” Slide 🤓
At the end of the day, “RPC is all you need”
💻
Run func(x)
Return y
At the end of the day, “RPC is all you need”
. . .
💻
Harry: do this Ron: do that
. . .
💻
SELECT * …
At the end of the day, “RPC is all you need”
. . .
💻
I need someone to do this (w/ 2 CPUs)
I need someone to do that (w/ 1 H100 GPU)
At the end of the day, “RPC is all you need”
“No abstraction; it’s like SSH”
“Do things my way (queries)”
“It’s still Python code”
At the end of the day, “RPC is all you need”
A compelling example
This Talk
Adoption Highlights
What is Ray
How is Ray Used?
Future Outlook
Most Successful Use Cases
Model Training Unstructured Data
(text, image, video)
LLM Inference /
Fine-tune
Reinforcement
Learning
Graph
Computing
…
Model Training – Pinterest
Model Training – Pinterest
Model Training – Pinterest
Model Training – Uber
Model Training – Uber
Unstructured Data – Benchmark
Leading
open-source
framework
Leading
commercial
ML Platform
$0
$20
$40
$60
$3.5
Load
Pre-
processing
Inference Save
Ray Core
CPU CPU
GPU
Batch Inference
● Use most cost-effective hardware for each stage
● Independently scale every stage
$7.3
$57
Cost to process 1M images
https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker
Unstructured Data – Benchmark
Load
Pre-
processing
Inference Save
Ray Core
CPU CPU
GPU
● Use most cost-effective hardware for each stage
● Independently scale every stage
Unstructured Data – Niantic
Unstructured Data – Niantic
Unstructured Data – Niantic
This Talk
Adoption Highlights
What is Ray
How is Ray Used?
Future Outlook
Zhe’s Takes on ML Infra
“No abstraction; it’s like SSH”
“Do things my way (queries)”
“It’s still Python code”
Zhe’s Takes on ML Infra
How structured is the problem?
- ML is a much more unstructured problem than
Data at this point
- It could become structured at some point
- Most people are still settling with the ssh + cmd
approach (e.g. torchrun on Slurm)
Zhe’s Takes on ML Infra
The rise of multi-modality AI
- Much much more data-intensive
- More development / trial-and-error
- It becomes more meaningful to “upgrade your
gear” (try Ray)

More Related Content

PDF
Ray and Its Growing Ecosystem
PDF
Building an ML Platform with Ray and MLflow
PDF
ACM Sunnyvale Meetup.pdf
PDF
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
PDF
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...
PDF
The Future of Computing is Distributed
PDF
Distributed computing and hyper-parameter tuning with Ray
PDF
Learning Ray, 5th Early Release Max Pumperla
Ray and Its Growing Ecosystem
Building an ML Platform with Ray and MLflow
ACM Sunnyvale Meetup.pdf
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...
The Future of Computing is Distributed
Distributed computing and hyper-parameter tuning with Ray
Learning Ray, 5th Early Release Max Pumperla

Similar to AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI (20)

PDF
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
PPTX
Next generation analytics with yarn, spark and graph lab
PDF
Introduction to Chainer: A Flexible Framework for Deep Learning
PDF
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
PDF
Dask and Machine Learning Models in Production - PyColorado 2019
PPTX
An Introduction to TensorFlow architecture
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
PPTX
NYAI - Scaling Machine Learning Applications by Braxton McKee
PDF
Introduction to Chainer
PDF
Advanced Spark and TensorFlow Meetup May 26, 2016
PPTX
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
PPTX
MLAPI_DataGuild.pptx
PDF
Joblib for cloud computing
PDF
PyParis2017 / Cloud computing made easy in Joblib, by Alexandre Abadie
PDF
Introduction to Chainer 11 may,2018
PDF
Distributing your pandas ETL job using Modin and Ray.pdf
PDF
Ray The alternative to distributed frameworks.pdf
PPTX
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
PDF
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
PPTX
Deep Learning in your Browser: powered by WebGL
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Next generation analytics with yarn, spark and graph lab
Introduction to Chainer: A Flexible Framework for Deep Learning
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Dask and Machine Learning Models in Production - PyColorado 2019
An Introduction to TensorFlow architecture
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
NYAI - Scaling Machine Learning Applications by Braxton McKee
Introduction to Chainer
Advanced Spark and TensorFlow Meetup May 26, 2016
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
MLAPI_DataGuild.pptx
Joblib for cloud computing
PyParis2017 / Cloud computing made easy in Joblib, by Alexandre Abadie
Introduction to Chainer 11 may,2018
Distributing your pandas ETL job using Modin and Ray.pdf
Ray The alternative to distributed frameworks.pdf
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Deep Learning in your Browser: powered by WebGL
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Ad

Recently uploaded (20)

PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
System and Network Administraation Chapter 3
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Transform Your Business with a Software ERP System
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Introduction to Artificial Intelligence
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
How Creative Agencies Leverage Project Management Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Design an Analysis of Algorithms II-SECS-1021-03
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Design an Analysis of Algorithms I-SECS-1021-03
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
System and Network Administraation Chapter 3
ISO 45001 Occupational Health and Safety Management System
PTS Company Brochure 2025 (1).pdf.......
Upgrade and Innovation Strategies for SAP ERP Customers
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Transform Your Business with a Software ERP System
2025 Textile ERP Trends: SAP, Odoo & Oracle
Wondershare Filmora 15 Crack With Activation Key [2025
Introduction to Artificial Intelligence
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo Companies in India – Driving Business Transformation.pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx

AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI

  • 1. The power of Ray in the era of LLM and multi-modality AI
  • 2. Current: NVIDIA (DGX Cloud) ‘20 ~ ‘24: Anyscale Head of Ray / OSS Before: Cloudera, LinkedIn, OSS Hadoop, Spark etc.
  • 3. Ray is a popular OSS project Ray Apache Spark MLFlow
  • 5. ChatGPT / GPT-4 Trained on Ray! Fast iterations at Hyper-scale!
  • 6. This Talk Adoption Highlights What is Ray How is Ray Used? Future Outlook
  • 7. Ray: a short history 2016: Started in UC Berkeley (same lab as Spark / vLLM) 2019: Anyscale founded (company behind Ray) 2020: Ray v1.0 release; Ray Serve released 2022: Ray v2.0 release; (KubeRay, Ray Data etc) 2023 / 2024: a lot of focus on LLMs
  • 8. A Typical ML Pipeline
  • 9. Ray: Holistically addresses AI/LLM challenges Unified Framework for Scaling AI Workloads Ray Core “Operating System” for heterogeneous distributed computing Ray AI Libraries Data Train Reinforcement Learning Serve Tune
  • 10. Minimalist API ray.init() Initialize Ray context. @ray.remote Function or class decorator specifying that the function will be executed as a task or the class as an actor in a different process. .remote Postfix to every remote function, remote class declaration, or invocation of a remote class method. Remote operations are asynchronous. ray.put() Store object in object store, and return its ID. This ID can be used to pass object as an argument to any remote function or method call. This is a synchronous operation. ray.get() Return an object or list of objects from the object ID or list of object IDs. This is a synchronous (i.e., blocking) operation.
  • 11. def read_array(file): # read ndarray “a” # from “file” return a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) Function class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Class
  • 12. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Function → Task Class → Actor
  • 13. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() Function → Task Object → Actor
  • 14. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 id1: distributed future (object id) read_array id1 Return future id1; before read_array() finishes Task API
  • 15. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 read_array id1 read_array id2 Dynamic task graph: build at runtime Task API
  • 16. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 read_array id1 read_array id2 add id Node 3 Every task submitted, but not finished yet Task API
  • 17. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 read_array id1 read_array id2 add id Node 3 ray.get() block until result available Task API
  • 18. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 Node 1 Node 2 Node 3 read_array file2 read_array add sum Task graph executed to compute sum Task API
  • 19. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) @ray.remote(num_cpus=2, num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id1 = c.inc.remote() id2 = c.inc.remote() val = ray.get(id2) Function → Task Class → Actor can specify resource demands; support heterogeneous hardware
  • 20. The “No B.S” Slide 🤓 At the end of the day, “RPC is all you need” 💻 Run func(x) Return y
  • 21. At the end of the day, “RPC is all you need” . . . 💻 Harry: do this Ron: do that
  • 22. . . . 💻 SELECT * … At the end of the day, “RPC is all you need”
  • 23. . . . 💻 I need someone to do this (w/ 2 CPUs) I need someone to do that (w/ 1 H100 GPU) At the end of the day, “RPC is all you need”
  • 24. “No abstraction; it’s like SSH” “Do things my way (queries)” “It’s still Python code” At the end of the day, “RPC is all you need”
  • 26. This Talk Adoption Highlights What is Ray How is Ray Used? Future Outlook
  • 27. Most Successful Use Cases Model Training Unstructured Data (text, image, video) LLM Inference / Fine-tune Reinforcement Learning Graph Computing …
  • 28. Model Training – Pinterest
  • 29. Model Training – Pinterest
  • 30. Model Training – Pinterest
  • 33. Unstructured Data – Benchmark Leading open-source framework Leading commercial ML Platform $0 $20 $40 $60 $3.5 Load Pre- processing Inference Save Ray Core CPU CPU GPU Batch Inference ● Use most cost-effective hardware for each stage ● Independently scale every stage $7.3 $57 Cost to process 1M images https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker
  • 34. Unstructured Data – Benchmark Load Pre- processing Inference Save Ray Core CPU CPU GPU ● Use most cost-effective hardware for each stage ● Independently scale every stage
  • 38. This Talk Adoption Highlights What is Ray How is Ray Used? Future Outlook
  • 39. Zhe’s Takes on ML Infra “No abstraction; it’s like SSH” “Do things my way (queries)” “It’s still Python code”
  • 40. Zhe’s Takes on ML Infra How structured is the problem? - ML is a much more unstructured problem than Data at this point - It could become structured at some point - Most people are still settling with the ssh + cmd approach (e.g. torchrun on Slurm)
  • 41. Zhe’s Takes on ML Infra The rise of multi-modality AI - Much much more data-intensive - More development / trial-and-error - It becomes more meaningful to “upgrade your gear” (try Ray)