SlideShare a Scribd company logo
Accelerate Distributed
PyTorch/Ray Workloads in
the Cloud
Chunxu Tang, Siyuan Sheng @ Alluxio
1
2
Agenda
ML Workloads in the Cloud
Accelerating PyTorch Workloads
Accelerating Ray Workloads
01
02
03
ALLUXIO 3
ML Workloads in the
Cloud
3
Data Access Patterns
4
Hybrid/Multi-Cloud ML Platforms
Online ML platform
Serving cluster
Models
Training Data
Models
1
2
3
Offline training platform
Training cluster
DC/Cloud A DC/Cloud B
5
Challenges
● I/O Bottlenecks
● Performance
○ Significant latency for remote data retrieval
○ Repeated data retrieval
● Cost
○ High expenses Incurred from remote storage access
○ Underutilization of GPU resources
6
Benefits of Data Locality
● Performance Gain
○ Faster access to your data compared to remote storage
○ Less time spent on data-intensive applications
● Cost Saving
○ Fewer API calls to cloud storage (data & metadata)
○ Higher utilization of GPU
7
Solutions
8
Pros Cons
Read from remote storage
(No locality)
● Easy to set up ● Performance and cost
issues due to I/O
bottlenecks
Copy data to local before
training
● Data is local
● Easy to set up
● Hard to manage
● Limited cache space
Local cache layer
(S3FS-FUSE, Alluxio-FUSE)
● Data is local
● Convenient interface
● Hard to manage
● Limited cache space
Distributed data access
layer
● Data is local or adjacent
● Central data management
● Scalable cache space
● Hard to build
Unified Data Access for ML Platforms
Online ML platform
Alluxio
Serving cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
DC/Cloud A DC/Cloud B
9
ALLUXIO 10
Accelerating PyTorch
Workloads
10
Under Storage
Integration with PyTorch Training (Alluxio)
Training Node
Get Task Info
Alluxio Client
PyTorch
Get Cluster Info
Send Result
Cache Cluster
Service Registry
Cache Worker
Cache Worker
Execute Task
Cache Worker
Cache Client
Find Worker(s)
Affinity Block
Location
Policy Client-side load
balance
1
2
3
4
5
Cache miss -
Under storage task
11
Data Loading Performance
ImageNet (subset)
12
Yelp review
13
Training Directly from Storage (S3-FUSE)
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
GPU Utilization Improvement
Training with Alluxio-FUSE
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
GPU Utilization Improvement
ALLUXIO 15
Accelerating Ray
Workloads
15
Ray is Designed for Distributed Cloud Training
● Ray uses a distributed scheduler to dispatch training jobs to available
workers (CPUs/GPUs)
● Enables seamless horizontal scaling of training jobs across multiple nodes
● Provides streaming data abstraction for ML training for parallel and
distributed preprocessing.
16
Performance & Cost Issues from Ray community
● You might load the entire dataset again and again for each epoch
● You cannot cache the hottest data among multiple training jobs automatically
● You might be suffering from a cold start every time.
Alluxio - Ray Integration
18
Ray Dataloader
fsspec - Alluxio
impl
Alluxio Python
client
Ray
etcd
Alluxio Worker
REST API server
Alluxio Worker
REST API server
PyArrow Dataset
loading
Registration
Get worker
addresses
Alluxiofs - fsspec with Ray Usage
# Import fsspec & alluxio fsspec implementation
import fsspec
from alluxiofs import AlluxioFileSystem
# Create Alluxio filesystem
alluxio = fsspec.filesystem("s3", etcd_host=args.etcd_host)
# Ray read data from Alluxio using S3 URL
ds = ray.data.read_images("s3://ai-ref-arch/imagenet-full/train",
filesystem=alluxio)
See more in: https://github.com/fsspec/alluxiofs
Using Alluxiofs instead of S3fs
Original S3 URL
Alluxio+Ray Benchmark – Small Files
● Dataset
○ 130GB imagenet dataset
● Process Settings
○ 4 train workers
○ 9 process reading
● Active Object Store Memory
○ 400-500 MiB
Alluxio+Ray Benchmark – Large Parquet files
● Dataset
○ 200MiB files, adds up to
60GiB
● Process Settings
○ 28 train workers
○ 28 process reading
● Active Object Store Memory
○ 20-30 GiB
Cost Saving – Egress/Data Transfer Fees
Cost Saving – API Calls/S3 Operations (List, Get)
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Github
https://github.com/Alluxio
Chunxu Tang
www.linkedin.com/in/chunxu-tang
Siyuan Sheng
www.linkedin.com/in/siyuan-sheng

More Related Content

PDF
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
PDF
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
PDF
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
Accelerating Cloud Training With Alluxio
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Accelerating Cloud Training With Alluxio
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
From Data Preparation to Inference: How Alluxio Speeds Up AI

Similar to Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud (20)

PDF
Slides: Accelerating Queries on Cloud Data Lakes
PDF
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
ODP
Zero Downtime JEE Architectures
PDF
Alluxio Webinar - Maximize GPU Utilization for Model Training
PDF
Alluxio Monthly Webinar - Accelerate AI Path to Production
PDF
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
PDF
Nextflow on Velsera: a data-driven journey from failure to cutting-edge
PDF
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
PDF
Data Orchestration Platform for the Cloud
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PPTX
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
PDF
RAPIDS – Open GPU-accelerated Data Science
PDF
Peek into Neo4j Product Strategy and Roadmap
PDF
Toronto meetup 20190917
PDF
Machine learning at scale with Google Cloud Platform
PDF
Alluxio Use Cases and Future Directions
PDF
Alluxio Product school Webinar - Distributed Caching for Generative AI
Slides: Accelerating Queries on Cloud Data Lakes
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Zero Downtime JEE Architectures
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Nextflow on Velsera: a data-driven journey from failure to cutting-edge
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
From limited Hadoop compute capacity to increased data scientist efficiency
Data Orchestration Platform for the Cloud
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
RAPIDS – Open GPU-accelerated Data Science
Peek into Neo4j Product Strategy and Roadmap
Toronto meetup 20190917
Machine learning at scale with Google Cloud Platform
Alluxio Use Cases and Future Directions
Alluxio Product school Webinar - Distributed Caching for Generative AI
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
PDF
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Ad

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Nekopoi APK 2025 free lastest update
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
L1 - Introduction to python Backend.pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
ai tools demonstartion for schools and inter college
PPTX
history of c programming in notes for students .pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Design an Analysis of Algorithms II-SECS-1021-03
Nekopoi APK 2025 free lastest update
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Digital Strategies for Manufacturing Companies
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Softaken Excel to vCard Converter Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Which alternative to Crystal Reports is best for small or large businesses.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Upgrade and Innovation Strategies for SAP ERP Customers
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
L1 - Introduction to python Backend.pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
How to Choose the Right IT Partner for Your Business in Malaysia
ai tools demonstartion for schools and inter college
history of c programming in notes for students .pptx

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

  • 1. Accelerate Distributed PyTorch/Ray Workloads in the Cloud Chunxu Tang, Siyuan Sheng @ Alluxio 1
  • 2. 2 Agenda ML Workloads in the Cloud Accelerating PyTorch Workloads Accelerating Ray Workloads 01 02 03
  • 3. ALLUXIO 3 ML Workloads in the Cloud 3
  • 5. Hybrid/Multi-Cloud ML Platforms Online ML platform Serving cluster Models Training Data Models 1 2 3 Offline training platform Training cluster DC/Cloud A DC/Cloud B 5
  • 6. Challenges ● I/O Bottlenecks ● Performance ○ Significant latency for remote data retrieval ○ Repeated data retrieval ● Cost ○ High expenses Incurred from remote storage access ○ Underutilization of GPU resources 6
  • 7. Benefits of Data Locality ● Performance Gain ○ Faster access to your data compared to remote storage ○ Less time spent on data-intensive applications ● Cost Saving ○ Fewer API calls to cloud storage (data & metadata) ○ Higher utilization of GPU 7
  • 8. Solutions 8 Pros Cons Read from remote storage (No locality) ● Easy to set up ● Performance and cost issues due to I/O bottlenecks Copy data to local before training ● Data is local ● Easy to set up ● Hard to manage ● Limited cache space Local cache layer (S3FS-FUSE, Alluxio-FUSE) ● Data is local ● Convenient interface ● Hard to manage ● Limited cache space Distributed data access layer ● Data is local or adjacent ● Central data management ● Scalable cache space ● Hard to build
  • 9. Unified Data Access for ML Platforms Online ML platform Alluxio Serving cluster Models Models Training Data Models 1 2 3 4 5 Offline training platform Alluxio Training cluster Training Data 2 DC/Cloud A DC/Cloud B 9
  • 11. Under Storage Integration with PyTorch Training (Alluxio) Training Node Get Task Info Alluxio Client PyTorch Get Cluster Info Send Result Cache Cluster Service Registry Cache Worker Cache Worker Execute Task Cache Worker Cache Client Find Worker(s) Affinity Block Location Policy Client-side load balance 1 2 3 4 5 Cache miss - Under storage task 11
  • 12. Data Loading Performance ImageNet (subset) 12 Yelp review
  • 13. 13 Training Directly from Storage (S3-FUSE) - > 80% of total time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) GPU Utilization Improvement
  • 14. Training with Alluxio-FUSE - Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X) GPU Utilization Improvement
  • 16. Ray is Designed for Distributed Cloud Training ● Ray uses a distributed scheduler to dispatch training jobs to available workers (CPUs/GPUs) ● Enables seamless horizontal scaling of training jobs across multiple nodes ● Provides streaming data abstraction for ML training for parallel and distributed preprocessing. 16
  • 17. Performance & Cost Issues from Ray community ● You might load the entire dataset again and again for each epoch ● You cannot cache the hottest data among multiple training jobs automatically ● You might be suffering from a cold start every time.
  • 18. Alluxio - Ray Integration 18 Ray Dataloader fsspec - Alluxio impl Alluxio Python client Ray etcd Alluxio Worker REST API server Alluxio Worker REST API server PyArrow Dataset loading Registration Get worker addresses
  • 19. Alluxiofs - fsspec with Ray Usage # Import fsspec & alluxio fsspec implementation import fsspec from alluxiofs import AlluxioFileSystem # Create Alluxio filesystem alluxio = fsspec.filesystem("s3", etcd_host=args.etcd_host) # Ray read data from Alluxio using S3 URL ds = ray.data.read_images("s3://ai-ref-arch/imagenet-full/train", filesystem=alluxio) See more in: https://github.com/fsspec/alluxiofs Using Alluxiofs instead of S3fs Original S3 URL
  • 20. Alluxio+Ray Benchmark – Small Files ● Dataset ○ 130GB imagenet dataset ● Process Settings ○ 4 train workers ○ 9 process reading ● Active Object Store Memory ○ 400-500 MiB
  • 21. Alluxio+Ray Benchmark – Large Parquet files ● Dataset ○ 200MiB files, adds up to 60GiB ● Process Settings ○ 28 train workers ○ 28 process reading ● Active Object Store Memory ○ 20-30 GiB
  • 22. Cost Saving – Egress/Data Transfer Fees
  • 23. Cost Saving – API Calls/S3 Operations (List, Get)