SlideShare a Scribd company logo
Enhancing Python Data
Loading in the Cloud for AI/ML
Bin Fan, Chief Architect & VP of Open Source @ Alluxio
1
Tightly-Coupled Hadoop
& HDFS
Compute-Storage
Separation
On-Prem HDFS
Cloud Data Lake
Single Region &
Single Cloud
Multi-Region/
Hybrid/Multi-Cloud
10yr
Ago
Today
More Elastic, Cheaper, More Scalable
The Evolution of the Modern Data Stack
Compute-Storage
Separation
Cloud Data Lake
Multi-Region/
Hybrid/Multi-Cloud
Today
Data is Remote from Compute; Locality is Missing
I/O Challenges
The Evolution of the Modern Data Stack
● GET/PUT operation costs
add up quickly
● Cross-region data transfer
(egress) fees
● GPU cycles are wasted
waiting for data
● Job failures
● Amazon S3 errors:
503 Slow Down
503 Service Unavailable
I/O Challenges
● Analytics SQL: High query
latency because of
retrieving remote data
● Model Training: Training is
slow because of loading
remote data in each epoch
(LISTing lots of small files is
particularly slow)
Performance Cost Reliability
10%
of your data is hot data
Source: Alluxio
10%
of your data is hot data
Data Caching Layer
between compute & storage
Add a
Source: Alluxio
Reduce Latency
I/O
Compute I/O
Compute Compute
I/O
(first time retrieving
remote data)
Compute
I/O Compute
Without
Cache
With
Cache
Total job run time is reduced
I/O
Compute Compute
Compute I/O
Increase GPU Utilization
I/O
(data loading)
Training I/O
Training Training
I/O
(first time loading
remote data)
Training I/O
Training Training
I/O Training
Training
Without
Cache
With
Cache
GPU is idle idle
I/O
idle
GPU is idle GPU is busy most of the time
GPU utilization is greatly increased
Reduce Cloud Storage Cost
Compute
Compute
AWS S3
us-east-1
Without Cache With Cache
AWS S3
us-west-1
AWS S3
us-east-1
Frequently Retrieving Data =
High GET/PUT Operations Costs & Data Transfer
Costs
Fast Access with
Hot Data Cached
AWS S3
us-west-1
Only Retrieve Data When Necessary =
Lower S3 Costs
… …
… …
Data Cache
Improve Reliability
Prevent
Network
Congestion
Relieve
Overloaded
Storage
Prevent Job Failures like “503 Service Unavailable” …
DATA CACHING LAYER
Observations So Far …
● The evolution of modern data stack poses
challenges for data locality
● You should care about I/O in data lake
because it greatly impacts the
performance, cost & reliability of your
data platform
● Having a data caching layer between
compute and storage can solve the I/O
challenges
● You can use cache for both analytics and
AI workloads
COMPUTE
STORAGE
ALLUXIO 12
Accessing Data and
Models In the Cloud
12
Hybrid/Multi-Cloud ML Platforms
Online ML platform
Serving cluster
Models
Training Data
Models
1
2
3
Offline training platform
Training cluster
DC/Cloud A DC/Cloud B
13
Separation of compute and storage
Data access:
1. Read data directly from cloud storage
2. Copy data from cloud to local before training
3. Local cache layer for data reuse
4. Distributed cache system
Model access:
1. Pull models directly from cloud storage
Existing Solutions
14
Option 1: Read From Cloud Storage
● Easy to set up
● Performance are not ideal
■ Model access: Models are repeatedly pulled from cloud storage
■ Data access: Reading data can take more time than actual training
82% of the time
spent by
DataLoader
15
Option 2: Copy Data To Local Before Training
● Data is now local
■ Faster access + less cost
● Management is hard
■ Must manually delete training data after use
● Local storage space is limited
■ Dataset is huge - limited benefits
16
Option 3: Local Cache for Data Reuse
Examples: S3FS built-in local cache, Alluxio Fuse SDK
● Reused data is local
■ Faster access + less cost
● Cache layer provider helps data management
■ No manual deletion/supervision
● Cache space is limited
■ Dataset is huge - limited benefits
17
Option 4: Distributed Cache System
Clients
Worker
Worker
Worker
…
● Training data and trained models can
be kept in cache - distributed.
● Typically with data management
functionalities.
18
Challenges
1. Performance
● Pulling data from cloud storage is hurting training/serving.
2. Cost
● Repeatedly requesting data from cloud storage is costly.
3. Reliability
● Availability is the key for every service in cloud.
4. Usability
● Manual data management is unfavorable.
19
ALLUXIO 20
Alluxio as an example
20
Clients Worker
Worker
…
Masters
Worker
● Use consistent hashing to cache both data
and metadata on workers.
● Worker nodes have plenty space for cache.
Training data and models only need to be
pulled once from cloud storage. Cost --
● No more single point of failure. Reliability ++
● No more performance bottleneck on masters.
Performance ++
● Data management system.
Consistent Hashing for caching
21
By the numbers
● High Scalability
■ One worker supports 30 - 50 million files
■ Scale linearly - easy to support 10 billions of files
● High Availability
■ 99.99% uptime
■ No single point of failure
● High Performance
■ Faster data loading
● Cloud-native K8s Operator and CSI-FUSE for data access management
22
Alluxio FUSE
● Expose the Alluxio file system as a local file system.
● Can access the cloud storage just as accessing local storage.
○ cat, ls
○ f = open(“a.txt”, “r”)
● Very low impact for end users
23
Alluxio CSI x Alluxio FUSE for Data Access
● FUSE: Turn remote dataset in cloud
into local folder for training
● CSI: Launch Alluxio FUSE pod only
when dataset is needed
Alluxio Fuse pod
Fuse
Container
Host Machine
Application pod
Application
Container
Persistent
volume +
claim
mount
mount
24
ALLUXIO 25
Data Access
Management for
PyTorch
25
Under Storage
Integration with PyTorch Training (Alluxio)
Training Node
Get Task Info
Alluxio Client
PyTorch
Get Cluster Info
Send Result
Cache Cluster
Service Registry
Cache Worker
Cache Worker
Execute Task
Cache Worker
Cache Client
Find Worker(s)
Affinity Block
Location
Policy Client-side load
balance
1
2
3
4
5
Cache miss -
Under storage task
26
Data Loading Performance
ImageNet (subset)
27
Yelp review
28
Training Directly from Storage (S3-FUSE)
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
GPU Utilization Improvement
Training with Alluxio-FUSE
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
GPU Utilization Improvement
ALLUXIO 30
How to enable Python
Applications
30
Use Alluxio - Ray Integration as an example
31
Ray Dataloader
fsspec - Alluxio
impl
Alluxio Python
client
Ray
etcd
Alluxio Worker
REST API server
Alluxio Worker
REST API server
PyArrow Dataset
loading
Registration
Get worker
addresses
Alluxio+Ray Benchmark – Small Files
● Dataset
○ 130GB imagenet dataset
● Process Settings
○ 4 train workers
○ 9 process reading
● Active Object Store Memory
○ 400-500 MiB
32
Alluxio+Ray Benchmark – Large Parquet files
● Dataset
○ 200MiB files, adds up to
60GiB
● Process Settings
○ 28 train workers
○ 28 process reading
● Active Object Store Memory
○ 20-30 GiB
33
Cost Saving – Egress/Data Transfer Fees
34
Cost Saving – API Calls/S3 Operations (List, Get)
List/Get API calls only access Alluxio
35
ALLUXIO 36
Use Cases
36
Alluxio Benefits
30-50%
90% +
Reduce 30%+ time
compare consuming from
Cloud object storage
Manage the on-going training dataset from cold storage
Alluxio server data to GPU with advanced caching capability
Avoid 50%+ data copy
Stable GPU utilization no
matter where you start GPU
cluster
Virtual layer to different
storage
Use case
Autonomous driving
THANKS
Any Questions?
Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!
38

More Related Content

PDF
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
PDF
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
PDF
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
PDF
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

Similar to Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML (20)

PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
PDF
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
PDF
Accelerate Cloud Training with Alluxio
PDF
Alluxio Monthly Webinar - Accelerate AI Path to Production
PDF
Accelerating Cloud Training With Alluxio
PDF
Best Practices for Using Alluxio with Spark
PDF
Flexible and Fast Storage for Deep Learning with Alluxio
PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
PDF
Alluxio Use Cases and Future Directions
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Enabling big data & AI workloads on the object store at DBS
PDF
Unified Data API for Distributed Cloud Analytics and AI
PDF
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
PDF
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
Achieving compute and storage independence for data-driven workloads
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Accelerate Cloud Training with Alluxio
Alluxio Monthly Webinar - Accelerate AI Path to Production
Accelerating Cloud Training With Alluxio
Best Practices for Using Alluxio with Spark
Flexible and Fast Storage for Deep Learning with Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio Use Cases and Future Directions
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Enabling big data & AI workloads on the object store at DBS
Unified Data API for Distributed Cloud Analytics and AI
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Achieving compute and storage independence for data-driven workloads
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
PDF
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Ad

Recently uploaded (20)

PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Nekopoi APK 2025 free lastest update
PDF
Digital Strategies for Manufacturing Companies
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
ai tools demonstartion for schools and inter college
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
System and Network Administration Chapter 2
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Transform Your Business with a Software ERP System
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
Designing Intelligence for the Shop Floor.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Nekopoi APK 2025 free lastest update
Digital Strategies for Manufacturing Companies
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
ai tools demonstartion for schools and inter college
VVF-Customer-Presentation2025-Ver1.9.pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
top salesforce developer skills in 2025.pdf
System and Network Administration Chapter 2
Odoo Companies in India – Driving Business Transformation.pdf
Transform Your Business with a Software ERP System
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Operating system designcfffgfgggggggvggggggggg
Digital Systems & Binary Numbers (comprehensive )
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Upgrade and Innovation Strategies for SAP ERP Customers

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

  • 1. Enhancing Python Data Loading in the Cloud for AI/ML Bin Fan, Chief Architect & VP of Open Source @ Alluxio 1
  • 2. Tightly-Coupled Hadoop & HDFS Compute-Storage Separation On-Prem HDFS Cloud Data Lake Single Region & Single Cloud Multi-Region/ Hybrid/Multi-Cloud 10yr Ago Today More Elastic, Cheaper, More Scalable The Evolution of the Modern Data Stack
  • 3. Compute-Storage Separation Cloud Data Lake Multi-Region/ Hybrid/Multi-Cloud Today Data is Remote from Compute; Locality is Missing I/O Challenges The Evolution of the Modern Data Stack
  • 4. ● GET/PUT operation costs add up quickly ● Cross-region data transfer (egress) fees ● GPU cycles are wasted waiting for data ● Job failures ● Amazon S3 errors: 503 Slow Down 503 Service Unavailable I/O Challenges ● Analytics SQL: High query latency because of retrieving remote data ● Model Training: Training is slow because of loading remote data in each epoch (LISTing lots of small files is particularly slow) Performance Cost Reliability
  • 5. 10% of your data is hot data Source: Alluxio
  • 6. 10% of your data is hot data Data Caching Layer between compute & storage Add a Source: Alluxio
  • 7. Reduce Latency I/O Compute I/O Compute Compute I/O (first time retrieving remote data) Compute I/O Compute Without Cache With Cache Total job run time is reduced I/O Compute Compute Compute I/O
  • 8. Increase GPU Utilization I/O (data loading) Training I/O Training Training I/O (first time loading remote data) Training I/O Training Training I/O Training Training Without Cache With Cache GPU is idle idle I/O idle GPU is idle GPU is busy most of the time GPU utilization is greatly increased
  • 9. Reduce Cloud Storage Cost Compute Compute AWS S3 us-east-1 Without Cache With Cache AWS S3 us-west-1 AWS S3 us-east-1 Frequently Retrieving Data = High GET/PUT Operations Costs & Data Transfer Costs Fast Access with Hot Data Cached AWS S3 us-west-1 Only Retrieve Data When Necessary = Lower S3 Costs … … … … Data Cache
  • 11. DATA CACHING LAYER Observations So Far … ● The evolution of modern data stack poses challenges for data locality ● You should care about I/O in data lake because it greatly impacts the performance, cost & reliability of your data platform ● Having a data caching layer between compute and storage can solve the I/O challenges ● You can use cache for both analytics and AI workloads COMPUTE STORAGE
  • 12. ALLUXIO 12 Accessing Data and Models In the Cloud 12
  • 13. Hybrid/Multi-Cloud ML Platforms Online ML platform Serving cluster Models Training Data Models 1 2 3 Offline training platform Training cluster DC/Cloud A DC/Cloud B 13 Separation of compute and storage
  • 14. Data access: 1. Read data directly from cloud storage 2. Copy data from cloud to local before training 3. Local cache layer for data reuse 4. Distributed cache system Model access: 1. Pull models directly from cloud storage Existing Solutions 14
  • 15. Option 1: Read From Cloud Storage ● Easy to set up ● Performance are not ideal ■ Model access: Models are repeatedly pulled from cloud storage ■ Data access: Reading data can take more time than actual training 82% of the time spent by DataLoader 15
  • 16. Option 2: Copy Data To Local Before Training ● Data is now local ■ Faster access + less cost ● Management is hard ■ Must manually delete training data after use ● Local storage space is limited ■ Dataset is huge - limited benefits 16
  • 17. Option 3: Local Cache for Data Reuse Examples: S3FS built-in local cache, Alluxio Fuse SDK ● Reused data is local ■ Faster access + less cost ● Cache layer provider helps data management ■ No manual deletion/supervision ● Cache space is limited ■ Dataset is huge - limited benefits 17
  • 18. Option 4: Distributed Cache System Clients Worker Worker Worker … ● Training data and trained models can be kept in cache - distributed. ● Typically with data management functionalities. 18
  • 19. Challenges 1. Performance ● Pulling data from cloud storage is hurting training/serving. 2. Cost ● Repeatedly requesting data from cloud storage is costly. 3. Reliability ● Availability is the key for every service in cloud. 4. Usability ● Manual data management is unfavorable. 19
  • 20. ALLUXIO 20 Alluxio as an example 20
  • 21. Clients Worker Worker … Masters Worker ● Use consistent hashing to cache both data and metadata on workers. ● Worker nodes have plenty space for cache. Training data and models only need to be pulled once from cloud storage. Cost -- ● No more single point of failure. Reliability ++ ● No more performance bottleneck on masters. Performance ++ ● Data management system. Consistent Hashing for caching 21
  • 22. By the numbers ● High Scalability ■ One worker supports 30 - 50 million files ■ Scale linearly - easy to support 10 billions of files ● High Availability ■ 99.99% uptime ■ No single point of failure ● High Performance ■ Faster data loading ● Cloud-native K8s Operator and CSI-FUSE for data access management 22
  • 23. Alluxio FUSE ● Expose the Alluxio file system as a local file system. ● Can access the cloud storage just as accessing local storage. ○ cat, ls ○ f = open(“a.txt”, “r”) ● Very low impact for end users 23
  • 24. Alluxio CSI x Alluxio FUSE for Data Access ● FUSE: Turn remote dataset in cloud into local folder for training ● CSI: Launch Alluxio FUSE pod only when dataset is needed Alluxio Fuse pod Fuse Container Host Machine Application pod Application Container Persistent volume + claim mount mount 24
  • 26. Under Storage Integration with PyTorch Training (Alluxio) Training Node Get Task Info Alluxio Client PyTorch Get Cluster Info Send Result Cache Cluster Service Registry Cache Worker Cache Worker Execute Task Cache Worker Cache Client Find Worker(s) Affinity Block Location Policy Client-side load balance 1 2 3 4 5 Cache miss - Under storage task 26
  • 27. Data Loading Performance ImageNet (subset) 27 Yelp review
  • 28. 28 Training Directly from Storage (S3-FUSE) - > 80% of total time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) GPU Utilization Improvement
  • 29. Training with Alluxio-FUSE - Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X) GPU Utilization Improvement
  • 30. ALLUXIO 30 How to enable Python Applications 30
  • 31. Use Alluxio - Ray Integration as an example 31 Ray Dataloader fsspec - Alluxio impl Alluxio Python client Ray etcd Alluxio Worker REST API server Alluxio Worker REST API server PyArrow Dataset loading Registration Get worker addresses
  • 32. Alluxio+Ray Benchmark – Small Files ● Dataset ○ 130GB imagenet dataset ● Process Settings ○ 4 train workers ○ 9 process reading ● Active Object Store Memory ○ 400-500 MiB 32
  • 33. Alluxio+Ray Benchmark – Large Parquet files ● Dataset ○ 200MiB files, adds up to 60GiB ● Process Settings ○ 28 train workers ○ 28 process reading ● Active Object Store Memory ○ 20-30 GiB 33
  • 34. Cost Saving – Egress/Data Transfer Fees 34
  • 35. Cost Saving – API Calls/S3 Operations (List, Get) List/Get API calls only access Alluxio 35
  • 37. Alluxio Benefits 30-50% 90% + Reduce 30%+ time compare consuming from Cloud object storage Manage the on-going training dataset from cold storage Alluxio server data to GPU with advanced caching capability Avoid 50%+ data copy Stable GPU utilization no matter where you start GPU cluster Virtual layer to different storage Use case Autonomous driving
  • 38. THANKS Any Questions? Scan the QR code for a Linktree including great learning resources, exciting meetups & a community of data & AI infra experts! 38