SlideShare a Scribd company logo
Optimize, Don't Overspend:
Data Caching Strategy for AI Workloads
Sep, 2024
Alluxio makes it easy to share and
manage data from any storage to any
compute engine in any environment,
with high performance and low cost.
2
3
Open Source Started From UC
Berkeley AMPLab in 2014
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK
1,200+
contributors &
growing
10,000+
Slack Community
Members
Top 10
Most Critical Java
Based Open
Source Project
Top 100
Most Valuable
Repositories Out of 96
Million on GitHub
4
Case Studies
Zhihu
TELCO & MEDIA
E-COMMERCE
FINANCIAL SERVICES
TECH & INTERNET
OTHERS
Leverage GPUs
Anywhere
Run AI workloads wherever
GPUs are available without
data locality concerns
6
Alluxio AI Offering $
7
Critical infrastructure barriers to
effective AI/ML adoption
LOW PERFORMANCE COST MANAGEMENT
High performance caching
for model training &
distribution
GPU SCARCITY
Multi-region/cloud data
serving capability
Shorten time-to-production
$
$
Higher GPU Utilization
Avoid copying across data lakes
Utilize NVMe directly on the GPU
cluster
I/O Performance for AI Training
and GPU Utilization
1 HPC Performance on Existing Data Lakes
Achieve up to 8GB/s throughput & 200K IOPS for a single client
Improvements compared to 2.x: 35% for hot sequential reads, 20x for
hot random reads, 4x for cold reads
2 GPU Saturation
Fully saturate 8 A100 GPUs, showing over 97% GPU utilization in
MLPerf Storage language processing benchmarks.
Customer production data show GPU utilization improvement from
40% to 60% for search/recommendation models & 50% to 95% for LLMs
3 Checkpoint Optimization
New checkpoint read/write support optimizes training with write caching
capabilities
● Alluxio 3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single client, significantly outperforming competitors.
● JuiceFS: Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1% slower than Alluxio 3.2.
● FSx Lustre: Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to 51.2% slower than Alluxio 3.2.
● Observations: Alluxio 3.2 shows better performance, particularly in handling sequential read operations efficiently.
Comparison against other vendors | FIO - Sequential Read
Setup
● Alluxio:
1 Alluxio worker (i3en.metal)
1 Alluxio fuse client (c5n.metal)
● AWS FSx Lustre (12TB capacity)
● JuiceFS (SaaS)
Note: the Alluxio fuse client co-located with
training servers is responsible for POSIX
API access to Alluxio Workers which
actually cache the data
Alluxio Proprietary and Confidential
Comparison against other vendors | MLPerf Storage
Setup
● Alluxio
1 fuse (c6in.metal)
2 worker (i3en.metal)
Note: DDN with 12 GPUs and Weka
with 20 GPUs are the available data
points published on MLPerf website.
Alluxio Proprietary and Confidential
11
New Architecture $
Scalability
Master as the
bottleneck
Unlimited scalability
Support tens of
billions of small files
with single Alluxio
cluster
Reliability
Fault tolerance
Automatic Fallback
to under file system
More friendly to
Kubernetes and Cloud
Performance
Zero-copy network
transmission with
netty
High concurrent read
Data
Governance
Multi-tenant & quota
management
Plugable security
management
Decentralized Object Repository Architecture (DORA)
Motivation & Benefits
Architecture
70
70
AI/Analytics Applications
Get Task Info
Send Result
Alluxio Client
13
Affinity Block
Location Policy
Client Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker
Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task
5
Training Node
Alluxio Cluster
Under Storage
Read Optimization
High Concurrent
Position Read
Solve up to 150X Read
Amplification issue
Improve unstructured file
parallel read up to 9X
Improve structured file position
read 2 - 15X
Zero-copy
Data Transmission
Improve memory efficiency
Improve large file sequential
streaming read performance by
30% - 50%
15
Example Use Case $
16
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
HDFS
Training
Data
Training
Data
M
o
d
e
l
s
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Offline Cloud Online Cloud
Zhihu CASE STUDY:
High Performance AI Platform for LLM
2 - 4X faster
time-to-market
17
$
Try Alluxio For Free in 30 min!
Try the fully deployed Alluxio AI cluster for FREE!
● Explore the potential performance benefits of Alluxio by
running FIO benchmarks
● Simplify the deployment process with preconfigured
template clusters
● Maintain full control of your data with Alluxio deployed
within your AWS account
● User friendly webUI with just a few clicks in under 40
minutes
Blog with sign up link and tutorial
Introducing Rapid Alluxio Deployer (RAD) in AWS!
Example
19
Thank you!
$
Join the conversation on Slack
alluxio.io/slack
Sign up RAD at https://signup.alluxio-rad.io/
and send us a screenshot of the cluster you
created to get a chance to win a $50 Amazon
gift card!

More Related Content

PDF
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
PPTX
Alluxio: Unify Data at Memory Speed
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Best Practices for Using Alluxio with Spark
PDF
Building Fast SQL Analytics on Anything with Presto, Alluxio
PDF
Alluxio Community Office Hour: Getting Started with Alluxio Open Source
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Alluxio: Unify Data at Memory Speed
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Best Practices for Using Alluxio with Spark
Building Fast SQL Analytics on Anything with Presto, Alluxio
Alluxio Community Office Hour: Getting Started with Alluxio Open Source

Similar to Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Workloads (20)

PDF
Alluxio @ Uber Seattle Meetup
PDF
Unify Data at Memory Speed
PDF
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
Best Practices for Using Alluxio with Spark
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
PDF
Open Source Memory Speed Virtual Distributed Storage
PDF
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
PPTX
Alluxio Presentation at Strata San Jose 2016
PDF
Data EcoSystem 2.0
PDF
Accelerate Spark Workloads on S3
PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PDF
PowerAlluxio
PDF
Best Practices for Using Alluxio with Spark
PDF
Running Machine Learning Workloads with Tensorflow, Alluxio and AWS S3
PDF
Getting Started with Alluxio + Spark + S3
Alluxio @ Uber Seattle Meetup
Unify Data at Memory Speed
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Open Source Memory Speed Virtual Distributed Storage
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Alluxio Presentation at Strata San Jose 2016
Data EcoSystem 2.0
Accelerate Spark Workloads on S3
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PowerAlluxio
Best Practices for Using Alluxio with Spark
Running Machine Learning Workloads with Tensorflow, Alluxio and AWS S3
Getting Started with Alluxio + Spark + S3
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Ad

Recently uploaded (20)

PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Digital Strategies for Manufacturing Companies
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
System and Network Administration Chapter 2
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Odoo Companies in India – Driving Business Transformation.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
How Creative Agencies Leverage Project Management Software.pdf
PTS Company Brochure 2025 (1).pdf.......
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Digital Strategies for Manufacturing Companies
How to Migrate SBCGlobal Email to Yahoo Easily
Design an Analysis of Algorithms I-SECS-1021-03
System and Network Administraation Chapter 3
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
System and Network Administration Chapter 2
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Adobe Illustrator 28.6 Crack My Vision of Vector Design

Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Workloads

  • 1. Optimize, Don't Overspend: Data Caching Strategy for AI Workloads Sep, 2024
  • 2. Alluxio makes it easy to share and manage data from any storage to any compute engine in any environment, with high performance and low cost. 2
  • 3. 3 Open Source Started From UC Berkeley AMPLab in 2014 JOIN THE CONVERSATION ON SLACK ALLUXIO.IO/SLACK 1,200+ contributors & growing 10,000+ Slack Community Members Top 10 Most Critical Java Based Open Source Project Top 100 Most Valuable Repositories Out of 96 Million on GitHub
  • 4. 4 Case Studies Zhihu TELCO & MEDIA E-COMMERCE FINANCIAL SERVICES TECH & INTERNET OTHERS
  • 5. Leverage GPUs Anywhere Run AI workloads wherever GPUs are available without data locality concerns
  • 7. 7 Critical infrastructure barriers to effective AI/ML adoption LOW PERFORMANCE COST MANAGEMENT High performance caching for model training & distribution GPU SCARCITY Multi-region/cloud data serving capability Shorten time-to-production $ $ Higher GPU Utilization Avoid copying across data lakes Utilize NVMe directly on the GPU cluster
  • 8. I/O Performance for AI Training and GPU Utilization 1 HPC Performance on Existing Data Lakes Achieve up to 8GB/s throughput & 200K IOPS for a single client Improvements compared to 2.x: 35% for hot sequential reads, 20x for hot random reads, 4x for cold reads 2 GPU Saturation Fully saturate 8 A100 GPUs, showing over 97% GPU utilization in MLPerf Storage language processing benchmarks. Customer production data show GPU utilization improvement from 40% to 60% for search/recommendation models & 50% to 95% for LLMs 3 Checkpoint Optimization New checkpoint read/write support optimizes training with write caching capabilities
  • 9. ● Alluxio 3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single client, significantly outperforming competitors. ● JuiceFS: Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1% slower than Alluxio 3.2. ● FSx Lustre: Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to 51.2% slower than Alluxio 3.2. ● Observations: Alluxio 3.2 shows better performance, particularly in handling sequential read operations efficiently. Comparison against other vendors | FIO - Sequential Read Setup ● Alluxio: 1 Alluxio worker (i3en.metal) 1 Alluxio fuse client (c5n.metal) ● AWS FSx Lustre (12TB capacity) ● JuiceFS (SaaS) Note: the Alluxio fuse client co-located with training servers is responsible for POSIX API access to Alluxio Workers which actually cache the data Alluxio Proprietary and Confidential
  • 10. Comparison against other vendors | MLPerf Storage Setup ● Alluxio 1 fuse (c6in.metal) 2 worker (i3en.metal) Note: DDN with 12 GPUs and Weka with 20 GPUs are the available data points published on MLPerf website. Alluxio Proprietary and Confidential
  • 12. Scalability Master as the bottleneck Unlimited scalability Support tens of billions of small files with single Alluxio cluster Reliability Fault tolerance Automatic Fallback to under file system More friendly to Kubernetes and Cloud Performance Zero-copy network transmission with netty High concurrent read Data Governance Multi-tenant & quota management Plugable security management Decentralized Object Repository Architecture (DORA) Motivation & Benefits
  • 13. Architecture 70 70 AI/Analytics Applications Get Task Info Send Result Alluxio Client 13 Affinity Block Location Policy Client Consistent Hash (Task Info) 2 3 Service Registry Alluxio Worker Alluxio Worker Alluxio Worker Execute Task Get Cluster Info Find Worker(s) 1 4 Cache miss Under storage task 5 Training Node Alluxio Cluster Under Storage
  • 14. Read Optimization High Concurrent Position Read Solve up to 150X Read Amplification issue Improve unstructured file parallel read up to 9X Improve structured file position read 2 - 15X Zero-copy Data Transmission Improve memory efficiency Improve large file sequential streaming read performance by 30% - 50%
  • 16. 16 BUSINESS BENEFIT: TECH BENEFIT: Increase GPU utilization 50% 93% HDFS Training Data Training Data M o d e l s Training Data Models Model Training Model Training Model Deployment Model Inference Downstream Applications Model Update Training Clouds Offline Cloud Online Cloud Zhihu CASE STUDY: High Performance AI Platform for LLM 2 - 4X faster time-to-market
  • 17. 17 $ Try Alluxio For Free in 30 min!
  • 18. Try the fully deployed Alluxio AI cluster for FREE! ● Explore the potential performance benefits of Alluxio by running FIO benchmarks ● Simplify the deployment process with preconfigured template clusters ● Maintain full control of your data with Alluxio deployed within your AWS account ● User friendly webUI with just a few clicks in under 40 minutes Blog with sign up link and tutorial Introducing Rapid Alluxio Deployer (RAD) in AWS! Example
  • 19. 19 Thank you! $ Join the conversation on Slack alluxio.io/slack Sign up RAD at https://signup.alluxio-rad.io/ and send us a screenshot of the cluster you created to get a chance to win a $50 Amazon gift card!