SlideShare a Scribd company logo
Accelerate Model Training with
Alluxio Enterprise AI
Adit Madan
adit@alluxio.com
2
Alluxio Data Platform
High Performance data access, unified global view
Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
1000+
nodes
Largest deployment by
Baidu
Started
from UC
Berkeley
AMPLab
1 Billion
Files
supported by Alluxio
with 2.0 release
2014 2019 2023
7/10 top
Internet Co
powered by Alluxio
3
AliPay 80%
Model
Training
Zhihu LLM
Model training served by
Alluxio
EXPLOSION OF DATA
rise of big data & analytics
CLOUD ADOPTION
Single to hybrid cloud,
multi-cloud, cross region
DEEP LEARNING & AI
Large-scale model training
and deployment
1000+
Contributors
Open Source
1000+
Attendees
Data Orchestration Summit
100% Presto @
Meta
Fully on-boarded to Alluxio
9/10 top
Internet Co
powered by Alluxio
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Caching
5
Critical infrastructure barriers to
effective AI/ML adoption
LOW PERFORMANCE COST MANAGEMENT
Inefficient Data I/O
GPU SCARCITY
Ability to leverage GPUs
anywhere
$
$
Specialized storage comes at
a premium
Whatʼs New: Alluxio Enterprise AI
1. High performance I/O over commodity storage
○ New Distributed System Architecture, called DORA (Decentralized
Object Repository Architecture)
2. Accelerating end-to-end ML pipelines (LLM, NLP & Computer Vision)
○ Optimized Performance for Model Training and Model Serving
ALLUXIO 6
Distributed Object Repository Architecture (DORA)
● No single point of failure with a new architecture that scales-out
horizontally without any central management
● Automatic Fallback to data lake storage for masking any failures to due to
capacity or other reasons
● Performance
○ Revamped single-node storage with 50 million objects per node
○ Workload-specific optimizations for ML training & analytics
Design Goals:
Extremely Stable, Low Maintenance Overhead, Scalability for ML
Alluxio Platform
Revolutionary New Architecture
Alluxio Client
Affinity Location
Policy Consistent Hash
(Decentralized)
Alluxio Enterprise AI
Whatʼs New on the Alluxio Platform for AI
Model Training
Scale to 10 billion+ objects to handle the demands of AI
POSIX & REST API for Python
● 2-8x performance improvements over commodity S3
● 1.5-2x over specialized storage systems with POSIX API
● Upto 95% API cost savings compared to direct access
1
Alluxio Enterprise AI
Whatʼs New on the Alluxio Platform for AI
Model Serving
Extreme Concurrency for model serving,
from training to inference clusters
Data Preloading based on usage pattern
● 2-3x reduced deployment times in production
2
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
File System
Training
Data
Training
Data
M
o
d
e
l
s
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Offline Cloud Online Cloud
APAC Quora CASE STUDY:
High Performance AI Platform for LLM
2 - 4X faster
time-to-market
Model Training:
Increase GPU utilization
with Existing Data Lake
11
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
Increase
utilization up
to 90%
Faster model training
with more accurate,
fresher models
Save on API costs
Runs on standard
low-cost storage
12
Alluxio vs Directly Accessing S3
17 min
Total training time
(3 epochs)
93%
GPU utilization
(TensorBoard)
Alluxio
85 min
Total training time
(3 epochs)
17%
GPU utilization
(TensorBoard)
S3
Alluxio is
5 times
faster than S3
Model Training:
Eliminate cost/complexity
with data copies
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
13
Automatically load data
from existing data lake
Faster access to training
data
Increased data
engineering productivity
Model Training:
Spin up GPUs where available
14
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REMOTE TRAINING CLUSTER
Deploy GPUs anywhere
based on availability and
cost
Eliminate data copies
Unified access for all
training data
Reduced network and
egress costs
ALLUXIO 15
Appendix
15
Training Cluster
Offline Training Platform
1
Training Data
Models
4
2
Training Data
3
Models
Models
5
Inference Cluster
Online ML Platform
Consumer is the Data Scientist with a focus on building models without
having to worry about scaling to multiple servers and the platform complexity
Data Sources in the same or different region / cloud as the AI/ML infrastructure
Decentralized Object
Repository Architecture
70
AI Reference Architecture
17
Training Cluster
Offline Training Platform
1
Training Data
Models
4
2
Training Data
3
Models
Models
5
Inference Cluster
Online ML Platform
New
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Before using Alluxio
> 80% of total time is spent in DataLoader
Result in low GPU Utilization Rate (<20%)
18
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute
Capability 7.5
GPU Utilization 16.96%
Est. SM Efficiency 16.91%
Est. Achieved
Occupancy 68.75%
Kernel Time using
Tensor Cores 0.0%
Category
Time Duration
(us)
Percentage
(%)
Average Step
Time
1,763,649,145 100
Kernel 299,168,905 16.96
Memcpy 10,521,722 0.6
Memset 39,459 0
Runtime 3,043,169 0.17
DataLoader 1,446,068,956 81.99
CPU Exec 1,570,076 0.09
Other 3,245,858 0.18
Resnet-50
3 epochs
S3 Fuse
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Ater using Alluxio
Reduce Data Loader Rate from 82% to 1%
Increase GPU Utilization Rate from 17% to 93%
19
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute Capability 7,5
GPU Utilization 93.29%
Est. SM Efficiency 92.98%
Est. Achieved
Occupancy
68.03%
Kernel Time using
Tensor Cores
0.0%
Category
Time Duration
(us)
Percentage
(%)
Average Step
Time
334,274,946 100%
Kernel 311,847,023 93.29
Memcpy 10,500,126 3.14
Memset 43,946 0.01
Runtime 3,899,241 1.17
DataLoader 3,343,301 1
CPU Exec 1,648,391 0.49
Other 2,992,918 0.9
Resnet-50
3 epochs
S3 Fuse
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Model Serving:
Faster model
deployment times
20
70
70
On Prem
…
Checkpoints
Training
Data
Object Store
or HDFS
Data Lake
Source of Truth
Training
Cluster
On
Premise
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REGIONAL INTERFACE CLUSTERS
Deploy models to remote
inference sites in minutes
Reduced network
bandwidth
Offload underlying object
store or HDFS
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REGIONAL INTERFACE CLUSTERS
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Distributed Object Repository Architecture (DORA)
● No single point of failure with a new architecture that scales-out
horizontally without any central management
● Automatic Fallback to data lake storage for masking any failures to due to
capacity or other reasons
● Performance
○ Revamped single-node storage with 50 million objects per node
○ Workload-specific optimizations for ML training & analytics
Design Goals:
Extremely Stable, Low Maintenance Overhead, Scalability for ML
Alluxio Platform
Revolutionary New Architecture
Alluxio Client
Affinity Location
Policy Consistent Hash
(Decentralized)
New
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Alluxio System Architecture
70
AI/Analytics Applications
Get Task Info
Send Result
Alluxio Client
22
Affinity Block
Location Policy
Client Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker
Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task
5
Training Node
Alluxio Cluster
Under Storage
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET

More Related Content

PDF
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
PDF
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
PDF
Alluxio Monthly Webinar - Accelerate AI Path to Production
PDF
Alluxio Webinar - Maximize GPU Utilization for Model Training
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
PDF
Accelerate Cloud Training with Alluxio
PDF
Accelerating Cloud Training With Alluxio
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio Webinar - Maximize GPU Utilization for Model Training
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Accelerate Cloud Training with Alluxio
Accelerating Cloud Training With Alluxio

Similar to AI Infra Day | Accelerate Your Model Training and Serving with Distributed Caching (20)

PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
PDF
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
PDF
Democratize ai with google cloud
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
PDF
Intelligent internet of things with Google Cloud
PDF
Alluxio Product school Webinar - Distributed Caching for Generative AI
PDF
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
PDF
High Performance Computing (HPC) and Engineering Simulations in the Cloud
PDF
High Performance Computing (HPC) and Engineering Simulations in the Cloud
PPTX
Presentation for the registation gONI.pptx
PDF
Building ML Pipelines with DCOS
PDF
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
PDF
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
PDF
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PDF
Austin,TX Meetup presentation tensorflow final oct 26 2017
PDF
CI CD in the age of machine learning by Sofia Calcagno
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Democratize ai with google cloud
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Intelligent internet of things with Google Cloud
Alluxio Product school Webinar - Distributed Caching for Generative AI
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the Cloud
Presentation for the registation gONI.pptx
Building ML Pipelines with DCOS
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Innovation with ai at scale on the edge vt sept 2019 v0
Austin,TX Meetup presentation tensorflow final oct 26 2017
CI CD in the age of machine learning by Sofia Calcagno
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
PDF
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Ad

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Digital Strategies for Manufacturing Companies
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
ai tools demonstartion for schools and inter college
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Introduction to Artificial Intelligence
PDF
Design an Analysis of Algorithms I-SECS-1021-03
VVF-Customer-Presentation2025-Ver1.9.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Digital Strategies for Manufacturing Companies
Design an Analysis of Algorithms II-SECS-1021-03
ai tools demonstartion for schools and inter college
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
L1 - Introduction to python Backend.pptx
Upgrade and Innovation Strategies for SAP ERP Customers
Essential Infomation Tech presentation.pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Operating system designcfffgfgggggggvggggggggg
Understanding Forklifts - TECH EHS Solution
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
CHAPTER 2 - PM Management and IT Context
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
Which alternative to Crystal Reports is best for small or large businesses.pdf
Introduction to Artificial Intelligence
Design an Analysis of Algorithms I-SECS-1021-03

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Caching

  • 1. Accelerate Model Training with Alluxio Enterprise AI Adit Madan adit@alluxio.com
  • 2. 2 Alluxio Data Platform High Performance data access, unified global view
  • 3. Alluxio Technology Journey Open Source Started From UC Berkeley AMPLab in 2014 1000+ nodes Largest deployment by Baidu Started from UC Berkeley AMPLab 1 Billion Files supported by Alluxio with 2.0 release 2014 2019 2023 7/10 top Internet Co powered by Alluxio 3 AliPay 80% Model Training Zhihu LLM Model training served by Alluxio EXPLOSION OF DATA rise of big data & analytics CLOUD ADOPTION Single to hybrid cloud, multi-cloud, cross region DEEP LEARNING & AI Large-scale model training and deployment 1000+ Contributors Open Source 1000+ Attendees Data Orchestration Summit 100% Presto @ Meta Fully on-boarded to Alluxio 9/10 top Internet Co powered by Alluxio
  • 5. 5 Critical infrastructure barriers to effective AI/ML adoption LOW PERFORMANCE COST MANAGEMENT Inefficient Data I/O GPU SCARCITY Ability to leverage GPUs anywhere $ $ Specialized storage comes at a premium
  • 6. Whatʼs New: Alluxio Enterprise AI 1. High performance I/O over commodity storage ○ New Distributed System Architecture, called DORA (Decentralized Object Repository Architecture) 2. Accelerating end-to-end ML pipelines (LLM, NLP & Computer Vision) ○ Optimized Performance for Model Training and Model Serving ALLUXIO 6
  • 7. Distributed Object Repository Architecture (DORA) ● No single point of failure with a new architecture that scales-out horizontally without any central management ● Automatic Fallback to data lake storage for masking any failures to due to capacity or other reasons ● Performance ○ Revamped single-node storage with 50 million objects per node ○ Workload-specific optimizations for ML training & analytics Design Goals: Extremely Stable, Low Maintenance Overhead, Scalability for ML Alluxio Platform Revolutionary New Architecture Alluxio Client Affinity Location Policy Consistent Hash (Decentralized)
  • 8. Alluxio Enterprise AI Whatʼs New on the Alluxio Platform for AI Model Training Scale to 10 billion+ objects to handle the demands of AI POSIX & REST API for Python ● 2-8x performance improvements over commodity S3 ● 1.5-2x over specialized storage systems with POSIX API ● Upto 95% API cost savings compared to direct access 1
  • 9. Alluxio Enterprise AI Whatʼs New on the Alluxio Platform for AI Model Serving Extreme Concurrency for model serving, from training to inference clusters Data Preloading based on usage pattern ● 2-3x reduced deployment times in production 2
  • 10. BUSINESS BENEFIT: TECH BENEFIT: Increase GPU utilization 50% 93% File System Training Data Training Data M o d e l s Training Data Models Model Training Model Training Model Deployment Model Inference Downstream Applications Model Update Training Clouds Offline Cloud Online Cloud APAC Quora CASE STUDY: High Performance AI Platform for LLM 2 - 4X faster time-to-market
  • 11. Model Training: Increase GPU utilization with Existing Data Lake 11 70 70 On Prem … Checkpoints Training Data Data Lake Source of Truth Training Cluster Object Store Increase utilization up to 90% Faster model training with more accurate, fresher models Save on API costs Runs on standard low-cost storage
  • 12. 12 Alluxio vs Directly Accessing S3 17 min Total training time (3 epochs) 93% GPU utilization (TensorBoard) Alluxio 85 min Total training time (3 epochs) 17% GPU utilization (TensorBoard) S3 Alluxio is 5 times faster than S3
  • 13. Model Training: Eliminate cost/complexity with data copies 70 70 On Prem … Checkpoints Training Data Data Lake Source of Truth Training Cluster Object Store 13 Automatically load data from existing data lake Faster access to training data Increased data engineering productivity
  • 14. Model Training: Spin up GPUs where available 14 70 70 On Prem … Checkpoints Training Data Data Lake Source of Truth Training Cluster Object Store 70 70 On Prem … Checkpoints Training Data Training Cluster REMOTE TRAINING CLUSTER Deploy GPUs anywhere based on availability and cost Eliminate data copies Unified access for all training data Reduced network and egress costs
  • 16. Training Cluster Offline Training Platform 1 Training Data Models 4 2 Training Data 3 Models Models 5 Inference Cluster Online ML Platform Consumer is the Data Scientist with a focus on building models without having to worry about scaling to multiple servers and the platform complexity Data Sources in the same or different region / cloud as the AI/ML infrastructure Decentralized Object Repository Architecture
  • 17. 70 AI Reference Architecture 17 Training Cluster Offline Training Platform 1 Training Data Models 4 2 Training Data 3 Models Models 5 Inference Cluster Online ML Platform New What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 18. Before using Alluxio > 80% of total time is spent in DataLoader Result in low GPU Utilization Rate (<20%) 18 GPU Summary Name Tesla T4 Memory 14.62GB Compute Capability 7.5 GPU Utilization 16.96% Est. SM Efficiency 16.91% Est. Achieved Occupancy 68.75% Kernel Time using Tensor Cores 0.0% Category Time Duration (us) Percentage (%) Average Step Time 1,763,649,145 100 Kernel 299,168,905 16.96 Memcpy 10,521,722 0.6 Memset 39,459 0 Runtime 3,043,169 0.17 DataLoader 1,446,068,956 81.99 CPU Exec 1,570,076 0.09 Other 3,245,858 0.18 Resnet-50 3 epochs S3 Fuse What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 19. Ater using Alluxio Reduce Data Loader Rate from 82% to 1% Increase GPU Utilization Rate from 17% to 93% 19 GPU Summary Name Tesla T4 Memory 14.62GB Compute Capability 7,5 GPU Utilization 93.29% Est. SM Efficiency 92.98% Est. Achieved Occupancy 68.03% Kernel Time using Tensor Cores 0.0% Category Time Duration (us) Percentage (%) Average Step Time 334,274,946 100% Kernel 311,847,023 93.29 Memcpy 10,500,126 3.14 Memset 43,946 0.01 Runtime 3,899,241 1.17 DataLoader 3,343,301 1 CPU Exec 1,648,391 0.49 Other 2,992,918 0.9 Resnet-50 3 epochs S3 Fuse What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 20. Model Serving: Faster model deployment times 20 70 70 On Prem … Checkpoints Training Data Object Store or HDFS Data Lake Source of Truth Training Cluster On Premise 70 70 On Prem … Checkpoints Training Data Training Cluster REGIONAL INTERFACE CLUSTERS Deploy models to remote inference sites in minutes Reduced network bandwidth Offload underlying object store or HDFS 70 70 On Prem … Checkpoints Training Data Training Cluster REGIONAL INTERFACE CLUSTERS What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 21. Distributed Object Repository Architecture (DORA) ● No single point of failure with a new architecture that scales-out horizontally without any central management ● Automatic Fallback to data lake storage for masking any failures to due to capacity or other reasons ● Performance ○ Revamped single-node storage with 50 million objects per node ○ Workload-specific optimizations for ML training & analytics Design Goals: Extremely Stable, Low Maintenance Overhead, Scalability for ML Alluxio Platform Revolutionary New Architecture Alluxio Client Affinity Location Policy Consistent Hash (Decentralized) New What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 22. Alluxio System Architecture 70 AI/Analytics Applications Get Task Info Send Result Alluxio Client 22 Affinity Block Location Policy Client Consistent Hash (Task Info) 2 3 Service Registry Alluxio Worker Alluxio Worker Alluxio Worker Execute Task Get Cluster Info Find Worker(s) 1 4 Cache miss Under storage task 5 Training Node Alluxio Cluster Under Storage What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET