Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Cloud-Native Model Training
on Distributed Data
Shawn Sun, Cloud Native Tech Lead @ Alluxio - shawn.sun@alluxio.com
ChanChan Mao, Developer Advocate @ Alluxio - chanchan.mao@alluxio.com
1

Cloud Native Tech Lead
@ Alluxio
Shawn Sun
Developer Advocate
@ Alluxio
ChanChan Mao

Tightly-Coupled Hadoop
& HDFS
On-Prem HDFS
Single Region &
Single Cloud
10yr
Ago
The Evolution of the Modern Data Stack

Tightly-Coupled Hadoop
& HDFS
Compute-Storage
Separation
On-Prem HDFS
Cloud Data Lake
Single Region &
Single Cloud
Multi-Region/
Hybrid/Multi-Cloud
10yr
Ago
Today
More Elastic, Cheaper, More Scalable

Compute-Storage
Separation
Cloud Data Lake
Multi-Region/
Hybrid/Multi-Cloud
Today
Data is Remote from Compute; Locality is Missing
I/O Challenges

● GET/PUT operation costs
add up quickly
● Cross-region data transfer
(egress) fees
● GPU cycles are wasted
waiting for data
● Job failures
● Amazon S3 errors:
503 Slow Down
503 Service Unavailable
I/O Challenges
● Analytics SQL: High query
latency because of
retrieving remote data
● Model Training: Training is
slow because of loading
remote data in each epoch
(LISTing lots of small files is
particularly slow)
Performance Cost Reliability

10%
of your data is hot data
Data Caching Layer
between compute & storage
Add a
Source: Alluxio

Reduce Latency
I/O
Compute I/O
Compute Compute
I/O
(first time retrieving
remote data)
Compute
I/O Compute
Without
Cache
With
Cache
Total job run time is reduced
I/O
Compute Compute
Compute I/O

Increase GPU Utilization
I/O
(data loading)
Training I/O
Training Training
I/O
(first time loading
remote data)
Training I/O
Training Training
I/O Training
Training
Without
Cache
With
Cache
GPU is idle idle
I/O
idle
GPU is idle GPU is busy most of the time
GPU utilization is greatly increased

Reduce Cloud Storage Cost
Compute
Compute
AWS S3
us-east-1
Without Cache With Cache
AWS S3
us-west-1
AWS S3
us-east-1
Frequently Retrieving Data =
High GET/PUT Operations Costs & Data Transfer
Costs
Fast Access with
Hot Data Cached
AWS S3
us-west-1
Only Retrieve Data When Necessary =
Lower S3 Costs
… …
… …
Data Cache

Improve Reliability
Prevent
Network
Congestion
Relieve
Overloaded
Storage
Prevent Job Failures like “503 Service Unavailable” …

DATA CACHING LAYER
Observations So Far …
● The evolution of modern data stack poses
challenges for data locality
● You should care about I/O in data lake
because it greatly impacts the
performance, cost & reliability of your
data platform
● Having a data caching layer between
compute and storage can solve the I/O
challenges
● You can use cache for both analytics and
AI workloads
COMPUTE
STORAGE

ALLUXIO 14
Accessing Data and
Models In the Cloud
14

Hybrid/Multi-Cloud ML Platforms
Online ML platform
Serving cluster
Models
Training Data
Models
1
2
3
Offline training platform
Training cluster
DC/Cloud A DC/Cloud B
15
Separation of compute and storage

1. Read data directly from cloud storage
2. Copy data from cloud to local before training
3. Local cache layer for data reuse
4. Distributed cache system
Existing Solutions
16

Option 1: Read From Cloud Storage
● Easy to set up
● Performance are not ideal
■ Model access: Models are repeatedly pulled from cloud storage
■ Data access: Reading data can take more time than actual training
82% of the time
spent by
DataLoader
17

Option 2: Copy Data To Local Before Training
● Data is now local
■ Faster access + less cost
● Management is hard
■ Must manually delete training data after use
● Local storage space is limited
■ Dataset is huge - limited benefits
18

Option 3: Local Cache for Data Reuse
Examples: S3FS built-in local cache, Alluxio Fuse SDK
● Reused data is local
■ Faster access + less cost
● Cache layer provider helps data management
■ No manual deletion/supervision
● Cache space is limited
■ Dataset is huge - limited benefits
19

Option 4: Distributed Cache System
Clients
Worker
Worker
Worker
…
● Training data and trained models can
be kept in cache - distributed.
● Typically with data management
functionalities.
20

Challenges
1. Performance
● Pulling data from cloud storage is hurting training/serving.
2. Cost
● Repeatedly requesting data from cloud storage is costly.
3. Reliability
● Availability is the key for every service in cloud.
4. Usability
● Manual data management is unfavorable.
21

ALLUXIO 22
Alluxio as an example
22

Clients Worker
Worker
…
Masters
Worker
● Use consistent hashing to cache both data
and metadata on workers.
● Worker nodes have plenty space for cache.
Training data and models only need to be
pulled once from cloud storage. Cost --
● No more single point of failure. Reliability ++
● No more performance bottleneck on masters.
Performance ++
● Data management system.
Consistent Hashing for caching
23

By the numbers
● High Scalability
■ One worker supports 30 - 50 million files
■ Scale linearly - easy to support 10 billions of files
● High Availability
■ 99.99% uptime
■ No single point of failure
● High Performance
■ Faster data loading
● Cloud-native K8s Operator and CSI-FUSE for data access management
24

25
Cloud Storage
Alluxio Operator
Training
Framework
Training
Framework
Cloud VMs
Alluxio Operator
Kubernetes
Alluxio Cluster
Training
Framework
Manages the life cycle of Alluxio clusters and datasets

26
Alluxio Cluster CRD
Alluxio Operator follows the Kubernetes Operator pattern
1.Create
AlluxioCluster,
Dataset CRs
2.Inform CR
User K8s Api
Server
Alluxio
Operator
Alluxio Cluster
Dataset
3.Manage k8s
resources
4.Reconcile
● Zero-downtime
Upgrade
● High-availability
● Auto-scaling

Alluxio FUSE
● Expose the Alluxio file system as a local file system.
● Can access the cloud storage just as accessing local storage.
○ cat, ls
○ f = open(“a.txt”, “r”)
● Very low impact for end users
27

Alluxio CSI on K8s x Alluxio FUSE for Data Access
● FUSE: Turn remote dataset in cloud
into local folder for training
● CSI: Launch Alluxio FUSE pod only
when dataset is needed
Alluxio Fuse pod
Fuse
Container
Host Machine
Application pod
Application
Container
Persistent
volume +
claim
mount
mount
28

ALLUXIO 29
Data Access
Management for
PyTorch
29

Under Storage
Integration with PyTorch Training (Alluxio)
Training Node
Get Task Info
Alluxio Client
PyTorch
Get Cluster Info
Send Result
Cache Cluster
Service Registry
Cache Worker
Cache Worker
Execute Task
Cache Worker
Cache Client
Find Worker(s)
Affinity Block
Location
Policy Client-side load
balance
1
2
3
4
5
Cache miss -
Under storage task
30

Data Loading Performance
ImageNet (subset)
31
Yelp review

32
Training Directly from Storage (S3-FUSE)
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
GPU Utilization Improvement

Training with Alluxio-FUSE
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
GPU Utilization Improvement

ALLUXIO 34
How to enable Python
Applications
34

Use Alluxio - Ray Integration as an example
35
Ray Dataloader
fsspec - Alluxio
impl
Alluxio Python
client
Ray
etcd
Alluxio Worker
REST API server
Alluxio Worker
REST API server
PyArrow Dataset
loading
Registration
Get worker
addresses

Alluxio+Ray Benchmark – Small Files
● Dataset
○ 130GB imagenet dataset
● Process Settings
○ 4 train workers
○ 9 process reading
● Active Object Store Memory
○ 400-500 MiB
36

Alluxio+Ray Benchmark – Large Parquet files
● Dataset
○ 200MiB files, adds up to
60GiB
● Process Settings
○ 28 train workers
○ 28 process reading
● Active Object Store Memory
○ 20-30 GiB
37

Cost Saving – Egress/Data Transfer Fees
38

Cost Saving – API Calls/S3 Operations (List, Get)
List/Get API calls only access Alluxio
39

Any Questions? Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!
40

Thank you!
41
Up Next:
AI/ML Infra Meetup Thur May 9 @ Uber Sunnyvale
https://lu.ma/AIMLinfra
Speak at an Alluxio event:
https://forms.gle/iJX9GTMaAVQdzKc28

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

More Related Content

Similar to Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data