SlideShare a Scribd company logo
bilibili Coeus ML
Platform with Alluxio
Lei Li, AI Platform Lead, bilibili
Zifan Ni, Sr. SWE, bilibili
$whoami
Lei Li
• Worked in
• Working in
• Wide interest in cloud-native,
kubenetes, MLOps, serverless, etc.
• ✈ ♠ 🏃
$whoami
Zifan Ni
• Worked in
• Working in
• Lovely Puppy, Da Vinci.
• 👨🍳 🍹 🎮
2 3 4
1CONTENTS
Best
Practice
Alluxio Meets
Coeus
bilibili Coeus Alluxio Perf
bilibili
• Leading video community
• MAU reached 230 million in last
financial report
• Aim to enrich everyday life of young
generation in China
bilibili Coeus
• cloud-native AI platform
• supporting
ü ADs
ü CV
ü NLP
ü Speech
ü e-commerce
ü etc.
model dev model training model storage model serving
VPA Hawkeye
Cloud-Native Observability System of bilibili
Alluxio + Fluid
OSS HDFS
Volcano
bilibili Coeus
2
3 4
1
CONTENTS
Best
Practice
Alluxio Meets
Coeus
bilibili Coeus Alluxio Perf
2
1
Coeus without Alluxio
OSS HDFS
Scenario 1 Container Crash
Training Container
Coeus without Alluxio
Scenario 1 Container Crash
Training Container
OSS HDFS
Download Data
Coeus without Alluxio
Scenario 1 Container Crash
Training Container
OSS HDFS
Download Data
Has to be Downloaded Again…
Coeus without Alluxio
Data is too huge to fit into one single machine, Users have to
• (Re)write difficult APIs for different storage to consume data in pipe.
• Handle retry, reconnection logic, etc.
Scenario 2 Data is too huge
Alluxio meets Coeus
Cache Wanted!
• Distributed
• Hides difficult details of reading data!
• Holds huge data in distributed workers
• Fuse simplifies data reading (os.open())
Alluxio meets Coeus
Cache Wanted!
• Distributed
• Hides difficult details of reading data!
Alluxio + Fluid
2
2 4
CONTENTS
Best
Practice
Alluxio Meets
Coeus
bilibili Coeus Alluxio Perf
1 3
Serverless Deployment
Pod0
1. Why Serverless
GPU0 GPU1
Serverless Fuse
Pod0 Alluxio Master0
GPU1
Alluxio Worker0
Alluxio Fuse
Pod1
1. Why Serverless
GPU0
Serverless Fuse
Pod0 Alluxio Master0
Pending…
GPU1
Alluxio Worker0
Alluxio Fuse
Pod1
1. Why Serverless
GPU0
Serverless Fuse
2. Serverless Solution
Pod0 Alluxio Master0
Alluxio Worker0
Pod1
Alluxio Fuse
GPU0 GPU1
Serverless Fuse
Worker Worker
Data Panel
Control Panel
Sidecar Control Panel
Runtime Controller
Pod
Application
Master
Runtime
Fuse Sidecar
Fluid
watch mount
3. Serverless Details
Alluxio Tuning
Problem Solution
1.Master stop world gc 30s / 30m
• Java: MaxGCPauseMillis 👇 , ParallelGCThreads👆
• Alluxio Master: Memory request👆
2. Fuse frequently throw I/O
exception
• alluxio.user.rpc.retry.max.duration👆
• alluxio.user.rpc.retry.base.sleep👆
3. Poor performance in reading
bunches of small files
• Upgrade Alluxio >= 2.6.2
2 4
CONTENTS
Best
Practice
Alluxio Meets
Coeus
bilibili Coeus Alluxio Perf
1 3
Framework Pytorch
Network TDNN neural network model
Total Epoches 20
Number of Files ~2.55Millions
File Size ~300Ki/file; ~800Gi total
GPU 4 * Nivida V100/16G GPU
Alluxio 2 * 500 Gi worker
Case 1 : Audio Language Recognition Model
Alluxio Performance
Speed Comparison
OSS S3Fuse Local SSD OSS Alluxio Cache
Completion Time 242.56h 63.48h 64.17h
Speed-up Ratio 1 3.82 3.78
Alluxio Performance
User Case 1 : Audio Language Recognition Model
Case 2 : Video Portrait Matting
Framework Pytorch
Network TDNN neural network model
Total Epoches 50
Number of Files ~20Millions
File Size ~100K/file; ~2Ti in total
GPU 4 * Nividia V100/32G
Alluxio 4 * 600 Gi worker
Alluxio Performance
• not enough disks to support large-
scale training in local storage
• S3Fuse: bad performance and
availability with large metadata
• each epoch completed stably in ~18
hours
• quality of trained model significantly
improved
• intersection of union(IoU) of portrait
matting improved by about 2%
Alluxio Performance
W/O Alluxio With Alluxio
Case 2 : Video Portrait Matting
Q&A
Lei Li, lilei06@bilibili.com
Zifan Ni, nizifan@bilibili.com
Thanksto
LuQiu@alluxio
BaoyouChen,XinguiZeng@bilibili
YangChe@Alibaba

More Related Content

PDF
Speed Up Uber's Presto with Alluxio
PDF
Native Clients, more the merrier with GFProxy!
PDF
Gluster: a SWOT Analysis
PDF
Unikraft: Fast, Specialized Unikernels the Easy Way
PPTX
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
PDF
Automating Gluster @ Facebook - Shreyas Siravara
PDF
SVC / Storwize: cost effective storage planning (BVQ use case)
PPTX
From monolith to microservice with containers.
Speed Up Uber's Presto with Alluxio
Native Clients, more the merrier with GFProxy!
Gluster: a SWOT Analysis
Unikraft: Fast, Specialized Unikernels the Easy Way
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Automating Gluster @ Facebook - Shreyas Siravara
SVC / Storwize: cost effective storage planning (BVQ use case)
From monolith to microservice with containers.

What's hot (20)

PDF
Data Reduction for Gluster with VDO
PDF
Using Ceph in OStack.de - Ceph Day Frankfurt
PDF
P99CONF — What We Need to Unlearn About Persistent Storage
POTX
Mobile 3: Launch Like a Boss!
PDF
GlusterFS w/ Tiered XFS
PDF
Object Compaction in Cloud for High Yield
PDF
Avoiding Data Hotspots at Scale
PPTX
Beyond 1000 bosh Deployments
PPTX
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
PPTX
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
PDF
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
PPTX
Is It Fast? : Measuring MongoDB Performance
KEY
Scaling application servers for efficiency
PDF
Redis : Database, cache, pub/sub and more at Jelly button games
PDF
2021.06. Ceph Project Update
PDF
OSCON 2017: To contain or not to contain
PDF
Ceph Month 2021: RADOS Update
PDF
SVC / Storwize analysis cost effective storage planning (use case)
PPTX
Virtual memory ,Allocaton of frame & Trashing
PPTX
How Scylla Manager Handles Backups
Data Reduction for Gluster with VDO
Using Ceph in OStack.de - Ceph Day Frankfurt
P99CONF — What We Need to Unlearn About Persistent Storage
Mobile 3: Launch Like a Boss!
GlusterFS w/ Tiered XFS
Object Compaction in Cloud for High Yield
Avoiding Data Hotspots at Scale
Beyond 1000 bosh Deployments
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
Is It Fast? : Measuring MongoDB Performance
Scaling application servers for efficiency
Redis : Database, cache, pub/sub and more at Jelly button games
2021.06. Ceph Project Update
OSCON 2017: To contain or not to contain
Ceph Month 2021: RADOS Update
SVC / Storwize analysis cost effective storage planning (use case)
Virtual memory ,Allocaton of frame & Trashing
How Scylla Manager Handles Backups
Ad

Similar to Building an Efficient AI Training Platform at bilibili with Alluxio (20)

PDF
Opencast Summit 2024 — Opencast: Quo Vadis? – Time for an overhaul?
PDF
Big bluebutton moodle integration 2013b
PDF
Accelerate Cloud Training with Alluxio
PDF
KubeCon 2019 Recap (Parts 1-3)
PDF
[1C5]Lessons from developing a web browser for raspberry pi
PDF
GitOps with GitHub Actions & Flux by Kingdon Barrett
PPTX
GAB 2017 NICE - Docker Hands-On-Lab
PDF
Intro to GitOps & Flux.pdf
PDF
GitOps Testing in Kubernetes with Flux and Testkube.pdf
PPTX
REALITY iOSアプリを支える開発効率化
PDF
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
PDF
Flux Security & Scalability using VS Code GitOps Extension
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PPTX
Kubernetes 101
PDF
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
PDF
Leveraging docker for hadoop build automation and big data stack provisioning
PDF
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
PPTX
Fish Cam.pptx
PDF
DoxLon | Life with kube, containers and microservices
PDF
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Opencast Summit 2024 — Opencast: Quo Vadis? – Time for an overhaul?
Big bluebutton moodle integration 2013b
Accelerate Cloud Training with Alluxio
KubeCon 2019 Recap (Parts 1-3)
[1C5]Lessons from developing a web browser for raspberry pi
GitOps with GitHub Actions & Flux by Kingdon Barrett
GAB 2017 NICE - Docker Hands-On-Lab
Intro to GitOps & Flux.pdf
GitOps Testing in Kubernetes with Flux and Testkube.pdf
REALITY iOSアプリを支える開発効率化
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Flux Security & Scalability using VS Code GitOps Extension
Alluxio Webinar | Accelerate AI: Alluxio 101
Kubernetes 101
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Fish Cam.pptx
DoxLon | Life with kube, containers and microservices
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
System and Network Administration Chapter 2
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
history of c programming in notes for students .pptx
PPTX
ai tools demonstartion for schools and inter college
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo Companies in India – Driving Business Transformation.pdf
Computer Software and OS of computer science of grade 11.pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
Digital Strategies for Manufacturing Companies
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
top salesforce developer skills in 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
System and Network Administration Chapter 2
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Wondershare Filmora 15 Crack With Activation Key [2025
VVF-Customer-Presentation2025-Ver1.9.pptx
wealthsignaloriginal-com-DS-text-... (1).pdf
history of c programming in notes for students .pptx
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms I-SECS-1021-03
PTS Company Brochure 2025 (1).pdf.......
Softaken Excel to vCard Converter Software.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Reimagine Home Health with the Power of Agentic AI​

Building an Efficient AI Training Platform at bilibili with Alluxio