SlideShare a Scribd company logo
What’s new in Alluxio 2
Bin Fan & Calvin Jia | Founding Engineers | Alluxio
Seamless
Operations
Alluxio 2 Directions
Advanced
Data Management
2
Hyper-scale
Architecture
Seamless Operations
Cloud Native on AWS: AMI, CFT, EMR
Presto Hive
Cluster Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
4
§ Alluxio AMI in the Marketplace
§ Alluxio Cloud Formation Template for cluster deployment
§ AWS EMR with Alluxio with bootstrap script
Enable one-click to deploy Alluxio on AWS
Cloud Native on Google Cloud: Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
5
§ Google Dataproc with Alluxio (init action integration available)
Google
Dataproc
Cluster
Enable one-click to deploy Alluxio on Google Cloud
Native Deployment with Kubernetes
Alluxio Worker
Kubernetes
Cluster
Host
Machine
6
Alluxio Master Alluxio Worker
Host
Machine
Journal Volume
Application
ApplicationApplicationApplicationApplication
Self-Managed Quorum
7
Available in 2.0.0
Distributed
Storage
(ie. HDFS)
Alluxio Standby
Master
Distributed
Quorum
(Zookeeper)
Alluxio Master
Alluxio Standby
Master
Alluxio Standby
Master
Alluxio Master
RAFT
No major external dependencies
Hyper-scale architecture
§ Challenge:
• 1 file metadata takes 1KB of on-heap storage
• 1 billion files would take 1 TB of heap space, GC becomes a big problem
§ Solution:
• Add new tier with embedded RocksDB to store inode tree
• Keep an in-memory cache of frequently used inodes
9
Scaling to 1 Billion+ Files
Scale to one billion files and beyond, with performance comparable
to previous on-heap implementation
Scaling to 1 Billion+ Files
10
Available in 2.0.0
Alluxio Master
Local Disk
RocksDB (Embedded)
● Inode Table
● Edge Table
● Block Table
● Block to Worker Table
● Worker to Block Table
On Heap
● Inode Cache
● Mount Table
● Locks
Inode ID Metadata (Binary)
12392 010101101101
12393 110110110100
… …
Edge (ID, name) Inode ID
12392,foo 12393
… …
Efficient cluster communication with gRPC
11
Available in 2.0.0
Thrift (Metadata)
Netty (IO)
Alluxio Master
Alluxio Worker
Alluxio Worker
Alluxio Client
Alluxio Master
Alluxio Worker
Alluxio Worker
Alluxio Client
gRPC (Metadata + IO)
Advanced Data Management
§ New Alluxio Catalog Service
• Provides the Abstraction of Structured Data
• Attaching a Hive MetaStore like Mounting a File system
• Understand and Serve Schema of Files or Objects
§ New Alluxio Data Transformation Service
• Tranform csv à parquet
• Compact many files à fewer files
Deeper Integration with Presto
13
Presto Alluxio Connector Based off the Hive Connector
Now available as Developer Preview
Policy Driven Data Management
14
Available in 2.0.0
Alluxio
Master
Alluxio Policy Engine
Example Policy
Move files older than 90
days from HDFS to S3
Application
Apps access the same path regardless
of where the actual data is stored
Decouple logical file system namespace with physical storage systems
Replicated Asynchronous Writes
15
RAM / SSD / HDD
Network Speed Write of Data
Application
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
Under Store
Available in 2.0.0
Fast and reliable writes to Alluxio, with data persisted in background
Questions?

More Related Content

PDF
How to Develop and Operate Cloud First Data Platforms
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
The Practice of Alluxio in JD.com
PDF
Alluxio-FUSE as a data access layer for Dask
PPTX
Hybrid collaborative tiered storage with alluxio
PDF
Enabling Presto Caching at Uber with Alluxio
PDF
Best Practices for Using Alluxio with Spark
PDF
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud First Data Platforms
Apache Iceberg - A Table Format for Hige Analytic Datasets
The Practice of Alluxio in JD.com
Alluxio-FUSE as a data access layer for Dask
Hybrid collaborative tiered storage with alluxio
Enabling Presto Caching at Uber with Alluxio
Best Practices for Using Alluxio with Spark
How to Develop and Operate Cloud Native Data Platforms and Applications

What's hot (20)

PPTX
PDF
Improve Presto Architectural Decisions with Shadow Cache
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
PPT
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
PDF
Presto: Query Anything - Data Engineer’s perspective
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
PDF
Exploring Alluxio for Daily Tasks at Robinhood
PDF
Presto on Alluxio Hands-On Lab
PDF
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
PPTX
From monolith to microservice with containers.
PDF
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
PDF
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
PDF
Building Fast SQL Analytics on Anything with Presto, Alluxio
PDF
The Missing Piece of On-Demand Clusters
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Improve Presto Architectural Decisions with Shadow Cache
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio Data Orchestration Platform for the Cloud
Iceberg + Alluxio for Fast Data Analytics
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Presto: Query Anything - Data Engineer’s perspective
Hybrid data lake on google cloud with alluxio and dataproc
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Exploring Alluxio for Daily Tasks at Robinhood
Presto on Alluxio Hands-On Lab
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
From monolith to microservice with containers.
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
Building Fast SQL Analytics on Anything with Presto, Alluxio
The Missing Piece of On-Demand Clusters
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Ad

Similar to What’s new in Alluxio 2: from seamless operations to structured data management (20)

PDF
Alluxio 2 Community Update
PDF
Accelerating Spark with Kubernetes
PDF
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
PDF
Deploying Alluxio in the Cloud for Machine Learning
PDF
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
PDF
Running Spark & Alluxio in Kubernetes
PDF
Alluxio Innovations for Structured Data
PDF
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
PPTX
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
PDF
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
PDF
Spark Pipelines in the Cloud with Alluxio with Gene Pang
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio 2 Community Update
Accelerating Spark with Kubernetes
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
Deploying Alluxio in the Cloud for Machine Learning
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Running Spark & Alluxio in Kubernetes
Alluxio Innovations for Structured Data
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Open Source Data Orchestration for AI, Big Data, and Cloud
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Accelerate Analytics and ML in the Hybrid Cloud Era
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Digital Strategies for Manufacturing Companies
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
history of c programming in notes for students .pptx
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
System and Network Administration Chapter 2
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Digital Strategies for Manufacturing Companies
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Understanding Forklifts - TECH EHS Solution
history of c programming in notes for students .pptx
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Reimagine Home Health with the Power of Agentic AI​
Digital Systems & Binary Numbers (comprehensive )
Design an Analysis of Algorithms I-SECS-1021-03
PTS Company Brochure 2025 (1).pdf.......
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Upgrade and Innovation Strategies for SAP ERP Customers
Designing Intelligence for the Shop Floor.pdf
System and Network Administration Chapter 2
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Design an Analysis of Algorithms II-SECS-1021-03

What’s new in Alluxio 2: from seamless operations to structured data management

  • 1. What’s new in Alluxio 2 Bin Fan & Calvin Jia | Founding Engineers | Alluxio
  • 2. Seamless Operations Alluxio 2 Directions Advanced Data Management 2 Hyper-scale Architecture
  • 4. Cloud Native on AWS: AMI, CFT, EMR Presto Hive Cluster Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync 4 § Alluxio AMI in the Marketplace § Alluxio Cloud Formation Template for cluster deployment § AWS EMR with Alluxio with bootstrap script Enable one-click to deploy Alluxio on AWS
  • 5. Cloud Native on Google Cloud: Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync 5 § Google Dataproc with Alluxio (init action integration available) Google Dataproc Cluster Enable one-click to deploy Alluxio on Google Cloud
  • 6. Native Deployment with Kubernetes Alluxio Worker Kubernetes Cluster Host Machine 6 Alluxio Master Alluxio Worker Host Machine Journal Volume Application ApplicationApplicationApplicationApplication
  • 7. Self-Managed Quorum 7 Available in 2.0.0 Distributed Storage (ie. HDFS) Alluxio Standby Master Distributed Quorum (Zookeeper) Alluxio Master Alluxio Standby Master Alluxio Standby Master Alluxio Master RAFT No major external dependencies
  • 9. § Challenge: • 1 file metadata takes 1KB of on-heap storage • 1 billion files would take 1 TB of heap space, GC becomes a big problem § Solution: • Add new tier with embedded RocksDB to store inode tree • Keep an in-memory cache of frequently used inodes 9 Scaling to 1 Billion+ Files Scale to one billion files and beyond, with performance comparable to previous on-heap implementation
  • 10. Scaling to 1 Billion+ Files 10 Available in 2.0.0 Alluxio Master Local Disk RocksDB (Embedded) ● Inode Table ● Edge Table ● Block Table ● Block to Worker Table ● Worker to Block Table On Heap ● Inode Cache ● Mount Table ● Locks Inode ID Metadata (Binary) 12392 010101101101 12393 110110110100 … … Edge (ID, name) Inode ID 12392,foo 12393 … …
  • 11. Efficient cluster communication with gRPC 11 Available in 2.0.0 Thrift (Metadata) Netty (IO) Alluxio Master Alluxio Worker Alluxio Worker Alluxio Client Alluxio Master Alluxio Worker Alluxio Worker Alluxio Client gRPC (Metadata + IO)
  • 13. § New Alluxio Catalog Service • Provides the Abstraction of Structured Data • Attaching a Hive MetaStore like Mounting a File system • Understand and Serve Schema of Files or Objects § New Alluxio Data Transformation Service • Tranform csv à parquet • Compact many files à fewer files Deeper Integration with Presto 13 Presto Alluxio Connector Based off the Hive Connector Now available as Developer Preview
  • 14. Policy Driven Data Management 14 Available in 2.0.0 Alluxio Master Alluxio Policy Engine Example Policy Move files older than 90 days from HDFS to S3 Application Apps access the same path regardless of where the actual data is stored Decouple logical file system namespace with physical storage systems
  • 15. Replicated Asynchronous Writes 15 RAM / SSD / HDD Network Speed Write of Data Application Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker Under Store Available in 2.0.0 Fast and reliable writes to Alluxio, with data persisted in background