SlideShare a Scribd company logo
Meet in the Middle for a
1,000X Performance Boost
Querying Parquet Files on
Petabyte-Scale Data Lakes
David Zhu, Engineering Manager
david@alluxio.com
Twitter/X: @davidyuzhu
Data Explosion
2019 2023
2014
BIG DATA ANALYTICS CLOUD ADOPTION GENERATIVE AI
2024
AI Data Life Cycle
Data
Collection
Data
Preprocessing
Model
Training
Model
Verification
Model
Loading
Inference
Data
Archiving
High concurrency,
High throughput model distribution
High concurrency,
High throughput data read/write
Compute engines like Spark
High concurrency,
high throughput data reading
Checkpoint write, async upload
High concurrency and
fast feature look up
Data is everywhere in every stage, and need to be accessed fast
Focus of
This Talk
Alluxio Confidential
Alluxio
Worker n
Alluxio
Worker 2
Big Data Query
Big Data ETL Model Training
Distributed Caching for High Throughput
Alluxio
Worker 1
A
B
s3:/bucket/file1
s3://bucket/file2
C
A C B
Select worker based on
consistent hashing
Alluxio Confidential
Alluxio
Worker n
Alluxio
Worker 2
Big Data Query
Big Data ETL Model Training
Distributed Caching for High Throughput
Alluxio
Worker 1
A
B
s3:/bucket/file1
s3://bucket/file2
C
A C B
Select worker based on
consistent hashing
Alluxio Confidential
New Architecture in EE 3.x
70
70
AI/Analytics Applications
Get Task Info
Send Result
Alluxio Client
6
Affinity Block
Location Policy
Client
Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task
5
Training Node
Alluxio Cluster
Under Storage
Alluxio Confidential
Holy Grail of Storage Systems
Source:
https://jack-vanlightly.com/b
log/2023/11/29/s3-express-
one-zone-not-quite-what-i-h
oped-for
● Cheap
○ S3 Express One Zone 5x compared to S3, even with the recent
reduction
○ Alluxio cache spends money on the hot data, leaving the rest to S3
standard cost
● Low Latency:
○ Achieve sub-millisecond or single-digit millisecond latency for fast
responses
● Scaling Linearly in Capacity:
○ Seamlessly scale to support tens of billions of objects and files.
● High Availability:
○ No centralized metadata service, no single point of failure.
○ Caching in Multi-AZs, Multiple Regions, always backed up by S3
Alluxio Confidential
Low Latency File Access
● To Achieve Low Latency Data Access, we need Low Latency File Access first
● Asynchronous Event Loop:
Each Alluxio worker is built on a high-performance, asynchronous I/O framework. This enables non-blocking I/O with minimal
context switching and thread contention—two major contributors to latency in traditional blocking I/O systems. Its event-driven
model allows one worker instance to scale to thousands of concurrent connections while maintaining sub-millisecond
responsiveness.
● Off-Heap Page Storage on NVMe:
Alluxio leverage NVMe SSDs to store cached pages off-heap. This design allows for significantly higher storage density without
overwhelming memory resources, offering a favorable balance between cost and access latency.
● Zero-Copy I/O:
To avoid unnecessary memory copies and to reduce CPU load, Alluxio employs zero-copy I/O techniques using sendfile()
and mmap(). These allow cached pages to be read directly from NVMe and transmitted over the network stack without copying
through user space, enhancing both throughput and latency.
Result: 1ms File Access for small positioned read from cache (~1KB)
Alluxio Confidential
Low Latency Parquet API
● Goal:
○ Achieve sub millisecond latency in single field, single row point query lookup for Files stored in S3 and
cached in Alluxio
○ Driven by AI inference workload, search applications etc
● Builds on Previous work:
○ Achieved sub-millisecond 1KB read from a cached file from Alluxio
○ Using ParquetReader to query a field gives 46ms latency, between S3Express (<10ms) and S3 (300-400ms)
● Assumption:
○ Point Query: select col1, col2 where id = x;
○ id is a primary key, returned fields are not large enough to cause latency to be network-bound (<20K Bytes)
○ Col id is sorted, min/max statistics on row groups are available, column index and offset index on pages are
available
○ ParquetReader is generally too heavy for this
Alluxio Confidential
Key Ideas
● Cache Parquet Metadata in Alluxio (Reduce pointer chasing and lookups)
○ Cache the parquet footer (file path -> footer)
○ Cache the column index and offset index (file path, column)
● Offload processing to the client (Reduce CPU workload on caching node)
○ Use small page size and send back entire pages with offset rather than decoding on the storage
node (trade-off some network transfer vs throughput)
○ Return protobuf raw bytes
● Pushdown of Predicates and Projections to the leaf cache node (reduce network traffic to
minimal)
○ Usually this is pushed to the compute node workers (Spark worker, Trino worker, but never to the
storage)
Footer: for each rowgroup, min and max of id, so we
can quickly binary search for the right rowgroup
Column Index: within each row group, we can locate
the page containing the right id
And find row number in that page and in the row group
OffsetIndex: Find other columns with the same row
number quickly
Background on Parquet
Format
Alluxio Confidential
Summary and Next Steps
● We brought latency from 46ms on a
cached alluxio file to 0.4ms using a
specialized interface
● Throughput: 20K QPS per 8-core
storage worker node i4i.2xlarge
● Next step, we are looking to integrate
with upper layers/Query
engines/Compute frameworks to bring
this low latency to applications..
Alluxio Confidential
S3 Express One Zone EC2: i3en.metal S3 Standard
List Price/TB/Month $110* $132** $23***
Example Data Set Size in TB 500 500 500
% of Data Set Stored 100% 20% 100%
Actual Cost/TB/Month $55,000 $13,200 $11,500
Latency <1 ms <1 ms 100+ ms
* At the time of writing, S3 Express One Zone has a list
price of $110/TB/Month.
** At the time of writing, on demand pricing for EC2
i3en.12xlarge instances with 30TB of NVMe capacity
was $5.42/hour which calculates to $132/TB/Month.
*** At the time of writing, S3 Standard has a list price of
$23/TB/Month.
Cost Analysis of Alluxio vs S3 Express One Zone
Alluxio Confidential
Meet-in-the-Middle Philosophy
● Long debate: Move the data to the compute or move the compute to the data
● Why not both?
● Caching layer is where they meet
● Instead of application-specific cache, this is a data specific cache
● Can be shared by many applications

More Related Content

PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
PPTX
Windows Azure: Lessons From The Field
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
4K Video Downloader Crack + License Key 2025
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Windows Azure: Lessons From The Field
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
4K Video Downloader Crack + License Key 2025
Project Tungsten: Bringing Spark Closer to Bare Metal

Similar to Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Data Lakes (20)

PDF
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
PDF
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
PDF
T12.Fujitsu World Tour India 2016-Your Datacenter‘s backbone
PPTX
Amazon Aurora TechConnect
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
Netflix Open Source Meetup Season 4 Episode 2
PDF
Speeding Up Spark Performance using Alluxio at China Unicom
PDF
Redpanda and ClickHouse
PDF
In search of the perfect IoT Stack - Scalable IoT Architectures with MQTT
PDF
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PPT
3PAR and VMWare
PPTX
Azure Storage – Foundation for Building Secure, Scalable Cloud Applications
PDF
MinIO January 2020 Briefing
PPTX
Cloud Architecture best practices
PPT
Open HFT libraries in @Java
PDF
Cncf storage-final-filip
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
T12.Fujitsu World Tour India 2016-Your Datacenter‘s backbone
Amazon Aurora TechConnect
Alluxio Data Orchestration Platform for the Cloud
Netflix Open Source Meetup Season 4 Episode 2
Speeding Up Spark Performance using Alluxio at China Unicom
Redpanda and ClickHouse
In search of the perfect IoT Stack - Scalable IoT Architectures with MQTT
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
3PAR and VMWare
Azure Storage – Foundation for Building Secure, Scalable Cloud Applications
MinIO January 2020 Briefing
Cloud Architecture best practices
Open HFT libraries in @Java
Cncf storage-final-filip
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
PDF
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
PDF
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
PDF
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
PDF
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Best Practice for LLM Serving in the Cloud
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Ad

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
medical staffing services at VALiNTRY
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Transform Your Business with a Software ERP System
PDF
Digital Strategies for Manufacturing Companies
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
AI in Product Development-omnex systems
PPTX
Introduction to Artificial Intelligence
PPT
Introduction Database Management System for Course Database
System and Network Administraation Chapter 3
Softaken Excel to vCard Converter Software.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Wondershare Filmora 15 Crack With Activation Key [2025
Design an Analysis of Algorithms I-SECS-1021-03
medical staffing services at VALiNTRY
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Odoo POS Development Services by CandidRoot Solutions
Transform Your Business with a Software ERP System
Digital Strategies for Manufacturing Companies
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
VVF-Customer-Presentation2025-Ver1.9.pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
AI in Product Development-omnex systems
Introduction to Artificial Intelligence
Introduction Database Management System for Course Database

Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Data Lakes

  • 1. Meet in the Middle for a 1,000X Performance Boost Querying Parquet Files on Petabyte-Scale Data Lakes David Zhu, Engineering Manager david@alluxio.com Twitter/X: @davidyuzhu
  • 2. Data Explosion 2019 2023 2014 BIG DATA ANALYTICS CLOUD ADOPTION GENERATIVE AI 2024
  • 3. AI Data Life Cycle Data Collection Data Preprocessing Model Training Model Verification Model Loading Inference Data Archiving High concurrency, High throughput model distribution High concurrency, High throughput data read/write Compute engines like Spark High concurrency, high throughput data reading Checkpoint write, async upload High concurrency and fast feature look up Data is everywhere in every stage, and need to be accessed fast Focus of This Talk
  • 4. Alluxio Confidential Alluxio Worker n Alluxio Worker 2 Big Data Query Big Data ETL Model Training Distributed Caching for High Throughput Alluxio Worker 1 A B s3:/bucket/file1 s3://bucket/file2 C A C B Select worker based on consistent hashing
  • 5. Alluxio Confidential Alluxio Worker n Alluxio Worker 2 Big Data Query Big Data ETL Model Training Distributed Caching for High Throughput Alluxio Worker 1 A B s3:/bucket/file1 s3://bucket/file2 C A C B Select worker based on consistent hashing
  • 6. Alluxio Confidential New Architecture in EE 3.x 70 70 AI/Analytics Applications Get Task Info Send Result Alluxio Client 6 Affinity Block Location Policy Client Consistent Hash (Task Info) 2 3 Service Registry Alluxio Worker Alluxio Worker Alluxio Worker Execute Task Get Cluster Info Find Worker(s) 1 4 Cache miss Under storage task 5 Training Node Alluxio Cluster Under Storage
  • 7. Alluxio Confidential Holy Grail of Storage Systems Source: https://jack-vanlightly.com/b log/2023/11/29/s3-express- one-zone-not-quite-what-i-h oped-for ● Cheap ○ S3 Express One Zone 5x compared to S3, even with the recent reduction ○ Alluxio cache spends money on the hot data, leaving the rest to S3 standard cost ● Low Latency: ○ Achieve sub-millisecond or single-digit millisecond latency for fast responses ● Scaling Linearly in Capacity: ○ Seamlessly scale to support tens of billions of objects and files. ● High Availability: ○ No centralized metadata service, no single point of failure. ○ Caching in Multi-AZs, Multiple Regions, always backed up by S3
  • 8. Alluxio Confidential Low Latency File Access ● To Achieve Low Latency Data Access, we need Low Latency File Access first ● Asynchronous Event Loop: Each Alluxio worker is built on a high-performance, asynchronous I/O framework. This enables non-blocking I/O with minimal context switching and thread contention—two major contributors to latency in traditional blocking I/O systems. Its event-driven model allows one worker instance to scale to thousands of concurrent connections while maintaining sub-millisecond responsiveness. ● Off-Heap Page Storage on NVMe: Alluxio leverage NVMe SSDs to store cached pages off-heap. This design allows for significantly higher storage density without overwhelming memory resources, offering a favorable balance between cost and access latency. ● Zero-Copy I/O: To avoid unnecessary memory copies and to reduce CPU load, Alluxio employs zero-copy I/O techniques using sendfile() and mmap(). These allow cached pages to be read directly from NVMe and transmitted over the network stack without copying through user space, enhancing both throughput and latency. Result: 1ms File Access for small positioned read from cache (~1KB)
  • 9. Alluxio Confidential Low Latency Parquet API ● Goal: ○ Achieve sub millisecond latency in single field, single row point query lookup for Files stored in S3 and cached in Alluxio ○ Driven by AI inference workload, search applications etc ● Builds on Previous work: ○ Achieved sub-millisecond 1KB read from a cached file from Alluxio ○ Using ParquetReader to query a field gives 46ms latency, between S3Express (<10ms) and S3 (300-400ms) ● Assumption: ○ Point Query: select col1, col2 where id = x; ○ id is a primary key, returned fields are not large enough to cause latency to be network-bound (<20K Bytes) ○ Col id is sorted, min/max statistics on row groups are available, column index and offset index on pages are available ○ ParquetReader is generally too heavy for this
  • 10. Alluxio Confidential Key Ideas ● Cache Parquet Metadata in Alluxio (Reduce pointer chasing and lookups) ○ Cache the parquet footer (file path -> footer) ○ Cache the column index and offset index (file path, column) ● Offload processing to the client (Reduce CPU workload on caching node) ○ Use small page size and send back entire pages with offset rather than decoding on the storage node (trade-off some network transfer vs throughput) ○ Return protobuf raw bytes ● Pushdown of Predicates and Projections to the leaf cache node (reduce network traffic to minimal) ○ Usually this is pushed to the compute node workers (Spark worker, Trino worker, but never to the storage)
  • 11. Footer: for each rowgroup, min and max of id, so we can quickly binary search for the right rowgroup Column Index: within each row group, we can locate the page containing the right id And find row number in that page and in the row group OffsetIndex: Find other columns with the same row number quickly Background on Parquet Format
  • 12. Alluxio Confidential Summary and Next Steps ● We brought latency from 46ms on a cached alluxio file to 0.4ms using a specialized interface ● Throughput: 20K QPS per 8-core storage worker node i4i.2xlarge ● Next step, we are looking to integrate with upper layers/Query engines/Compute frameworks to bring this low latency to applications..
  • 13. Alluxio Confidential S3 Express One Zone EC2: i3en.metal S3 Standard List Price/TB/Month $110* $132** $23*** Example Data Set Size in TB 500 500 500 % of Data Set Stored 100% 20% 100% Actual Cost/TB/Month $55,000 $13,200 $11,500 Latency <1 ms <1 ms 100+ ms * At the time of writing, S3 Express One Zone has a list price of $110/TB/Month. ** At the time of writing, on demand pricing for EC2 i3en.12xlarge instances with 30TB of NVMe capacity was $5.42/hour which calculates to $132/TB/Month. *** At the time of writing, S3 Standard has a list price of $23/TB/Month. Cost Analysis of Alluxio vs S3 Express One Zone
  • 14. Alluxio Confidential Meet-in-the-Middle Philosophy ● Long debate: Move the data to the compute or move the compute to the data ● Why not both? ● Caching layer is where they meet ● Instead of application-specific cache, this is a data specific cache ● Can be shared by many applications