Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Data Lakes

Meet in the Middle for a
1,000X Performance Boost
Querying Parquet Files on
Petabyte-Scale Data Lakes
David Zhu, Engineering Manager
david@alluxio.com
Twitter/X: @davidyuzhu

Data Explosion
2019 2023
2014
BIG DATA ANALYTICS CLOUD ADOPTION GENERATIVE AI
2024

AI Data Life Cycle
Data
Collection
Data
Preprocessing
Model
Training
Model
Veriﬁcation
Model
Loading
Inference
Data
Archiving
High concurrency,
High throughput model distribution
High concurrency,
High throughput data read/write
Compute engines like Spark
High concurrency,
high throughput data reading
Checkpoint write, async upload
High concurrency and
fast feature look up
Data is everywhere in every stage, and need to be accessed fast
Focus of
This Talk

Alluxio Confidential
Alluxio
Worker n
Alluxio
Worker 2
Big Data Query
Big Data ETL Model Training
Distributed Caching for High Throughput
Alluxio
Worker 1
A
B
s3:/bucket/file1
s3://bucket/file2
C
A C B
Select worker based on
consistent hashing

New Architecture in EE 3.x
70
70
AI/Analytics Applications
Get Task Info
Send Result
Alluxio Client
6
Aﬀinity Block
Location Policy
Client
Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task
5
Training Node
Alluxio Cluster
Under Storage

Holy Grail of Storage Systems
Source:
https://jack-vanlightly.com/b
log/2023/11/29/s3-express-
one-zone-not-quite-what-i-h
oped-for
● Cheap
○ S3 Express One Zone 5x compared to S3, even with the recent
reduction
○ Alluxio cache spends money on the hot data, leaving the rest to S3
standard cost
● Low Latency:
○ Achieve sub-millisecond or single-digit millisecond latency for fast
responses
● Scaling Linearly in Capacity:
○ Seamlessly scale to support tens of billions of objects and files.
● High Availability:
○ No centralized metadata service, no single point of failure.
○ Caching in Multi-AZs, Multiple Regions, always backed up by S3

Low Latency File Access
● To Achieve Low Latency Data Access, we need Low Latency File Access first
● Asynchronous Event Loop:
Each Alluxio worker is built on a high-performance, asynchronous I/O framework. This enables non-blocking I/O with minimal
context switching and thread contention—two major contributors to latency in traditional blocking I/O systems. Its event-driven
model allows one worker instance to scale to thousands of concurrent connections while maintaining sub-millisecond
responsiveness.
● Off-Heap Page Storage on NVMe:
Alluxio leverage NVMe SSDs to store cached pages off-heap. This design allows for significantly higher storage density without
overwhelming memory resources, offering a favorable balance between cost and access latency.
● Zero-Copy I/O:
To avoid unnecessary memory copies and to reduce CPU load, Alluxio employs zero-copy I/O techniques using sendfile()
and mmap(). These allow cached pages to be read directly from NVMe and transmitted over the network stack without copying
through user space, enhancing both throughput and latency.
Result: 1ms File Access for small positioned read from cache (~1KB)

Low Latency Parquet API
● Goal:
○ Achieve sub millisecond latency in single field, single row point query lookup for Files stored in S3 and
cached in Alluxio
○ Driven by AI inference workload, search applications etc
● Builds on Previous work:
○ Achieved sub-millisecond 1KB read from a cached file from Alluxio
○ Using ParquetReader to query a field gives 46ms latency, between S3Express (<10ms) and S3 (300-400ms)
● Assumption:
○ Point Query: select col1, col2 where id = x;
○ id is a primary key, returned fields are not large enough to cause latency to be network-bound (<20K Bytes)
○ Col id is sorted, min/max statistics on row groups are available, column index and oﬀset index on pages are
available
○ ParquetReader is generally too heavy for this

Key Ideas
● Cache Parquet Metadata in Alluxio (Reduce pointer chasing and lookups)
○ Cache the parquet footer (file path -> footer)
○ Cache the column index and offset index (file path, column)
● Offload processing to the client (Reduce CPU workload on caching node)
○ Use small page size and send back entire pages with offset rather than decoding on the storage
node (trade-off some network transfer vs throughput)
○ Return protobuf raw bytes
● Pushdown of Predicates and Projections to the leaf cache node (reduce network traffic to
minimal)
○ Usually this is pushed to the compute node workers (Spark worker, Trino worker, but never to the
storage)

Footer: for each rowgroup, min and max of id, so we
can quickly binary search for the right rowgroup
Column Index: within each row group, we can locate
the page containing the right id
And find row number in that page and in the row group
OffsetIndex: Find other columns with the same row
number quickly
Background on Parquet
Format

Summary and Next Steps
● We brought latency from 46ms on a
cached alluxio file to 0.4ms using a
specialized interface
● Throughput: 20K QPS per 8-core
storage worker node i4i.2xlarge
● Next step, we are looking to integrate
with upper layers/Query
engines/Compute frameworks to bring
this low latency to applications..

S3 Express One Zone EC2: i3en.metal S3 Standard
List Price/TB/Month $110* $132** $23***
Example Data Set Size in TB 500 500 500
% of Data Set Stored 100% 20% 100%
Actual Cost/TB/Month $55,000 $13,200 $11,500
Latency <1 ms <1 ms 100+ ms
* At the time of writing, S3 Express One Zone has a list
price of $110/TB/Month.
** At the time of writing, on demand pricing for EC2
i3en.12xlarge instances with 30TB of NVMe capacity
was $5.42/hour which calculates to $132/TB/Month.
*** At the time of writing, S3 Standard has a list price of
$23/TB/Month.
Cost Analysis of Alluxio vs S3 Express One Zone

Meet-in-the-Middle Philosophy
● Long debate: Move the data to the compute or move the compute to the data
● Why not both?
● Caching layer is where they meet
● Instead of application-specific cache, this is a data specific cache
● Can be shared by many applications

Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Data Lakes

More Related Content

Similar to Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Data Lakes (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Data Lakes