Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration Between Presto & Alluxio

Optimizing Latency-Sensitive Queries for
Presto at Facebook
A Collaboration Between Presto & Alluxio
Rohit Jain (Facebook), James Sun (Facebook), Bin Fan (Alluxio)
05/07/2020
Q&A: www.alluxio.io/slack

• Overview
• Architecture and Problems
• Re-architecture and solution
• Performance
• Introduce Alluxio Local Cache
• Timeline

SELECT * FROM A, B
WHERE A.k = B.k
Storage
A Distributed SQL (Compute) Engine
Result

Presto @ Facebook Scale
40K
Servers
~ 1 EB data scan
per day
> 80%
new ETL

Interactive Use Cases @ Facebook *
Rapto
r
MySQL
. . .joins allowed across heterogenous data
sources
Hive
• presto-hive (Presto)
• General-purpose dashboarding and adhoc queries
• presto-raptor (Raptor)
• Low-latency dashboarding and A/B testing
• Usually 10X faster than Presto
• The two use cases have the similar fleet sizes

Difference between Presto and Raptor: How Presto Works
Driver
Driver
Planner/
Optimizer
Scheduler
Worker
Worker
Driver
Worker
HDFS
Hive
Metastore
read/writeBlock
workload balanced
openFiles
getPartitions
getFiles
SQL
result

Difference between Presto and Raptor: How Raptor Works
Driver
Driver
Planner/
Optimizer
Scheduler
Worker
Worker
Driver
Worker KV Store
Raptor
Metastore
getFiles
read/writeBlock
hard affinity
Local SSD
Background
Job
backup
compaction /
cleaning
write thru.
SQL
result

Pros and Cons between Presto and Raptor
Presto Raptor
Pros
Large-scale storage (EB) Low latency (sub-second)
Independent storage and compute scale Refined metastore (file-level)
Cons
High latency (sub-minute) Mid-scale storage (PB)
Coarse metastore (partition-level) Coupled storage and compute

New Architecture to Unify Presto and Raptor
Driver
Driver
Planner/
Optimizer
Scheduler
Worker
Worker
Driver
Worker
HDFS
Hive
Metastore
getFiles
openFile/footer cache
read/writeBlock
soft affinity
local data cache
Local SSD
Caching
Low-overhead
coordinator
KV store
file location/stats

• Random Node Scheduler
• Best efforts to assign the same split to the same worker
Affinity scheduling

• A common optimization technique is to cache working dataset
closer to the compute node.
• With lesser trips to remote storage should help with latencies
and IO.
Data Caching

• Facebook internal caching libraries
• Open source solutions
• Build our own
Various Options

• Naïve solution
• Copying files from remote storage on local storage
• Merging files in the local storage to keep file count low
File Merge Caching

• Java based OSS library
• Segment Based data caching: Reading, writing and evicting in
smaller units.
• Asynchronous operations
• Pluggable eviction policies
• Semantic aware caching
Learnings

• A Java based OSS library
• Segment Based data caching
• Pluggable eviction policies
• Configuration of various aspects, sizes, resources usage, eviction policies, etc.
• Provide detailed stats regarding cache usage.
• Caching should not become a point of failure.
• Asynchronous operations.
• Files management at the disk level.
• Flash throughput limiter to avoid endurance issues.
Collaboration with Alluxio

• Two full days worth of queries from the production cluster was
shadowed to the test cluster.
• Query Count: 17320
• 600 nodes cluster
• 460GB per node was configured for data caching.
• LRU eviction policy.
• 1MB as the block size, meaning data is read, stored, and evicted
in the 1 MB size.
Benchmark

• IO Savings
• Data Size read for master branch run: 582 T Bytes
• Data Size read for caching branch run: 251 T Bytes
• Savings in Scans: 57%
Benchmark result cond..

• Cache hit rate
Benchmark result cont..

Confidential Use Only – Do Not Share
Alluxio Overview
• Open source data orchestration
• Commonly used for data analytics such as OLAP on Hadoop
• Started as a research project “Tachyon” in UC Berkeley

Confidential Use Only – Do Not Share
750
1 3 70
210
1080
Fast Growing Open Source Community
v1.0
Feb ‘16
v0.6
Mar ‘15
v0.2
Apr ‘13
v0.1
Dec ‘12
v2.1
Nov ‘19
v1.8
Jul ‘18
Over 1000 Github Contributors
Latest release: 2.2.0 in March 2020

Consumer Travel & TransportationTelco & Media
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
Deployed in Hundreds of Companies

Alluxio Local Cache: Architecture
Local cache
storage
Alluxio Caching
File System
On Cache Hit
External
Storage
Presto
Worker
On Cache Miss
HDFS API Calls
Alluxio Cache
Manager
External File
System
Presto Server JVM

• Seek-heavy read pattern is (non-sequential)
• Segment (1MB by default) based caching (compared file size)
• Presto server is highly concurrent by design
• Light-weight & fine-grained locking across segments
• Queries are bursty
• Optional async write to cache
I/O Challenges & Implementation

• Pluggable cache replace policies:
• LRU, LFU
• Pluggable cache storage options:
• local file system: each segment -> one file
• Rocksdb
Cache Configuration

• Alluxio Local Cache is an embedded library
• Shipped with Alluxio client jar since v2.2.0
• No extra server daemon required
• Can be easily used in other JVM applications
• Alluxio System supports full functionalities
• Data policies: free, pin, TTL etc
• Metadata caching and synching
• Familiar Filesystem CLIs
• Transformation service
Alluxio Local Cache vs. Alluxio System

• Enable Presto + Alluxio Local Cache:
• edit etc/catalog/hive.properties
• available in next Presto release
• Future work
• Semantics-aware metadata caching
• Performance tuning: CPU vs mem
Timeline & Future Work
cache.enabled=true
cache.type=ALLUXIO
cache.base-directory=/tmp/alluxio-cache

Recording of this talk will be available soonQ & A

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration Between Presto & Alluxio

More Related Content

What's hot (20)

Similar to Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration Between Presto & Alluxio (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration Between Presto & Alluxio