Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio

Optimizing Latency-Sensitive Queries at Facebook
with Presto & Alluxio
Ke Wang (Facebook)
Bin Fan (Alluxio)
December 2020

• Overview
• Architecture and Problems
• Re-architecture and Solutions
• Performance
• Alluxio Deep-dive
2

40K
Servers
~ 1 EB data
scan per day
> 80%
new ETL
Presto @ Facebook Scale
4

• Overview
• Re-architecture and Solution
• Performance
5

Driver
Driver
Planner/
Optimizer
Scheduler
Worker
Worker
Driver
Worker
HDFS
Hive
Metastore
read/writeBlock
workload balanced
openFiles
getPartitions
getFiles
SQL
result
How Presto Works
6

• Overview
• Re-architecture and Solution
• Performance
7

• Metadata cache at various levels
• schemas
• ACLs
• Partitions info
• HDFS
• File handle caching: avoid ﬁle open calls
• File stripe/footer caching: avoid multiple redundant RPC calls to HDFS
• File data caching: avoid network or HDFS latency.
• Compute
• Plan
• Partitial Result
Caching
8

• An optimization technique is to cache working dataset closer to the
compute node.
• Less trips to remote storage should help with latencies and IO.
Data Caching
9

Driver
Driver
Planner/
Optimizer
Scheduler
Worker
Worker
Driver
Worker
HDFS
Hive
Metastore
getFiles
openFile/footer cache
read/writeBlock
soft affinity
Data Cache
Local SSD
Metadata
Caching
Low-overhead
coordinator
KV store
file location/stats
Presto with Data Caching
10

• Random Node Scheduler
• Best eﬀorts to assign the same split to the same worker
Aﬃnity Scheduling
11

• Blocked --> Secondary Preference --> Least busy
Soft Aﬃnity
12

• Facebook internal caching libraries
• Open source solutions
• Build our own
Various Options
13

• Naïve solution
• Copying files from remote storage on local storage
• Merging files in the local storage to keep file count low
File Merge Caching
14

• Segment Based data caching
• Pluggable eviction policies
• Conﬁguration of various aspects like sizes, resources usage, eviction policies, etc.
•
• A Java based OSS library
• Provide detailed stats regarding cache usage.
• Caching should not become a point of failure.
• Asynchronous operations.
• Files management at the disk level.
• Flash throughput limiter to avoid endurance issues.
Learnings & Alluxio Collaboration
16

• Overview
• Re-architecture w/
Presto+Alluxio
• Performance
17

• Two full days worth of queries from the production cluster was shadowed
to the test cluster.
• Query Count: 17320
• 600 nodes cluster
• 460GB per node was conﬁgured for data caching.
• LRU eviction policy.
• 1MB as the block size, meaning data is read, stored, and evicted in the 1
MB size.
Benchmark Conﬁguration
18

Benchmark Results
Query Execution Time
19

• Data Size read for master branch run: 582 T Bytes
• Data Size read for caching branch run: 251 T Bytes
• Savings in Scans: 57%
Benchmark Results
IO Savings
20

Benchmark Results
Cache hit rate
21

• Overview
• Re-architecture and Solutions
• Performance
23

Alluxio Overview
Translate access to optimal storage APIs over a slow network
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
24

Local cache
storage
Alluxio Caching
File System
On Cache Hit
External
Storage
Presto
Worker
On Cache Miss
HDFS API Calls
Alluxio Cache
Manager
External
File System
Presto Server JVM
Presto & Alluxio Local Cache
Architecture
25

• Cache files in fix-sized segments (called pages)
• configurable, 1MB by default
• Store pages off-heap
• avoid using JVM memory resource but with SSDs
• Highly-concurrent & thread-safe
• Light-weight & fine-grained locking
if cacheManager.hasPage(pageId):
page = cacheManager.readPage(pageId)
else:
readFromExternalFS(page, offset, len)
cacheManager.writePage(pageId, page)
Implementation & Optimization
26

• Pluggable cache replace policies:
• LRU, LFU
• Pluggable cache storage options:
• Local ﬁle system store: each page -> one ﬁle
• Rocksdb store: page -> one value associated with pageId
• Async cache writes
• to handle bursty cache write ops, queue writes in background
• Failure Recovery
• disks are expected to fail when running at Facebook scale
Implementation & Optimization
27

• (WIP) Support Schema/Table/Partition level Cache Quota
• (WIP) Performance optimizations for small ﬁles
• (Future work) Semantics-aware caching
Ongoing Development
28

• Edit etc/catalog/hive.properties
• More details in the blog
cache.enabled=true
cache.type=ALLUXIO
cache.base-directory=/tmp/alluxio-cache
cache.alluxio.max-cache-size=500GB
hive.node-selection-strategy=SOFT_AFFINITY
Enable Alluxio Local Cache w/ Presto
29
https://prestodb.io/blog/2020/06/16/alluxio-datacaching

• Fine-grained control on working set
• free / pin data in cache, set data TTL in cache etc
• Metadata caching and syncing
• Automatically sync data b/w Alluxio cache and persisted data
• Data Transformation Services
• e.g., convert csv ﬁles into parquet format in cache
• Data Migration services
• e.g., migrate from HDFS to S3 based on access time policy
• Familiar Filesystem CLIs
• e.g., alluxio fs ls /my/path
Alluxio File System Enhancements
30
Alluxio Doc: https://docs.alluxio.io/os/user/stable/en/Overview.html

Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media
A recording of this talk will be available soon
Q & A
www.prestodb.io
https://prestodb.io/blog/2020/06/16/alluxio-datacaching
https://prestodb.slack.com
https://alluxio.io/slack

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio

More Related Content

What's hot (20)

Similar to Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio