Building Fast SQL Analytics on Anything with Presto, Alluxio

Building Fast SQL Analytics on Anything with
Presto,Alluxio
Bin Fan | Founding Engineer @ Alluxio
2019/08/20

Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver

The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software

Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio

Consumer Travel &
Transportation
Telco & Media Healthcare
Community Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services

Data Locality via Intelligent Multi-tiering
§ Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
9/13/19 7

Spark
Presto
Bash
Tensorflow
Java
~$ cat /mnt/alluxio/myInput
Data Accessibility via popular APIs
> rdd = sc.textFile(“alluxio://master:19998/myInput”)
> CREATE SCHEMA hive.web
> WITH (location = 'alluxio://master:19998/my-table/')
~$ python classify_image.py --model_dir /mnt/fuse/imagenet/
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
$ ./bin/alluxio fs mount /Data s3://bucket/directory

Typical Use Cases
Cloud Analytics Caching
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Hybrid Cloud Analytics
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage

Deployment Approaches
Spark
Alluxio
Storage
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Any Cloud
Same instance
/ container
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between Spark and Storage
Any Cloud
Same data
center / region
Presto

Use Case | On-premise Caching for Presto
HDFS
§ Large query variance during peak hours before
§ Alluxio brings data local to Presto to reduce
the latency during peak hours
NetEase Games
Leading Online Game Company in China
https://www.alluxio.io/blog/presto-on-alluxio-how-netease-
games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/
Presto
HDFS
Presto
Alluxio

Architecture: Colocate Alluxio with Presto
• Black/Red line – Large Query variance without Alluxio
• Green line - Stable query time with Alluxio

Project:
• Offload HDFS with separate clusters
of Presto and Spark
Problem:
• HDFS cluster is compute and
network bound
• Performance is inconsistent
JD.com |
$70B e-commerce retailer
Hadoop Offload Use Case
Alluxio solution:
• Alluxio offloads the network I/O as
well as the compute
Result:
• Teams can run additional workloads
without taxing the existing HDFS
cluster
3000 Node HDFS
PRESTO
Separate Compute
ALLUXIO
Datacenter
SPARK
3000 Node HDFS
PRESTO
Separate Compute
Datacenter
SPARK
https://www.slideshare.net/Alluxio/alluxio-in-jd

Performance Evaluation
• Yellow line - Stable query time with Alluxio
• < 1sec after first query (cold read)
• Green line – JD Presto without Alluxio : > 10sec

Alluxio
MasterZookeeper
/ RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2

Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master

Read data not in Alluxio
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store

Write data only to Alluxio on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master

Write data to Alluxio and Under Store synchronously
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store

Alluxio 2.0 & Coming in 2.1 Release
§ Alluxio 2.0: Released in July
§ Metadata scales to 1 bln file or more (based on rocksdb)
§ Self-managed Metadata service based on Quorum
§ Async writes, distributed load
§ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/
§ Alluxio 2.1: Scheduled in Sept
§ A Presto-Alluxio Connector with Iceberg Integration
§ Use Alluxio as a caching layer without modifying HMS

Next steps - Try it out!
• Getting Started
• Try 10 Minutes Alluxio & Presto Tutorial on Laptop
• Try 10 Minutes Alluxio & Presto Tutorial on AWS
• Tops 5 Performance tips running Presto on Alluxio
Questions or Suggestions? Engage with us at alluxio.io/slack!

Questions
Slides will be available at slack channel (https://alluxio.io/slack)

Building Fast SQL Analytics on Anything with Presto, Alluxio

More Related Content

What's hot (20)

Similar to Building Fast SQL Analytics on Anything with Presto, Alluxio (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Building Fast SQL Analytics on Anything with Presto, Alluxio