SlideShare a Scribd company logo
Building Fast SQL Analytics on Anything with
Presto,Alluxio
Bin Fan | Founding Engineer @ Alluxio
2019/08/20
Alluxio Overview
Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software
Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
Consumer Travel &
Transportation
Telco & Media Healthcare
Community Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
Data Locality via Intelligent Multi-tiering
§ Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
9/13/19 7
Spark
Presto
Bash
Tensorflow
Java
~$ cat /mnt/alluxio/myInput
Data Accessibility via popular APIs
> rdd = sc.textFile(“alluxio://master:19998/myInput”)
> CREATE SCHEMA hive.web
> WITH (location = 'alluxio://master:19998/my-table/')
~$ python classify_image.py --model_dir /mnt/fuse/imagenet/
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
$ ./bin/alluxio fs mount /Data s3://bucket/directory
Use Cases
Typical Use Cases
Cloud Analytics Caching
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Hybrid Cloud Analytics
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Deployment Approaches
Spark
Alluxio
Storage
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Any Cloud
Same instance
/ container
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between Spark and Storage
Any Cloud
Same data
center / region
Presto
Use Case | On-premise Caching for Presto
HDFS
§ Large query variance during peak hours before
§ Alluxio brings data local to Presto to reduce
the latency during peak hours
NetEase Games
Leading Online Game Company in China
https://www.alluxio.io/blog/presto-on-alluxio-how-netease-
games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/
Presto
HDFS
Presto
Alluxio
Architecture: Colocate Alluxio with Presto
• Black/Red line – Large Query variance without Alluxio
• Green line - Stable query time with Alluxio
Project:
• Offload HDFS with separate clusters
of Presto and Spark
Problem:
• HDFS cluster is compute and
network bound
• Performance is inconsistent
JD.com |
$70B e-commerce retailer
Hadoop Offload Use Case
Alluxio solution:
• Alluxio offloads the network I/O as
well as the compute
Result:
• Teams can run additional workloads
without taxing the existing HDFS
cluster
3000 Node HDFS
PRESTO
Separate Compute
ALLUXIO
Datacenter
SPARK
3000 Node HDFS
PRESTO
Separate Compute
Datacenter
SPARK
https://www.slideshare.net/Alluxio/alluxio-in-jd
Performance Evaluation
• Yellow line - Stable query time with Alluxio
• < 1sec after first query (cold read)
• Green line – JD Presto without Alluxio : > 10sec
Alluxio
MasterZookeeper
/ RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2
Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Read data not in Alluxio
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
Write data only to Alluxio on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Write data to Alluxio and Under Store synchronously
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
Alluxio 2.0 & Coming in 2.1 Release
§ Alluxio 2.0: Released in July
§ Metadata scales to 1 bln file or more (based on rocksdb)
§ Self-managed Metadata service based on Quorum
§ Async writes, distributed load
§ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/
§ Alluxio 2.1: Scheduled in Sept
§ A Presto-Alluxio Connector with Iceberg Integration
§ Use Alluxio as a caching layer without modifying HMS
Next steps - Try it out!
• Getting Started
• Try 10 Minutes Alluxio & Presto Tutorial on Laptop
• Try 10 Minutes Alluxio & Presto Tutorial on AWS
• Tops 5 Performance tips running Presto on Alluxio
Questions or Suggestions? Engage with us at alluxio.io/slack!
Questions
Slides will be available at slack channel (https://alluxio.io/slack)

More Related Content

PDF
Building Cloud Native Analytical Pipelines on AWS
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Presto on Alluxio Hands-On Lab
PDF
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Best Practices for Using Alluxio with Spark
Building Cloud Native Analytical Pipelines on AWS
Iceberg + Alluxio for Fast Data Analytics
Presto on Alluxio Hands-On Lab
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Best Practices for Using Alluxio with Spark

What's hot (20)

PDF
Best Practices for Using Alluxio with Spark
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
PDF
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
PDF
Alluxio-FUSE as a data access layer for Dask
PDF
Presto: Query Anything - Data Engineer’s perspective
PDF
Alluxio Use Cases and Future Directions
PDF
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
PDF
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Accelerate Cloud Training with Alluxio
PDF
Alluxio Mesos Meetup - SMACK to SMAACK
PDF
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
The Missing Piece of On-Demand Clusters
PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practices for Using Alluxio with Spark
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Alluxio-FUSE as a data access layer for Dask
Presto: Query Anything - Data Engineer’s perspective
Alluxio Use Cases and Future Directions
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
Hybrid data lake on google cloud with alluxio and dataproc
Accelerate Cloud Training with Alluxio
Alluxio Mesos Meetup - SMACK to SMAACK
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Accelerate Analytics and ML in the Hybrid Cloud Era
The Missing Piece of On-Demand Clusters
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Best Practice in Accelerating Data Applications with Spark+Alluxio
Ad

Similar to Building Fast SQL Analytics on Anything with Presto, Alluxio (20)

PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PDF
Achieving compute and storage independence for data-driven workloads
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
Accelerating Spark with Kubernetes
PDF
Alluxio @ Uber Seattle Meetup
PDF
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
PDF
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
PDF
Enabling Ultra-fast Presto in the Cloud with Alluxio
PDF
Accelerating Analytics with EMR on your S3 Data Lake
PDF
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Accelerate Spark Workloads on S3
PDF
Unified Data API for Distributed Cloud Analytics and AI
PDF
Unify Data at Memory Speed
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Achieving compute and storage independence for data-driven workloads
Open Source Data Orchestration for AI, Big Data, and Cloud
Accelerating Spark with Kubernetes
Alluxio @ Uber Seattle Meetup
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Enabling Ultra-fast Presto in the Cloud with Alluxio
Accelerating Analytics with EMR on your S3 Data Lake
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio Data Orchestration Platform for the Cloud
Achieving Separation of Compute and Storage in a Cloud World
Accelerate Spark Workloads on S3
Unified Data API for Distributed Cloud Analytics and AI
Unify Data at Memory Speed
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
L1 - Introduction to python Backend.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
history of c programming in notes for students .pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
top salesforce developer skills in 2025.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Nekopoi APK 2025 free lastest update
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
AI in Product Development-omnex systems
System and Network Administraation Chapter 3
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
L1 - Introduction to python Backend.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
history of c programming in notes for students .pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Operating system designcfffgfgggggggvggggggggg
top salesforce developer skills in 2025.pdf
PTS Company Brochure 2025 (1).pdf.......
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Understanding Forklifts - TECH EHS Solution
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Nekopoi APK 2025 free lastest update
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
AI in Product Development-omnex systems

Building Fast SQL Analytics on Anything with Presto, Alluxio

  • 1. Building Fast SQL Analytics on Anything with Presto,Alluxio Bin Fan | Founding Engineer @ Alluxio 2019/08/20
  • 3. Alluxio is Open-Source Data Orchestration Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver GCS Driver S3 Driver Azure Driver
  • 4. The Alluxio Story Originated as Tachyon project, at UC Berkley AMPLab by Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 20192018 2019 Top 10 Big Data 2019 Top 10 Cloud Software
  • 5. Fast-growing Open Source Community 4000+ Github Stars1000+ Contributors Join the community on Slack alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  • 6. Consumer Travel & Transportation Telco & Media Healthcare Community Across Industries Learn more TechnologyFinancial Services Retail & Entertainment Data & Analytics Services
  • 7. Data Locality via Intelligent Multi-tiering § Local performance from remote data using multi-tier storage RAM SSD HDD Hot Warm Cold Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL 9/13/19 7
  • 8. Spark Presto Bash Tensorflow Java ~$ cat /mnt/alluxio/myInput Data Accessibility via popular APIs > rdd = sc.textFile(“alluxio://master:19998/myInput”) > CREATE SCHEMA hive.web > WITH (location = 'alluxio://master:19998/my-table/') ~$ python classify_image.py --model_dir /mnt/fuse/imagenet/ FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  • 9. Data Abstraction via Unified Namespace Enables effective data management across different Under Store $ ./bin/alluxio fs mount /Data s3://bucket/directory
  • 11. Typical Use Cases Cloud Analytics Caching Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage Hybrid Cloud Analytics Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage
  • 12. Deployment Approaches Spark Alluxio Storage Co-locate Alluxio Workers with Spark for optimal I/O performance Any Cloud Same instance / container Spark Alluxio Storage Deploy Alluxio as standalone cluster between Spark and Storage Any Cloud Same data center / region Presto
  • 13. Use Case | On-premise Caching for Presto HDFS § Large query variance during peak hours before § Alluxio brings data local to Presto to reduce the latency during peak hours NetEase Games Leading Online Game Company in China https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/ Presto HDFS Presto Alluxio
  • 14. Architecture: Colocate Alluxio with Presto • Black/Red line – Large Query variance without Alluxio • Green line - Stable query time with Alluxio
  • 15. Project: • Offload HDFS with separate clusters of Presto and Spark Problem: • HDFS cluster is compute and network bound • Performance is inconsistent JD.com | $70B e-commerce retailer Hadoop Offload Use Case Alluxio solution: • Alluxio offloads the network I/O as well as the compute Result: • Teams can run additional workloads without taxing the existing HDFS cluster 3000 Node HDFS PRESTO Separate Compute ALLUXIO Datacenter SPARK 3000 Node HDFS PRESTO Separate Compute Datacenter SPARK https://www.slideshare.net/Alluxio/alluxio-in-jd
  • 16. Performance Evaluation • Yellow line - Stable query time with Alluxio • < 1sec after first query (cold read) • Green line – JD Presto without Alluxio : > 10sec
  • 17. Alluxio MasterZookeeper / RAFT Standby Master Alluxio Worker Alluxio Worker Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  • 18. Read data in Alluxio, on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  • 19. Read data not in Alluxio RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  • 20. Write data only to Alluxio on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  • 21. Write data to Alluxio and Under Store synchronously RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  • 22. Alluxio 2.0 & Coming in 2.1 Release § Alluxio 2.0: Released in July § Metadata scales to 1 bln file or more (based on rocksdb) § Self-managed Metadata service based on Quorum § Async writes, distributed load § Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/ § Alluxio 2.1: Scheduled in Sept § A Presto-Alluxio Connector with Iceberg Integration § Use Alluxio as a caching layer without modifying HMS
  • 23. Next steps - Try it out! • Getting Started • Try 10 Minutes Alluxio & Presto Tutorial on Laptop • Try 10 Minutes Alluxio & Presto Tutorial on AWS • Tops 5 Performance tips running Presto on Alluxio Questions or Suggestions? Engage with us at alluxio.io/slack!
  • 24. Questions Slides will be available at slack channel (https://alluxio.io/slack)