SlideShare a Scribd company logo
Alluxio Open Source Community Office Hour:
Burst Presto & Spark to AWS EMR
Open Source Project started at the UC Berkeley's AMP Lab,
with an incredible Open Source Momentum with growing community
1,000+
contributors
& growing
4000+
Git Stars
Apache 2.0 Licensed
Millions of downloads
Alluxio Use Cases
Presto
Alluxio
*Burst big data workloads in
hybrid cloud environments
On-premise
Public cloud
Alluxio
On-premise
Presto
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Same instance /
container
Spark
Alluxio
Accelerate big data frameworks on the
public cloud
Intro to EMR
▪ AWS Provided and Managed Hadoop Services
▪ Spark, HDFS, Presto, Hive
▪ Easy to configure and onboard
▪ Does the work for you
▪ Elastic and Flexible
4
EMR Service Integration: Bootstrap Actions
▪ EMR hooks into the main configuration files for Hadoop Services:
▪ hive-site.xml, core-site.xml, hadoop-env.sh, hive.properties
▪ Bootstrap Actions
▪ Alluxio can be deployed using a bootstrap action
Hybrid Cloud Architecture
Burst workloads to EMR without manual copies and synchronization
A. Meta-data Locality with Active Sync
Synchronize Metadata for On-premise Mutations
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion, TTL
HDFS iNotify Based
Metadata Synchronization
Mutation
B. Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
C. No Hive Table Redefinitions in the Public Cloud
Feature Highlight - Use Transparent URI
• Begin with data and metadata on-premises
• HDFS has data on-premises
• Hive Metastore has meta-data with location hdfs://ns/table
• Launch a Cluster in Public Cloud
• Presto Catalog on EMR points to Hive Metastore On-premises
• Configure Catalog to use Alluxio Transparent URI
• Alluxio intercepts Presto calls to hdfs://ns/table
• Start Querying in the Public Cloud
• Accesses to HDFS on-premises are now served by Alluxio
Benchmark Report - TPC-DS
▪ Setup
▪ 10+1 r5.4xlarge instances in both clusters
▪ Latency: 175ms
Benchmark Report - TPC-DS
▪ Average Improvement
Benchmark Report - TPC-DS
▪ Maximum Improvement - All Queries
▪ q9 (7.1x)
▪ Maximum Improvement - By Class
▪ Reporting: q27 (3.1x)
▪ Interactive: q73 (3.9x)
▪ Deep Analytics: q34 (4.2x)
Additional Resources
▪ “Zero-Copy” Hybrid Cloud for Data Analytics - Strategy, Architecture and Benchmark Report
https://www.alluxio.io/resources/whitepapers/zero-copy-hybrid-cloud-for-data-analytics-strategy-architecture-an
d-benchmark-report/
▪ Running Presto with Alluxio
https://docs.alluxio.io/os/user/stable/en/compute/Presto.html
▪ Using Transparent URI
https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html
▪ Top 5 performance tips running Presto with Alluxio
https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1
▪ Getting Started with EMR and Alluxio
https://docs.alluxio.io/os/user/stable/en/cloud/AWS-EMR.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html
Questions?
How are you using EMR?
Welcome to join the Alluxio Open Source Community!
www.alluxio.io | @alluxio

More Related Content

PDF
From limited Hadoop compute capacity to increased data scientist efficiency
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Speeding Up Spark Performance using Alluxio at China Unicom
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Accelerating Hive with Alluxio on S3
From limited Hadoop compute capacity to increased data scientist efficiency
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Hybrid data lake on google cloud with alluxio and dataproc
Speeding Up Spark Performance using Alluxio at China Unicom
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Achieving Separation of Compute and Storage in a Cloud World
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Accelerating Hive with Alluxio on S3

What's hot (20)

PDF
Scalable and High available Distributed File System Metadata Service Using gR...
PDF
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
How to Develop and Operate Cloud First Data Platforms
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
PDF
Accelerate Cloud Training with Alluxio
PDF
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
PDF
Apache Hudi: The Path Forward
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Alluxio Use Cases and Future Directions
PDF
RaptorX: Building a 10X Faster Presto with hierarchical cache
PDF
Accelerating Data Computation on Ceph Objects
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
The Practice of Alluxio in JD.com
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
PDF
Flexible and Fast Storage for Deep Learning with Alluxio
PDF
Best Practices for Using Alluxio with Spark
PDF
Fluid: When Alluxio Meets Kubernetes
Scalable and High available Distributed File System Metadata Service Using gR...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
How to Develop and Operate Cloud First Data Platforms
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Accelerate Cloud Training with Alluxio
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Apache Hudi: The Path Forward
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio Use Cases and Future Directions
RaptorX: Building a 10X Faster Presto with hierarchical cache
Accelerating Data Computation on Ceph Objects
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
The Practice of Alluxio in JD.com
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Flexible and Fast Storage for Deep Learning with Alluxio
Best Practices for Using Alluxio with Spark
Fluid: When Alluxio Meets Kubernetes
Ad

Similar to Burst Presto & Spark workloads to AWS EMR with no data copies (20)

PDF
Running Presto with Alluxio on AWS EMR
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
Enabling Ultra-fast Presto in the Cloud with Alluxio
PDF
Building Fast SQL Analytics on Anything with Presto, Alluxio
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
PDF
Introducing the Hub for Data Orchestration
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Presto on Alluxio Hands-On Lab
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PDF
Data Orchestration for the Hybrid Cloud Era
PDF
Enabling Apache Spark for Hybrid Cloud
PDF
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
PPTX
Accelerating workloads and bursting data with Google Dataproc & Alluxio
PDF
Enabling big data & AI workloads on the object store at DBS
PDF
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Best Practices for Using Alluxio with Spark
PDF
Slides: Accelerating Queries on Cloud Data Lakes
Running Presto with Alluxio on AWS EMR
Alluxio Data Orchestration Platform for the Cloud
Enabling Ultra-fast Presto in the Cloud with Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
Accelerate Analytics and ML in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Introducing the Hub for Data Orchestration
Accelerate Analytics and ML in the Hybrid Cloud Era
Presto on Alluxio Hands-On Lab
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Data Orchestration for the Hybrid Cloud Era
Enabling Apache Spark for Hybrid Cloud
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Enabling big data & AI workloads on the object store at DBS
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Accelerate Analytics and ML in the Hybrid Cloud Era
Best Practices for Using Alluxio with Spark
Slides: Accelerating Queries on Cloud Data Lakes
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...

Recently uploaded (20)

PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPT
Introduction Database Management System for Course Database
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Transform Your Business with a Software ERP System
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
history of c programming in notes for students .pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Nekopoi APK 2025 free lastest update
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms I-SECS-1021-03
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Introduction Database Management System for Course Database
CHAPTER 2 - PM Management and IT Context
Transform Your Business with a Software ERP System
ManageIQ - Sprint 268 Review - Slide Deck
ISO 45001 Occupational Health and Safety Management System
How to Choose the Right IT Partner for Your Business in Malaysia
Odoo Companies in India – Driving Business Transformation.pdf
history of c programming in notes for students .pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Nekopoi APK 2025 free lastest update
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
How Creative Agencies Leverage Project Management Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Burst Presto & Spark workloads to AWS EMR with no data copies

  • 1. Alluxio Open Source Community Office Hour: Burst Presto & Spark to AWS EMR
  • 2. Open Source Project started at the UC Berkeley's AMP Lab, with an incredible Open Source Momentum with growing community 1,000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed Millions of downloads
  • 3. Alluxio Use Cases Presto Alluxio *Burst big data workloads in hybrid cloud environments On-premise Public cloud Alluxio On-premise Presto Dramatically speed-up big data on object stores on premise Same container / machine or or Same instance / container Spark Alluxio Accelerate big data frameworks on the public cloud
  • 4. Intro to EMR ▪ AWS Provided and Managed Hadoop Services ▪ Spark, HDFS, Presto, Hive ▪ Easy to configure and onboard ▪ Does the work for you ▪ Elastic and Flexible 4
  • 5. EMR Service Integration: Bootstrap Actions ▪ EMR hooks into the main configuration files for Hadoop Services: ▪ hive-site.xml, core-site.xml, hadoop-env.sh, hive.properties ▪ Bootstrap Actions ▪ Alluxio can be deployed using a bootstrap action
  • 6. Hybrid Cloud Architecture Burst workloads to EMR without manual copies and synchronization
  • 7. A. Meta-data Locality with Active Sync Synchronize Metadata for On-premise Mutations Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion, TTL HDFS iNotify Based Metadata Synchronization Mutation
  • 8. B. Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL
  • 9. C. No Hive Table Redefinitions in the Public Cloud Feature Highlight - Use Transparent URI • Begin with data and metadata on-premises • HDFS has data on-premises • Hive Metastore has meta-data with location hdfs://ns/table • Launch a Cluster in Public Cloud • Presto Catalog on EMR points to Hive Metastore On-premises • Configure Catalog to use Alluxio Transparent URI • Alluxio intercepts Presto calls to hdfs://ns/table • Start Querying in the Public Cloud • Accesses to HDFS on-premises are now served by Alluxio
  • 10. Benchmark Report - TPC-DS ▪ Setup ▪ 10+1 r5.4xlarge instances in both clusters ▪ Latency: 175ms
  • 11. Benchmark Report - TPC-DS ▪ Average Improvement
  • 12. Benchmark Report - TPC-DS ▪ Maximum Improvement - All Queries ▪ q9 (7.1x) ▪ Maximum Improvement - By Class ▪ Reporting: q27 (3.1x) ▪ Interactive: q73 (3.9x) ▪ Deep Analytics: q34 (4.2x)
  • 13. Additional Resources ▪ “Zero-Copy” Hybrid Cloud for Data Analytics - Strategy, Architecture and Benchmark Report https://www.alluxio.io/resources/whitepapers/zero-copy-hybrid-cloud-for-data-analytics-strategy-architecture-an d-benchmark-report/ ▪ Running Presto with Alluxio https://docs.alluxio.io/os/user/stable/en/compute/Presto.html ▪ Using Transparent URI https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html ▪ Top 5 performance tips running Presto with Alluxio https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1 ▪ Getting Started with EMR and Alluxio https://docs.alluxio.io/os/user/stable/en/cloud/AWS-EMR.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html
  • 14. Questions? How are you using EMR? Welcome to join the Alluxio Open Source Community! www.alluxio.io | @alluxio