SlideShare a Scribd company logo
1
Powering interactive analytics
with Alluxio and Presto
Dmytro Dermanskyi
DATA ORCHESTRATION SUMMIT 2020
22
™
© All Rights Reserved
™
WalkMe: Digital Adoption Platform
33
™
© All Rights Reserved
™
Insights: WalkMe’s web analytics solution
44
™
© All Rights Reserved
™
Funnels analysis
55
™
© All Rights Reserved
User must perform these steps to submit a support ticket
1 2 3
Click “Contact
support” link
Visit “Submit a ticket”
page
Fill in the details and
open the ticket1 2 3
Funnel example: “Submit support ticket” journey
66
™
© All Rights Reserved
™
Funnel example: UI
77
™
© All Rights Reserved
™
Funnel example: SQL behind the scenes
88
™
© All Rights Reserved
™
Initial architecture
99
™
© All Rights Reserved
™
Challenges on the way to Alluxio
1. Glue tables data location paths are fixed to s3:// “filesystem”.
2. Alluxio metadata needs to be periodically refreshed to sync changes in
S3 since some S3 files are written not through Alluxio.
1. In a collocated deployment we need to carefully distribute memory
allocations between Alluxio and Presto processes. By default EMR gives
Presto JVMs most of available RAM.
1010
™
© All Rights Reserved
™
Problem 1: Pointing Glue tables to Alluxio
Solution:
1. Export Glue catalog into a classic Hive metastore.
2. During the export fix table location paths to use alluxio:// instead of
s3://.
3. Connect EMR Presto to the Hive metastore instead of Glue catalog.
4. Repeat the steps 1 and 2 periodically to sync changes in Glue(e.g. new
partitions, new tables) to the Hive metastore.
1111
™
© All Rights Reserved
™
Problem 1: Pointing Glue tables to Alluxio
Disadvantages of the solution:
● Glue-to-Hive export job is really complex.
● Necessary to keep two table catalogs: Glue and Hive metastore.
Resolving and syncing schema changes is difficult.
● The export job depends on proprietary Glue API which can change with
time and break the job.
1212
™
© All Rights Reserved
™
Problem 1: Pointing Glue tables to Alluxio
Possible alternative solution: Alluxio Catalog Service
Just delegate the problem to Alluxio
https://docs.alluxio.io/os/user/stable/en/core-services/Catalog.html
1313
™
© All Rights Reserved
™
Problem 2: Alluxio metadata sync for S3 changes
Solution:
alluxio fs ls [-R|-f]
Can be executed:
● Upon file change events from S3
● Periodically with crontab
1414
™
© All Rights Reserved
™
Problem 3: Presto vs Alluxio memory contention
By default EMR gives Presto JVM processes most available RAM:
EC2 node type RAM EMR Presto -Xmx
c5.2xlarge 16GB 12GB
r5.4xlarge 128GB 100GB
c5.4xlarge 32GB 24.5GB
c4.8xlarge 60Gb 47GB
c5.9xlarge 72GB 55GB
Remaining memory may not be enough for Alluxio processes.
Especially if you want Alluxio to cache data in RAM.
1515
™
© All Rights Reserved
™
Problem 3: Presto vs Alluxio memory contention
EMR doesn’t provide means to adjust default Presto JVM config. But we need such
control to appropriately distribute memory allocation for processes.
1616
™
© All Rights Reserved
™
Problem 3: Presto vs Alluxio memory contention
Solution 1:
Don’t use EMR Presto but deploy our own Presto distribution:
● This brings additional complexity and more work.
● Though also gives more flexibility for Presto version choice and
configuration.
1717
™
© All Rights Reserved
™
Problem 3: Presto vs Alluxio memory contention
Solution 2:
Use EMR bootstrap script to setup a crontab script that fixes EMR Presto JVM
config as soon as it provisioned by EMR:
NOTE: EMR runs bootstrap scripts before installing Presto
● Allows reusing EMR Presto distribution (no need to deploy own Presto).
● “Hacky” solution.
1818
™
© All Rights Reserved
™
Alluxio-based architecture
1919
™
© All Rights Reserved
™
Performance improvement
Average query execution time has dropped from ~20 sec to ~1 sec.
2020
™
© All Rights Reserved
™
Lessons learned: Memory cannot be overcommitted
● Make sure to not use default EMR Presto JVM config since there might be
not enough memory left for Alluxio processes. Otherwise beware OOM
crashes.
● Chose appropriate caching medium for Alluxio:
○ To cache in RAM use memory optimized instances.
○ Consider caching on SSD. Both NVMe and EBS gp2 disks can give
significant performance boost.
2121
™
© All Rights Reserved
™
Lessons learned: Data locality matters
We tried deploying Alluxio as a separate from Presto cluster:
+ This design allows decoupling of storage (S3), caching layer (Alluxio) and
compute (Presto). You can give Alluxio as much resources (RAM/SSD) as
necessary.
- Query performance is at least 2-3 times worse comparing to the collocated
deployment.
2222
™
© All Rights Reserved
™
Lessons learned: Alluxio metastore can grow large
When the number of files in UFS is becoming huge (e.g. hundreds of
thousand) consider the following:
● Alluxio journal will have gigabytes of size. Make sure the disk where it’s
kept is large enough. Or use UFS for it (though it’s not so simple).
● It may be worth switching from HEAP to RocksDB-based metastore. This
way gigabytes of master RAM taken by metastore can be freed up.
2323
™
© All Rights Reserved
™
Lessons learned: Monitor everything!
What can go wrong:
● Presto workers can crash
● Alluxio workers can crash
● Presto/Alluxio masters can crash
● EC2 instance can run out of disk space
● Queries can start to fail because of lots of different reasons
● Queries can start to just hang and timeout
● You name it ...
24
Thank you!
Any questions?

More Related Content

PDF
Improving Presto performance with Alluxio at TikTok
PDF
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
PPTX
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
PDF
The Practice of Alluxio in JD.com
PDF
RaptorX: Building a 10X Faster Presto with hierarchical cache
PPTX
Hybrid collaborative tiered storage with alluxio
PDF
Best Practices for Using Alluxio with Spark
PDF
Hybrid data lake on google cloud with alluxio and dataproc
Improving Presto performance with Alluxio at TikTok
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
The Practice of Alluxio in JD.com
RaptorX: Building a 10X Faster Presto with hierarchical cache
Hybrid collaborative tiered storage with alluxio
Best Practices for Using Alluxio with Spark
Hybrid data lake on google cloud with alluxio and dataproc

What's hot (20)

PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
Improve Presto Architectural Decisions with Shadow Cache
PDF
Presto on Alluxio Hands-On Lab
PDF
How to Develop and Operate Cloud First Data Platforms
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Exploring Alluxio for Daily Tasks at Robinhood
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
PDF
Alluxio-FUSE as a data access layer for Dask
PDF
Building Cloud Native Analytical Pipelines on AWS
PDF
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
PDF
Atom: A cloud native deep learning platform at Supremind
PDF
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
PDF
Introducing the Hub for Data Orchestration
PDF
Alluxio: Unify Data at Memory Speed; 2016-11-18
PDF
Building Fast SQL Analytics on Anything with Presto, Alluxio
PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
PDF
Query Anything, Anywhere with Kubernetes
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Improve Presto Architectural Decisions with Shadow Cache
Presto on Alluxio Hands-On Lab
How to Develop and Operate Cloud First Data Platforms
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Exploring Alluxio for Daily Tasks at Robinhood
Achieving Separation of Compute and Storage in a Cloud World
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio-FUSE as a data access layer for Dask
Building Cloud Native Analytical Pipelines on AWS
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Atom: A cloud native deep learning platform at Supremind
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Introducing the Hub for Data Orchestration
Alluxio: Unify Data at Memory Speed; 2016-11-18
Building Fast SQL Analytics on Anything with Presto, Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Query Anything, Anywhere with Kubernetes
Ad

Similar to Powering Interactive Analytics with Alluxio and Presto (20)

PPTX
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
PDF
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
PDF
Apache Kudu - Updatable Analytical Storage #rakutentech
PDF
How to deploy SQL Server on an Microsoft Azure virtual machines
PDF
What's New in Alluxio 2.3
ODP
AMQP vs GRAPHITE
PDF
PowerAlluxio
PDF
50-Tips-for-Boosting-MySQL-Performance-CON2655.pdf
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Shared memory Parallelism (NOTES)
PDF
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
PPTX
Alluxio: Unify Data at Memory Speed
PDF
Virtualization with Lenovo X6 Blade Servers: white paper
PDF
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
PDF
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
DOC
Using preferred read groups in oracle asm michael ault
PDF
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Apache Kudu - Updatable Analytical Storage #rakutentech
How to deploy SQL Server on an Microsoft Azure virtual machines
What's New in Alluxio 2.3
AMQP vs GRAPHITE
PowerAlluxio
50-Tips-for-Boosting-MySQL-Performance-CON2655.pdf
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
Shared memory Parallelism (NOTES)
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
Alluxio: Unify Data at Memory Speed
Virtualization with Lenovo X6 Blade Servers: white paper
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
Using preferred read groups in oracle asm michael ault
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
L1 - Introduction to python Backend.pptx
PPT
Introduction Database Management System for Course Database
PDF
top salesforce developer skills in 2025.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
AI in Product Development-omnex systems
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
history of c programming in notes for students .pptx
PDF
medical staffing services at VALiNTRY
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Introduction to Artificial Intelligence
PDF
System and Network Administration Chapter 2
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
Navsoft: AI-Powered Business Solutions & Custom Software Development
ManageIQ - Sprint 268 Review - Slide Deck
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Design an Analysis of Algorithms I-SECS-1021-03
L1 - Introduction to python Backend.pptx
Introduction Database Management System for Course Database
top salesforce developer skills in 2025.pdf
Odoo POS Development Services by CandidRoot Solutions
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
AI in Product Development-omnex systems
Understanding Forklifts - TECH EHS Solution
ISO 45001 Occupational Health and Safety Management System
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
history of c programming in notes for students .pptx
medical staffing services at VALiNTRY
Design an Analysis of Algorithms II-SECS-1021-03
VVF-Customer-Presentation2025-Ver1.9.pptx
Introduction to Artificial Intelligence
System and Network Administration Chapter 2
2025 Textile ERP Trends: SAP, Odoo & Oracle

Powering Interactive Analytics with Alluxio and Presto

  • 1. 1 Powering interactive analytics with Alluxio and Presto Dmytro Dermanskyi DATA ORCHESTRATION SUMMIT 2020
  • 2. 22 ™ © All Rights Reserved ™ WalkMe: Digital Adoption Platform
  • 3. 33 ™ © All Rights Reserved ™ Insights: WalkMe’s web analytics solution
  • 4. 44 ™ © All Rights Reserved ™ Funnels analysis
  • 5. 55 ™ © All Rights Reserved User must perform these steps to submit a support ticket 1 2 3 Click “Contact support” link Visit “Submit a ticket” page Fill in the details and open the ticket1 2 3 Funnel example: “Submit support ticket” journey
  • 6. 66 ™ © All Rights Reserved ™ Funnel example: UI
  • 7. 77 ™ © All Rights Reserved ™ Funnel example: SQL behind the scenes
  • 8. 88 ™ © All Rights Reserved ™ Initial architecture
  • 9. 99 ™ © All Rights Reserved ™ Challenges on the way to Alluxio 1. Glue tables data location paths are fixed to s3:// “filesystem”. 2. Alluxio metadata needs to be periodically refreshed to sync changes in S3 since some S3 files are written not through Alluxio. 1. In a collocated deployment we need to carefully distribute memory allocations between Alluxio and Presto processes. By default EMR gives Presto JVMs most of available RAM.
  • 10. 1010 ™ © All Rights Reserved ™ Problem 1: Pointing Glue tables to Alluxio Solution: 1. Export Glue catalog into a classic Hive metastore. 2. During the export fix table location paths to use alluxio:// instead of s3://. 3. Connect EMR Presto to the Hive metastore instead of Glue catalog. 4. Repeat the steps 1 and 2 periodically to sync changes in Glue(e.g. new partitions, new tables) to the Hive metastore.
  • 11. 1111 ™ © All Rights Reserved ™ Problem 1: Pointing Glue tables to Alluxio Disadvantages of the solution: ● Glue-to-Hive export job is really complex. ● Necessary to keep two table catalogs: Glue and Hive metastore. Resolving and syncing schema changes is difficult. ● The export job depends on proprietary Glue API which can change with time and break the job.
  • 12. 1212 ™ © All Rights Reserved ™ Problem 1: Pointing Glue tables to Alluxio Possible alternative solution: Alluxio Catalog Service Just delegate the problem to Alluxio https://docs.alluxio.io/os/user/stable/en/core-services/Catalog.html
  • 13. 1313 ™ © All Rights Reserved ™ Problem 2: Alluxio metadata sync for S3 changes Solution: alluxio fs ls [-R|-f] Can be executed: ● Upon file change events from S3 ● Periodically with crontab
  • 14. 1414 ™ © All Rights Reserved ™ Problem 3: Presto vs Alluxio memory contention By default EMR gives Presto JVM processes most available RAM: EC2 node type RAM EMR Presto -Xmx c5.2xlarge 16GB 12GB r5.4xlarge 128GB 100GB c5.4xlarge 32GB 24.5GB c4.8xlarge 60Gb 47GB c5.9xlarge 72GB 55GB Remaining memory may not be enough for Alluxio processes. Especially if you want Alluxio to cache data in RAM.
  • 15. 1515 ™ © All Rights Reserved ™ Problem 3: Presto vs Alluxio memory contention EMR doesn’t provide means to adjust default Presto JVM config. But we need such control to appropriately distribute memory allocation for processes.
  • 16. 1616 ™ © All Rights Reserved ™ Problem 3: Presto vs Alluxio memory contention Solution 1: Don’t use EMR Presto but deploy our own Presto distribution: ● This brings additional complexity and more work. ● Though also gives more flexibility for Presto version choice and configuration.
  • 17. 1717 ™ © All Rights Reserved ™ Problem 3: Presto vs Alluxio memory contention Solution 2: Use EMR bootstrap script to setup a crontab script that fixes EMR Presto JVM config as soon as it provisioned by EMR: NOTE: EMR runs bootstrap scripts before installing Presto ● Allows reusing EMR Presto distribution (no need to deploy own Presto). ● “Hacky” solution.
  • 18. 1818 ™ © All Rights Reserved ™ Alluxio-based architecture
  • 19. 1919 ™ © All Rights Reserved ™ Performance improvement Average query execution time has dropped from ~20 sec to ~1 sec.
  • 20. 2020 ™ © All Rights Reserved ™ Lessons learned: Memory cannot be overcommitted ● Make sure to not use default EMR Presto JVM config since there might be not enough memory left for Alluxio processes. Otherwise beware OOM crashes. ● Chose appropriate caching medium for Alluxio: ○ To cache in RAM use memory optimized instances. ○ Consider caching on SSD. Both NVMe and EBS gp2 disks can give significant performance boost.
  • 21. 2121 ™ © All Rights Reserved ™ Lessons learned: Data locality matters We tried deploying Alluxio as a separate from Presto cluster: + This design allows decoupling of storage (S3), caching layer (Alluxio) and compute (Presto). You can give Alluxio as much resources (RAM/SSD) as necessary. - Query performance is at least 2-3 times worse comparing to the collocated deployment.
  • 22. 2222 ™ © All Rights Reserved ™ Lessons learned: Alluxio metastore can grow large When the number of files in UFS is becoming huge (e.g. hundreds of thousand) consider the following: ● Alluxio journal will have gigabytes of size. Make sure the disk where it’s kept is large enough. Or use UFS for it (though it’s not so simple). ● It may be worth switching from HEAP to RocksDB-based metastore. This way gigabytes of master RAM taken by metastore can be freed up.
  • 23. 2323 ™ © All Rights Reserved ™ Lessons learned: Monitor everything! What can go wrong: ● Presto workers can crash ● Alluxio workers can crash ● Presto/Alluxio masters can crash ● EC2 instance can run out of disk space ● Queries can start to fail because of lots of different reasons ● Queries can start to just hang and timeout ● You name it ...