SlideShare a Scribd company logo
Building Cloud-native Analytical
Pipelines on AWS
Irene Cai
Software Engineer
2019.08.20
Outline
- Background
- Challenges
- Performance
- Spiky Load Handling
- Expensive rename operation on S3
- Solution with Alluxio
- Ideas for Critical Future Improvement in Alluxio
Background
As cloud offerings become matured and cost efficient, we decided to move our
data pipelines that processes hundreds of TBs of data daily from team managed
Hadoop cluster to AWS with S3 as storage and EMR for compute.
This allows us to leverage the elastic compute EMR has to offer and offload the
cluster operational work to AWS. Also, with S3 as the storage, this allows data to
be easily shareable across teams and form one data lake.
Challenge
- I/O Performance
○ The reading and writing performance to S3
- Spiky Load Handling
○ When read and write load gets spiky (i.e. when jobs are in final stage of outputting results),
different storage system handles spiky load differently.
- Expensive rename operation on S3
I/O Performance - S3 Bottleneck
The reading and writing performance to S3 is one of the bottlenecks we
encounter. Remote reads and writes are very expensive and needs to be
performed multiple times at each pipeline job.
Alluxio as cache buffer
- Reading
- Alluxio acts as a shared reading buffer and allows us to reduce number of reads needed at
each pipeline job
- Writing
- Transient Output
Suitable for output that can be deleted after use and is cheap to recompute if lost. This is great
performance boost as these output can be written to Alluxio only and never output to S3. From
our benchmark test, we can write ~100G to Alluxio in 1 minute.
- Persistent Output
Output to be consumed for future use or expansive to recompute. For this type of output, need
to persist to persistent storage such as HDFS/S3. Alluxio helps to accelerate and simplify
writing to persistent store at the application layer.
Alluxio Performance
Alluxio as in-memory file system has excellent read and write performance.
In our experiments, we are able to write ~100G data to Alluxio in 1 minute and
persist to S3 from Alluxio in 7 minutes versus writing directly to S3 often gets
throttled or takes much longer.
Storage System Response to Spiky Load
- S3 Throttling
S3 doesn’t have a hard limit on request rate. However, it throttles requests
when request rate dramatically increases.
- HDFS namenode slowness
HDFS handles the requests sequentially but responses get increasingly
slower when namenode is under stress.
Handling Spiky Load
- Complicated for applications.
- Data engines such as Hadoop handle throttling poorly
- Data applications do not have mechanisms to tune output pace
- In order to address such behavior, pipelines have to retry and this is very
expensive because of the repetitive compute costs
- Alluxio as a simplifying solution
Alluxio handles this problem and serve as a buffer to smooth out the IO
stream and avoid throttling or slowdowns.
Pipeline applications can simply write to Alluxio without explicitly handling
spiky loads in the application logic.
How Alluxio helps with Spiky Load
- Offer user mechanism to control output pace to S3 and avoid throttling
- We used hadoop distcp with Alluxio 1.8 and tuned the pace to copy to S3 with it.
- Persist to underlying file system can be asynchrones so from application’s
view the files are available for use once it’s written to Alluxio.
- Alluxio is memory based and avoid blocking replication which provides much
better performance than EMRFS
Rename Operations on S3
Move operating is very expansive on S3 as it deletes the old object and creates a
new one.
However, Spark/Hive typically write into a temp directory and move result to final
destination when computation finishes. This creates unnecessary stress on S3.
With Alluxio as middle layer, we persist only final results to S3.
Alternatively, user can use EMRFS as the middle layer. However, performance is
less optimal than Alluxio.
Ideas for Critical Future Improvement in Alluxio
- Data completeness at node failure
- This is currently mitigated by data replication.
- In-memory replication is costly
- Persistent storage replication is slow and could still result in corrupt data if data has not been
replicated to disk prior to node failure.
- When data is corrupted, pipelines need to rerun the job to regenerate the complete output. It
would be great if Alluxio can manage the data loss automatically by launching only the tasks
required to recompute the missing blocks.
Ideas for Critical Future Improvement in Alluxio
- Stronger guarantee on writing success to persistent storage
- Alluxio currently gives user a way to tune the output to reduce throttling. It would be great if
Alluxio can have built-in mechanism to provide better guarantees on successfully persist data
to underlying storage without user intervention.
Ideas for Critical Future Improvement in Alluxio
- Deeper data engine integration
- Accessing Alluxio from various compute engines is very easy. It would be great to see such
integrations get deeper to provide a stronger guarantee when compute engines write to
Alluxio. For example, Alluxio can help avoid recompute of previously successful tasks with
same input to reduce cost of job failures.
Takeaway
Introducing Alluxio into the pipeline gives significant performance advantages and
helps to address multiple challenges from underlying file system behavior.
Alluxio provides great performance advantage as a memory based shared cache.
It also provides good abstraction so applications don’t need to handle underlying
storage system when working with them.
Thank you!

More Related Content

PDF
Building Fast SQL Analytics on Anything with Presto, Alluxio
PDF
Query Anything, Anywhere with Kubernetes
PDF
Presto on Alluxio Hands-On Lab
PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
PDF
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
PDF
Best Practices for Using Alluxio with Spark
Building Fast SQL Analytics on Anything with Presto, Alluxio
Query Anything, Anywhere with Kubernetes
Presto on Alluxio Hands-On Lab
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Best Practices for Using Alluxio with Spark

What's hot (20)

PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Powering Interactive Analytics with Alluxio and Presto
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
PDF
Accelerate Cloud Training with Alluxio
PDF
Speeding Up Spark Performance using Alluxio at China Unicom
PDF
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
PDF
Alluxio-FUSE as a data access layer for Dask
PDF
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
PDF
Accelerating Hive with Alluxio on S3
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
PDF
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PPTX
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Alluxio Use Cases and Future Directions
Hybrid data lake on google cloud with alluxio and dataproc
Iceberg + Alluxio for Fast Data Analytics
Powering Interactive Analytics with Alluxio and Presto
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Accelerate Cloud Training with Alluxio
Speeding Up Spark Performance using Alluxio at China Unicom
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio-FUSE as a data access layer for Dask
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Accelerating Hive with Alluxio on S3
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Spark Summit EU talk by Jiri Simsa
Alluxio Use Cases and Future Directions
Ad

Similar to Building Cloud Native Analytical Pipelines on AWS (20)

PDF
Spark Pipelines in the Cloud with Alluxio with Gene Pang
PDF
Spark Pipelines in the Cloud with Alluxio
PDF
Alluxio @ Uber Seattle Meetup
PDF
Unified Big Data Analytics: Any Stack, Any Cloud
PDF
Achieving compute and storage independence for data-driven workloads
PDF
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
PDF
Running Machine Learning Workloads with Tensorflow, Alluxio and AWS S3
PDF
Best Practices for Using Alluxio with Spark
PDF
Accelerate Spark Workloads on S3
PDF
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
PDF
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
PPTX
Alluxio Presentation at Strata San Jose 2016
PDF
PowerAlluxio
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
PPTX
Spark Pipelines in the Cloud with Alluxio by Bin Fan
PDF
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Getting Started with Alluxio + Spark + S3
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio
Alluxio @ Uber Seattle Meetup
Unified Big Data Analytics: Any Stack, Any Cloud
Achieving compute and storage independence for data-driven workloads
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Running Machine Learning Workloads with Tensorflow, Alluxio and AWS S3
Best Practices for Using Alluxio with Spark
Accelerate Spark Workloads on S3
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Presentation at Strata San Jose 2016
PowerAlluxio
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Spark Summit EU talk by Jiri Simsa
Getting Started with Alluxio + Spark + S3
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...

Recently uploaded (20)

PDF
top salesforce developer skills in 2025.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
L1 - Introduction to python Backend.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
history of c programming in notes for students .pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Essential Infomation Tech presentation.pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
top salesforce developer skills in 2025.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
How to Choose the Right IT Partner for Your Business in Malaysia
Wondershare Filmora 15 Crack With Activation Key [2025
How Creative Agencies Leverage Project Management Software.pdf
Reimagine Home Health with the Power of Agentic AI​
L1 - Introduction to python Backend.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
CHAPTER 2 - PM Management and IT Context
Odoo Companies in India – Driving Business Transformation.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Design an Analysis of Algorithms I-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
history of c programming in notes for students .pptx
Operating system designcfffgfgggggggvggggggggg
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Essential Infomation Tech presentation.pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design

Building Cloud Native Analytical Pipelines on AWS

  • 1. Building Cloud-native Analytical Pipelines on AWS Irene Cai Software Engineer 2019.08.20
  • 2. Outline - Background - Challenges - Performance - Spiky Load Handling - Expensive rename operation on S3 - Solution with Alluxio - Ideas for Critical Future Improvement in Alluxio
  • 3. Background As cloud offerings become matured and cost efficient, we decided to move our data pipelines that processes hundreds of TBs of data daily from team managed Hadoop cluster to AWS with S3 as storage and EMR for compute. This allows us to leverage the elastic compute EMR has to offer and offload the cluster operational work to AWS. Also, with S3 as the storage, this allows data to be easily shareable across teams and form one data lake.
  • 4. Challenge - I/O Performance ○ The reading and writing performance to S3 - Spiky Load Handling ○ When read and write load gets spiky (i.e. when jobs are in final stage of outputting results), different storage system handles spiky load differently. - Expensive rename operation on S3
  • 5. I/O Performance - S3 Bottleneck The reading and writing performance to S3 is one of the bottlenecks we encounter. Remote reads and writes are very expensive and needs to be performed multiple times at each pipeline job.
  • 6. Alluxio as cache buffer - Reading - Alluxio acts as a shared reading buffer and allows us to reduce number of reads needed at each pipeline job - Writing - Transient Output Suitable for output that can be deleted after use and is cheap to recompute if lost. This is great performance boost as these output can be written to Alluxio only and never output to S3. From our benchmark test, we can write ~100G to Alluxio in 1 minute. - Persistent Output Output to be consumed for future use or expansive to recompute. For this type of output, need to persist to persistent storage such as HDFS/S3. Alluxio helps to accelerate and simplify writing to persistent store at the application layer.
  • 7. Alluxio Performance Alluxio as in-memory file system has excellent read and write performance. In our experiments, we are able to write ~100G data to Alluxio in 1 minute and persist to S3 from Alluxio in 7 minutes versus writing directly to S3 often gets throttled or takes much longer.
  • 8. Storage System Response to Spiky Load - S3 Throttling S3 doesn’t have a hard limit on request rate. However, it throttles requests when request rate dramatically increases. - HDFS namenode slowness HDFS handles the requests sequentially but responses get increasingly slower when namenode is under stress.
  • 9. Handling Spiky Load - Complicated for applications. - Data engines such as Hadoop handle throttling poorly - Data applications do not have mechanisms to tune output pace - In order to address such behavior, pipelines have to retry and this is very expensive because of the repetitive compute costs - Alluxio as a simplifying solution Alluxio handles this problem and serve as a buffer to smooth out the IO stream and avoid throttling or slowdowns. Pipeline applications can simply write to Alluxio without explicitly handling spiky loads in the application logic.
  • 10. How Alluxio helps with Spiky Load - Offer user mechanism to control output pace to S3 and avoid throttling - We used hadoop distcp with Alluxio 1.8 and tuned the pace to copy to S3 with it. - Persist to underlying file system can be asynchrones so from application’s view the files are available for use once it’s written to Alluxio. - Alluxio is memory based and avoid blocking replication which provides much better performance than EMRFS
  • 11. Rename Operations on S3 Move operating is very expansive on S3 as it deletes the old object and creates a new one. However, Spark/Hive typically write into a temp directory and move result to final destination when computation finishes. This creates unnecessary stress on S3. With Alluxio as middle layer, we persist only final results to S3. Alternatively, user can use EMRFS as the middle layer. However, performance is less optimal than Alluxio.
  • 12. Ideas for Critical Future Improvement in Alluxio - Data completeness at node failure - This is currently mitigated by data replication. - In-memory replication is costly - Persistent storage replication is slow and could still result in corrupt data if data has not been replicated to disk prior to node failure. - When data is corrupted, pipelines need to rerun the job to regenerate the complete output. It would be great if Alluxio can manage the data loss automatically by launching only the tasks required to recompute the missing blocks.
  • 13. Ideas for Critical Future Improvement in Alluxio - Stronger guarantee on writing success to persistent storage - Alluxio currently gives user a way to tune the output to reduce throttling. It would be great if Alluxio can have built-in mechanism to provide better guarantees on successfully persist data to underlying storage without user intervention.
  • 14. Ideas for Critical Future Improvement in Alluxio - Deeper data engine integration - Accessing Alluxio from various compute engines is very easy. It would be great to see such integrations get deeper to provide a stronger guarantee when compute engines write to Alluxio. For example, Alluxio can help avoid recompute of previously successful tasks with same input to reduce cost of job failures.
  • 15. Takeaway Introducing Alluxio into the pipeline gives significant performance advantages and helps to address multiple challenges from underlying file system behavior. Alluxio provides great performance advantage as a memory based shared cache. It also provides good abstraction so applications don’t need to handle underlying storage system when working with them.