SlideShare a Scribd company logo
Efficient Data Engineering with
Apache Spark, Hive, and Alluxio on S3
+
August 14th, 2019
Inaugural Cloud, Data & Orchestration Meetup!
● Welcome!
● First Meetup
● Looking for future presenters in Data Engineering/Ops
Community
● Let us know on the Meetup group or talk to Bin, Thai, & Tim
Your Hosts:
● Thai Bui - Senior Staff Big Data Engineer, Bazaarvoice
● Bin Fan - VP, Founding Member, Alluxio
● Tim Kelly - Engineering Manager, Bazaarvoice
● Amelia Wong - Co-Founder, Alluxio
About Bazaarvoice
Creating the World’s Smartest
Network of Consumers, Brands
& Retailers
Core Product:
SaaS Ratings & Reviews
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3
2018 Holiday
Product
Pageviews
2018 - Holiday
Shoppers in the
Bazaarvoice
Network
Confidential and Proprietary. © 2017 Bazaarvoice, Inc.10
Tools
Confidential and Proprietary. © 2017 Bazaarvoice, Inc.11
Architecture
11 Confidential and proprietary. Copyright 2015 Bazaarvoice.
Confidential and Proprietary. © 2017 Bazaarvoice, Inc.12
Cloud Compute
Tech
12 Confidential and proprietary. Copyright 2015 Bazaarvoice.
Confidential and Proprietary. © 2017 Bazaarvoice, Inc.13
Bazaarvoice Data Lake Stats:
AWS S3 with Parquet & ORC
Registered Data on S3
500 TB
Clickstream Data
250 TB
Number of Clients
5700
Targetable Shoppers
214 M
Active Products
125 M
Total Reviews
900 M
Confidential and Proprietary. © 2017 Bazaarvoice, Inc.14
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3
Accelerating S3 with ZFS,
tiered-storage & Alluzio
Thai Bui
AWS S3 : The Good
An object storage service provided by Amazon
● Really cheap
● Highly available
● Fully-managed service
● Scales really well
● Integrates with virtually all tools
AWS S3 : The Bad
When you have 100s of TB of data and millions of files
● Just object listing is slow
● Download speed is limited by network bandwidth
● No concept of cache
● No concept of data locality
AWS S3 : The Need For Speed
● Add tiered-storage to S3
○ Hot, warm, cold storage (fastest, fast, and not so fast)
○ Metadata cache
○ Data cache
● Keep data local
○ In the same machine, not via the network cable
● Compatible with existing services
○ Hadoop, Spark, Hive, Presto, etc.
● Adaptive & highly configurable
○ Symlink for S3
ZFS
Hive Spark
Alluxio S3
Hot & Warm
Cold
Overview
Hive
● Alluxio
○ Compatibility-layer
○ Tiered storage layer
● ZFS
○ OS-level file system
○ Volume manager
○ Acceleration layer
● Both are open-source
metastore metastore metastore
Alluxio : The tiered-storage layer
● Support for native filesystem and Hadoop filesystem
● Distributed but can be installed in every node
○ Provides data locality
● Mount S3, HDFS, etc. to Alluxio
○ Think symlink. No data movement.
● Use RAM, NVMe, SSD, HDD to define hot, warm, cold data tier
● LRU, LFU policies for caching data at different layers
● Not enough space -> evict or move least used files to the next tier
ZFS : The acceleration layer
● Both a filesytem & a volume manager
○ Works with RAM to accelerate read/write
○ Auto promote/demote blocks from RAM to other storage
○ Used with local NVMe SSD if data is not in RAM
○ Mirror write to 2 SSDs -> 2x read speed
● Works at the kernel-space
● Extremely reliable
○ Automatic block checksum & repair
ZFS + NVMe: Micro benchmark
I3.4xlarge, up to 10 Gbit network, 2 x 1.9 NVMe SSD, single-threaded
● Baseline w/ EBS
○ 135 MB/s write (dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync)
○ 157 MB/s read (dd if=/tmp/test1.img of=/dev/zero bs=8k)
● ZFS + 2 mirrored NVMe SSD
○ 820 MB/s write (dd if=/dev/zero of=/alluxio/fs/test1.img bs=1G count=1)
○ 1.7 GB/s read (dd if=/alluxio/fs/test1.img of=/dev/zero bs=1G count=1)
● 4x write, 10x read compared to EBS
● 8-14x compared to S3 (120 MB/s both read/write)
All together
ZFS
Hot
Warm
Kernel-space
User-space
Alluxio
RAM
NVMe SSD
promote demote
Native/Hadoop Filesystem API
Hive
Metastore
Last 30
days
Alluxio
> 30 daysS3
Hot &
Warm
Cold
With Hive
Example query
SELECT ..
FROM ..
WHERE year = 2019
AND month = 1
GROUP BY ..
ORDER BY ..
LIMIT 100;
Example query plan
SELECT ..
FROM ..
WHERE year = 2019
AND month = 1
GROUP BY .. ORDER BY ..
LIMIT 100
Without tiered-storage
● 50s for split calculations
○ Listing files on S3
○ Sub-dividing the tasks amongst workers
● 12s for scanning data on S3
● 70s to complete the query
With tiered-storage
● 1.7s for split calculations
○ 30x improvement
● 3s for scanning data on tiered-storage
○ 3x improvement
● 6s to complete the query
○ 10x improvement overall
Result
● 5-10X read improvement in Hive
○ Worker can short-circuit and read directly from ZFS instead of S3
○ Move compute to the data
● Should give the same result in Apache Spark
● Good for iterating over the same data set multiple times
○ Machine learning
○ Exploratory analysis
● Give us control over S3
○ More recent data should be faster to access
Question?

More Related Content

PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Apache Hudi: The Path Forward
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Accelerating Hive with Alluxio on S3
PDF
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Apache Hudi: The Path Forward
Hybrid data lake on google cloud with alluxio and dataproc
Accelerating Hive with Alluxio on S3
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Achieving Separation of Compute and Storage in a Cloud World
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
From limited Hadoop compute capacity to increased data scientist efficiency

What's hot (20)

PDF
Speeding Up Spark Performance using Alluxio at China Unicom
PDF
Improving Presto performance with Alluxio at TikTok
PDF
Hudi architecture, fundamentals and capabilities
PDF
RaptorX: Building a 10X Faster Presto with hierarchical cache
PDF
Accelerate Cloud Training with Alluxio
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
PDF
Scalable and High available Distributed File System Metadata Service Using gR...
PDF
Burst Presto & Spark workloads to AWS EMR with no data copies
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Flexible and Fast Storage for Deep Learning with Alluxio
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
PPTX
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
PPTX
Taming the Elephant: Efficient and Effective Apache Hadoop Management
PPTX
Tachyon meetup slides.
PDF
Fluid: When Alluxio Meets Kubernetes
PDF
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Speeding Up Spark Performance using Alluxio at China Unicom
Improving Presto performance with Alluxio at TikTok
Hudi architecture, fundamentals and capabilities
RaptorX: Building a 10X Faster Presto with hierarchical cache
Accelerate Cloud Training with Alluxio
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Scalable and High available Distributed File System Metadata Service Using gR...
Burst Presto & Spark workloads to AWS EMR with no data copies
HDFS Tiered Storage: Mounting Object Stores in HDFS
Best Practice in Accelerating Data Applications with Spark+Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
Intro to Apache Kudu (short) - Big Data Application Meetup
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Tachyon meetup slides.
Fluid: When Alluxio Meets Kubernetes
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Ad

Similar to How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3 (20)

PPTX
Hybrid collaborative tiered storage with alluxio
PDF
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
PDF
Accelerate Spark Workloads on S3
PPTX
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
PDF
Building Fast SQL Analytics on Anything with Presto, Alluxio
PDF
Accelerating Analytics with EMR on your S3 Data Lake
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
Achieving compute and storage independence for data-driven workloads
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Alluxio @ Uber Seattle Meetup
PDF
Design Choices for Cloud Data Platforms
PPTX
Empower Data-Driven Organizations
PDF
Hive spark-s3acommitter-hbase-nfs
PDF
Behind the Scenes at Coolblue - Feb 2017
PDF
How the Development Bank of Singapore solves on-prem compute capacity challen...
PDF
Best Practices for Using Alluxio with Spark
PDF
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
OVHcloud Partner Webinar - Data Processing
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Hybrid collaborative tiered storage with alluxio
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Accelerate Spark Workloads on S3
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Building Fast SQL Analytics on Anything with Presto, Alluxio
Accelerating Analytics with EMR on your S3 Data Lake
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Achieving compute and storage independence for data-driven workloads
AWS Big Data Demystified #1: Big data architecture lessons learned
Alluxio @ Uber Seattle Meetup
Design Choices for Cloud Data Platforms
Empower Data-Driven Organizations
Hive spark-s3acommitter-hbase-nfs
Behind the Scenes at Coolblue - Feb 2017
How the Development Bank of Singapore solves on-prem compute capacity challen...
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Open Source Data Orchestration for AI, Big Data, and Cloud
OVHcloud Partner Webinar - Data Processing
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...

Recently uploaded (20)

PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
L1 - Introduction to python Backend.pptx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
AI in Product Development-omnex systems
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Introduction to Artificial Intelligence
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Digital Strategies for Manufacturing Companies
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
history of c programming in notes for students .pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
L1 - Introduction to python Backend.pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Operating system designcfffgfgggggggvggggggggg
Softaken Excel to vCard Converter Software.pdf
AI in Product Development-omnex systems
Odoo POS Development Services by CandidRoot Solutions
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Introduction to Artificial Intelligence
Navsoft: AI-Powered Business Solutions & Custom Software Development
How Creative Agencies Leverage Project Management Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Digital Strategies for Manufacturing Companies
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
history of c programming in notes for students .pptx
How to Choose the Right IT Partner for Your Business in Malaysia
How to Migrate SBCGlobal Email to Yahoo Easily
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...

How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3

  • 1. Efficient Data Engineering with Apache Spark, Hive, and Alluxio on S3 + August 14th, 2019
  • 2. Inaugural Cloud, Data & Orchestration Meetup! ● Welcome! ● First Meetup ● Looking for future presenters in Data Engineering/Ops Community ● Let us know on the Meetup group or talk to Bin, Thai, & Tim
  • 3. Your Hosts: ● Thai Bui - Senior Staff Big Data Engineer, Bazaarvoice ● Bin Fan - VP, Founding Member, Alluxio ● Tim Kelly - Engineering Manager, Bazaarvoice ● Amelia Wong - Co-Founder, Alluxio
  • 5. Creating the World’s Smartest Network of Consumers, Brands & Retailers
  • 9. 2018 - Holiday Shoppers in the Bazaarvoice Network
  • 10. Confidential and Proprietary. © 2017 Bazaarvoice, Inc.10 Tools
  • 11. Confidential and Proprietary. © 2017 Bazaarvoice, Inc.11 Architecture 11 Confidential and proprietary. Copyright 2015 Bazaarvoice.
  • 12. Confidential and Proprietary. © 2017 Bazaarvoice, Inc.12 Cloud Compute Tech 12 Confidential and proprietary. Copyright 2015 Bazaarvoice.
  • 13. Confidential and Proprietary. © 2017 Bazaarvoice, Inc.13 Bazaarvoice Data Lake Stats: AWS S3 with Parquet & ORC Registered Data on S3 500 TB Clickstream Data 250 TB Number of Clients 5700 Targetable Shoppers 214 M Active Products 125 M Total Reviews 900 M
  • 14. Confidential and Proprietary. © 2017 Bazaarvoice, Inc.14
  • 16. Accelerating S3 with ZFS, tiered-storage & Alluzio Thai Bui
  • 17. AWS S3 : The Good An object storage service provided by Amazon ● Really cheap ● Highly available ● Fully-managed service ● Scales really well ● Integrates with virtually all tools
  • 18. AWS S3 : The Bad When you have 100s of TB of data and millions of files ● Just object listing is slow ● Download speed is limited by network bandwidth ● No concept of cache ● No concept of data locality
  • 19. AWS S3 : The Need For Speed ● Add tiered-storage to S3 ○ Hot, warm, cold storage (fastest, fast, and not so fast) ○ Metadata cache ○ Data cache ● Keep data local ○ In the same machine, not via the network cable ● Compatible with existing services ○ Hadoop, Spark, Hive, Presto, etc. ● Adaptive & highly configurable ○ Symlink for S3
  • 20. ZFS Hive Spark Alluxio S3 Hot & Warm Cold Overview Hive ● Alluxio ○ Compatibility-layer ○ Tiered storage layer ● ZFS ○ OS-level file system ○ Volume manager ○ Acceleration layer ● Both are open-source metastore metastore metastore
  • 21. Alluxio : The tiered-storage layer ● Support for native filesystem and Hadoop filesystem ● Distributed but can be installed in every node ○ Provides data locality ● Mount S3, HDFS, etc. to Alluxio ○ Think symlink. No data movement. ● Use RAM, NVMe, SSD, HDD to define hot, warm, cold data tier ● LRU, LFU policies for caching data at different layers ● Not enough space -> evict or move least used files to the next tier
  • 22. ZFS : The acceleration layer ● Both a filesytem & a volume manager ○ Works with RAM to accelerate read/write ○ Auto promote/demote blocks from RAM to other storage ○ Used with local NVMe SSD if data is not in RAM ○ Mirror write to 2 SSDs -> 2x read speed ● Works at the kernel-space ● Extremely reliable ○ Automatic block checksum & repair
  • 23. ZFS + NVMe: Micro benchmark I3.4xlarge, up to 10 Gbit network, 2 x 1.9 NVMe SSD, single-threaded ● Baseline w/ EBS ○ 135 MB/s write (dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync) ○ 157 MB/s read (dd if=/tmp/test1.img of=/dev/zero bs=8k) ● ZFS + 2 mirrored NVMe SSD ○ 820 MB/s write (dd if=/dev/zero of=/alluxio/fs/test1.img bs=1G count=1) ○ 1.7 GB/s read (dd if=/alluxio/fs/test1.img of=/dev/zero bs=1G count=1) ● 4x write, 10x read compared to EBS ● 8-14x compared to S3 (120 MB/s both read/write)
  • 25. Hive Metastore Last 30 days Alluxio > 30 daysS3 Hot & Warm Cold With Hive
  • 26. Example query SELECT .. FROM .. WHERE year = 2019 AND month = 1 GROUP BY .. ORDER BY .. LIMIT 100;
  • 27. Example query plan SELECT .. FROM .. WHERE year = 2019 AND month = 1 GROUP BY .. ORDER BY .. LIMIT 100
  • 28. Without tiered-storage ● 50s for split calculations ○ Listing files on S3 ○ Sub-dividing the tasks amongst workers ● 12s for scanning data on S3 ● 70s to complete the query
  • 29. With tiered-storage ● 1.7s for split calculations ○ 30x improvement ● 3s for scanning data on tiered-storage ○ 3x improvement ● 6s to complete the query ○ 10x improvement overall
  • 30. Result ● 5-10X read improvement in Hive ○ Worker can short-circuit and read directly from ZFS instead of S3 ○ Move compute to the data ● Should give the same result in Apache Spark ● Good for iterating over the same data set multiple times ○ Machine learning ○ Exploratory analysis ● Give us control over S3 ○ More recent data should be faster to access