How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3

Eﬃcient Data Engineering with
Apache Spark, Hive, and Alluxio on S3
+
August 14th, 2019

Inaugural Cloud, Data & Orchestration Meetup!
● Welcome!
● First Meetup
● Looking for future presenters in Data Engineering/Ops
Community
● Let us know on the Meetup group or talk to Bin, Thai, & Tim

Your Hosts:
● Thai Bui - Senior Staff Big Data Engineer, Bazaarvoice
● Bin Fan - VP, Founding Member, Alluxio
● Tim Kelly - Engineering Manager, Bazaarvoice
● Amelia Wong - Co-Founder, Alluxio

Creating the World’s Smartest
Network of Consumers, Brands
& Retailers

Core Product:
SaaS Ratings & Reviews

How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3

2018 Holiday
Product
Pageviews

2018 - Holiday
Shoppers in the
Bazaarvoice
Network

Confidential and Proprietary. © 2017 Bazaarvoice, Inc.10
Tools

Architecture
11 Confidential and proprietary. Copyright 2015 Bazaarvoice.

Cloud Compute
Tech
12 Confidential and proprietary. Copyright 2015 Bazaarvoice.

Bazaarvoice Data Lake Stats:
AWS S3 with Parquet & ORC
Registered Data on S3
500 TB
Clickstream Data
250 TB
Number of Clients
5700
Targetable Shoppers
214 M
Active Products
125 M
Total Reviews
900 M

Accelerating S3 with ZFS,
tiered-storage & Alluzio
Thai Bui

AWS S3 : The Good
An object storage service provided by Amazon
● Really cheap
● Highly available
● Fully-managed service
● Scales really well
● Integrates with virtually all tools

AWS S3 : The Bad
When you have 100s of TB of data and millions of ﬁles
● Just object listing is slow
● Download speed is limited by network bandwidth
● No concept of cache
● No concept of data locality

AWS S3 : The Need For Speed
● Add tiered-storage to S3
○ Hot, warm, cold storage (fastest, fast, and not so fast)
○ Metadata cache
○ Data cache
● Keep data local
○ In the same machine, not via the network cable
● Compatible with existing services
○ Hadoop, Spark, Hive, Presto, etc.
● Adaptive & highly conﬁgurable
○ Symlink for S3

ZFS
Hive Spark
Alluxio S3
Hot & Warm
Cold
Overview
Hive
● Alluxio
○ Compatibility-layer
○ Tiered storage layer
● ZFS
○ OS-level ﬁle system
○ Volume manager
○ Acceleration layer
● Both are open-source
metastore metastore metastore

Alluxio : The tiered-storage layer
● Support for native filesystem and Hadoop filesystem
● Distributed but can be installed in every node
○ Provides data locality
● Mount S3, HDFS, etc. to Alluxio
○ Think symlink. No data movement.
● Use RAM, NVMe, SSD, HDD to define hot, warm, cold data tier
● LRU, LFU policies for caching data at different layers
● Not enough space -> evict or move least used files to the next tier

ZFS : The acceleration layer
● Both a ﬁlesytem & a volume manager
○ Works with RAM to accelerate read/write
○ Auto promote/demote blocks from RAM to other storage
○ Used with local NVMe SSD if data is not in RAM
○ Mirror write to 2 SSDs -> 2x read speed
● Works at the kernel-space
● Extremely reliable
○ Automatic block checksum & repair

ZFS + NVMe: Micro benchmark
I3.4xlarge, up to 10 Gbit network, 2 x 1.9 NVMe SSD, single-threaded
● Baseline w/ EBS
○ 135 MB/s write (dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oﬂag=dsync)
○ 157 MB/s read (dd if=/tmp/test1.img of=/dev/zero bs=8k)
● ZFS + 2 mirrored NVMe SSD
○ 820 MB/s write (dd if=/dev/zero of=/alluxio/fs/test1.img bs=1G count=1)
○ 1.7 GB/s read (dd if=/alluxio/fs/test1.img of=/dev/zero bs=1G count=1)
● 4x write, 10x read compared to EBS
● 8-14x compared to S3 (120 MB/s both read/write)

All together
ZFS
Hot
Warm
Kernel-space
User-space
Alluxio
RAM
NVMe SSD
promote demote
Native/Hadoop Filesystem API

Hive
Metastore
Last 30
days
Alluxio
> 30 daysS3
Hot &
Warm
Cold
With Hive

Example query
SELECT ..
FROM ..
WHERE year = 2019
AND month = 1
GROUP BY ..
ORDER BY ..
LIMIT 100;

Example query plan
SELECT ..
FROM ..
WHERE year = 2019
AND month = 1
GROUP BY .. ORDER BY ..
LIMIT 100

Without tiered-storage
● 50s for split calculations
○ Listing ﬁles on S3
○ Sub-dividing the tasks amongst workers
● 12s for scanning data on S3
● 70s to complete the query

With tiered-storage
● 1.7s for split calculations
○ 30x improvement
● 3s for scanning data on tiered-storage
○ 3x improvement
● 6s to complete the query
○ 10x improvement overall

Result
● 5-10X read improvement in Hive
○ Worker can short-circuit and read directly from ZFS instead of S3
○ Move compute to the data
● Should give the same result in Apache Spark
● Good for iterating over the same data set multiple times
○ Machine learning
○ Exploratory analysis
● Give us control over S3
○ More recent data should be faster to access

How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3

More Related Content

What's hot (20)

Similar to How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3 (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio on S3