Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage

Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Agenda
• Background and motivation
• Bigdata analytics on the cloud: the challenges & optimizations
• Accelerate bigdata analytics Alluxio
• Summary
2

Optimization Notice
BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges
Data/Capacity
Upgrade Cost
Space, Power, Utilization
Multiple Storage Silos
Inadequate Performance
Typical Challenges
Costs
Provisioning and Configuration
Performance
& efficiency
Data Capacity Silos
Challenges of scaling Hadoop* Storage
3

Optimization Notice
4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store

Optimization Notice
Discontinuity in bigdata infrastructure makes
different solution
Get a bigger cluster
for many teams to share.
Give each team their
own dedicated cluster,
each with a copy of
PBs of data.
Give teams ability to
spin-up/spin-down
clusters which can
share data sets.
SINGLE LARGE CLUSTER MULTIPLE SMALL CLUSTERS ON DEMAND ANALYTIC CLUSTERS
5

Optimization Notice
6

Optimization Notice
7
Storage Disaggregation architecture
• Replace HDFS with Shared data lake
• Enables independent scale of compute and storage
• But does this architecture works?
Shared Data Lake
Batch Streaming Interactive Machine Leaning
Graph
Analytics
Compute
Storage

Optimization Notice
Functionality challenges: trouble-shooting on
configurations
0%
20%
40%
60%
80%
100%
120%
hive-parquet spark-parquet presto-parquet (untuned) presto-parquet (tuned)
QuerySuccess%
1TB Query Success % (54 TPC-DS Queries)
0%
20%
40%
60%
80%
100%
120%
spark-parquet spark-orc presto-parquet presto-parquet
1TB&10TB Query Success %(54 TPC-DS Queries)
0 2 4 6 8 10 12 14 16
Ceph issue
Compatible issue
Deployment issue
Improper default configuration
Middleware issue
Runtime issue
S3a driver issue
Count of Issue Type
• Lots of tunings & trouble shootings required to achieve 100%
success ratio for selected TPC-DS queries
• Improper Default configuration
• Wrong middleware configuration
• Improper Hadoop/Spark configuration for different size and
format data issues
tuned
8

Optimization Notice
Deployment Architecture challenges: Multiple choices
Deployment Architecture depends on detail HW configuration, application and cost requirements
9
Architecture 1 Architecture 2 Architecture 3
Architecture 4 Architecture 5
1: Dedicated Load Balance
2: Round Robin DNS and dedicated gateway
3: Round Robin DNS, gateway co-located with
Storage node
4: Fully disaggregated architecture, multiple
storage solutions on the disaggregated storage
node
5: Alluxio cache layer deployed on the compute
node

Optimization Notice
S3A Ceph
cloud adaptor challenges: S3a tunings an example
S3A performance is dramatically
worse than HDFS
Remote HDFSS3A Ceph
stage 0 stage > 0
stage 0 stage > 0
stage 0 stage > 0
Took 820secs with
BW of 120MB/s
• From disk io and network
io data, we can see read
Bandwidth on Ceph is
extremely low, about
100MB/s vs. 3GB/s on
HDFS.
• And based on our
experience, Ceph is
capable drive disk BW to
more than 2GB/s.
• S3A adaptor is the
bottleneck.
<property>
<name>fs.s3a.readahead.range</name>
<value>1024K</value>
</property>
<property>
<name>fs.s3a.experimental.input.fadvise</name>
<value>random</value>
</property>
Tuning up readahead range will
decrease S3A opened connections.
S3A 11.5x improved!!
10

Optimization Notice
Ingest
cluster
ETL
Transform
ation
cluster
Compute resource pool
Disaggregated Storage
Simulating typical usage cases:
Simple Read/Write
§ Terasort: a popular benchmark that measures the
amount of time to sort one terabyte of randomly
distributed data on a given computer system.
TPC-DS derived tests:
Batch Analytics
§ To consistently executing analytical process to process
large set of data.
§ UC11: Leveraging 54 derived from TPC-DS * queries
with intensive reads across objects in different buckets
§ I/O intensive queries: selected 9 I/O intensive queries
from TPC-DS
Kmeans
§ K-means is one of the most commonly used clustering
algorithms that clusters the data points into a
predefined number of clusters.
Performance gaps: usage cases
11
Batch
query
cluster
Interactive
query
cluster
Machine
Learning
cluster

Optimization Notice
Performance gaps with storage disaggregation
• Storage disaggregation leads to performance regression
• Up to 10% for remote HDFS, Terasort performance is higher as usable memory increased
• Up to 60% for S3 object storage (optimized results, up to 11.5x perf. boost through tunings compared with default parameters)
• One important cause for the performance gap: s3a does not support Transactional Writes
• Most of bigdata software (Spark, Hive) relies on HDFS’s atomic rename feature to support atomic writes
• During job submit, commit protocol is used to specify how results should be written at the end of job, first stage task output into temporary
locations, and only moving (renaming) data to final location upon task or job completion
• S3a implements this with: COPY+DELETE+HEAD+POST
• Despite there are some on-going efforts to optimize s3a adaptor, there is no near-term solution for the performance gap
1.0 1.0 1.0 1.0
0.9 0.9
1.1
0.9
0.7
0.6
0.4
0.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g
Performance Comparision of Disaggregated analytics storage
spark(yarn) + Local HDFS (HDD) spark(yarn) + Remote HDFS (HDD) spark(yarn) + S3 (HDD)
higher is better
12
Need to close the performance gap!

Optimization Notice
Alluxio based IN Memory data accelerator
(IMDA)
Shared Data Lake with s3a object storage
Batch Streaming Interactive Machine Leaning
Graph
Analytics
Batch Streaming Interactive Machine
Leaning
Graph
Analytics
Provisioned Compute Pool
In Memory Data Acclerator
Replace HDFS with disaggregated s3 object storage Alluxio based In Memory Accleration layer
13

Optimization Notice
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage

Optimization Notice
Persistent Memory and RDMA
Persistent Memory:
• PMEM represents a new class of memory and storage technology
architected specifically for data center usage
• Combination of high-capacity, affordability and persistence.
RDMA: Remote Direct Memory Access
• Accessing (i.e. reading from or writing to) memory on a
remote machine without interrupting the processing of the
CPU(s) on that system.
• Zero-copy - applications perform data transfer without
the network software stack involvement, data is being
send received directly to the buffers without being
copied between the network layers.
• Kernel bypass - applications perform data transfer
directly from userspace, no context switches.
• No CPU involvement - applications can access remote
memory without consuming any CPU in the remote
machine.
Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture
15

Optimization Notice
Persistent Memory Operations Mode
IMC
Cascade Lake
IMC
• 128, 256, 512GB
DIMM Capacity
• 2666 MT/sec
Speed
• 3TB (not including DRAM)
Capacity per CPU
Flexible, Usage Specific Partitions
Non-Volatile Memory Pool
DDR4 DRAM*
DCPMM*
AppDirect
Storage
Memory
• DDR4 electrical & physical
• Close to DRAM latency
• Cache line size access
DRAM, or
DRAM as
cache
* DIMM population shown as an example only.
1 MEMORY mode
Storage over APP DIRECT
● Large memory at lower cost
● Low latency persistent memory
● Fast direct-attach storage
● Persistent data for rapid recovery2
APP DIRECT mode
16

Optimization Notice
Leveraging In memory data accelerator to accelerate
intermediate data access
Applications
Disaggregated
Storage
Hbase*Ceph*
Resource Mgmt
& Co-ordination
ZooKeeper*YARN*
Data
Processing
& Analysis
MR*
Storm*
Parquet* Avro*
Spark Core
SQL* Streaming* Mllib* GraphX*
DataFrame
ML Pipelines
SparkR*
Flink*
Giraph*
Batch StreamingInteractive Machine
Leaning
Graph
Analytics
HDFS* OSS*
Acceleration Layer Alluxio*
• Leverage new HW technologies & products that delivers
significant performance improvement
• Persistent memory, RDMA
• Using Alluxio based in memory data accelerator layer to
accelerate ephemeral data access
• Caching hot data in Alluxio shorten I/O stack
• Unifies underlying Filesystem
• It requires a storage and network co-design to fully leverage
those technologies or HWs address the bottlenecks
• Optimized libraries to bypass filesystem, avoid user
space/kernel space context switch
…
k8s*
High Speed Networking
17

Optimization Notice
In memory data accelerator (IMDA) architecture
Enable Alluxio with state-of-art HW technology
§ Alluxio as a light-weighted user space I/O based distributed data store
– For ephemeral data access like cache, shuffle, spill
– Tiered storage – DRAM, persistent memory and SSDa
§ Persistent Memory to enlarge compute storage with high performance and low
cost
§ RDMA to avoid context switch, kernel bypass
– Persistent Memory mmap address as RDMA buffer to avoid memory
copies
– Persistent Memory as off-heap memory to improve GC
§ Long term: Customized shuffler for shuffle data, spill data to Alluxio IMDA
Batch Streaming Interactive Machine
Leaning
Graph
Analytics
Provisioned Compute Pool
in memory data accelerator
Ephemeral data
RDMA enabled Network
Persistent MemoryDRAM
NVMe SSD
18

Optimization Notice
Alluxio IMDA system configuration 5x Compute Node
Hardware:
• ntel® Xeon™ processor Gold 6140 @
2.3GHz, 384GB Memory
• 1x 82599 10Gb NIC
• 5x P4500 SSD (2 for spark-shuffle)
Software:
• Hadoop 2.8.1
• Spark 2.2.0
• Hive 2.2.1
• RHEL7.3
5x Storage Node
• Intel(R) Xeon(R) CPU Gold 6140 @
2.30GHz, 192GB Memory
• 2x 82599 10Gb NIC
• 7x 1TB HDD for Ceph bluestore or HDFS
namenode and datanode
Software:
• Hadoop 2.8.1
• Ceph 12.2.7
• RHEL7.3
*Other names and brands may be
claimed as the property of others. 19
Hadoop
Hive
Spark
Alluxio
DNS Hadoop
Hive
Spark
Hadoop
Hive
Spark
Hadoop
Hive
Spark
Hadoop
Hive
Spark
CEPH
MON
RGW
REMOTE
HDFS
NN
1x10Gb NIC
Alluxio Alluxio Alluxio Alluxio
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
Alluxio Acceleration Layer
• 200GB Mem for mem mode
Software:
• Alluxio 1.7.0

Optimization Notice
Alluxio IMDA Performance
Using Alluxio IMDA as cache:
• For terasort, 3.4x speedup over S3 object storage, 1.36x speedup over local HDFS.
• For TPCDS test, up to 1.56x performance speedup for IO intensive queries, slightly lower than local HDFS.
• For KMeans test, 1.62x speedup over S3 object storage, 14% lower compared with local HDFS.
• KMeans is a CPU intensive workload
1.00 1.00 1.00 1.00
0.70
0.62
0.40
0.53
0.96 0.97
1.36
0.86
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g
Alluxio Acceleration of Disaggregated analytics storage
spark(yarn) + Local HDFS (HDD) spark(yarn) + S3 (HDD) spark(yarn) +alluxio(MEM) + S3 (HDD)
higher is better
Using Alluxio IMDA cache improved in IO intensive workloads
20

Optimization Notice
21
Alluxio DCPMM Tier architecture
Alluxio PMEM tier
• A new PMEM tier layer introduced to provide
higher performance with lower cost
• Large Capacity -> Cache more data
• Higher performance compared with NVMe SSD
• Leverage PMDK lib to bypass filesystem
overhead and context switches
• Deliver dedicated SLA to mission critical
applications
DRAM
DCPMM
SSD
HDD Under Storage
Application
s
Alluxio
Worker
Alluxio
Master
Alluxio
Client

Optimization Notice
22
New PMEM tier for Alluxio
Two modes to support different usage scenario
• SoAD mode
• No code changes.
• Bypass pagecache
• PMDK AD mode
• Bypass pagecache & no context switches
• Better cache load performance.
DCPMM
PMDK based AD mode
worker
POSIX
filesystem
Pagecache
DAX
filesystem
Memory mapped
file
client
Context Switches
POSIX Context Switches
workerclient workerclient
Userspace
Load/Store
JNI
Storage over App Direct
(SoAD) Mode
PMDK

Use Cases Alluxio Enables
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Alluxio
PrestoHive
Alluxio
Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations

Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL

Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver

Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming

Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2

Enterprises moving towards independent compute & storage
Learn more

Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.org/slack

Questions?
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | @alluxio

Optimization Notice
32
Call to action
• Stay tuned for further updates
• More details
• Speeding Big Data Analytics on the Cloud with an In-Memory Data Accelerator
• https://www.alluxio.io/blog/speeding-big-data-analytics-on-the-cloud-with-in-
memory-data-accelerator/

Optimization Notice
Legal Information: Benchmark and Performance
DisclaimersPerformance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See
configuration disclosure for details. No product can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more information, see Performance
Benchmark Test Disclosure.
Configurations: see performance benchmark test configurations.
33

Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage

More Related Content

What's hot (20)

Similar to Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage