SlideShare a Scribd company logo
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Agenda
• Background and motivation
• Bigdata analytics on the cloud: the challenges & optimizations
• Accelerate bigdata analytics Alluxio
• Summary
2
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges
Data/Capacity
Upgrade Cost
Space, Power, Utilization
Multiple Storage Silos
Inadequate Performance
Typical Challenges
Costs
Provisioning and Configuration
Performance
& efficiency
Data Capacity Silos
Challenges of scaling Hadoop* Storage
*Other names and brands may be claimed as the property of others.
3
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Discontinuity in bigdata infrastructure makes
different solution
Get a bigger cluster
for many teams to share.
Give each team their
own dedicated cluster,
each with a copy of
PBs of data.
Give teams ability to
spin-up/spin-down
clusters which can
share data sets.
SINGLE LARGE CLUSTER MULTIPLE SMALL CLUSTERS ON DEMAND ANALYTIC CLUSTERS
5
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
6
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
7
Storage Disaggregation architecture
• Replace HDFS with Shared data lake
• Enables independent scale of compute and storage
• But does this architecture works?
Shared Data Lake
Batch Streaming Interactive Machine Leaning
Graph
Analytics
Compute
Storage
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Functionality challenges: trouble-shooting on
configurations
0%
20%
40%
60%
80%
100%
120%
hive-parquet spark-parquet presto-parquet (untuned) presto-parquet (tuned)
QuerySuccess%
1TB Query Success % (54 TPC-DS Queries)
0%
20%
40%
60%
80%
100%
120%
spark-parquet spark-orc presto-parquet presto-parquet
1TB&10TB Query Success %(54 TPC-DS Queries)
0 2 4 6 8 10 12 14 16
Ceph issue
Compatible issue
Deployment issue
Improper default configuration
Middleware issue
Runtime issue
S3a driver issue
Count of Issue Type
• Lots of tunings & trouble shootings required to achieve 100%
success ratio for selected TPC-DS queries
• Improper Default configuration
• Wrong middleware configuration
• Improper Hadoop/Spark configuration for different size and
format data issues
tuned
8
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Deployment Architecture challenges: Multiple choices
Deployment Architecture depends on detail HW configuration, application and cost requirements
9
Architecture 1 Architecture 2 Architecture 3
Architecture 4 Architecture 5
1: Dedicated Load Balance
2: Round Robin DNS and dedicated gateway
3: Round Robin DNS, gateway co-located with
Storage node
4: Fully disaggregated architecture, multiple
storage solutions on the disaggregated storage
node
5: Alluxio cache layer deployed on the compute
node
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
S3A Ceph
cloud adaptor challenges: S3a tunings an example
S3A performance is dramatically
worse than HDFS
Remote HDFSS3A Ceph
stage 0 stage > 0
stage 0 stage > 0
stage 0 stage > 0
Took 820secs with
BW of 120MB/s
• From disk io and network
io data, we can see read
Bandwidth on Ceph is
extremely low, about
100MB/s vs. 3GB/s on
HDFS.
• And based on our
experience, Ceph is
capable drive disk BW to
more than 2GB/s.
• S3A adaptor is the
bottleneck.
<property>
<name>fs.s3a.readahead.range</name>
<value>1024K</value>
</property>
<property>
<name>fs.s3a.experimental.input.fadvise</name>
<value>random</value>
</property>
Tuning up readahead range will
decrease S3A opened connections.
S3A 11.5x improved!!
10
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Ingest
cluster
ETL
Transform
ation
cluster
Compute resource pool
Disaggregated Storage
Simulating typical usage cases:
Simple Read/Write
§ Terasort: a popular benchmark that measures the
amount of time to sort one terabyte of randomly
distributed data on a given computer system.
TPC-DS derived tests:
Batch Analytics
§ To consistently executing analytical process to process
large set of data.
§ UC11: Leveraging 54 derived from TPC-DS * queries
with intensive reads across objects in different buckets
§ I/O intensive queries: selected 9 I/O intensive queries
from TPC-DS
Kmeans
§ K-means is one of the most commonly used clustering
algorithms that clusters the data points into a
predefined number of clusters.
Performance gaps: usage cases
11
Batch
query
cluster
Interactive
query
cluster
Machine
Learning
cluster
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Performance gaps with storage disaggregation
• Storage disaggregation leads to performance regression
• Up to 10% for remote HDFS, Terasort performance is higher as usable memory increased
• Up to 60% for S3 object storage (optimized results, up to 11.5x perf. boost through tunings compared with default parameters)
• One important cause for the performance gap: s3a does not support Transactional Writes
• Most of bigdata software (Spark, Hive) relies on HDFS’s atomic rename feature to support atomic writes
• During job submit, commit protocol is used to specify how results should be written at the end of job, first stage task output into temporary
locations, and only moving (renaming) data to final location upon task or job completion
• S3a implements this with: COPY+DELETE+HEAD+POST
• Despite there are some on-going efforts to optimize s3a adaptor, there is no near-term solution for the performance gap
1.0 1.0 1.0 1.0
0.9 0.9
1.1
0.9
0.7
0.6
0.4
0.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g
Performance Comparision of Disaggregated analytics storage
spark(yarn) + Local HDFS (HDD) spark(yarn) + Remote HDFS (HDD) spark(yarn) + S3 (HDD)
higher is better
12
Need to close the performance gap!
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Alluxio based IN Memory data accelerator
(IMDA)
Shared Data Lake with s3a object storage
Batch Streaming Interactive Machine Leaning
Graph
Analytics
Shared Data Lake with s3a object storage
Batch Streaming Interactive Machine
Leaning
Graph
Analytics
Provisioned Compute Pool
In Memory Data Acclerator
Replace HDFS with disaggregated s3 object storage Alluxio based In Memory Accleration layer
13
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Persistent Memory and RDMA
Persistent Memory:
• PMEM represents a new class of memory and storage technology
architected specifically for data center usage
• Combination of high-capacity, affordability and persistence.
RDMA: Remote Direct Memory Access
• Accessing (i.e. reading from or writing to) memory on a
remote machine without interrupting the processing of the
CPU(s) on that system.
• Zero-copy - applications perform data transfer without
the network software stack involvement, data is being
send received directly to the buffers without being
copied between the network layers.
• Kernel bypass - applications perform data transfer
directly from userspace, no context switches.
• No CPU involvement - applications can access remote
memory without consuming any CPU in the remote
machine.
Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture
15
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Persistent Memory Operations Mode
IMC
Cascade Lake
IMC
• 128, 256, 512GB
DIMM Capacity
• 2666 MT/sec
Speed
• 3TB (not including DRAM)
Capacity per CPU
Flexible, Usage Specific Partitions
Non-Volatile Memory Pool
DDR4 DRAM*
DCPMM*
AppDirect
Storage
Memory
• DDR4 electrical & physical
• Close to DRAM latency
• Cache line size access
DRAM, or
DRAM as
cache
* DIMM population shown as an example only.
1 MEMORY mode
Storage over APP DIRECT
● Large memory at lower cost
● Low latency persistent memory
● Fast direct-attach storage
● Persistent data for rapid recovery2
APP DIRECT mode
16
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Leveraging In memory data accelerator to accelerate
intermediate data access
Applications
Disaggregated
Storage
Hbase*Ceph*
Resource Mgmt
& Co-ordination
ZooKeeper*YARN*
Data
Processing
& Analysis
MR*
Storm*
Parquet* Avro*
Spark Core
SQL* Streaming* Mllib* GraphX*
DataFrame
ML Pipelines
SparkR*
Flink*
Giraph*
Batch StreamingInteractive Machine
Leaning
Graph
Analytics
HDFS* OSS*
Acceleration Layer Alluxio*
• Leverage new HW technologies & products that delivers
significant performance improvement
• Persistent memory, RDMA
• Using Alluxio based in memory data accelerator layer to
accelerate ephemeral data access
• Caching hot data in Alluxio shorten I/O stack
• Unifies underlying Filesystem
• It requires a storage and network co-design to fully leverage
those technologies or HWs address the bottlenecks
• Optimized libraries to bypass filesystem, avoid user
space/kernel space context switch
…
k8s*
High Speed Networking
17
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
In memory data accelerator (IMDA) architecture
Enable Alluxio with state-of-art HW technology
§ Alluxio as a light-weighted user space I/O based distributed data store
– For ephemeral data access like cache, shuffle, spill
– Tiered storage – DRAM, persistent memory and SSDa
§ Persistent Memory to enlarge compute storage with high performance and low
cost
§ RDMA to avoid context switch, kernel bypass
– Persistent Memory mmap address as RDMA buffer to avoid memory
copies
– Persistent Memory as off-heap memory to improve GC
§ Long term: Customized shuffler for shuffle data, spill data to Alluxio IMDA
Shared Data Lake with s3a object storage
Batch Streaming Interactive Machine
Leaning
Graph
Analytics
Provisioned Compute Pool
in memory data accelerator
Ephemeral data
RDMA enabled Network
Persistent MemoryDRAM
NVMe SSD
18
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Alluxio IMDA system configuration 5x Compute Node
Hardware:
• ntel® Xeon™ processor Gold 6140 @
2.3GHz, 384GB Memory
• 1x 82599 10Gb NIC
• 5x P4500 SSD (2 for spark-shuffle)
Software:
• Hadoop 2.8.1
• Spark 2.2.0
• Hive 2.2.1
• RHEL7.3
5x Storage Node
• Intel(R) Xeon(R) CPU Gold 6140 @
2.30GHz, 192GB Memory
• 2x 82599 10Gb NIC
• 7x 1TB HDD for Ceph bluestore or HDFS
namenode and datanode
Software:
• Hadoop 2.8.1
• Ceph 12.2.7
• RHEL7.3
*Other names and brands may be
claimed as the property of others. 19
Hadoop
Hive
Spark
Alluxio
DNS Hadoop
Hive
Spark
Hadoop
Hive
Spark
Hadoop
Hive
Spark
Hadoop
Hive
Spark
CEPH
MON
RGW
REMOTE
HDFS
NN
1x10Gb NIC
Alluxio Alluxio Alluxio Alluxio
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
Alluxio Acceleration Layer
• 200GB Mem for mem mode
Software:
• Alluxio 1.7.0
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Alluxio IMDA Performance
Using Alluxio IMDA as cache:
• For terasort, 3.4x speedup over S3 object storage, 1.36x speedup over local HDFS.
• For TPCDS test, up to 1.56x performance speedup for IO intensive queries, slightly lower than local HDFS.
• For KMeans test, 1.62x speedup over S3 object storage, 14% lower compared with local HDFS.
• KMeans is a CPU intensive workload
1.00 1.00 1.00 1.00
0.70
0.62
0.40
0.53
0.96 0.97
1.36
0.86
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g
Alluxio Acceleration of Disaggregated analytics storage
spark(yarn) + Local HDFS (HDD) spark(yarn) + S3 (HDD) spark(yarn) +alluxio(MEM) + S3 (HDD)
higher is better
Using Alluxio IMDA cache improved in IO intensive workloads
20
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
21
Alluxio DCPMM Tier architecture
Alluxio PMEM tier
• A new PMEM tier layer introduced to provide
higher performance with lower cost
• Large Capacity -> Cache more data
• Higher performance compared with NVMe SSD
• Leverage PMDK lib to bypass filesystem
overhead and context switches
• Deliver dedicated SLA to mission critical
applications
DRAM
DCPMM
SSD
HDD Under Storage
Application
s
Alluxio
Worker
Alluxio
Master
Alluxio
Client
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
22
New PMEM tier for Alluxio
Two modes to support different usage scenario
• SoAD mode
• No code changes.
• Bypass pagecache
• PMDK AD mode
• Bypass pagecache & no context switches
• Better cache load performance.
DCPMM
PMDK based AD mode
worker
POSIX
filesystem
Pagecache
DAX
filesystem
Memory mapped
file
client
Context Switches
POSIX Context Switches
workerclient workerclient
Userspace
Load/Store
JNI
Storage over App Direct
(SoAD) Mode
PMDK
Use Cases Alluxio Enables
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Alluxio
PrestoHive
Alluxio
Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark
Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2
Enterprises moving towards independent compute & storage
Learn more
Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.org/slack
Questions?
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | @alluxio
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
32
Call to action
• Stay tuned for further updates
• More details
• Speeding Big Data Analytics on the Cloud with an In-Memory Data Accelerator
• https://www.alluxio.io/blog/speeding-big-data-analytics-on-the-cloud-with-in-
memory-data-accelerator/
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Legal Information: Benchmark and Performance
DisclaimersPerformance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See
configuration disclosure for details. No product can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more information, see Performance
Benchmark Test Disclosure.
Configurations: see performance benchmark test configurations.
33
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage

More Related Content

PPT
Collaborate07kmohiuddin
PPTX
Business Track 3: arcserve udp licensing pricing & support made simple
PDF
Spectrum Scale final
PDF
DDN: Protecting Your Data, Protecting Your Hardware
PPTX
Arcserve udp recovery point server and global deduplication 12-2014
PDF
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
PDF
Dell Lustre Storage Architecture Presentation - MBUG 2016
PDF
Blazing Fast Lustre Storage
Collaborate07kmohiuddin
Business Track 3: arcserve udp licensing pricing & support made simple
Spectrum Scale final
DDN: Protecting Your Data, Protecting Your Hardware
Arcserve udp recovery point server and global deduplication 12-2014
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Dell Lustre Storage Architecture Presentation - MBUG 2016
Blazing Fast Lustre Storage

What's hot (20)

PPTX
Ac922 watson 180208 v1
PPTX
EMC config Hadoop
PPTX
The Importance of Fast, Scalable Storage for Today’s HPC
PDF
Dell Storage Management
PPTX
Expert Guide to Migrating Legacy Databases to Postgres
 
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
PPTX
In-Place analytics with Unified Data Access
PPTX
Appliance Launch Webcast
PPTX
Business Track session 2: udp solution selling made simple
PPTX
Commercial track 1_The Power of UDP
PDF
Best Practices in Security with PostgreSQL
 
PPTX
An overview of reference architectures for Postgres
 
PDF
Ibm pure data system for analytics n200x
PDF
EMC Starter Kit - IBM BigInsights - EMC Isilon
DOCX
How to choose the right server
PDF
Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...
PPTX
Automating a PostgreSQL High Availability Architecture with Ansible
 
PDF
Netezza vs Teradata vs Exadata
PPTX
Super cluster oracleday cl 7
PPTX
Commercial track 3_UDP Licensing Pricing & Support Made Simple
Ac922 watson 180208 v1
EMC config Hadoop
The Importance of Fast, Scalable Storage for Today’s HPC
Dell Storage Management
Expert Guide to Migrating Legacy Databases to Postgres
 
Using a Field Programmable Gate Array to Accelerate Application Performance
In-Place analytics with Unified Data Access
Appliance Launch Webcast
Business Track session 2: udp solution selling made simple
Commercial track 1_The Power of UDP
Best Practices in Security with PostgreSQL
 
An overview of reference architectures for Postgres
 
Ibm pure data system for analytics n200x
EMC Starter Kit - IBM BigInsights - EMC Isilon
How to choose the right server
Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...
Automating a PostgreSQL High Availability Architecture with Ansible
 
Netezza vs Teradata vs Exadata
Super cluster oracleday cl 7
Commercial track 3_UDP Licensing Pricing & Support Made Simple
Ad

Similar to Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage (20)

PPTX
Light-weighted HDFS disaster recovery
PDF
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
PDF
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
PDF
DDN Product Update from SC13
PDF
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
PDF
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
PDF
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
PDF
A5 oracle exadata-the game changer for online transaction processing data w...
PPTX
Simplify IT: Oracle SuperCluster
PDF
@IBM Power roadmap 8
PPT
IBMHadoopofferingTechline-Systems2015
PDF
Healthcare Claim Reimbursement using Apache Spark
PDF
20150704 benchmark and user experience in sahara weiting
PDF
Session 307 ravi pendekanti engineered systems
PDF
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
PDF
G108277 ds8000-resiliency-lagos-v1905c
PDF
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
PPTX
Webinar: The Bifurcation of the Flash Market
PDF
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
PDF
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Light-weighted HDFS disaster recovery
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
DDN Product Update from SC13
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
A5 oracle exadata-the game changer for online transaction processing data w...
Simplify IT: Oracle SuperCluster
@IBM Power roadmap 8
IBMHadoopofferingTechline-Systems2015
Healthcare Claim Reimbursement using Apache Spark
20150704 benchmark and user experience in sahara weiting
Session 307 ravi pendekanti engineered systems
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
G108277 ds8000-resiliency-lagos-v1905c
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
Webinar: The Bifurcation of the Flash Market
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Nekopoi APK 2025 free lastest update
PPTX
L1 - Introduction to python Backend.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
history of c programming in notes for students .pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
System and Network Administration Chapter 2
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Introduction to Artificial Intelligence
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Upgrade and Innovation Strategies for SAP ERP Customers
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Odoo POS Development Services by CandidRoot Solutions
Online Work Permit System for Fast Permit Processing
Nekopoi APK 2025 free lastest update
L1 - Introduction to python Backend.pptx
Odoo Companies in India – Driving Business Transformation.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
history of c programming in notes for students .pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PTS Company Brochure 2025 (1).pdf.......
2025 Textile ERP Trends: SAP, Odoo & Oracle
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
System and Network Administration Chapter 2
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
How to Migrate SBCGlobal Email to Yahoo Easily
Introduction to Artificial Intelligence
How Creative Agencies Leverage Project Management Software.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx

Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage

  • 2. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Agenda • Background and motivation • Bigdata analytics on the cloud: the challenges & optimizations • Accelerate bigdata analytics Alluxio • Summary 2
  • 3. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges Data/Capacity Upgrade Cost Space, Power, Utilization Multiple Storage Silos Inadequate Performance Typical Challenges Costs Provisioning and Configuration Performance & efficiency Data Capacity Silos Challenges of scaling Hadoop* Storage *Other names and brands may be claimed as the property of others. 3
  • 4. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 4 big trends driving the need for a new architecture Separation of Compute & Storage Hybrid – Multi cloud environments Self-service data across the enterprise Rise of the object store
  • 5. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Discontinuity in bigdata infrastructure makes different solution Get a bigger cluster for many teams to share. Give each team their own dedicated cluster, each with a copy of PBs of data. Give teams ability to spin-up/spin-down clusters which can share data sets. SINGLE LARGE CLUSTER MULTIPLE SMALL CLUSTERS ON DEMAND ANALYTIC CLUSTERS 5
  • 6. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 6
  • 7. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 7 Storage Disaggregation architecture • Replace HDFS with Shared data lake • Enables independent scale of compute and storage • But does this architecture works? Shared Data Lake Batch Streaming Interactive Machine Leaning Graph Analytics Compute Storage
  • 8. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Functionality challenges: trouble-shooting on configurations 0% 20% 40% 60% 80% 100% 120% hive-parquet spark-parquet presto-parquet (untuned) presto-parquet (tuned) QuerySuccess% 1TB Query Success % (54 TPC-DS Queries) 0% 20% 40% 60% 80% 100% 120% spark-parquet spark-orc presto-parquet presto-parquet 1TB&10TB Query Success %(54 TPC-DS Queries) 0 2 4 6 8 10 12 14 16 Ceph issue Compatible issue Deployment issue Improper default configuration Middleware issue Runtime issue S3a driver issue Count of Issue Type • Lots of tunings & trouble shootings required to achieve 100% success ratio for selected TPC-DS queries • Improper Default configuration • Wrong middleware configuration • Improper Hadoop/Spark configuration for different size and format data issues tuned 8
  • 9. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Deployment Architecture challenges: Multiple choices Deployment Architecture depends on detail HW configuration, application and cost requirements 9 Architecture 1 Architecture 2 Architecture 3 Architecture 4 Architecture 5 1: Dedicated Load Balance 2: Round Robin DNS and dedicated gateway 3: Round Robin DNS, gateway co-located with Storage node 4: Fully disaggregated architecture, multiple storage solutions on the disaggregated storage node 5: Alluxio cache layer deployed on the compute node
  • 10. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice S3A Ceph cloud adaptor challenges: S3a tunings an example S3A performance is dramatically worse than HDFS Remote HDFSS3A Ceph stage 0 stage > 0 stage 0 stage > 0 stage 0 stage > 0 Took 820secs with BW of 120MB/s • From disk io and network io data, we can see read Bandwidth on Ceph is extremely low, about 100MB/s vs. 3GB/s on HDFS. • And based on our experience, Ceph is capable drive disk BW to more than 2GB/s. • S3A adaptor is the bottleneck. <property> <name>fs.s3a.readahead.range</name> <value>1024K</value> </property> <property> <name>fs.s3a.experimental.input.fadvise</name> <value>random</value> </property> Tuning up readahead range will decrease S3A opened connections. S3A 11.5x improved!! 10
  • 11. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Ingest cluster ETL Transform ation cluster Compute resource pool Disaggregated Storage Simulating typical usage cases: Simple Read/Write § Terasort: a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. TPC-DS derived tests: Batch Analytics § To consistently executing analytical process to process large set of data. § UC11: Leveraging 54 derived from TPC-DS * queries with intensive reads across objects in different buckets § I/O intensive queries: selected 9 I/O intensive queries from TPC-DS Kmeans § K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. Performance gaps: usage cases 11 Batch query cluster Interactive query cluster Machine Learning cluster
  • 12. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Performance gaps with storage disaggregation • Storage disaggregation leads to performance regression • Up to 10% for remote HDFS, Terasort performance is higher as usable memory increased • Up to 60% for S3 object storage (optimized results, up to 11.5x perf. boost through tunings compared with default parameters) • One important cause for the performance gap: s3a does not support Transactional Writes • Most of bigdata software (Spark, Hive) relies on HDFS’s atomic rename feature to support atomic writes • During job submit, commit protocol is used to specify how results should be written at the end of job, first stage task output into temporary locations, and only moving (renaming) data to final location upon task or job completion • S3a implements this with: COPY+DELETE+HEAD+POST • Despite there are some on-going efforts to optimize s3a adaptor, there is no near-term solution for the performance gap 1.0 1.0 1.0 1.0 0.9 0.9 1.1 0.9 0.7 0.6 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g Performance Comparision of Disaggregated analytics storage spark(yarn) + Local HDFS (HDD) spark(yarn) + Remote HDFS (HDD) spark(yarn) + S3 (HDD) higher is better 12 Need to close the performance gap!
  • 13. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Alluxio based IN Memory data accelerator (IMDA) Shared Data Lake with s3a object storage Batch Streaming Interactive Machine Leaning Graph Analytics Shared Data Lake with s3a object storage Batch Streaming Interactive Machine Leaning Graph Analytics Provisioned Compute Pool In Memory Data Acclerator Replace HDFS with disaggregated s3 object storage Alluxio based In Memory Accleration layer 13
  • 14. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Independent scaling of compute & storage
  • 15. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Persistent Memory and RDMA Persistent Memory: • PMEM represents a new class of memory and storage technology architected specifically for data center usage • Combination of high-capacity, affordability and persistence. RDMA: Remote Direct Memory Access • Accessing (i.e. reading from or writing to) memory on a remote machine without interrupting the processing of the CPU(s) on that system. • Zero-copy - applications perform data transfer without the network software stack involvement, data is being send received directly to the buffers without being copied between the network layers. • Kernel bypass - applications perform data transfer directly from userspace, no context switches. • No CPU involvement - applications can access remote memory without consuming any CPU in the remote machine. Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture 15
  • 16. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Persistent Memory Operations Mode IMC Cascade Lake IMC • 128, 256, 512GB DIMM Capacity • 2666 MT/sec Speed • 3TB (not including DRAM) Capacity per CPU Flexible, Usage Specific Partitions Non-Volatile Memory Pool DDR4 DRAM* DCPMM* AppDirect Storage Memory • DDR4 electrical & physical • Close to DRAM latency • Cache line size access DRAM, or DRAM as cache * DIMM population shown as an example only. 1 MEMORY mode Storage over APP DIRECT ● Large memory at lower cost ● Low latency persistent memory ● Fast direct-attach storage ● Persistent data for rapid recovery2 APP DIRECT mode 16
  • 17. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Leveraging In memory data accelerator to accelerate intermediate data access Applications Disaggregated Storage Hbase*Ceph* Resource Mgmt & Co-ordination ZooKeeper*YARN* Data Processing & Analysis MR* Storm* Parquet* Avro* Spark Core SQL* Streaming* Mllib* GraphX* DataFrame ML Pipelines SparkR* Flink* Giraph* Batch StreamingInteractive Machine Leaning Graph Analytics HDFS* OSS* Acceleration Layer Alluxio* • Leverage new HW technologies & products that delivers significant performance improvement • Persistent memory, RDMA • Using Alluxio based in memory data accelerator layer to accelerate ephemeral data access • Caching hot data in Alluxio shorten I/O stack • Unifies underlying Filesystem • It requires a storage and network co-design to fully leverage those technologies or HWs address the bottlenecks • Optimized libraries to bypass filesystem, avoid user space/kernel space context switch … k8s* High Speed Networking 17
  • 18. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice In memory data accelerator (IMDA) architecture Enable Alluxio with state-of-art HW technology § Alluxio as a light-weighted user space I/O based distributed data store – For ephemeral data access like cache, shuffle, spill – Tiered storage – DRAM, persistent memory and SSDa § Persistent Memory to enlarge compute storage with high performance and low cost § RDMA to avoid context switch, kernel bypass – Persistent Memory mmap address as RDMA buffer to avoid memory copies – Persistent Memory as off-heap memory to improve GC § Long term: Customized shuffler for shuffle data, spill data to Alluxio IMDA Shared Data Lake with s3a object storage Batch Streaming Interactive Machine Leaning Graph Analytics Provisioned Compute Pool in memory data accelerator Ephemeral data RDMA enabled Network Persistent MemoryDRAM NVMe SSD 18
  • 19. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Alluxio IMDA system configuration 5x Compute Node Hardware: • ntel® Xeon™ processor Gold 6140 @ 2.3GHz, 384GB Memory • 1x 82599 10Gb NIC • 5x P4500 SSD (2 for spark-shuffle) Software: • Hadoop 2.8.1 • Spark 2.2.0 • Hive 2.2.1 • RHEL7.3 5x Storage Node • Intel(R) Xeon(R) CPU Gold 6140 @ 2.30GHz, 192GB Memory • 2x 82599 10Gb NIC • 7x 1TB HDD for Ceph bluestore or HDFS namenode and datanode Software: • Hadoop 2.8.1 • Ceph 12.2.7 • RHEL7.3 *Other names and brands may be claimed as the property of others. 19 Hadoop Hive Spark Alluxio DNS Hadoop Hive Spark Hadoop Hive Spark Hadoop Hive Spark Hadoop Hive Spark CEPH MON RGW REMOTE HDFS NN 1x10Gb NIC Alluxio Alluxio Alluxio Alluxio OSD DN CEPH RGW REMOTE HDFS OSD DN CEPH RGW REMOTE HDFS OSD DN CEPH RGW REMOTE HDFS OSD DN CEPH RGW REMOTE HDFS OSD DN Alluxio Acceleration Layer • 200GB Mem for mem mode Software: • Alluxio 1.7.0
  • 20. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Alluxio IMDA Performance Using Alluxio IMDA as cache: • For terasort, 3.4x speedup over S3 object storage, 1.36x speedup over local HDFS. • For TPCDS test, up to 1.56x performance speedup for IO intensive queries, slightly lower than local HDFS. • For KMeans test, 1.62x speedup over S3 object storage, 14% lower compared with local HDFS. • KMeans is a CPU intensive workload 1.00 1.00 1.00 1.00 0.70 0.62 0.40 0.53 0.96 0.97 1.36 0.86 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g Alluxio Acceleration of Disaggregated analytics storage spark(yarn) + Local HDFS (HDD) spark(yarn) + S3 (HDD) spark(yarn) +alluxio(MEM) + S3 (HDD) higher is better Using Alluxio IMDA cache improved in IO intensive workloads 20
  • 21. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 21 Alluxio DCPMM Tier architecture Alluxio PMEM tier • A new PMEM tier layer introduced to provide higher performance with lower cost • Large Capacity -> Cache more data • Higher performance compared with NVMe SSD • Leverage PMDK lib to bypass filesystem overhead and context switches • Deliver dedicated SLA to mission critical applications DRAM DCPMM SSD HDD Under Storage Application s Alluxio Worker Alluxio Master Alluxio Client
  • 22. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 22 New PMEM tier for Alluxio Two modes to support different usage scenario • SoAD mode • No code changes. • Bypass pagecache • PMDK AD mode • Bypass pagecache & no context switches • Better cache load performance. DCPMM PMDK based AD mode worker POSIX filesystem Pagecache DAX filesystem Memory mapped file client Context Switches POSIX Context Switches workerclient workerclient Userspace Load/Store JNI Storage over App Direct (SoAD) Mode PMDK
  • 23. Use Cases Alluxio Enables Burst big data workloads in hybrid cloud environments Same instance / container Accelerate big data frameworks on the public cloud Same instance / container Dramatically speed-up big data on object stores on premise Same container / machine or or Alluxio Presto Alluxio Presto Alluxio Presto Alluxio PrestoHive Alluxio Hive Alluxio Hive Alluxio Hive Alluxio Alluxio Spark AlluxioAlluxio Spark Alluxio SparkSpark
  • 24. Data Elasticity with a unified namespace Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data Data Accessibility for popular APIs & API translation Data Locality with Intelligent Multi-tiering Alluxio – Key innovations
  • 25. Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  • 26. Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  • 27. Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  • 28. Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  • 29. Enterprises moving towards independent compute & storage Learn more
  • 30. Incredible Open Source Momentum with growing community 1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads Join the conversation on Slack alluxio.org/slack
  • 31. Questions? Join the Alluxio Community www.alluxio.org | www.alluxio.com | @alluxio
  • 32. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 32 Call to action • Stay tuned for further updates • More details • Speeding Big Data Analytics on the Cloud with an In-Memory Data Accelerator • https://www.alluxio.io/blog/speeding-big-data-analytics-on-the-cloud-with-in- memory-data-accelerator/
  • 33. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Information: Benchmark and Performance DisclaimersPerformance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information, see Performance Benchmark Test Disclosure. Configurations: see performance benchmark test configurations. 33