SlideShare a Scribd company logo
Zhen Fan, JD.com
Yue Li, MemVerge
Optimizing Computing Cluster
Resource Utilization
with
Disaggregated Persistent Memory
#UnifiedAnalytics #SparkAISummit
2#UnifiedAnalytics #SparkAISummit
Agenda
Motivation
• Production Environment — tens of thousands of servers
• Uneven computing resource utilization in production data center
• Independent scaling of compute and storage
Extending Spark with external storage
• Current: JD’s remote shuffle service (RSS)
• The next generation: disaggregated persistent memory
Performance evaluation
Conclusion
3#UnifiedAnalytics #SparkAISummit
Computing Resource Utilization of a Production Cluster
100
20100
40100
60100
80100
100100
120100
140100
160100
8:30
8:53
9:15
9:38
10:00
10:23
10:45
11:08
11:30
11:53
12:15
CPU(Cores)
Time
Allocated VCores Total VCores
100.00
200.00
300.00
400.00
500.00
600.00
700.00
8:30
8:53
9:15
9:38
10:00
10:23
10:45
11:08
11:30
11:53
12:15
Memory(TB)
Time
Allocated Memory Total Memory
• 3,700 servers from 8:30 AM to 12:30 PM
• Memory is at high level all time
4#UnifiedAnalytics #SparkAISummit
Uneven Computing Resource Utilization
-10.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
MemoryOverhead(%)
Day
Memory overhead of a production computing cluster with 3,700 servers
- MemoryOverhead = MemoryUtilization(%) – CPUUtilization(%)
• Uneven computing resource utilization between CPU and memory
• Thousands of servers – impossible to change hardware
• Spark applications centric data center – need more memory
5#UnifiedAnalytics #SparkAISummit
High Memory Demand from Spark Tasks
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000 1200 1400
ExecutionTime(s)
Memory Utilization (GB)
• Spark performance highly depends on the capacity of available memory
• Machine learning and iterative jobs
• Cache and execution memory
6#UnifiedAnalytics #SparkAISummit
Too Much Shuffle Data to Store
• Shuffle-heavy applications are very common in production
• Local HDD
− slow under high pressure
− Intolerable random I/O when shuffle data heavy
− easily broken in production and no replicas
7#UnifiedAnalytics #SparkAISummit
Too Costly to Fail
• Stage-recompute is a disaster to SLA
• We need shuffle storage that is manageable and highly available
Current Solution
• Related discussion
– Spark JIRA-25299
• Separate storage from compute
– Shuffle to external storage
• The JD remote shuffle service (RSS)
− More effective and more stable for shuffle-heavy applications
8#UnifiedAnalytics #SparkAISummit
9#UnifiedAnalytics #SparkAISummit
Workflow of JD Remote Shuffle Service
• RemoteShuffleManager
10#UnifiedAnalytics #SparkAISummit
Writer & RSS Implementation
Shuffle Writer:
• Add metadata header and use protobuf to encode the sending blocks
• Partition group for effective merge in RSS
• Ack to guarantee blocks saved in HDFS
• The data path can introduce dirty data — deduplication at reducer side
RSS:
• Merge blocks of the same partition group in memory
• trade-off between merge buffer size and the lingered time
• Flush the merged buffer in synchronization
• Many other details about controlling the data flow
11#UnifiedAnalytics #SparkAISummit
Reducer Side Implementation
• Reduce task fetches related HDFS file(s)
- Files: if one crashes, other RSS make a new file for the this partition group
• Extract the key info from driver mapStatuses
- For deduplication
• Skip the partitions that is not relevant
• Deduplication
- with metadata in blocks and mapStatus
• Throttle the input streams …
12#UnifiedAnalytics #SparkAISummit
Use case – JD Data Warehouse Workload
13#UnifiedAnalytics #SparkAISummit
Use case – JD Data Warehouse Workload
To Improve Performance …
• Storage
– HDFS might not be performant enough?
• Network
– Netty might introduce bottleneck?
• We need better tools that help us explore
– Different storage backend
– Different network transports
14#UnifiedAnalytics #SparkAISummit
15#UnifiedAnalytics #SparkAISummit
The Splash Shuffle Manager
• A flexible shuffle manager
• Supports user-defined storage backend
and network transport for shuffle
• Works with vanilla Spark
• JD-RSS uses HDFS-based backend
• Open source
• https://github.com/MemVerge/splash
• Sending shuffle states to external storage
• Shuffle index
• Data partition
• Spill
The Next Generation Architecture
• JD remote shuffle service (RSS)
– Only supports shuffle workload
– Uses external storage that is relatively slow (HDFS)
• A more general framework that separates compute and storage
– Shuffle
– RDD caching
– Goal: faster, reliable and elastic
• Our solution
• Memory extension via external memory pool
16#UnifiedAnalytics #SparkAISummit
17#UnifiedAnalytics #SparkAISummit
Extending Spark Memory via Remote Memory
RDD caching
Shuffle
Spark Memory Management Model
Image source: https://0x0fff.com/spark-memory-management/
External Memory Pool
• High capacity
• High performance
• High endurance
• Affordable
18#UnifiedAnalytics #SparkAISummit
Intel Optane DC Persistent Memory: An Excellent Candidate
• Jointly developed by Intel and Micron
• High density (up to 6TB per 2-socket server)
• High endurance
• Low latency: ns level
• Byte-addressable, can be used as main memory
• Non-volatile, can be used as primary storage
• Half the price of DRAM
19#UnifiedAnalytics #SparkAISummit
Spark with Disaggregated Persistent Memory
External Extended
Memory Node
DDR-4
pmem
Node #1
DDR-4
Persistence,
Shuffling & Spill
Data
Persistence,
Shuffling &Spill
Data
RDMA
Spark Executor
Spark Executor
Node #N
DDR-4
Persistence,
Shuffling & Spill
Data
Spark Executor
TOR
RDMA
Spark Compute Node Spark Compute Node
Fast network
20#UnifiedAnalytics #SparkAISummit
Summary
Spark with disaggregated persistent memory
• A dedicated remote PMEM pool
• Easier to manage
• Affordable
• Minimal changes
• No change for existing computing nodes
• No change for user application
• Minimal at Spark level
• Highly elastic
• Computing nodes become stateless
• Highly performant
21#UnifiedAnalytics #SparkAISummit
Workloads
TeraSort
• A synthetic shuffle intensive benchmark well suited to evaluate the I/O performance
Core service of data warehouse application
• Spark-SQL
• A core data warehouse task at JD.com, supporting business decision
• I/O-intensive, based on select, insert, and full outer join operations
Anti-Price-Crawling Service
• A core analytic task of JD.com that defends against coordinated price crawling
• It is both I/O intensive and computing intensive
22#UnifiedAnalytics #SparkAISummit
Workload Characteristics
TeraSort
Data Warehouse Anti-Price-
Crawling
Input Data size 600 GB 200 GB 726.9 GB
Cached RDD size N/A N/A 349 GB
Shuffle size 334.2 GB 234.7 GB 57.6 GB
Experimental Setup
• 10 Spark computing nodes
• 1 external persistent memory node
• Spark 2.3 with standalone mode
• HDFS 2.7
23#UnifiedAnalytics #SparkAISummit
Ethernet Switch
PMEM Pool
24#UnifiedAnalytics #SparkAISummit
Performance – TeraSort
Low executor memory scenario:
• Execution time ⇓300-400s (31%)
Medium executor memory scenario:
• Execution time ⇓300-500s (38%)
Large executor memory scenario:
• Execution time ⇓450s (35%) 700
850
1000
1150
1300
1450
1600
1750
400 600 800 1000 1200 1400 1600
ExecutionTime(s)
Memory Utilization (GB)
Spark
Optimal Spark
Spark w remote pmem
Low Medium Large
25#UnifiedAnalytics #SparkAISummit
Performance – Data Warehouse
Low executor memory scenario:
• Execution time ⇓700-800s (58%)
Medium executor memory scenario:
• Execution time ⇓550-650s (54%)
Large executor memory scenario:
• Execution time ⇓350-400s (42%)
400
600
800
1000
1200
1400
1600
100 300 500 700 900 1100 1300 1500 1700
ExecutionTime(s)
Memory Utilization (GB)
Spark
Optimal Spark
Spark w remote pmem
Low Medium Large
26#UnifiedAnalytics #SparkAISummit
Performance – Anti-Price-Crawling
Low executor memory scenario:
• Execution time ⇓500s (17%)
Medium executor memory scenario:
• Execution time ⇓1200s (40%)
Large executor memory scenario:
• Execution time ⇑0-60s (4%)
• Baseline → process local caching data
• Spark w remote PMEM→ remote PMEM
caching data
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000 1200 1400 1600 1800
ExecutionTime(s)
Memory Utilization (GB)
Spark
Optimal Spark
Spark w remote pmem
Low Medium Large
Conclusion
• The separation of compute and storage
– JD-RSS
• Shuffle to external storage pool based on HDFS
– Disaggregated persistent memory pool
• Storage memory extension for shuffle and RDD caching
– Unified under MemVerge Splash Shuffle Manager
• Better elasticity and reliability
• High capacity, high performance and affordable
• Persistent memory will bring fundamental changes to Big Data
infrastructure
27#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
Understanding and Improving Code Generation
PDF
Odi 12c-getting-started-guide-2032250
PDF
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PDF
Enabling Vectorized Engine in Apache Spark
PPTX
Optimizing Apache Spark SQL Joins
PDF
Apache Nifi Crash Course
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Understanding and Improving Code Generation
Odi 12c-getting-started-guide-2032250
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Enabling Vectorized Engine in Apache Spark
Optimizing Apache Spark SQL Joins
Apache Nifi Crash Course
Dongwon Kim – A Comparative Performance Evaluation of Flink
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

What's hot (20)

PDF
NiFi 시작하기
PDF
Etsy Activity Feeds Architecture
PPTX
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
PDF
Oracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud Services
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Apache Parquet
PPTX
Oracle application express ppt
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Using Apache Spark as ETL engine. Pros and Cons
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
How to make APEX print through Node.js
PDF
Spark Performance Tuning .pdf
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PPTX
Modeling Data and Queries for Wide Column NoSQL
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Scalability, Availability & Stability Patterns
PPTX
Programming in Spark using PySpark
PPTX
Introduction to Spring Boot
NiFi 시작하기
Etsy Activity Feeds Architecture
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Oracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud Services
Evening out the uneven: dealing with skew in Flink
Apache Parquet
Oracle application express ppt
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Using Apache Spark as ETL engine. Pros and Cons
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
How to make APEX print through Node.js
Spark Performance Tuning .pdf
Apache Spark Core—Deep Dive—Proper Optimization
Modeling Data and Queries for Wide Column NoSQL
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
The Parquet Format and Performance Optimization Opportunities
Scalability, Availability & Stability Patterns
Programming in Spark using PySpark
Introduction to Spring Boot
Ad

Similar to Optimizing Performance and Computing Resource Efficiency of In-Memory Big Data Analytics with Disaggregated Persistent Memory (20)

PDF
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
PDF
Apache Spark At Scale in the Cloud
PDF
Apache Spark At Scale in the Cloud
PDF
Elastify Cloud-Native Spark Application with Persistent Memory
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
PDF
Spectrum Scale Memory Usage
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PDF
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
PDF
In-memory Data Management Trends & Techniques
PDF
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PPTX
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
PPTX
stackArmor presentation for DevOpsDC ver 4
PPSX
Coa presentation3
PDF
Caching methodology and strategies
PDF
Caching Methodology & Strategies
PDF
Architecture at Scale
PDF
High Performance Hardware for Data Analysis
PDF
Mike Pittaro - High Performance Hardware for Data Analysis
PPTX
Deploying ssd in the data center 2014
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Elastify Cloud-Native Spark Application with Persistent Memory
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spectrum Scale Memory Usage
Taking Splunk to the Next Level - Architecture Breakout Session
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
In-memory Data Management Trends & Techniques
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
stackArmor presentation for DevOpsDC ver 4
Coa presentation3
Caching methodology and strategies
Caching Methodology & Strategies
Architecture at Scale
High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
Deploying ssd in the data center 2014
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Database Infoormation System (DBIS).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Reliability_Chapter_ presentation 1221.5784
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Launch Your Data Science Career in Kochi – 2025
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Data_Analytics_and_PowerBI_Presentation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Fluorescence-microscope_Botany_detailed content
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Acumen Training GuidePresentation.pptx
Clinical guidelines as a resource for EBP(1).pdf
Database Infoormation System (DBIS).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
STUDY DESIGN details- Lt Col Maksud (21).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Data Analytics with Disaggregated Persistent Memory

  • 1. Zhen Fan, JD.com Yue Li, MemVerge Optimizing Computing Cluster Resource Utilization with Disaggregated Persistent Memory #UnifiedAnalytics #SparkAISummit
  • 2. 2#UnifiedAnalytics #SparkAISummit Agenda Motivation • Production Environment — tens of thousands of servers • Uneven computing resource utilization in production data center • Independent scaling of compute and storage Extending Spark with external storage • Current: JD’s remote shuffle service (RSS) • The next generation: disaggregated persistent memory Performance evaluation Conclusion
  • 3. 3#UnifiedAnalytics #SparkAISummit Computing Resource Utilization of a Production Cluster 100 20100 40100 60100 80100 100100 120100 140100 160100 8:30 8:53 9:15 9:38 10:00 10:23 10:45 11:08 11:30 11:53 12:15 CPU(Cores) Time Allocated VCores Total VCores 100.00 200.00 300.00 400.00 500.00 600.00 700.00 8:30 8:53 9:15 9:38 10:00 10:23 10:45 11:08 11:30 11:53 12:15 Memory(TB) Time Allocated Memory Total Memory • 3,700 servers from 8:30 AM to 12:30 PM • Memory is at high level all time
  • 4. 4#UnifiedAnalytics #SparkAISummit Uneven Computing Resource Utilization -10.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 MemoryOverhead(%) Day Memory overhead of a production computing cluster with 3,700 servers - MemoryOverhead = MemoryUtilization(%) – CPUUtilization(%) • Uneven computing resource utilization between CPU and memory • Thousands of servers – impossible to change hardware • Spark applications centric data center – need more memory
  • 5. 5#UnifiedAnalytics #SparkAISummit High Memory Demand from Spark Tasks 1000 2000 3000 4000 5000 6000 0 200 400 600 800 1000 1200 1400 ExecutionTime(s) Memory Utilization (GB) • Spark performance highly depends on the capacity of available memory • Machine learning and iterative jobs • Cache and execution memory
  • 6. 6#UnifiedAnalytics #SparkAISummit Too Much Shuffle Data to Store • Shuffle-heavy applications are very common in production • Local HDD − slow under high pressure − Intolerable random I/O when shuffle data heavy − easily broken in production and no replicas
  • 7. 7#UnifiedAnalytics #SparkAISummit Too Costly to Fail • Stage-recompute is a disaster to SLA • We need shuffle storage that is manageable and highly available
  • 8. Current Solution • Related discussion – Spark JIRA-25299 • Separate storage from compute – Shuffle to external storage • The JD remote shuffle service (RSS) − More effective and more stable for shuffle-heavy applications 8#UnifiedAnalytics #SparkAISummit
  • 9. 9#UnifiedAnalytics #SparkAISummit Workflow of JD Remote Shuffle Service • RemoteShuffleManager
  • 10. 10#UnifiedAnalytics #SparkAISummit Writer & RSS Implementation Shuffle Writer: • Add metadata header and use protobuf to encode the sending blocks • Partition group for effective merge in RSS • Ack to guarantee blocks saved in HDFS • The data path can introduce dirty data — deduplication at reducer side RSS: • Merge blocks of the same partition group in memory • trade-off between merge buffer size and the lingered time • Flush the merged buffer in synchronization • Many other details about controlling the data flow
  • 11. 11#UnifiedAnalytics #SparkAISummit Reducer Side Implementation • Reduce task fetches related HDFS file(s) - Files: if one crashes, other RSS make a new file for the this partition group • Extract the key info from driver mapStatuses - For deduplication • Skip the partitions that is not relevant • Deduplication - with metadata in blocks and mapStatus • Throttle the input streams …
  • 12. 12#UnifiedAnalytics #SparkAISummit Use case – JD Data Warehouse Workload
  • 13. 13#UnifiedAnalytics #SparkAISummit Use case – JD Data Warehouse Workload
  • 14. To Improve Performance … • Storage – HDFS might not be performant enough? • Network – Netty might introduce bottleneck? • We need better tools that help us explore – Different storage backend – Different network transports 14#UnifiedAnalytics #SparkAISummit
  • 15. 15#UnifiedAnalytics #SparkAISummit The Splash Shuffle Manager • A flexible shuffle manager • Supports user-defined storage backend and network transport for shuffle • Works with vanilla Spark • JD-RSS uses HDFS-based backend • Open source • https://github.com/MemVerge/splash • Sending shuffle states to external storage • Shuffle index • Data partition • Spill
  • 16. The Next Generation Architecture • JD remote shuffle service (RSS) – Only supports shuffle workload – Uses external storage that is relatively slow (HDFS) • A more general framework that separates compute and storage – Shuffle – RDD caching – Goal: faster, reliable and elastic • Our solution • Memory extension via external memory pool 16#UnifiedAnalytics #SparkAISummit
  • 17. 17#UnifiedAnalytics #SparkAISummit Extending Spark Memory via Remote Memory RDD caching Shuffle Spark Memory Management Model Image source: https://0x0fff.com/spark-memory-management/ External Memory Pool • High capacity • High performance • High endurance • Affordable
  • 18. 18#UnifiedAnalytics #SparkAISummit Intel Optane DC Persistent Memory: An Excellent Candidate • Jointly developed by Intel and Micron • High density (up to 6TB per 2-socket server) • High endurance • Low latency: ns level • Byte-addressable, can be used as main memory • Non-volatile, can be used as primary storage • Half the price of DRAM
  • 19. 19#UnifiedAnalytics #SparkAISummit Spark with Disaggregated Persistent Memory External Extended Memory Node DDR-4 pmem Node #1 DDR-4 Persistence, Shuffling & Spill Data Persistence, Shuffling &Spill Data RDMA Spark Executor Spark Executor Node #N DDR-4 Persistence, Shuffling & Spill Data Spark Executor TOR RDMA Spark Compute Node Spark Compute Node Fast network
  • 20. 20#UnifiedAnalytics #SparkAISummit Summary Spark with disaggregated persistent memory • A dedicated remote PMEM pool • Easier to manage • Affordable • Minimal changes • No change for existing computing nodes • No change for user application • Minimal at Spark level • Highly elastic • Computing nodes become stateless • Highly performant
  • 21. 21#UnifiedAnalytics #SparkAISummit Workloads TeraSort • A synthetic shuffle intensive benchmark well suited to evaluate the I/O performance Core service of data warehouse application • Spark-SQL • A core data warehouse task at JD.com, supporting business decision • I/O-intensive, based on select, insert, and full outer join operations Anti-Price-Crawling Service • A core analytic task of JD.com that defends against coordinated price crawling • It is both I/O intensive and computing intensive
  • 22. 22#UnifiedAnalytics #SparkAISummit Workload Characteristics TeraSort Data Warehouse Anti-Price- Crawling Input Data size 600 GB 200 GB 726.9 GB Cached RDD size N/A N/A 349 GB Shuffle size 334.2 GB 234.7 GB 57.6 GB
  • 23. Experimental Setup • 10 Spark computing nodes • 1 external persistent memory node • Spark 2.3 with standalone mode • HDFS 2.7 23#UnifiedAnalytics #SparkAISummit Ethernet Switch PMEM Pool
  • 24. 24#UnifiedAnalytics #SparkAISummit Performance – TeraSort Low executor memory scenario: • Execution time ⇓300-400s (31%) Medium executor memory scenario: • Execution time ⇓300-500s (38%) Large executor memory scenario: • Execution time ⇓450s (35%) 700 850 1000 1150 1300 1450 1600 1750 400 600 800 1000 1200 1400 1600 ExecutionTime(s) Memory Utilization (GB) Spark Optimal Spark Spark w remote pmem Low Medium Large
  • 25. 25#UnifiedAnalytics #SparkAISummit Performance – Data Warehouse Low executor memory scenario: • Execution time ⇓700-800s (58%) Medium executor memory scenario: • Execution time ⇓550-650s (54%) Large executor memory scenario: • Execution time ⇓350-400s (42%) 400 600 800 1000 1200 1400 1600 100 300 500 700 900 1100 1300 1500 1700 ExecutionTime(s) Memory Utilization (GB) Spark Optimal Spark Spark w remote pmem Low Medium Large
  • 26. 26#UnifiedAnalytics #SparkAISummit Performance – Anti-Price-Crawling Low executor memory scenario: • Execution time ⇓500s (17%) Medium executor memory scenario: • Execution time ⇓1200s (40%) Large executor memory scenario: • Execution time ⇑0-60s (4%) • Baseline → process local caching data • Spark w remote PMEM→ remote PMEM caching data 1000 2000 3000 4000 5000 6000 0 200 400 600 800 1000 1200 1400 1600 1800 ExecutionTime(s) Memory Utilization (GB) Spark Optimal Spark Spark w remote pmem Low Medium Large
  • 27. Conclusion • The separation of compute and storage – JD-RSS • Shuffle to external storage pool based on HDFS – Disaggregated persistent memory pool • Storage memory extension for shuffle and RDD caching – Unified under MemVerge Splash Shuffle Manager • Better elasticity and reliability • High capacity, high performance and affordable • Persistent memory will bring fundamental changes to Big Data infrastructure 27#UnifiedAnalytics #SparkAISummit
  • 28. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT