SlideShare a Scribd company logo
Costin Iancu, Khaled Ibrahim – LBNL
Nicholas Chaimov – U. Oregon
Spark on Supercomputers:
A Tale of the Storage Hierarchy
Apache Spark
• Developed for cloud environments
• Specialized runtime provides for
– Performance J, Elastic parallelism, Resilience
• Programming productivity through
– HLL front-ends (Scala, R, SQL), multiple domain-specific libraries:
Streaming, SparkSQL, SparkR, GraphX, Splash, MLLib, Velox
• We have huge datasets but little penetration in HPC
Apache Spark
• In-memory Map-Reduce framework
• Central abstraction is the Resilient Distributed Dataset.
• Data movement is important
– Lazy, on-demand
– Horizontal (node-to-node) – shuffle/Reduce
– Vertical (node-to-storage) - Map/Reduce
p1
p2
p3
textFile
p1
p2
p3
flatMap
p1
p2
p3
map
p1
p2
p3
reduceByKey
(local)
STAGE 0
p1
p2
p3
reduceByKey
(global)
STAGE 1
JOB 0
Data Centers/Clouds
Node	local	storage,	assumes	 all	disk
operations	are	equal
Disk I/O	optimized	for	latency
Network optimized	for	bandwidth
HPC
Global	file	system,	asymmetry	expected
Disk I/O	optimized	for	bandwidth
Network optimized	for	latency
HDD/
SSD
NIC
CPU
Mem
HDD/
SDD
HDD/
SDD
HDD/
SDD
CPU
Mem
NIC
HDD/
SDD
HDD/
SSD
HDD/
SSD
HDD/
SSD
HDD
/SSD
Cloud: commodity CPU,
memory, HDD/SSD NIC
Data appliance: server CPU,
large fast memory, fast SSD
Backend storage
Intermediate
storage
HPC: server CPU, fast memory,
combo of fast and slower storage
HDD/
SSD
NIC
CPU
Mem
HDD/
SDD
HDD/
SDD
HDD/
SDD
CPU
Mem
NIC
HDD/
SDD
HDD/
SSD
HDD/
SSD
HDD/
SSD
HDD
/SSD
Backend storage
Intermediate
storage
2.5 GHz Intel Haswell - 24 cores
2.3 GHz Intel Haswell – 32 cores
128GB/1.5TB DDR4
128GB DDR4
320 GB of SSD local 56 Gbps FDR InfiniBand
Cray Data Warp
1.8PB at 1.7TB/s
Sonexion Lustre 30PB
Cray Aries
Comet (DELL)
Cori (Cray XC40)
Scaling Spark on Cray XC40
(It’s all about file system metadata)
Not ALL I/O is Created Equal
0	
2000	
4000	
6000	
8000	
10000	
12000	
1	 2	 4	 8	 16	
Time	Per	Opera1on	(microseconds)	
Nodes	
GroupByTest	-	I/O	Components	-	Cori		
Lustre	-	Open	 BB	Striped	-	Open	
BB	Private	-	Open	 Lustre	-	Read	
BB	Striped	-	Read	 BB	Private	-	Read	
Lustre	-	Write	 BB	Striped	-	Write	
# Shuffle opens = # Shuffle reads O(cores2)
Time per open increases with scale, unlike read/write
9,216
36,864
147,456
589,824
2,359,296
opens
I/O Variability is HIGH
fopen is a problem:
• Mean time is 23X larger than SSD
• Variability is 14,000X
READ fopen
Improving I/O Performance
Eliminate file metadata operations
1. Keep files open (cache fopen)
• Surprising 10%-20% improvement on data appliance
• Argues for user level file systems, gets rid of serialized system calls
2. Use file system backed by single Lustre file for shuffle
• This should also help on systems with local SSDs
3. Use containers
• Speeds up startup, up to 20% end-to-end performance improvement
• Solutions need to be used in conjunction
– E.g. fopen from Parquet reader
Plenty of details in “Scaling Spark on HPC Systems”. HPDC 2016
0
100
200
300
400
500
600
700
32 160 320 640 1280 2560 5120 10240
Time	(s)
Cores
Cori	- GroupBy	- Weak	Scaling	- Time	to	Job	Completion
Ramdisk
Mounted	File
Lustre
Scalability
6x
12x 14x
19x
33x
61x
At 10,240 cores
only 1.6x slower
than RAMdisk
(in memory
execution)
We scaled Spark from O(100)
up to
O(10,000) cores
File-Backed Filesystems
• NERSC Shifter (container infrastructure for HPC)
– Compatible with Docker images
– Integrated with Slurm scheduler
– Can control mounting of filesystems within container
• Per-Node Cache
– File-backed filesystem mounted within each node’s container instance at common
path (/mnt)
– ​--volume=$SCRATCH/backingFile:/mnt:perNodeCache=
size=100G
– File for each node is created stored on backend Lustre filesystem
– Single file open — intermediate data file opens are kept local
Now the fun part J
Architectural Performance Considerations
Cori Comet
The Supercomputer vs The Data Appliance
HDD/
SSD
NIC
CPU
Mem
HDD/
SDD
HDD/
SDD
HDD/
SDD
CPU
Mem
NIC
HDD/
SDD
HDD/
SSD
HDD/
SSD
HDD/
SSD
HDD
/SSD
Backend storage
Intermediate
storage
2.5 GHz Intel Haswell - 24 cores
2.3 GHz Intel Haswell – 32 cores
128GB/1.5TB DDR4
128GB DDR4
320 GB of SSD local 56 Gbps FDR InfiniBand
Cray Data Warp
1.8PB at 1.7TB/s
Sonexion Lustre 30PB
Cray Aries
Comet (DELL)
Cori (Cray XC40)
CPU, Memory, Network, Disk?
• Multiple extensions to Blocked Time Analysis (Ousterhout, 2015)
• BTA indicated that CPU dominates
– Network 2%, disk 19%
• Concentrate on scaling out, weak scaling studies
– Spark-perf, BigDataBenchmark, TPC-DS, TeraSort
• Interested in determining right ratio, machine balance for
– CPU, memory, network, disk …
• Spark 2.0.2 & Spark-RDMA 0.9.4 from Ohio State University,
Hadoop 2.6
Storage hierarchy and performance
Global Storage Matches Local Storage
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
Lustre
Mount+Pool
SSD+IB
Lustre
Mount+Pool
SSD+IB
Lustre
Mount+Pool
SSD+IB
1 5 20
Time	(ms)
Nodes	(32	cores)
App
JVM
RW	Input
RW	Shuffle
Open	Input
Open	Shuffle
• Variability matters more than
advertised latency and bandwidth
number
• Storage performance
obscured/mitigated by network
due to client/server in
BlockManager
• Small scale local is slightly
faster
• Large scale global is faster
Disk+Network Latency/BW
Metadata Overhead
Cray XC40 – TeraSort (100GB/node)
0
0.2
0.4
0.6
0.8
1
1.2
1 16 1 16
Comet	RDMA	Singularity	24	Cores Cori	Shifter	24	Cores
Average	Across	MLLib	Benchmarks
App Fetch JVM
Global Storage Matches Local Storage
11.8%
Fetch
12.5%
Fetch
Intermediate Storage Hurts
Performance
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Cori	Shifter	Lustre Cori	Shifter	BB	Striped Cori	Shifter	BB	Private
Time	(s)
TPC-DS	- Weak	Scaling
App Fetch JVM
19.4%	slower
on	average
86.8%	slower
on	average
(Without our optimizations, intermediate storage scaled better)
Networking performance
0
50
100
150
200
250
300
350
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Comet	Singularity Comet	RDMA	Singularity Cori	Shifter	24	cores
Time	(s)
Singular	Value	Decomposition
App
Fetch
JVM
Latency or Bandwidth?
10X in bandwidth,
latency differences matter
Can hide 2X differences
Average message size for spark-perf is 43B
Network Matters at Scale
0
50
100
150
200
250
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Cori	Shifter	24	Cores Cori	Shifter	32	Cores
Time	(s)
Average	Across	Benchmarks
App Fetch JVM
44%
CPU
More cores or better memory?
• Need more cores to hide
disk and network latency
at scale.
• Preliminary experiences
with Intel KNL are bad
• Too much concurrency
• Not enough integer
throughput
• Execution does not seem
to be memory bandwidth
limited
0
50
100
150
200
250
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Cori	Shifter	24	Cores Cori	Shifter	32	Cores
Time	(s)
Average	Across	Benchmarks
App Fetch JVM
Summary/Conclusions
• Latency and bandwidth are important, but not dominant
– Variability more important than marketing numbers
• Network time dominates at scale
– Network, disk is mis-attributed as CPU
• Comet matches Cori up to 512 cores, Cori twice as fast at
2048 cores
– Spark can run well on global storage
• Global storage opens the possibility of global name space, no
more client-server
Ackowledgement
Work partially supported by
Intel Parallel Computing Center: Big Data Support
for HPC
Thank You.
Questions, collaborations, free software
cciancu@lbl.gov
kzibrahim@lbl.gov
nchaimov@uoregon.edu
Burst Buffer
Setup
• Cray XC30 at NERSC (Edison): 2.4 GHz IvyBridge - Global
• Cray XC40 at NERSC (Cori): 2.3 GHz Haswell + Cray
DataWarp
• Comet at SDSC: 2.5GHz Haswell, InfiniBand FDR, 320 GB
SSD, 1.5TB memory - LOCAL

More Related Content

PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PDF
Low Latency Execution For Apache Spark
PDF
Apache Spark Performance is too hard. Let's make it easier
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PPTX
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
PDF
Re-Architecting Spark For Performance Understandability
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Spark on Mesos
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Low Latency Execution For Apache Spark
Apache Spark Performance is too hard. Let's make it easier
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
Re-Architecting Spark For Performance Understandability
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Spark on Mesos

What's hot (20)

PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
PDF
Spark Summit EU talk by Jorg Schad
PDF
Transactional writes to cloud storage with Eric Liang
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PDF
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
PDF
Spark Summit EU talk by Berni Schiefer
PDF
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Reactive Streams, Linking Reactive Application To Spark Streaming
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Deploying Accelerators At Datacenter Scale Using Spark
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Spark Summit EU talk by Jorg Schad
Transactional writes to cloud storage with Eric Liang
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
SSR: Structured Streaming for R and Machine Learning
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit EU talk by Berni Schiefer
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Reactive Streams, Linking Reactive Application To Spark Streaming
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Ad

Similar to Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin Iancu and Nicholas Chaimov (20)

PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PPTX
Stories About Spark, HPC and Barcelona by Jordi Torres
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
PDF
Handling the growth of data
PDF
Performance Characterization and Optimization of In-Memory Data Analytics on ...
PPTX
Empower Data-Driven Organizations
PDF
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
PDF
HPC DAY 2017 | HPE Storage and Data Management for Big Data
PDF
Bds session 13 14
PDF
2016 August POWER Up Your Insights - IBM System Summit Mumbai
PDF
Introduction to Spark Training
PDF
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PPTX
Intro to Spark development
PDF
Spark1.0での動作検証 - Hadoopユーザ・デベロッパから見たSparkへの期待 (Hadoop Conference Japan 2014)
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PDF
spark_v1_2
PDF
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
ODP
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
PPT
Spark_Part 1
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Stories About Spark, HPC and Barcelona by Jordi Torres
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Handling the growth of data
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Empower Data-Driven Organizations
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
HPC DAY 2017 | HPE Storage and Data Management for Big Data
Bds session 13 14
2016 August POWER Up Your Insights - IBM System Summit Mumbai
Introduction to Spark Training
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Intro to Spark development
Spark1.0での動作検証 - Hadoopユーザ・デベロッパから見たSparkへの期待 (Hadoop Conference Japan 2014)
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
spark_v1_2
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
Spark_Part 1
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Introduction to the R Programming Language
PDF
Business Analytics and business intelligence.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPT
Quality review (1)_presentation of this 21
PDF
annual-report-2024-2025 original latest.
PDF
Introduction to Data Science and Data Analysis
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to the R Programming Language
Business Analytics and business intelligence.pdf
Qualitative Qantitative and Mixed Methods.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
SAP 2 completion done . PRESENTATION.pptx
Quality review (1)_presentation of this 21
annual-report-2024-2025 original latest.
Introduction to Data Science and Data Analysis
Data_Analytics_and_PowerBI_Presentation.pptx
IB Computer Science - Internal Assessment.pptx

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin Iancu and Nicholas Chaimov

  • 1. Costin Iancu, Khaled Ibrahim – LBNL Nicholas Chaimov – U. Oregon Spark on Supercomputers: A Tale of the Storage Hierarchy
  • 2. Apache Spark • Developed for cloud environments • Specialized runtime provides for – Performance J, Elastic parallelism, Resilience • Programming productivity through – HLL front-ends (Scala, R, SQL), multiple domain-specific libraries: Streaming, SparkSQL, SparkR, GraphX, Splash, MLLib, Velox • We have huge datasets but little penetration in HPC
  • 3. Apache Spark • In-memory Map-Reduce framework • Central abstraction is the Resilient Distributed Dataset. • Data movement is important – Lazy, on-demand – Horizontal (node-to-node) – shuffle/Reduce – Vertical (node-to-storage) - Map/Reduce p1 p2 p3 textFile p1 p2 p3 flatMap p1 p2 p3 map p1 p2 p3 reduceByKey (local) STAGE 0 p1 p2 p3 reduceByKey (global) STAGE 1 JOB 0
  • 4. Data Centers/Clouds Node local storage, assumes all disk operations are equal Disk I/O optimized for latency Network optimized for bandwidth HPC Global file system, asymmetry expected Disk I/O optimized for bandwidth Network optimized for latency
  • 5. HDD/ SSD NIC CPU Mem HDD/ SDD HDD/ SDD HDD/ SDD CPU Mem NIC HDD/ SDD HDD/ SSD HDD/ SSD HDD/ SSD HDD /SSD Cloud: commodity CPU, memory, HDD/SSD NIC Data appliance: server CPU, large fast memory, fast SSD Backend storage Intermediate storage HPC: server CPU, fast memory, combo of fast and slower storage
  • 6. HDD/ SSD NIC CPU Mem HDD/ SDD HDD/ SDD HDD/ SDD CPU Mem NIC HDD/ SDD HDD/ SSD HDD/ SSD HDD/ SSD HDD /SSD Backend storage Intermediate storage 2.5 GHz Intel Haswell - 24 cores 2.3 GHz Intel Haswell – 32 cores 128GB/1.5TB DDR4 128GB DDR4 320 GB of SSD local 56 Gbps FDR InfiniBand Cray Data Warp 1.8PB at 1.7TB/s Sonexion Lustre 30PB Cray Aries Comet (DELL) Cori (Cray XC40)
  • 7. Scaling Spark on Cray XC40 (It’s all about file system metadata)
  • 8. Not ALL I/O is Created Equal 0 2000 4000 6000 8000 10000 12000 1 2 4 8 16 Time Per Opera1on (microseconds) Nodes GroupByTest - I/O Components - Cori Lustre - Open BB Striped - Open BB Private - Open Lustre - Read BB Striped - Read BB Private - Read Lustre - Write BB Striped - Write # Shuffle opens = # Shuffle reads O(cores2) Time per open increases with scale, unlike read/write 9,216 36,864 147,456 589,824 2,359,296 opens
  • 9. I/O Variability is HIGH fopen is a problem: • Mean time is 23X larger than SSD • Variability is 14,000X READ fopen
  • 10. Improving I/O Performance Eliminate file metadata operations 1. Keep files open (cache fopen) • Surprising 10%-20% improvement on data appliance • Argues for user level file systems, gets rid of serialized system calls 2. Use file system backed by single Lustre file for shuffle • This should also help on systems with local SSDs 3. Use containers • Speeds up startup, up to 20% end-to-end performance improvement • Solutions need to be used in conjunction – E.g. fopen from Parquet reader Plenty of details in “Scaling Spark on HPC Systems”. HPDC 2016
  • 11. 0 100 200 300 400 500 600 700 32 160 320 640 1280 2560 5120 10240 Time (s) Cores Cori - GroupBy - Weak Scaling - Time to Job Completion Ramdisk Mounted File Lustre Scalability 6x 12x 14x 19x 33x 61x At 10,240 cores only 1.6x slower than RAMdisk (in memory execution) We scaled Spark from O(100) up to O(10,000) cores
  • 12. File-Backed Filesystems • NERSC Shifter (container infrastructure for HPC) – Compatible with Docker images – Integrated with Slurm scheduler – Can control mounting of filesystems within container • Per-Node Cache – File-backed filesystem mounted within each node’s container instance at common path (/mnt) – ​--volume=$SCRATCH/backingFile:/mnt:perNodeCache= size=100G – File for each node is created stored on backend Lustre filesystem – Single file open — intermediate data file opens are kept local
  • 13. Now the fun part J Architectural Performance Considerations Cori Comet The Supercomputer vs The Data Appliance
  • 14. HDD/ SSD NIC CPU Mem HDD/ SDD HDD/ SDD HDD/ SDD CPU Mem NIC HDD/ SDD HDD/ SSD HDD/ SSD HDD/ SSD HDD /SSD Backend storage Intermediate storage 2.5 GHz Intel Haswell - 24 cores 2.3 GHz Intel Haswell – 32 cores 128GB/1.5TB DDR4 128GB DDR4 320 GB of SSD local 56 Gbps FDR InfiniBand Cray Data Warp 1.8PB at 1.7TB/s Sonexion Lustre 30PB Cray Aries Comet (DELL) Cori (Cray XC40)
  • 15. CPU, Memory, Network, Disk? • Multiple extensions to Blocked Time Analysis (Ousterhout, 2015) • BTA indicated that CPU dominates – Network 2%, disk 19% • Concentrate on scaling out, weak scaling studies – Spark-perf, BigDataBenchmark, TPC-DS, TeraSort • Interested in determining right ratio, machine balance for – CPU, memory, network, disk … • Spark 2.0.2 & Spark-RDMA 0.9.4 from Ohio State University, Hadoop 2.6
  • 16. Storage hierarchy and performance
  • 17. Global Storage Matches Local Storage 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 Lustre Mount+Pool SSD+IB Lustre Mount+Pool SSD+IB Lustre Mount+Pool SSD+IB 1 5 20 Time (ms) Nodes (32 cores) App JVM RW Input RW Shuffle Open Input Open Shuffle • Variability matters more than advertised latency and bandwidth number • Storage performance obscured/mitigated by network due to client/server in BlockManager • Small scale local is slightly faster • Large scale global is faster Disk+Network Latency/BW Metadata Overhead Cray XC40 – TeraSort (100GB/node)
  • 18. 0 0.2 0.4 0.6 0.8 1 1.2 1 16 1 16 Comet RDMA Singularity 24 Cores Cori Shifter 24 Cores Average Across MLLib Benchmarks App Fetch JVM Global Storage Matches Local Storage 11.8% Fetch 12.5% Fetch
  • 19. Intermediate Storage Hurts Performance 0 2000 4000 6000 8000 10000 12000 14000 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Cori Shifter Lustre Cori Shifter BB Striped Cori Shifter BB Private Time (s) TPC-DS - Weak Scaling App Fetch JVM 19.4% slower on average 86.8% slower on average (Without our optimizations, intermediate storage scaled better)
  • 21. 0 50 100 150 200 250 300 350 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Comet Singularity Comet RDMA Singularity Cori Shifter 24 cores Time (s) Singular Value Decomposition App Fetch JVM Latency or Bandwidth? 10X in bandwidth, latency differences matter Can hide 2X differences Average message size for spark-perf is 43B
  • 22. Network Matters at Scale 0 50 100 150 200 250 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Cori Shifter 24 Cores Cori Shifter 32 Cores Time (s) Average Across Benchmarks App Fetch JVM 44%
  • 23. CPU
  • 24. More cores or better memory? • Need more cores to hide disk and network latency at scale. • Preliminary experiences with Intel KNL are bad • Too much concurrency • Not enough integer throughput • Execution does not seem to be memory bandwidth limited 0 50 100 150 200 250 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Cori Shifter 24 Cores Cori Shifter 32 Cores Time (s) Average Across Benchmarks App Fetch JVM
  • 25. Summary/Conclusions • Latency and bandwidth are important, but not dominant – Variability more important than marketing numbers • Network time dominates at scale – Network, disk is mis-attributed as CPU • Comet matches Cori up to 512 cores, Cori twice as fast at 2048 cores – Spark can run well on global storage • Global storage opens the possibility of global name space, no more client-server
  • 26. Ackowledgement Work partially supported by Intel Parallel Computing Center: Big Data Support for HPC
  • 27. Thank You. Questions, collaborations, free software cciancu@lbl.gov kzibrahim@lbl.gov nchaimov@uoregon.edu
  • 28. Burst Buffer Setup • Cray XC30 at NERSC (Edison): 2.4 GHz IvyBridge - Global • Cray XC40 at NERSC (Cori): 2.3 GHz Haswell + Cray DataWarp • Comet at SDSC: 2.5GHz Haswell, InfiniBand FDR, 320 GB SSD, 1.5TB memory - LOCAL