SlideShare a Scribd company logo
Increasing cluster
performance by combining
rCUDA with Slurm
Federico Silla
Technical University of Valencia
Spain
HPC Advisory Council Switzerland Conference 2016 2/56
Outline
rCUDA … what’s that?
HPC Advisory Council Switzerland Conference 2016 3/56
Basics of CUDA
GPU
HPC Advisory Council Switzerland Conference 2016 4/56
rCUDA … remote CUDA
No GPU
HPC Advisory Council Switzerland Conference 2016 5/56
A software technology that enables a more
flexible use of GPUs in computing facilities
No GPU
rCUDA … remote CUDA
HPC Advisory Council Switzerland Conference 2016 6/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 7/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 8/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 9/56
Physical
configuration
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
PCI-e
CPU
MainMemory
Network
Interconnection Network
Logical connections
Logical
configuration
Cluster envision with rCUDA
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
Interconnection Network
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
 rCUDA allows a new vision of a GPU deployment, moving from
the usual cluster configuration:
to the following one:
HPC Advisory Council Switzerland Conference 2016 10/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 11/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 12/56
The main concern with rCUDA is the
reduced bandwidth to the remote GPU
Concern with rCUDA
No GPU
HPC Advisory Council Switzerland Conference 2016 13/56
Using InfiniBand networks
HPC Advisory Council Switzerland Conference 2016 14/56
H2D pageable D2H pageable
H2D pinned D2H pinned
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Initial transfers within rCUDA
HPC Advisory Council Switzerland Conference 2016 15/56
 CUDASW++
Bioinformatics software for Smith-Waterman protein database
searches
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCUDAOverhead(%)
ExecutionTime(s)
Performance depending on network
HPC Advisory Council Switzerland Conference 2016 16/56
H2D pageable D2H pageable
H2D pinned
Almost 100% of
available BW
D2H pinned
Almost 100% of
available BW
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Optimized transfers within rCUDA
HPC Advisory Council Switzerland Conference 2016 17/56
rCUDA optimizations on applications
• Several applications executed with
CUDA and rCUDA
• K20 GPU and FDR InfiniBand
• K40 GPU and EDR InfiniBand
Lower
is better
HPC Advisory Council Switzerland Conference 2016 18/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 19/56
Outline
rCUDA improves
cluster performance
HPC Advisory Council Switzerland Conference 2016 20/56
 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
Test bench for studying rCUDA+Slurm
node with
the Slurm
scheduler
node with
the Slurm
scheduler node with
the Slurm
scheduler
8+1 GPU nodes
16+1 GPU nodes
4+1 GPU nodes
HPC Advisory Council Switzerland Conference 2016 21/56
 Applications used for tests:
 GPU-Blast (21 seconds; 1 GPU; 1599 MB)
 LAMMPS (15 seconds; 4 GPUs; 876 MB)
 MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)
 GROMACS (2 nodes) (167 seconds)
 NAMD (4 nodes) (11 minutes)
 BarraCUDA (10 minutes; 1 GPU; 3319 MB)
 GPU-LIBSVM (5 minutes; 1GPU; 145 MB)
 MUMmerGPU (5 minutes; 1GPU; 2804 MB)
Non-GPU
Short execution time
Long execution time
Set 1
Set 2
 Three workloads:
 Set 1
 Set 2
 Set 1 + Set 2
Applications for studying rCUDA+Slurm
HPC Advisory Council Switzerland Conference 2016 22/56
Workloads for studying rCUDA+Slurm (I)
HPC Advisory Council Switzerland Conference 2016 23/56
Performance of rCUDA+Slurm (I)
HPC Advisory Council Switzerland Conference 2016 24/56
Workloads for studying rCUDA+Slurm (II)
HPC Advisory Council Switzerland Conference 2016 25/56
Performance of rCUDA+Slurm (II)
HPC Advisory Council Switzerland Conference 2016 26/56
Outline
Why does rCUDA improve
cluster performance?
HPC Advisory Council Switzerland Conference 2016 27/56
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Non-accelerated applications keep GPUs idle in the nodes
where they use all the cores
1st reason for improved performance
A CPU-only application spreading over
these nodes will make their GPUs
unavailable for accelerated applications
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
HPC Advisory Council Switzerland Conference 2016 28/56
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated application using just one
CPU core may avoid other jobs to be
dispatched to this node
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
2nd reason for improved performance (I)
HPC Advisory Council Switzerland Conference 2016 29/56
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated MPI application using
just one CPU core per node may
keep part of the cluster busy
2nd reason for improved performance (II)
HPC Advisory Council Switzerland Conference 2016 30/56
• Do applications completely squeeze the GPUs available in the cluster?
• When a GPU is assigned to an application, computational resources
inside the GPU may not be fully used
• Application presenting low level of parallelism
• CPU code being executed (GPU assigned ≠ GPU working)
• GPU-core stall due to lack of data
• etc …
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
3rd reason for improved performance
HPC Advisory Council Switzerland Conference 2016 31/56
GPU usage of GPU-Blast
GPU assigned
but not used
GPU assigned
but not used
HPC Advisory Council Switzerland Conference 2016 32/56
GPU usage of CUDA-MEME
GPU utilization is far away from maximum
HPC Advisory Council Switzerland Conference 2016 33/56
GPU usage of LAMMPS
GPU assigned
but not used
HPC Advisory Council Switzerland Conference 2016 34/56
GPU allocation vs GPU utilization
GPUs
assigned
but not
used
HPC Advisory Council Switzerland Conference 2016 35/56
Sharing a GPU among jobs: GPU-Blast
Two concurrent
instances of GPU-Blast
One
instance
required
about 51
seconds
HPC Advisory Council Switzerland Conference 2016 36/56
Two concurrent
instances of GPU-Blast
Sharing a GPU among jobs: GPU-Blast
First
instance
HPC Advisory Council Switzerland Conference 2016 37/56
Two concurrent
instances of GPU-Blast
Sharing a GPU among jobs: GPU-Blast
First
instance
Second
instance
HPC Advisory Council Switzerland Conference 2016 38/56
Sharing a GPU among jobs
• LAMMPS: 876 MB
• mCUDA-MEME: 151 MB
• BarraCUDA: 3319 MB
• MUMmerGPU: 2104 MB
• GPU-LIBSVM: 145 MB
K20 GPU
HPC Advisory Council Switzerland Conference 2016 39/56
Outline
Other reasons for
using rCUDA?
HPC Advisory Council Switzerland Conference 2016 40/56
Cheaper cluster upgrade
No GPU
• Let’s suppose that a cluster without GPUs needs to be
upgraded to use GPUs
• GPUs require large power supplies
• Are power supplies already installed in the
nodes large enough?
• GPUs require large amounts of space
• Does current form factor of the nodes allow
to install GPUs?
The answer to both
questions is usually “NO”
HPC Advisory Council Switzerland Conference 2016 41/56
Cheaper cluster upgrade
No GPU
GPU-enabled
Approach 1: augment the cluster with some CUDA GPU-
enabled nodes  only those GPU-enabled nodes can
execute accelerated applications
HPC Advisory Council Switzerland Conference 2016 42/56
Cheaper cluster upgrade
Approach 2: augment the cluster with some rCUDA
servers  all nodes can execute accelerated
applications
GPU-enabled
HPC Advisory Council Switzerland Conference 2016 43/56
 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
16 nodes without GPU + 1 node with 4 GPUs
Cheaper cluster upgrade
HPC Advisory Council Switzerland Conference 2016 44/56
More workloads for studying rCUDA+Slurm
HPC Advisory Council Switzerland Conference 2016 45/56
Performance
-68% -60%
-63% -56%
+131% +119%
HPC Advisory Council Switzerland Conference 2016 46/56
Outline
Additional reasons for
using rCUDA?
HPC Advisory Council Switzerland Conference 2016 47/56
#1: More GPUs for a single application
64
GPUs!
HPC Advisory Council Switzerland Conference 2016 48/56
 MonteCarlo Multi-GPU (from NVIDIA samples)
Lower
is better
Higher
is better
#1: More GPUs for a single application
FDR InfiniBand +
NVIDIA Tesla K20
HPC Advisory Council Switzerland Conference 2016 49/56
#2: Virtual machines can share GPUs
• The GPU is assigned by using PCI passthrough exclusively
to a single virtual machine
• Concurrent usage of the GPU is not possible
HPC Advisory Council Switzerland Conference 2016 50/56
High performance
network available
Low performance
network available
#2: Virtual machines can share GPUs
HPC Advisory Council Switzerland Conference 2016 51/56
 Box A has 4 GPUs but only one is busy
 Box B has 8 GPUs but only two are busy
1. Move jobs from Box B to Box A and
switch off Box B
2. Migration should be transparent to
applications (decided by the global
scheduler)
#3: GPU task migration
Box A
Box B
Migration is performed
at GPU granularity
HPC Advisory Council Switzerland Conference 2016 52/56
1
1
3
7
13
14
14
Job granularity instead of GPU granularity
#3: GPU task migration
HPC Advisory Council Switzerland Conference 2016 53/56
Outline
… in summary …
HPC Advisory Council Switzerland Conference 2016 54/56
• Cons:
1.Reduced bandwidth to remote GPU (really a concern??)
Pros and cons of rCUDA
• Pros:
1.Many GPUs for a single application
2.Concurrent GPU access to virtual machines
3.Increased cluster throughput
4.Similar performance with smaller investment
5.Easier (cheaper) cluster upgrade
6.Migration of GPU jobs
7.Reduced energy consumption
8.Increased GPU utilization
HPC Advisory Council Switzerland Conference 2016 55/56
Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_
More than 650 requests world wide
rCUDA is a development by Technical University of Valencia
HPC Advisory Council Switzerland Conference 2016 56/56
Thanks!
Questions?
rCUDA is a development by Technical University of Valencia

More Related Content

PPT
Welcome to the 2016 HPC Advisory Council Switzerland Conference
PDF
Programming Models for Exascale Systems
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PPTX
HPC AI Advisory Council Update
PDF
Nvidia SC16: The Greatest Challenges Can't Wait
PDF
Overview of Scientific Workflows - Why Use Them?
PDF
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
Welcome to the 2016 HPC Advisory Council Switzerland Conference
Programming Models for Exascale Systems
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
CUDA-Python and RAPIDS for blazing fast scientific computing
HPC AI Advisory Council Update
Nvidia SC16: The Greatest Challenges Can't Wait
Overview of Scientific Workflows - Why Use Them?
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...

What's hot (20)

PDF
Panda scalable hpc_bestpractices_tue100418
PDF
RAPIDS – Open GPU-accelerated Data Science
PDF
Overview of the MVAPICH Project and Future Roadmap
PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
PDF
NNSA Explorations: ARM for Supercomputing
PDF
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
PDF
Rapids: Data Science on GPUs
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
PDF
A Fresh Look at HPC from Huawei Enterprise
PDF
The Sierra Supercomputer: Science and Technology on a Mission
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
Scalable and Distributed DNN Training on Modern HPC Systems
PDF
Nvidia at SEMICon, Munich
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
Challenges and Opportunities for HPC Interconnects and MPI
PDF
dCUDA: Distributed GPU Computing with Hardware Overlap
PDF
Sierra Supercomputer: Science Unleashed
PDF
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
PPT
The SKA Project - The World's Largest Streaming Data Processor
PDF
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
Panda scalable hpc_bestpractices_tue100418
RAPIDS – Open GPU-accelerated Data Science
Overview of the MVAPICH Project and Future Roadmap
High-Performance and Scalable Designs of Programming Models for Exascale Systems
NNSA Explorations: ARM for Supercomputing
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Rapids: Data Science on GPUs
RAPIDS: GPU-Accelerated ETL and Feature Engineering
A Fresh Look at HPC from Huawei Enterprise
The Sierra Supercomputer: Science and Technology on a Mission
TAU E4S ON OpenPOWER /POWER9 platform
Scalable and Distributed DNN Training on Modern HPC Systems
Nvidia at SEMICon, Munich
Hardware & Software Platforms for HPC, AI and ML
Challenges and Opportunities for HPC Interconnects and MPI
dCUDA: Distributed GPU Computing with Hardware Overlap
Sierra Supercomputer: Science Unleashed
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
The SKA Project - The World's Largest Streaming Data Processor
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
Ad

Similar to Increasing Cluster Performance by Combining rCUDA with Slurm (20)

PPT
Parallel computing with Gpu
PDF
An exposition of performance comparison of graphic processing unit virtualiza...
PDF
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
PPTX
Kindratenko hpc day 2011 Kiev
PDF
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
PPTX
Graphic Processing Unit (GPU)
PDF
Volume 2-issue-6-2040-2045
PDF
Volume 2-issue-6-2040-2045
PPTX
Graphics processing unit ppt
PDF
Mauricio breteernitiz hpc-exascale-iscte
PDF
R&D work on pre exascale HPC systems
PDF
Computing using GPUs
PDF
Deep Learning on the SaturnV Cluster
PDF
PPT
Current Trends in HPC
PPTX
Introduction to Accelerators
PDF
Introduction to CUDA programming in C language
PDF
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
PDF
Compressing of Magnetic Resonance Images with Cuda
PDF
Python и программирование GPU (Ивашкевич Глеб)
Parallel computing with Gpu
An exposition of performance comparison of graphic processing unit virtualiza...
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
Kindratenko hpc day 2011 Kiev
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
Graphic Processing Unit (GPU)
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
Graphics processing unit ppt
Mauricio breteernitiz hpc-exascale-iscte
R&D work on pre exascale HPC systems
Computing using GPUs
Deep Learning on the SaturnV Cluster
Current Trends in HPC
Introduction to Accelerators
Introduction to CUDA programming in C language
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Compressing of Magnetic Resonance Images with Cuda
Python и программирование GPU (Ивашкевич Глеб)
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
PDF
Data Parallel Deep Learning
PDF
Making Supernovae with Jets
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Data Parallel Deep Learning
Making Supernovae with Jets

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A Presentation on Artificial Intelligence
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Dropbox Q2 2025 Financial Results & Investor Presentation
MIND Revenue Release Quarter 2 2025 Press Release
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Group 1 Presentation -Planning and Decision Making .pptx
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Empathic Computing: Creating Shared Understanding
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25-Week II
cuic standard and advanced reporting.pdf

Increasing Cluster Performance by Combining rCUDA with Slurm

  • 1. Increasing cluster performance by combining rCUDA with Slurm Federico Silla Technical University of Valencia Spain
  • 2. HPC Advisory Council Switzerland Conference 2016 2/56 Outline rCUDA … what’s that?
  • 3. HPC Advisory Council Switzerland Conference 2016 3/56 Basics of CUDA GPU
  • 4. HPC Advisory Council Switzerland Conference 2016 4/56 rCUDA … remote CUDA No GPU
  • 5. HPC Advisory Council Switzerland Conference 2016 5/56 A software technology that enables a more flexible use of GPUs in computing facilities No GPU rCUDA … remote CUDA
  • 6. HPC Advisory Council Switzerland Conference 2016 6/56 Basics of rCUDA
  • 7. HPC Advisory Council Switzerland Conference 2016 7/56 Basics of rCUDA
  • 8. HPC Advisory Council Switzerland Conference 2016 8/56 Basics of rCUDA
  • 9. HPC Advisory Council Switzerland Conference 2016 9/56 Physical configuration CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e PCI-e CPU MainMemory Network Interconnection Network Logical connections Logical configuration Cluster envision with rCUDA GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network Interconnection Network PCI-e CPU GPU GPU mem MainMemory Network GPU GPU mem PCI-e CPU GPU GPU mem MainMemory Network GPU GPU mem CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e  rCUDA allows a new vision of a GPU deployment, moving from the usual cluster configuration: to the following one:
  • 10. HPC Advisory Council Switzerland Conference 2016 10/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 11. HPC Advisory Council Switzerland Conference 2016 11/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 12. HPC Advisory Council Switzerland Conference 2016 12/56 The main concern with rCUDA is the reduced bandwidth to the remote GPU Concern with rCUDA No GPU
  • 13. HPC Advisory Council Switzerland Conference 2016 13/56 Using InfiniBand networks
  • 14. HPC Advisory Council Switzerland Conference 2016 14/56 H2D pageable D2H pageable H2D pinned D2H pinned rCUDA EDR Orig rCUDA EDR Opt rCUDA FDR Orig rCUDA FDR Opt Initial transfers within rCUDA
  • 15. HPC Advisory Council Switzerland Conference 2016 15/56  CUDASW++ Bioinformatics software for Smith-Waterman protein database searches 144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478 0 5 10 15 20 25 30 35 40 45 0 2 4 6 8 10 12 14 16 18 FDR Overhead QDR Overhead GbE Overhead CUDA rCUDA FDR rCUDA QDR rCUDA GbE Sequence Length rCUDAOverhead(%) ExecutionTime(s) Performance depending on network
  • 16. HPC Advisory Council Switzerland Conference 2016 16/56 H2D pageable D2H pageable H2D pinned Almost 100% of available BW D2H pinned Almost 100% of available BW rCUDA EDR Orig rCUDA EDR Opt rCUDA FDR Orig rCUDA FDR Opt Optimized transfers within rCUDA
  • 17. HPC Advisory Council Switzerland Conference 2016 17/56 rCUDA optimizations on applications • Several applications executed with CUDA and rCUDA • K20 GPU and FDR InfiniBand • K40 GPU and EDR InfiniBand Lower is better
  • 18. HPC Advisory Council Switzerland Conference 2016 18/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 19. HPC Advisory Council Switzerland Conference 2016 19/56 Outline rCUDA improves cluster performance
  • 20. HPC Advisory Council Switzerland Conference 2016 20/56  Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU  FDR InfiniBand based cluster Test bench for studying rCUDA+Slurm node with the Slurm scheduler node with the Slurm scheduler node with the Slurm scheduler 8+1 GPU nodes 16+1 GPU nodes 4+1 GPU nodes
  • 21. HPC Advisory Council Switzerland Conference 2016 21/56  Applications used for tests:  GPU-Blast (21 seconds; 1 GPU; 1599 MB)  LAMMPS (15 seconds; 4 GPUs; 876 MB)  MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)  GROMACS (2 nodes) (167 seconds)  NAMD (4 nodes) (11 minutes)  BarraCUDA (10 minutes; 1 GPU; 3319 MB)  GPU-LIBSVM (5 minutes; 1GPU; 145 MB)  MUMmerGPU (5 minutes; 1GPU; 2804 MB) Non-GPU Short execution time Long execution time Set 1 Set 2  Three workloads:  Set 1  Set 2  Set 1 + Set 2 Applications for studying rCUDA+Slurm
  • 22. HPC Advisory Council Switzerland Conference 2016 22/56 Workloads for studying rCUDA+Slurm (I)
  • 23. HPC Advisory Council Switzerland Conference 2016 23/56 Performance of rCUDA+Slurm (I)
  • 24. HPC Advisory Council Switzerland Conference 2016 24/56 Workloads for studying rCUDA+Slurm (II)
  • 25. HPC Advisory Council Switzerland Conference 2016 25/56 Performance of rCUDA+Slurm (II)
  • 26. HPC Advisory Council Switzerland Conference 2016 26/56 Outline Why does rCUDA improve cluster performance?
  • 27. HPC Advisory Council Switzerland Conference 2016 27/56 Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Non-accelerated applications keep GPUs idle in the nodes where they use all the cores 1st reason for improved performance A CPU-only application spreading over these nodes will make their GPUs unavailable for accelerated applications Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes)
  • 28. HPC Advisory Council Switzerland Conference 2016 28/56 Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Accelerated applications keep CPUs idle in the nodes where they execute An accelerated application using just one CPU core may avoid other jobs to be dispatched to this node Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) 2nd reason for improved performance (I)
  • 29. HPC Advisory Council Switzerland Conference 2016 29/56 Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Accelerated applications keep CPUs idle in the nodes where they execute An accelerated MPI application using just one CPU core per node may keep part of the cluster busy 2nd reason for improved performance (II)
  • 30. HPC Advisory Council Switzerland Conference 2016 30/56 • Do applications completely squeeze the GPUs available in the cluster? • When a GPU is assigned to an application, computational resources inside the GPU may not be fully used • Application presenting low level of parallelism • CPU code being executed (GPU assigned ≠ GPU working) • GPU-core stall due to lack of data • etc … Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM 3rd reason for improved performance
  • 31. HPC Advisory Council Switzerland Conference 2016 31/56 GPU usage of GPU-Blast GPU assigned but not used GPU assigned but not used
  • 32. HPC Advisory Council Switzerland Conference 2016 32/56 GPU usage of CUDA-MEME GPU utilization is far away from maximum
  • 33. HPC Advisory Council Switzerland Conference 2016 33/56 GPU usage of LAMMPS GPU assigned but not used
  • 34. HPC Advisory Council Switzerland Conference 2016 34/56 GPU allocation vs GPU utilization GPUs assigned but not used
  • 35. HPC Advisory Council Switzerland Conference 2016 35/56 Sharing a GPU among jobs: GPU-Blast Two concurrent instances of GPU-Blast One instance required about 51 seconds
  • 36. HPC Advisory Council Switzerland Conference 2016 36/56 Two concurrent instances of GPU-Blast Sharing a GPU among jobs: GPU-Blast First instance
  • 37. HPC Advisory Council Switzerland Conference 2016 37/56 Two concurrent instances of GPU-Blast Sharing a GPU among jobs: GPU-Blast First instance Second instance
  • 38. HPC Advisory Council Switzerland Conference 2016 38/56 Sharing a GPU among jobs • LAMMPS: 876 MB • mCUDA-MEME: 151 MB • BarraCUDA: 3319 MB • MUMmerGPU: 2104 MB • GPU-LIBSVM: 145 MB K20 GPU
  • 39. HPC Advisory Council Switzerland Conference 2016 39/56 Outline Other reasons for using rCUDA?
  • 40. HPC Advisory Council Switzerland Conference 2016 40/56 Cheaper cluster upgrade No GPU • Let’s suppose that a cluster without GPUs needs to be upgraded to use GPUs • GPUs require large power supplies • Are power supplies already installed in the nodes large enough? • GPUs require large amounts of space • Does current form factor of the nodes allow to install GPUs? The answer to both questions is usually “NO”
  • 41. HPC Advisory Council Switzerland Conference 2016 41/56 Cheaper cluster upgrade No GPU GPU-enabled Approach 1: augment the cluster with some CUDA GPU- enabled nodes  only those GPU-enabled nodes can execute accelerated applications
  • 42. HPC Advisory Council Switzerland Conference 2016 42/56 Cheaper cluster upgrade Approach 2: augment the cluster with some rCUDA servers  all nodes can execute accelerated applications GPU-enabled
  • 43. HPC Advisory Council Switzerland Conference 2016 43/56  Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU  FDR InfiniBand based cluster 16 nodes without GPU + 1 node with 4 GPUs Cheaper cluster upgrade
  • 44. HPC Advisory Council Switzerland Conference 2016 44/56 More workloads for studying rCUDA+Slurm
  • 45. HPC Advisory Council Switzerland Conference 2016 45/56 Performance -68% -60% -63% -56% +131% +119%
  • 46. HPC Advisory Council Switzerland Conference 2016 46/56 Outline Additional reasons for using rCUDA?
  • 47. HPC Advisory Council Switzerland Conference 2016 47/56 #1: More GPUs for a single application 64 GPUs!
  • 48. HPC Advisory Council Switzerland Conference 2016 48/56  MonteCarlo Multi-GPU (from NVIDIA samples) Lower is better Higher is better #1: More GPUs for a single application FDR InfiniBand + NVIDIA Tesla K20
  • 49. HPC Advisory Council Switzerland Conference 2016 49/56 #2: Virtual machines can share GPUs • The GPU is assigned by using PCI passthrough exclusively to a single virtual machine • Concurrent usage of the GPU is not possible
  • 50. HPC Advisory Council Switzerland Conference 2016 50/56 High performance network available Low performance network available #2: Virtual machines can share GPUs
  • 51. HPC Advisory Council Switzerland Conference 2016 51/56  Box A has 4 GPUs but only one is busy  Box B has 8 GPUs but only two are busy 1. Move jobs from Box B to Box A and switch off Box B 2. Migration should be transparent to applications (decided by the global scheduler) #3: GPU task migration Box A Box B Migration is performed at GPU granularity
  • 52. HPC Advisory Council Switzerland Conference 2016 52/56 1 1 3 7 13 14 14 Job granularity instead of GPU granularity #3: GPU task migration
  • 53. HPC Advisory Council Switzerland Conference 2016 53/56 Outline … in summary …
  • 54. HPC Advisory Council Switzerland Conference 2016 54/56 • Cons: 1.Reduced bandwidth to remote GPU (really a concern??) Pros and cons of rCUDA • Pros: 1.Many GPUs for a single application 2.Concurrent GPU access to virtual machines 3.Increased cluster throughput 4.Similar performance with smaller investment 5.Easier (cheaper) cluster upgrade 6.Migration of GPU jobs 7.Reduced energy consumption 8.Increased GPU utilization
  • 55. HPC Advisory Council Switzerland Conference 2016 55/56 Get a free copy of rCUDA at http://www.rcuda.net @rcuda_ More than 650 requests world wide rCUDA is a development by Technical University of Valencia
  • 56. HPC Advisory Council Switzerland Conference 2016 56/56 Thanks! Questions? rCUDA is a development by Technical University of Valencia