Addressing Emerging Challenges in Designing HPC Runtimes

Addressing Emerging Challenges in Designing HPC Run5mes:
Energy-Awareness, Accelerators and Virtualiza5on
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
h<p://www.cse.ohio-state.edu/~panda
Talk at HPCAC-Switzerland (Mar ‘16)
by

HPCAC-Switzerland (Mar ‘16) 2 Network Based Compu5ng Laboratory
•  Scalability for million to billion processors
•  CollecDve communicaDon
•  Uniﬁed RunDme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, …)
•  InﬁniBand Network Analysis and Monitoring (INAM)
•  Integrated Support for GPGPUs
•  Integrated Support for MICs
•  VirtualizaDon (SR-IOV and Containers)
•  Energy-Awareness
•  Best PracDce: Set of Tunings for Common ApplicaDons

Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale

–  CUDA-Aware MPI
–  GPUDirect RDMA (GDR) Support
–  CUDA-aware Non-blocking CollecDves
–  Support for Managed Memory
–  Eﬃcient datatype Processing
–  SupporDng Streaming applicaDons with GDR
–  Eﬃcient Deep Learning with MVAPICH2-GDR


PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(s_hostbuf, s_devbuf, . . .);
MPI_Send(s_hostbuf, size, . . .);
At Receiver:
MPI_Recv(r_hostbuf, size, . . .);
cudaMemcpy(r_devbuf, r_hostbuf, . . .);
•  Data movement in applicaDons with standard MPI and CUDA interfaces
High Produc,vity and Low Performance
MPI + CUDA - Naive

PCIe
GPU
CPU
NIC
Switch
At Sender:
for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
<<Similar at receiver>>
•  Pipelining at user level with non-blocking MPI and CUDA interfaces
Low Produc,vity and High Performance
MPI + CUDA - Advanced

At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
•  Standard MPI interfaces used for uniﬁed data movement
•  Takes advantage of Uniﬁed Virtual Addressing (>= CUDA 4.0)
•  Overlaps data movement from GPU with RDMA transfers
High Performance and High Produc,vity
MPI_Send(s_devbuf, size, …);
GPU-Aware MPI Library: MVAPICH2-GPU

•  OFED with support for GPUDirect RDMA is
developed by NVIDIA and Mellanox
•  OSU has a design of MVAPICH2 using
GPUDirect RDMA
–  Hybrid design using GPU-Direct RDMA
•  GPUDirect RDMA and Host-based pipelining
•  Alleviates P2P bandwidth bo<lenecks on SandyBridge and
IvyBridge
–  Support for communicaDon using mulD-rail
–  Support for Mellanox Connect-IB and ConnectX VPI adapters
–  Support for RoCE with Mellanox ConnectX VPI adapters

GPU-Direct RDMA (GDR) with CUDA
IB Adapter
System
Memory
GPU
Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases
•  Support for MPI communicaDon from NVIDIA GPU device memory
•  High performance RDMA-based inter-node point-to-point
communicaDon (GPU-GPU, GPU-Host and Host-GPU)
•  High performance intra-node point-to-point communicaDon for mulD-
GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
•  Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communicaDon for mulDple GPU adapters/node
•  OpDmized and tuned collecDves for GPU device buﬀers
•  MPI datatype support for point-to-point and collecDve communicaDon
from GPU device buﬀers

MVAPICH2-GDR-2.2b
Intel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPU
Mellanox Connect-IB Dual-FDR HCA
CUDA 7
Mellanox OFED 2.4 with GPU-Direct-RDMA
10x
2X
11x
2x
Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)
0
5
10
15
20
25
30
0 2 8 32 128 512 2K
MV2-GDR2.2b MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU internode latency
Message Size (bytes)
Latency (us)
2.18us
0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1K 4K
MV2-GDR2.2b
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode Bandwidth
Bandwidth (MB/
s)
11X
0
1000
2000
3000
4000
1 4 16 64 256 1K 4K
MV2-GDR2.2b
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode Bi-Bandwidth
Bi-Bandwidth (MB/s)

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
•  HoomdBlue Version 1.0.5
•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Applica5on-Level Evalua5on (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Average Time Steps per
second (TPS)
Number of Processes
MV2 MV2+GDR
0
500
1000
1500
2000
2500
3000
3500
4 8 16 32
Average Time Steps per
second (TPS)
Number of Processes
64K Par5cles 256K Par5cles
2X
2X

0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Overlap (%)
Message Size (Bytes)
Medium/Large Message Overlap
(64 GPU nodes)
Ialltoall (1process/node)
Ialltoall (2process/node; 1process/GPU)
0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Overlap (%)
Medium/Large Message Overlap
(64 GPU nodes)
Igather (1process/node)
Igather (2processes/node; 1process/
GPU)
Plagorm: Wilkes: Intel Ivy Bridge
NVIDIA Tesla K20c + Mellanox Connect-IB
Available since MVAPICH2-GDR 2.2a
CUDA-Aware Non-Blocking Collec5ves
A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Oﬄoaded GPU
Collec5ves using CORE-Direct and CUDA Capabili5es on IB Clusters, HIPC,
2015

Communica5on Run5me with GPU Managed Memory
●  CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified)
memory allowing a common memory allocaDon for GPU
or CPU through cudaMallocManaged() call
●  Significant producDvity benefits due to abstracDon of
explicit allocaDon and cudaMemcpy()
●  Extended MVAPICH2 to perform communicaDons directly
from managed buffers (Available in MVAPICH2-GDR 2.2b)
●  OSU Micro-benchmarks extended to evaluate the
performance of point-to-point and collecDve
communicaDons using managed buffers
●  Available in OMB 5.2
D S Banerjee, K Hamidouche, DK Panda, Designing High Performance
Communica,on Run,me for GPUManaged Memory: Early Experiences at
GPGPU-9 Workshop held in conjunc5on with PPoPP 2016. Barcelona Spain
0
5
10
15
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
Latency (us)
Latency
H-H MH-MH
0
2000
4000
6000
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
Bandwidth (MB/s)
Bandwidth
D-D MD-MD

CPU
Progress
GPU
Time
Initiate
Kernel
Start
Send
Isend(1)
Initiate
Kernel
Start
Send
Initiate
Kernel
GPU
CPU
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1)
Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1)
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Start
Send
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
MPI Datatype Processing (Communica5on Op5miza5on )
Waste of compu5ng resources on CPU and GPU Common Scenario
*Buf1, Buf2…contain non-
conDguous MPI Datatype
MPI_Isend (A,.. Datatype,…)
MPI_Isend (B,.. Datatype,…)
MPI_Isend (C,.. Datatype,…)
MPI_Isend (D,.. Datatype,…)
…

MPI_Waitall (…);

Applica5on-Level Evalua5on (HaloExchange - Cosmo)
0
0.5
1
1.5
16 32 64 96
Normalized Execu5on Time
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
0
0.5
1
1.5
4 8 16 32
Normalized Execu5on Time
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
•  2X improvement on 32 GPUs nodes
•  30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploi5ng Maximal Overlap for Non-
Con5guous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

•  Pipelined data parallel compute phases that form
the crux of streaming applicaDons lend
themselves for GPGPUs
•  Data distribuDon to GPGPU sites occur over PCIe
within the node and over InﬁniBand interconnects
across nodes

Courtesy: Agarwalla, Bikash, et al. "Streamline: A scheduling heurisDc
for streaming applicaDons on the grid." Electronic Imaging 2006
•  Broadcast operaDon is a key dictator
of throughput of streaming
applicaDons
•  Current Broadcast operaDon on GPU
clusters does not take advantage of
•  IB Hardware MCAST
•  GPU Direct RDMA
Nature of Streaming Applica5ons

SGL-based design for Eﬃcient Broadcast Opera5on on GPU Systems
•  Current design is limited by the expensive copies
from/to GPUs
•  Proposed several alternaDve designs to avoid the
overhead of the copy
•  Loopback, GDRCOPY and hybrid
•  High performance and scalability
•  SDll uses PCI resources for Host-GPU copies
•  Proposed SGL-based design
•  Combines IB MCAST and GPUDirect RDMA features
•  High performance and scalability for D-D broadcast
•  Direct code path between HCA and GPU
•  Free PCI resources
•  3X improvement in latency

3X
A.  Venkatesh , H. Subramoni , K. Hamidouche , and D. K. Panda, A High Performance Broadcast Design with Hardware Mul5cast and
GPUDirect RDMA for Streaming Applica5ons on InﬁniBand Clusters , IEEE Int’l Conf. on High Performance Compu5ng (HiPC ’14)

Accelera5ng Deep Learning with MVAPICH2-GDR
•  Caffe: A flexible and layered Deep Learning
framework.
•  Benefits and Weaknesses
–  MulD-GPU Training within a single node
–  Performance degradaDon for GPUs across
different sockets
•  Can we enhance Caffe with MVAPICH2-GDR?
–  Caffe-Enhanced: A CUDA-Aware MPI version
–  Enables Scale-up (within a node) and Scale-
out (across mulD-GPU nodes)
–  IniDal EvaluaDon suggests up to 8X reducDon
in training Dme on CIFAR-10 dataset
8x improvement

MPI Applica5ons on MIC Clusters
Xeon Xeon Phi
MulD-core Centric
Many-core Centric
MPI
Program
MPI
Program
Offloaded
ComputaDon
MPI
Program
MPI Program
MPI Program
Host-only
Offload
(/reverse Offload)
Symmetric
Coprocessor-only
•  Flexibility in launching MPI jobs on clusters with Xeon Phi

MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC
•  Oﬄoad Mode
•  Intranode CommunicaDon
•  Coprocessor-only and Symmetric Mode
•  Internode CommunicaDon
•  Coprocessors-only and Symmetric Mode
•  MulD-MIC Node ConﬁguraDons
•  Running on three major systems
•  Stampede, Blueridge (Virginia Tech) and Beacon (UTK)

MIC-Remote-MIC P2P Communica5on with Proxy-based
Communica5on
Bandwidth
Bemer
Bemer Bemer
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Latency(usec)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth (MB/sec)
5236
Intra-socket P2P
Inter-socket P2P
0
5000
10000
15000
8K 32K 128K 512K 2M
Latency(usec)
Latency (Large
Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth (MB/sec)
Bemer
5594
Bandwidth

Op5mized MPI Collec5ves for MIC Clusters (Allgather & Alltoall)
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance
Alltoall and Allgather designs for InﬁniBand MIC Clusters; IPDPS’14, May 2014
0
10000
20000
30000
1 2 4 8 16 32 64 128 256 512 1K
Latency (usecs)
32-Node-Allgather (16H + 16 M)
Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Latency (usecs)
32-Node-Allgather (8H + 8 M)
Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
4K 8K 16K 32K 64K 128K 256K 512K
Latency (usecs)
32-Node-Alltoall (8H + 8 M)
Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
20
40
60
MV2-MIC-Opt MV2-MIC
Execu5on Time (secs)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance
CommunicaDon
ComputaDon
76%
58%
55%

•  VirtualizaDon has many benefits
–  Fault-tolerance
–  Job migraDon
–  CompacDon
•  Have not been very popular in HPC due to overhead associated with
VirtualizaDon
•  New SR-IOV (Single Root – IO VirtualizaDon) support available with
Mellanox InfiniBand adapters changes the field
•  Enhanced MVAPICH2 support for SR-IOV
•  MVAPICH2-Virt 2.1 (with and without OpenStack) is publicly available
•  How about the Containers support?

Can HPC and Virtualiza5on be Combined?
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applica5ons on SR-IOV based Virtualized InfiniBand
Clusters? EuroPar'14
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14
J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15

•  Redesign MVAPICH2 to make it
virtual machine aware
–  SR-IOV shows near to naDve
performance for inter-node point to
point communicaDon
–  IVSHMEM offers zero-copy access to
data on shared memory of co-resident
VMs
–  Locality Detector: maintains the locality
informaDon of co-resident virtual machines
–  CommunicaDon Coordinator: selects the
communicaDon channel (SR-IOV, IVSHMEM)
adapDvely
Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter
Physical
Function
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Guest 2
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Virtual
Function
Virtual
Function
/dev/shm/
IV-SHM
IV-Shmem Channel
SR-IOV Channel
J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM
Shmem Benefit MPI ApplicaDons on SR-IOV based
Virtualized InfiniBand Clusters? Euro-Par, 2014.
J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High
Performance MPI Library over SR-IOV Enabled InfiniBand
Clusters. HiPC, 2014.

Nova
Glance
Neutron
Swift
Keystone
Cinder
Heat
Ceilometer
Horizon
VM
Backup
volumes in
Stores
images in
Provides
images
Provides
Network
Provisions
Provides
Volumes
Monitors
Provides
UI
Provides
Auth for
Orchestrates
cloud
•  OpenStack is one of the most popular
open-source soluDons to build clouds and
manage virtual machines
•  Deployment with OpenStack
–  SupporDng SR-IOV configuraDon
–  SupporDng IVSHMEM configuraDon
–  Virtual Machine aware design of MVAPICH2 with
SR-IOV
•  An efficient approach to build HPC Clouds
with MVAPICH2-Virt and OpenStack
MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack
J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build
HPC Clouds. CCGrid, 2015.

0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
Execu5on Time (s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-NaDve
1%
9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
Execu5on Time (ms)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-NaDve
2%
•  32 VMs, 6 Core/VM
•  Compared to NaDve, 2-5% overhead for Graph500 with 128 Procs
•  Compared to NaDve, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Applica5on-Level Performance on Chameleon
SPEC MPI2007Graph500
5%

NSF Chameleon Cloud: A Powerful and Flexible
Experimental Instrument
•  Large-scale instrument
–  TargeDng Big Data, Big Compute, Big Instrument research
–  ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network
•  Reconﬁgurable instrument
–  Bare metal reconﬁguraDon, operated as single instrument, graduated approach for ease-of-use
•  Connected instrument
–  Workload and Trace Archive
–  Partnerships with producDon clouds: CERN, OSDC, Rackspace, Google, and others
–  Partnerships with users
•  Complementary instrument
–  ComplemenDng GENI, Grid’5000, and other testbeds
•  Sustainable instrument
–  Industry connecDons

h<p://www.chameleoncloud.org/

0
2
4
6
8
10
12
14
16
18
1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
Latency (us)
Container-Def
Container-Opt
NaDve
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
Bandwidth (MBps)
Container-Def
Container-Opt
NaDve
•  Intra-Node Inter-Container
•  Compared to Container-Def, up to 81% and 191% improvement on Latency and BW
•  Compared to NaDve, minor overhead on Latency and BW
Containers Support: MVAPICH2 Intra-node Point-to-Point
Performance on Chameleon
81%
191%

0
500
1000
1500
2000
2500
3000
3500
4000
22, 16 22, 20 24, 16 24, 20 26, 16 26, 20
Execu5on Time (ms)
Problem Size (Scale, Edgefactor)
Container-Def
Container-Opt
NaDve
0
10
20
30
40
50
60
70
80
90
100
MG.D FT.D EP.D LU.D CG.D
Execu5on Time (s)
Container-Def
Container-Opt
NaDve
•  64 Containers across 16 nodes, pining 4 Cores per Container
•  Compared to Container-Def, up to 11% and 16% of execuDon Dme reducDon for NAS and Graph 500
•  Compared to NaDve, less than 9 % and 4% overhead for NAS and Graph 500
•  Op5mized Container support will be available with the next release of MVAPICH2-Virt
Containers Support: Applica5on-Level Performance on Chameleon
Graph 500NAS
11%
16%

Designing Energy-Aware (EA) MPI Run5me
Energy Spent in CommunicaDon
RouDnes
Energy Spent in ComputaDon
RouDnes
Overall applicaDon Energy
Expenditure
Point-to-point
RouDnes
CollecDve
RouDnes
RMA RouDnes
MVAPICH2-EA Designs
MPI Two-sided and collecDves (ex: MVAPICH2)
Other PGAS ImplementaDons (ex: OSHMPI) One-sided runDmes (ex: ComEx)
Impact MPI-3 RMA ImplementaDons (ex: MVAPICH2)

•  MVAPICH2-EA 2.1 (Energy-Aware)
•  A white-box approach
•  New Energy-Eﬃcient communicaDon protocols for pt-pt and collecDve operaDons
•  Intelligently apply the appropriate Energy saving techniques
•  ApplicaDon oblivious energy saving

•  OEMT
•  A library uDlity to measure energy consumpDon for MPI applicaDons
•  Works with all MPI runDmes
•  PRELOAD opDon for precompiled applicaDons
•  Does not require ROOT permission:
•  A safe kernel module to read only a subset of MSRs
Energy-Aware MVAPICH2 & OSU Energy Management Tool
(OEMT)

•  An energy eﬃcient runDme that
provides energy savings without
applicaDon knowledge
•  Uses automaDcally and
transparently the best energy
lever
•  Provides guarantees on
maximum degradaDon with
5-41% savings at <= 5%
degradaDon
•  PessimisDc MPI applies energy
reducDon lever to each MPI call
MVAPICH2-EA: Applica5on Oblivious Energy-Aware-MPI (EAM)
A Case for Applica5on-Oblivious Energy-Eﬃcient MPI Run5me A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.
K. Panda, D. Kerbyson, and A. Hoise, Supercompu5ng ‘15, Nov 2015 [Best Student Paper Finalist]
1

MPI-3 RMA Energy Savings with Proxy-Applica5ons
0
10
20
30
40
50
60
512 256 128
Seconds
#Processes
Graph500 (Execution Time)
optimistic
pessimistic
EAM-RMA
0
50000
100000
150000
200000
250000
300000
350000
512 256 128
Joules
#Processes
Graph500 (Energy Usage)
optimistic
pessimistic
EAM-RMA
46%
•  MPI_Win_fence dominates applicaDon execuDon Dme in graph500
•  Between 128 and 512 processes, EAM-RMA yields between 31% and 46% savings with no
degradaDon in execuDon Dme in comparison with the default opDmisDc MPI runDme

0
500000
1000000
1500000
2000000
2500000
3000000
512 256 128
Joules
#Processes
SCF (Energy Usage)
optimistic
pessimistic
EAM-RMA
0
100
200
300
400
500
600
512 256 128
Seconds
#Processes
SCF (Execution Time)
optimistic
pessimistic
EAM-RMA
MPI-3 RMA Energy Savings with Proxy-Applica5ons
42%
•  SCF (self-consistent ﬁeld) calculaDon spends nearly 75% total Dme in MPI_Win_unlock call
•  With 256 and 512 processes, EAM-RMA yields 42% and 36% savings at 11% degradaDon (close to
permi<ed degradaDon ρ = 10%)
•  128 processes is an excepDon due 2-sided and 1-sided interacDon
•  MPI-3 RMA Energy-eﬃcient support will be available in upcoming MVAPICH2-EA release

•  MPI runDme has many parameters
•  Tuning a set of parameters can help you to extract higher performance
•  Compiled a list of such contribuDons through the MVAPICH Website
–  h<p://mvapich.cse.ohio-state.edu/best_pracDces/
•  IniDal list of applicaDons
–  Amber
–  HoomdBlue
–  HPCG
–  Lulesh
–  MILC
–  MiniAMR
–  Neuron
–  SMG2000
•  SoliciDng addiDonal contribuDons, send your results to mvapich-help at cse.ohio-
state.edu. We will link these results with credits to you.
Applica5ons-Level Tuning: Compila5on of Best Prac5ces

MVAPICH2 – Plans for Exascale
•  Performance and Memory scalability toward 1M cores
•  Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)
•  Support for task-based parallelism (UPC++)*
•  Enhanced OpDmizaDon for GPU Support and Accelerators
•  Taking advantage of advanced features of Mellanox InﬁniBand
•  On-Demand Paging (ODP)
•  Swith-IB2 SHArP
•  GID-based support
•  Enhanced Inter-node and Intra-node communicaDon schemes for upcoming architectures
•  OpenPower*
•  OmniPath-PSM2*
•  Knights Landing
•  Extended topology-aware collecDves
•  Extended Energy-aware designs and VirtualizaDon Support
•  Extended Support for MPI Tools Interface (as in MPI 3.0)
•  Extended Checkpoint-Restart and migraDon support with SCR
•  Support for * features will be available in MVAPICH2-2.2 RC1

•  Exascale systems will be constrained by
–  Power
–  Memory per core
–  Data movement cost
–  Faults
•  Programming Models and RunDmes for HPC need to be
designed for
–  Scalability
–  Performance
–  Fault-resilience
–  Energy-awareness
–  Programmability
–  ProducDvity
•  Highlighted some of the issues and challenges
•  Need conDnuous innovaDon on all these fronts
Looking into the Future ….

Funding Acknowledgments
Funding Support by
Equipment Support by

Personnel Acknowledgments
Current Students
–  A. AugusDne (M.S.)
–  A. Awan (Ph.D.)
–  S. Chakraborthy (Ph.D.)
–  C.-H. Chu (Ph.D.)
–  N. Islam (Ph.D.)
–  M. Li (Ph.D.)
Past Students
–  P. Balaji (Ph.D.)
–  S. Bhagvat (M.S.)
–  A. Bhat (M.S.)
–  D. BunDnas (Ph.D.)
–  L. Chai (Ph.D.)
–  B. Chandrasekharan (M.S.)
–  N. Dandapanthula (M.S.)
–  V. Dhanraj (M.S.)
–  T. Gangadharappa (M.S.)
–  K. Gopalakrishnan (M.S.)
–  G. Santhanaraman (Ph.D.)
–  A. Singh (Ph.D.)
–  J. Sridhar (M.S.)
–  S. Sur (Ph.D.)
–  H. Subramoni (Ph.D.)
–  K. Vaidyanathan (Ph.D.)
–  A. Vishnu (Ph.D.)
–  J. Wu (Ph.D.)
–  W. Yu (Ph.D.)
Past Research Scien,st
–  S. Sur
Current Post-Doc
–  J. Lin
–  D. Banerjee
Current Programmer
–  J. Perkins
Past Post-Docs
–  H. Wang
–  X. Besseron
–  H.-W. Jin
–  M. Luo
–  W. Huang (Ph.D.)
–  W. Jiang (M.S.)
–  J. Jose (Ph.D.)
–  S. Kini (M.S.)
–  M. Koop (Ph.D.)
–  R. Kumar (M.S.)
–  S. Krishnamoorthy (M.S.)
–  K. Kandalla (Ph.D.)
–  P. Lai (M.S.)
–  J. Liu (Ph.D.)
–  M. Luo (Ph.D.)
–  A. Mamidala (Ph.D.)
–  G. Marsh (M.S.)
–  V. Meshram (M.S.)
–  A. Moody (M.S.)
–  S. Naravula (Ph.D.)
–  R. Noronha (Ph.D.)
–  X. Ouyang (Ph.D.)
–  S. Pai (M.S.)
–  S. Potluri (Ph.D.)
–  R. Rajachandrasekar (Ph.D.)

–  K. Kulkarni (M.S.)
–  M. Rahman (Ph.D.)
–  D. Shankar (Ph.D.)
–  A. Venkatesh (Ph.D.)
–  J. Zhang (Ph.D.)
–  E. Mancini
–  S. Marcarelli
–  J. Vienne
Current Research Scien,sts Current Senior Research Associate
–  H. Subramoni
–  X. Lu
Past Programmers
–  D. Bureddy
- K. Hamidouche
Current Research Specialist
–  M. Arnold

Interna5onal Workshop on Communica5on
Architectures at Extreme Scale (Exacomm)
ExaComm 2015 was held with Int’l SupercompuDng Conference (ISC ‘15), at Frankfurt,
Germany, on Thursday, July 16th, 2015

One Keynote Talk: John M. Shalf, CTO, LBL/NERSC
Four Invited Talks: Dror Goldenberg (Mellanox); MarDn Schulz (LLNL);
Cyriel Minkenberg (IBM-Zurich); Arthur (Barney) Maccabe (ORNL)
Panel: Ron Brightwell (Sandia)
Two Research Papers
ExaComm 2016 will be held in conjuncDon with ISC ’16
h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html
Technical Paper Submission Deadline: Friday, April 15, 2016

panda@cse.ohio-state.edu
Thank You!
The High-Performance Big Data Project
h<p://hibd.cse.ohio-state.edu/
Network-Based CompuDng Laboratory
h<p://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project
h<p://mvapich.cse.ohio-state.edu/

Addressing Emerging Challenges in Designing HPC Runtimes

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Addressing Emerging Challenges in Designing HPC Runtimes (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Addressing Emerging Challenges in Designing HPC Runtimes