SlideShare a Scribd company logo
Addressing	Emerging	Challenges	in	Designing	HPC	Run5mes:	
Energy-Awareness,	Accelerators	and	Virtualiza5on	
Dhabaleswar	K.	(DK)	Panda	
The	Ohio	State	University	
E-mail:	panda@cse.ohio-state.edu	
h<p://www.cse.ohio-state.edu/~panda	
Talk	at	HPCAC-Switzerland	(Mar	‘16)	
by
HPCAC-Switzerland	(Mar	‘16)	 2	Network	Based	Compu5ng	Laboratory	
•  Scalability	for	million	to	billion	processors	
•  CollecDve	communicaDon	
•  Unified	RunDme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	MPI	+	
UPC,	CAF,	…)	
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  VirtualizaDon	(SR-IOV	and	Containers)	
•  Energy-Awareness	
•  Best	PracDce:	Set	of	Tunings	for	Common	ApplicaDons	
	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Switzerland	(Mar	‘16)	 3	Network	Based	Compu5ng	Laboratory	
•  Integrated	Support	for	GPGPUs	
–  CUDA-Aware	MPI	
–  GPUDirect	RDMA	(GDR)	Support	
–  CUDA-aware	Non-blocking	CollecDves	
–  Support	for	Managed	Memory	
–  Efficient	datatype	Processing	
–  SupporDng	Streaming	applicaDons	with	GDR	
–  Efficient	Deep	Learning	with	MVAPICH2-GDR	
•  Integrated	Support	for	MICs	
•  VirtualizaDon	(SR-IOV	and	Containers)	
•  Energy-Awareness	
•  Best	PracDce:	Set	of	Tunings	for	Common	ApplicaDons	
	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Switzerland	(Mar	‘16)	 4	Network	Based	Compu5ng	Laboratory	
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(s_hostbuf, s_devbuf, . . .);
MPI_Send(s_hostbuf, size, . . .);
At Receiver:
MPI_Recv(r_hostbuf, size, . . .);
cudaMemcpy(r_devbuf, r_hostbuf, . . .);
• 	Data	movement	in	applicaDons	with	standard	MPI	and	CUDA	interfaces		
High	Produc,vity	and	Low	Performance	
MPI	+	CUDA	-	Naive
HPCAC-Switzerland	(Mar	‘16)	 5	Network	Based	Compu5ng	Laboratory	
PCIe
GPU
CPU
NIC
Switch
At Sender:
for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
<<Similar at receiver>>
•  Pipelining	at	user	level	with	non-blocking	MPI	and	CUDA	interfaces	
Low	Produc,vity	and	High	Performance	
MPI	+	CUDA	-	Advanced
HPCAC-Switzerland	(Mar	‘16)	 6	Network	Based	Compu5ng	Laboratory	
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
•  Standard	MPI	interfaces	used	for	unified	data	movement	
•  Takes	advantage	of	Unified	Virtual	Addressing	(>=	CUDA	4.0)		
•  Overlaps	data	movement	from	GPU	with	RDMA	transfers		
High	Performance	and	High	Produc,vity	
MPI_Send(s_devbuf, size, …);
GPU-Aware	MPI	Library:	MVAPICH2-GPU
HPCAC-Switzerland	(Mar	‘16)	 7	Network	Based	Compu5ng	Laboratory	
•  OFED	with	support	for	GPUDirect	RDMA	is	
developed	by	NVIDIA	and	Mellanox	
•  OSU	has	a	design	of	MVAPICH2	using		
						GPUDirect	RDMA	
–  Hybrid	design	using	GPU-Direct	RDMA	
•  GPUDirect	RDMA	and	Host-based	pipelining	
•  Alleviates	P2P	bandwidth	bo<lenecks	on	SandyBridge	and	
IvyBridge	
–  Support	for	communicaDon	using	mulD-rail	
–  Support	for	Mellanox	Connect-IB	and	ConnectX	VPI	adapters	
–  Support	for	RoCE	with	Mellanox	ConnectX	VPI	adapters	
	
GPU-Direct	RDMA	(GDR)	with	CUDA		
IB	Adapter	
System	
Memory	
GPU	
Memory
GPU	
CPU	
Chipset	
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB	E5-2670	
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB	E5-2680V2	
SNB	E5-2670	/	
IVB	E5-2680V2
HPCAC-Switzerland	(Mar	‘16)	 8	Network	Based	Compu5ng	Laboratory	
CUDA-Aware	MPI:	MVAPICH2-GDR	1.8-2.2	Releases	
•  Support	for	MPI	communicaDon	from	NVIDIA	GPU	device	memory	
•  High	performance	RDMA-based	inter-node	point-to-point	
communicaDon	(GPU-GPU,	GPU-Host	and	Host-GPU)	
•  High	performance	intra-node	point-to-point	communicaDon	for	mulD-
GPU	adapters/node	(GPU-GPU,	GPU-Host	and	Host-GPU)	
•  Taking	advantage	of	CUDA	IPC	(available	since	CUDA	4.1)	in	intra-node	
communicaDon	for	mulDple	GPU	adapters/node	
•  OpDmized	and	tuned	collecDves	for	GPU	device	buffers	
•  MPI	datatype	support	for	point-to-point	and	collecDve	communicaDon	
from	GPU	device	buffers
HPCAC-Switzerland	(Mar	‘16)	 9	Network	Based	Compu5ng	Laboratory	
MVAPICH2-GDR-2.2b	
Intel	Ivy	Bridge	(E5-2680	v2)	node	-	20	cores	
NVIDIA	Tesla	K40c	GPU	
Mellanox	Connect-IB	Dual-FDR	HCA	
CUDA	7	
Mellanox	OFED	2.4	with	GPU-Direct-RDMA	
10x	
2X	
11x	
2x	
Performance	of	MVAPICH2-GPU	with	GPU-Direct	RDMA	(GDR)		
0	
5	
10	
15	
20	
25	
30	
0	 2	 8	 32	 128	 512	 2K	
MV2-GDR2.2b	 MV2-GDR2.0b	
MV2	w/o	GDR	
GPU-GPU		internode	latency	
Message	Size	(bytes)	
Latency	(us)	
2.18us	
0	
500	
1000	
1500	
2000	
2500	
3000	
1	 4	 16	 64	 256	 1K	 4K	
MV2-GDR2.2b	
MV2-GDR2.0b	
MV2	w/o	GDR	
GPU-GPU	Internode	Bandwidth	
Message	Size	(bytes)	
Bandwidth	(MB/
s)	
11X	
0	
1000	
2000	
3000	
4000	
1	 4	 16	 64	 256	 1K	 4K	
MV2-GDR2.2b	
MV2-GDR2.0b	
MV2	w/o	GDR	
GPU-GPU	Internode	Bi-Bandwidth	
Message	Size	(bytes)	
Bi-Bandwidth	(MB/s)
HPCAC-Switzerland	(Mar	‘16)	 10	Network	Based	Compu5ng	Laboratory	
•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
•  HoomdBlue Version 1.0.5
•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Applica5on-Level	Evalua5on	(HOOMD-blue)	
0	
500	
1000	
1500	
2000	
2500	
4	 8	 16	 32	
Average	Time	Steps	per	
second	(TPS)	
Number	of	Processes		
MV2	 MV2+GDR	
0	
500	
1000	
1500	
2000	
2500	
3000	
3500	
4	 8	 16	 32	
Average	Time	Steps	per	
second	(TPS)	
Number	of	Processes		
64K	Par5cles		 256K	Par5cles		
2X	
2X
HPCAC-Switzerland	(Mar	‘16)	 11	Network	Based	Compu5ng	Laboratory	
0	
20	
40	
60	
80	
100	
120	
4K	 16K	 64K	 256K	 1M	
Overlap	(%)	
Message	Size	(Bytes)	
Medium/Large	Message	Overlap		
(64	GPU	nodes)	
Ialltoall	(1process/node)	
Ialltoall	(2process/node;	1process/GPU)	
0	
20	
40	
60	
80	
100	
120	
4K	 16K	 64K	 256K	 1M	
Overlap	(%)	
Message	Size	(Bytes)	
Medium/Large	Message	Overlap	
(64	GPU	nodes)	
Igather	(1process/node)	
Igather	(2processes/node;	1process/
GPU)	
Plagorm:	Wilkes:	Intel	Ivy	Bridge		
NVIDIA	Tesla	K20c	+	Mellanox	Connect-IB	
Available	since	MVAPICH2-GDR	2.2a	
CUDA-Aware	Non-Blocking	Collec5ves	
A.	Venkatesh,	K.	Hamidouche,	H.	Subramoni,	and	D.	K.	Panda,	Offloaded	GPU	
Collec5ves	using	CORE-Direct	and	CUDA	Capabili5es	on	IB	Clusters,	HIPC,	
2015
HPCAC-Switzerland	(Mar	‘16)	 12	Network	Based	Compu5ng	Laboratory	
Communica5on	Run5me	with	GPU	Managed	Memory	
●  CUDA	6.0	NVIDIA	introduced	CUDA	Managed	(or	Unified)	
memory	allowing	a	common	memory	allocaDon	for	GPU	
or	CPU	through	cudaMallocManaged()	call	
●  Significant	producDvity	benefits	due	to	abstracDon	of	
explicit	allocaDon	and	cudaMemcpy()	
●  Extended	MVAPICH2	to	perform	communicaDons	directly	
from	managed	buffers	(Available	in	MVAPICH2-GDR	2.2b)	
●  OSU	Micro-benchmarks	extended	to	evaluate	the	
performance	of	point-to-point	and	collecDve	
communicaDons	using	managed	buffers		
●  Available	in	OMB	5.2	
D	S	Banerjee,	K	Hamidouche,	DK	Panda,	Designing	High	Performance	
Communica,on	Run,me	for	GPUManaged	Memory:	Early	Experiences	at	
GPGPU-9	Workshop	held	in	conjunc5on	with	PPoPP	2016.	Barcelona	Spain	
0	
5	
10	
15	
1	
2	
4	
8	
16	
32	
64	
128	
256	
512	
1024	
2048	
4096	
8192	
16384	
Latency	(us)	
Message	Size	(Bytes)	
Latency	
H-H	 MH-MH	
0	
2000	
4000	
6000	
1	
2	
4	
8	
16	
32	
64	
128	
256	
512	
1024	
2048	
4096	
8192	
16384	
Bandwidth	(MB/s)	
Message	Size	(Bytes)	
Bandwidth	
D-D	 MD-MD
HPCAC-Switzerland	(Mar	‘16)	 13	Network	Based	Compu5ng	Laboratory	
CPU
Progress
GPU
Time
Initiate
Kernel
Start
Send
Isend(1)
Initiate
Kernel
Start
Send
Initiate
Kernel
GPU
CPU
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1)
Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1)
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Start
Send
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
MPI	Datatype	Processing	(Communica5on	Op5miza5on	)	
Waste	of	compu5ng	resources	on	CPU	and	GPU	Common	Scenario	
*Buf1,	 Buf2…contain	 non-
conDguous	MPI	Datatype	
MPI_Isend	(A,..	Datatype,…)	
MPI_Isend	(B,..	Datatype,…)	
MPI_Isend	(C,..	Datatype,…)	
MPI_Isend	(D,..	Datatype,…)	
…	
	
MPI_Waitall	(…);
HPCAC-Switzerland	(Mar	‘16)	 14	Network	Based	Compu5ng	Laboratory	
Applica5on-Level	Evalua5on	(HaloExchange	-	Cosmo)	
0	
0.5	
1	
1.5	
16	 32	 64	 96	
Normalized	Execu5on	Time	
Number	of	GPUs	
CSCS	GPU	cluster	
Default	 Callback-based	 Event-based	
0	
0.5	
1	
1.5	
4	 8	 16	 32	
Normalized	Execu5on	Time	
Number	of	GPUs	
Wilkes	GPU	Cluster	
Default	 Callback-based	 Event-based	
•  2X	improvement	on	32	GPUs	nodes	
•  30%	improvement	on	96	GPU	nodes	(8	GPUs/node)		
C.	Chu,	K.	Hamidouche,	A.	Venkatesh,	D.	Banerjee	,	H.	Subramoni,	and	D.	K.	Panda,	Exploi5ng	Maximal	Overlap	for	Non-
Con5guous	Data	Movement	Processing	on	Modern	GPU-enabled	Systems,	IPDPS’16
HPCAC-Switzerland	(Mar	‘16)	 15	Network	Based	Compu5ng	Laboratory	
•  Pipelined	data	parallel	compute	phases	that	form	
the	crux	of	streaming	applicaDons	lend	
themselves	for	GPGPUs	
•  Data	distribuDon	to	GPGPU	sites	occur	over	PCIe	
within	the	node	and	over	InfiniBand	interconnects	
across	nodes	
	
Courtesy:	Agarwalla,	Bikash,	et	al.	"Streamline:	A	scheduling	heurisDc	
for	streaming	applicaDons	on	the	grid."	Electronic	Imaging	2006	
•  Broadcast	operaDon	is	a	key	dictator	
of	throughput	of	streaming	
applicaDons	
•  Current	Broadcast	operaDon	on	GPU	
clusters	does	not	take	advantage	of	
•  IB	Hardware	MCAST	
•  GPU	Direct	RDMA		
Nature	of	Streaming	Applica5ons
HPCAC-Switzerland	(Mar	‘16)	 16	Network	Based	Compu5ng	Laboratory	
SGL-based	design	for	Efficient	Broadcast	Opera5on	on	GPU	Systems		
•  Current	design	is	limited	by	the	expensive	copies	
from/to	GPUs		
•  Proposed	several	alternaDve	designs	to	avoid	the	
overhead	of	the	copy		
•  Loopback,	GDRCOPY	and	hybrid		
•  High	performance	and	scalability		
•  SDll	uses	PCI	resources	for	Host-GPU	copies		
•  Proposed	SGL-based	design		
•  Combines	IB	MCAST	and	GPUDirect	RDMA	features		
•  High	performance	and	scalability	for	D-D	broadcast	
•  Direct	code	path	between	HCA	and	GPU		
•  Free	PCI	resources			
•  3X	improvement	in	latency		
	
	
3X	
A.  Venkatesh	,	H.	Subramoni	,	K.	Hamidouche	,	and	D.	K.	Panda,	A	High	Performance	Broadcast	Design	with	Hardware	Mul5cast	and	
GPUDirect		RDMA	for	Streaming	Applica5ons	on	InfiniBand	Clusters	,	IEEE	Int’l	Conf.	on	High	Performance	Compu5ng	(HiPC	’14)
HPCAC-Switzerland	(Mar	‘16)	 17	Network	Based	Compu5ng	Laboratory	
Accelera5ng	Deep	Learning	with	MVAPICH2-GDR	
•  Caffe:	A	flexible	and	layered	Deep	Learning	
framework.	
•  Benefits	and	Weaknesses	
–  MulD-GPU	Training	within	a	single	node	
–  Performance	degradaDon	for	GPUs	across	
different	sockets	
•  Can	we	enhance	Caffe	with	MVAPICH2-GDR?	
–  Caffe-Enhanced:	A	CUDA-Aware	MPI	version	
–  Enables	Scale-up	(within	a	node)	and	Scale-
out	(across	mulD-GPU	nodes)	
–  IniDal	EvaluaDon	suggests	up	to	8X	reducDon	
in	training	Dme	on	CIFAR-10	dataset	
8x	improvement
HPCAC-Switzerland	(Mar	‘16)	 18	Network	Based	Compu5ng	Laboratory	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  VirtualizaDon	(SR-IOV	and	Containers)	
•  Energy-Awareness	
•  Best	PracDce:	Set	of	Tunings	for	Common	ApplicaDons	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Switzerland	(Mar	‘16)	 19	Network	Based	Compu5ng	Laboratory	
MPI	Applica5ons	on	MIC	Clusters	
Xeon	 Xeon	Phi	
MulD-core	Centric	
Many-core	Centric	
MPI	
Program	
MPI	
Program	
Offloaded	
ComputaDon	
MPI	
Program	
MPI	Program	
MPI	Program	
Host-only	
Offload		
(/reverse	Offload)	
Symmetric	
Coprocessor-only	
• 	Flexibility	in	launching	MPI	jobs	on	clusters	with	Xeon	Phi
HPCAC-Switzerland	(Mar	‘16)	 20	Network	Based	Compu5ng	Laboratory	
MVAPICH2-MIC	2.0	Design	for	Clusters	with	IB	and		MIC	
•  Offload	Mode		
•  Intranode	CommunicaDon	
•  Coprocessor-only	and	Symmetric	Mode		
•  Internode	CommunicaDon		
•  Coprocessors-only	and	Symmetric	Mode	
•  MulD-MIC	Node	ConfiguraDons	
•  Running	on	three	major	systems	
•  Stampede,	Blueridge	(Virginia	Tech)	and	Beacon	(UTK)
HPCAC-Switzerland	(Mar	‘16)	 21	Network	Based	Compu5ng	Laboratory	
MIC-Remote-MIC	P2P	Communica5on	with	Proxy-based	
Communica5on	
Bandwidth	
Bemer	
Bemer	Bemer	
Latency	(Large	Messages)	
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Latency(usec)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth	(MB/sec)	
Message Size (Bytes)
5236	
Intra-socket	P2P	
Inter-socket	P2P	
0
5000
10000
15000
8K 32K 128K 512K 2M
Latency(usec)
Message Size (Bytes)
Latency	(Large	
Messages)	
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth	(MB/sec)	
Message Size (Bytes)
Bemer	
5594	
Bandwidth
HPCAC-Switzerland	(Mar	‘16)	 22	Network	Based	Compu5ng	Laboratory	
Op5mized	MPI	Collec5ves	for	MIC	Clusters	(Allgather	&	Alltoall)	
A.	Venkatesh,	S.	Potluri,	R.	Rajachandrasekar,	M.	Luo,	K.	Hamidouche	and	D.	K.	Panda	-	High	Performance	
Alltoall	and	Allgather	designs	for	InfiniBand	MIC	Clusters;	IPDPS’14,	May	2014	
0	
10000	
20000	
30000	
1	 2	 4	 8	 16	 32	 64	 128	256	512	1K	
Latency	(usecs)	
Message	Size	(Bytes)	
32-Node-Allgather	(16H	+	16	M)	
Small	Message	Latency	
MV2-MIC	
MV2-MIC-Opt	
0	
500	
1000	
1500	
8K	 16K	 32K	 64K	 128K	256K	512K	 1M	
Latency	(usecs)	
Message	Size	(Bytes)	
32-Node-Allgather	(8H	+	8	M)	
Large	Message	Latency	
MV2-MIC	
MV2-MIC-Opt	
0	
500	
1000	
4K	 8K	 16K	 32K	 64K	 128K	256K	512K	
Latency	(usecs)	
Message	Size	(Bytes)	
32-Node-Alltoall	(8H	+	8	M)	
Large	Message	Latency	
MV2-MIC	
MV2-MIC-Opt	
0	
20	
40	
60	
MV2-MIC-Opt	 MV2-MIC	
Execu5on	Time	(secs)	
32	Nodes	(8H	+	8M),	Size	=	2K*2K*1K	
P3DFFT	Performance	
CommunicaDon		
ComputaDon	
76%	
58%	
55%
HPCAC-Switzerland	(Mar	‘16)	 23	Network	Based	Compu5ng	Laboratory	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  VirtualizaDon	(SR-IOV	and	Containers)	
•  Energy-Awareness	
•  Best	PracDce:	Set	of	Tunings	for	Common	ApplicaDons	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Switzerland	(Mar	‘16)	 24	Network	Based	Compu5ng	Laboratory	
•  VirtualizaDon	has	many	benefits	
–  Fault-tolerance	
–  Job	migraDon	
–  CompacDon	
•  Have	not	been	very	popular	in	HPC	due	to	overhead	associated	with	
VirtualizaDon	
•  New	SR-IOV	(Single	Root	–	IO	VirtualizaDon)	support	available	with	
Mellanox	InfiniBand	adapters	changes	the	field	
•  Enhanced	MVAPICH2	support	for	SR-IOV	
•  MVAPICH2-Virt	2.1	(with	and	without	OpenStack)	is	publicly	available	
•  How	about	the	Containers	support?	
	
	
	
	
Can	HPC	and	Virtualiza5on	be	Combined?	
J.	Zhang,	X.	Lu,	J.	Jose,	R.	Shi	and	D.	K.	Panda,	Can	Inter-VM	Shmem	Benefit	MPI	Applica5ons	on	SR-IOV	based	Virtualized	InfiniBand	
Clusters?	EuroPar'14	
J.	Zhang,	X.	Lu,	J.	Jose,	M.	Li,	R.	Shi	and	D.K.	Panda,	High	Performance	MPI	Libray	over	SR-IOV	enabled	InfiniBand	Clusters,	HiPC’14					
J.	Zhang,	X	.Lu,	M.	Arnold	and	D.	K.	Panda,	MVAPICH2	Over	OpenStack	with	SR-IOV:	an	Efficient	Approach	to	build	HPC	Clouds,	CCGrid’15
HPCAC-Switzerland	(Mar	‘16)	 25	Network	Based	Compu5ng	Laboratory	
•  Redesign	MVAPICH2	to	make	it	
virtual	machine	aware	
–  SR-IOV	shows	near	to	naDve	
performance	for	inter-node	point	to	
point	communicaDon	
–  IVSHMEM	offers	zero-copy	access	to	
data	on	shared	memory	of	co-resident	
VMs	
–  Locality	Detector:	maintains	the	locality	
informaDon	of	co-resident	virtual	machines	
–  CommunicaDon	Coordinator:	selects	the	
communicaDon	channel	(SR-IOV,	IVSHMEM)	
adapDvely	
Overview	of	MVAPICH2-Virt	with	SR-IOV	and	IVSHMEM	
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter
Physical
Function
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Guest 2
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Virtual
Function
Virtual
Function
/dev/shm/
IV-SHM
IV-Shmem Channel
SR-IOV Channel
J.	Zhang,	X.	Lu,	J.	Jose,	R.	Shi,	D.	K.	Panda.	Can	Inter-VM	
Shmem	Benefit	MPI	ApplicaDons	on	SR-IOV	based	
Virtualized	InfiniBand	Clusters?	Euro-Par,	2014.	
J.	Zhang,	X.	Lu,	J.	Jose,	R.	Shi,	M.	Li,	D.	K.	Panda.	High	
Performance	MPI	Library	over	SR-IOV	Enabled	InfiniBand	
Clusters.	HiPC,	2014.
HPCAC-Switzerland	(Mar	‘16)	 26	Network	Based	Compu5ng	Laboratory	
Nova
Glance
Neutron
Swift
Keystone
Cinder
Heat
Ceilometer
Horizon
VM
Backup
volumes in
Stores
images in
Provides
images
Provides
Network
Provisions
Provides
Volumes
Monitors
Provides
UI
Provides
Auth for
Orchestrates
cloud
•  OpenStack	is	one	of	the	most	popular	
open-source	soluDons	to	build	clouds	and	
manage	virtual	machines	
•  Deployment	with	OpenStack	
–  SupporDng	SR-IOV	configuraDon	
–  SupporDng	IVSHMEM	configuraDon	
–  Virtual	Machine	aware	design	of	MVAPICH2	with	
SR-IOV	
•  An	efficient	approach	to	build	HPC	Clouds	
with	MVAPICH2-Virt	and	OpenStack
MVAPICH2-Virt	with	SR-IOV	and	IVSHMEM	over	OpenStack
J.	Zhang,	X.	Lu,	M.	Arnold,	D.	K.	Panda.	MVAPICH2	over	OpenStack	with	SR-IOV:	An	Efficient	Approach	to	Build	
HPC	Clouds.	CCGrid,	2015.
HPCAC-Switzerland	(Mar	‘16)	 27	Network	Based	Compu5ng	Laboratory	
0	
50	
100	
150	
200	
250	
300	
350	
400	
milc	 leslie3d	 pop2	 GAPgeofem	 zeusmp2	 lu	
Execu5on	Time	(s)	
MV2-SR-IOV-Def	
MV2-SR-IOV-Opt	
MV2-NaDve	
1%
9.5%
0	
1000	
2000	
3000	
4000	
5000	
6000	
22,20	 24,10	 24,16	 24,20	 26,10	 26,16	
Execu5on	Time	(ms)	
Problem	Size	(Scale,	Edgefactor)	
MV2-SR-IOV-Def	
MV2-SR-IOV-Opt	
MV2-NaDve	
2%
•  32	VMs,	6	Core/VM		
•  Compared	to	NaDve,	2-5%	overhead	for	Graph500	with	128	Procs	
•  Compared	to	NaDve,	1-9.5%	overhead	for	SPEC	MPI2007	with	128	Procs	
Applica5on-Level	Performance	on	Chameleon	
SPEC	MPI2007Graph500
5%
HPCAC-Switzerland	(Mar	‘16)	 28	Network	Based	Compu5ng	Laboratory	
NSF	Chameleon	Cloud:	A	Powerful	and	Flexible		
Experimental	Instrument
•  Large-scale	instrument	
–  TargeDng	Big	Data,	Big	Compute,	Big	Instrument	research	
–  ~650	nodes	(~14,500	cores),	5	PB	disk	over	two	sites,	2	sites	connected	with	100G	network	
•  Reconfigurable	instrument	
–  Bare	metal	reconfiguraDon,	operated	as	single	instrument,	graduated	approach	for	ease-of-use	
•  Connected	instrument	
–  Workload	and	Trace	Archive	
–  Partnerships	with	producDon	clouds:	CERN,	OSDC,	Rackspace,	Google,	and	others	
–  Partnerships	with	users	
•  Complementary	instrument	
–  ComplemenDng	GENI,	Grid’5000,	and	other	testbeds	
•  Sustainable	instrument	
–  Industry	connecDons	
	
h<p://www.chameleoncloud.org/
HPCAC-Switzerland	(Mar	‘16)	 29	Network	Based	Compu5ng	Laboratory	
0	
2	
4	
6	
8	
10	
12	
14	
16	
18	
1	 2	 4	 8	 16	 32	 64	 128	256	512	 1k	 2k	 4k	 8k	 16k	32k	64k	
Latency	(us)	
Message	Size	(Bytes)	
Container-Def	
Container-Opt	
NaDve	
0	
2000	
4000	
6000	
8000	
10000	
12000	
14000	
16000	
1	 2	 4	 8	 16	 32	 64	 128	256	512	 1k	 2k	 4k	 8k	 16k	32k	64k	
Bandwidth	(MBps)	
Message	Size	(Bytes)	
Container-Def	
Container-Opt	
NaDve	
•  Intra-Node	Inter-Container	
•  Compared	to	Container-Def,	up	to	81%	and	191%	improvement	on	Latency	and	BW	
•  Compared	to	NaDve,	minor	overhead	on	Latency	and	BW	
Containers	Support:	MVAPICH2	Intra-node	Point-to-Point	
Performance	on	Chameleon
81%
191%
HPCAC-Switzerland	(Mar	‘16)	 30	Network	Based	Compu5ng	Laboratory	
0	
500	
1000	
1500	
2000	
2500	
3000	
3500	
4000	
22,	16				 22,	20	 24,	16			 24,	20				 26,	16				 26,	20	
Execu5on	Time	(ms)	
Problem	Size	(Scale,	Edgefactor)	
Container-Def	
Container-Opt	
NaDve	
0	
10	
20	
30	
40	
50	
60	
70	
80	
90	
100	
MG.D					 FT.D			 EP.D	 LU.D	 CG.D	
Execu5on	Time	(s)		
Container-Def	
Container-Opt	
NaDve	
•  64	Containers	across	16	nodes,	pining	4	Cores	per	Container		
•  Compared	to	Container-Def,	up	to	11%	and	16%	of	execuDon	Dme	reducDon	for	NAS	and	Graph	500	
•  Compared	to	NaDve,	less	than	9	%	and	4%	overhead	for	NAS	and	Graph	500	
•  Op5mized	Container	support	will	be	available	with	the	next	release	of	MVAPICH2-Virt	
Containers	Support:	Applica5on-Level	Performance	on	Chameleon
Graph	500NAS
11%
16%
HPCAC-Switzerland	(Mar	‘16)	 31	Network	Based	Compu5ng	Laboratory	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  VirtualizaDon	(SR-IOV	and	Containers)	
•  Energy-Awareness		
•  Best	PracDce:	Set	of	Tunings	for	Common	ApplicaDons	
	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Switzerland	(Mar	‘16)	 32	Network	Based	Compu5ng	Laboratory	
Designing	Energy-Aware	(EA)	MPI	Run5me	
Energy	Spent	in	CommunicaDon	
RouDnes	
Energy	Spent	in	ComputaDon	
RouDnes	
Overall	applicaDon	Energy	
Expenditure	
Point-to-point	
RouDnes	
CollecDve	
RouDnes	
RMA	RouDnes	
MVAPICH2-EA	Designs	
MPI	Two-sided	and	collecDves	(ex:	MVAPICH2)	
Other	PGAS	ImplementaDons	(ex:	OSHMPI)	One-sided	runDmes	(ex:	ComEx)	
Impact	 MPI-3	RMA	ImplementaDons	(ex:	MVAPICH2)
HPCAC-Switzerland	(Mar	‘16)	 33	Network	Based	Compu5ng	Laboratory	
•  MVAPICH2-EA	2.1	(Energy-Aware)	
•  A	white-box	approach	
•  New	Energy-Efficient	communicaDon	protocols	for	pt-pt	and	collecDve	operaDons	
•  Intelligently	apply	the	appropriate	Energy	saving	techniques	
•  ApplicaDon	oblivious	energy	saving	
	
•  OEMT	
•  A	library	uDlity	to	measure	energy	consumpDon	for	MPI	applicaDons	
•  Works	with	all	MPI	runDmes	
•  PRELOAD	opDon	for	precompiled	applicaDons			
•  Does	not	require	ROOT	permission:		
•  A	safe	kernel	module	to	read	only	a	subset	of	MSRs		
Energy-Aware	MVAPICH2	&	OSU	Energy	Management	Tool	
(OEMT)
HPCAC-Switzerland	(Mar	‘16)	 34	Network	Based	Compu5ng	Laboratory	
•  An	energy	efficient	runDme	that	
provides	energy	savings	without	
applicaDon	knowledge	
•  Uses	automaDcally	and	
transparently		the	best	energy	
lever	
•  Provides	guarantees	on	
maximum	degradaDon	with	
5-41%	savings	at	<=	5%	
degradaDon	
•  PessimisDc	MPI	applies	energy	
reducDon	lever	to	each	MPI	call	
MVAPICH2-EA:	Applica5on	Oblivious	Energy-Aware-MPI	(EAM)	
A	Case	for	Applica5on-Oblivious	Energy-Efficient	MPI	Run5me	A.	Venkatesh,	A.	Vishnu,	K.	Hamidouche,	N.	Tallent,	D.	
K.	Panda,	D.	Kerbyson,	and	A.	Hoise,	Supercompu5ng	‘15,	Nov	2015	[Best	Student	Paper	Finalist]	
1
HPCAC-Switzerland	(Mar	‘16)	 35	Network	Based	Compu5ng	Laboratory	
MPI-3	RMA	Energy	Savings	with	Proxy-Applica5ons
0
10
20
30
40
50
60
512 256 128
Seconds
#Processes
Graph500 (Execution Time)
optimistic
pessimistic
EAM-RMA
0
50000
100000
150000
200000
250000
300000
350000
512 256 128
Joules
#Processes
Graph500 (Energy Usage)
optimistic
pessimistic
EAM-RMA
46%
•  MPI_Win_fence	dominates	applicaDon	execuDon	Dme	in	graph500	
•  Between	128	and	512	processes,		EAM-RMA	yields	between	31%	and	46%	savings	with	no	
degradaDon	in	execuDon	Dme	in	comparison	with	the	default	opDmisDc	MPI	runDme
HPCAC-Switzerland	(Mar	‘16)	 36	Network	Based	Compu5ng	Laboratory	
0
500000
1000000
1500000
2000000
2500000
3000000
512 256 128
Joules
#Processes
SCF (Energy Usage)
optimistic
pessimistic
EAM-RMA
0
100
200
300
400
500
600
512 256 128
Seconds
#Processes
SCF (Execution Time)
optimistic
pessimistic
EAM-RMA
MPI-3	RMA	Energy	Savings	with	Proxy-Applica5ons
42%
•  SCF	(self-consistent	field)	calculaDon	spends	nearly	75%	total	Dme	in	MPI_Win_unlock	call	
•  With	256	and	512	processes,	EAM-RMA	yields	42%	and	36%	savings	at	11%	degradaDon	(close	to	
permi<ed	degradaDon	ρ	=	10%)		
•  128	processes	is	an	excepDon	due	2-sided	and	1-sided	interacDon	
•  MPI-3	RMA	Energy-efficient	support	will	be	available	in	upcoming	MVAPICH2-EA	release
HPCAC-Switzerland	(Mar	‘16)	 37	Network	Based	Compu5ng	Laboratory	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  VirtualizaDon	(SR-IOV	and	Containers)	
•  Energy-Awareness		
•  Best	PracDce:	Set	of	Tunings	for	Common	ApplicaDons	
	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Switzerland	(Mar	‘16)	 38	Network	Based	Compu5ng	Laboratory	
•  MPI	runDme	has	many	parameters	
•  Tuning	a	set	of	parameters	can	help	you	to	extract	higher	performance	
•  Compiled	a	list	of	such	contribuDons	through	the	MVAPICH	Website	
–  h<p://mvapich.cse.ohio-state.edu/best_pracDces/	
•  IniDal	list	of	applicaDons	
–  Amber	
–  HoomdBlue	
–  HPCG	
–  Lulesh	
–  MILC	
–  MiniAMR	
–  Neuron	
–  SMG2000	
•  SoliciDng	addiDonal	contribuDons,	send	your	results	to	mvapich-help	at	cse.ohio-
state.edu.	We	will	link	these	results	with	credits	to	you.	
Applica5ons-Level	Tuning:	Compila5on	of	Best	Prac5ces
HPCAC-Switzerland	(Mar	‘16)	 39	Network	Based	Compu5ng	Laboratory	
MVAPICH2	–	Plans	for	Exascale	
•  Performance	and	Memory	scalability	toward	1M	cores	
•  Hybrid	programming	(MPI	+	OpenSHMEM,	MPI	+	UPC,	MPI	+	CAF	…)	
•  Support	for		task-based	parallelism	(UPC++)*	
•  Enhanced	OpDmizaDon	for	GPU	Support	and	Accelerators	
•  Taking	advantage	of	advanced	features	of	Mellanox	InfiniBand	
•  On-Demand	Paging	(ODP)	
•  Swith-IB2	SHArP	
•  GID-based	support	
•  Enhanced	Inter-node	and	Intra-node	communicaDon	schemes	for	upcoming	architectures	
•  OpenPower*		
•  OmniPath-PSM2*	
•  Knights	Landing	
•  Extended	topology-aware	collecDves	
•  Extended	Energy-aware	designs	and	VirtualizaDon	Support	
•  Extended	Support	for	MPI	Tools	Interface	(as	in	MPI	3.0)	
•  Extended	Checkpoint-Restart	and	migraDon	support	with	SCR	
•  Support	for	*	features	will	be	available	in	MVAPICH2-2.2	RC1
HPCAC-Switzerland	(Mar	‘16)	 40	Network	Based	Compu5ng	Laboratory	
•  Exascale	systems	will	be	constrained	by	
–  Power	
–  Memory	per	core	
–  Data	movement	cost	
–  Faults	
•  Programming	Models	and	RunDmes	for	HPC	need	to	be	
designed	for	
–  Scalability	
–  Performance	
–  Fault-resilience	
–  Energy-awareness	
–  Programmability	
–  ProducDvity	
•  Highlighted	some	of	the	issues	and	challenges	
•  Need	conDnuous	innovaDon	on	all	these	fronts			
Looking	into	the	Future	….
HPCAC-Switzerland	(Mar	‘16)	 41	Network	Based	Compu5ng	Laboratory	
Funding	Acknowledgments	
Funding	Support	by	
Equipment	Support	by
HPCAC-Switzerland	(Mar	‘16)	 42	Network	Based	Compu5ng	Laboratory	
Personnel	Acknowledgments	
Current	Students		
–  A.	AugusDne	(M.S.)	
–  A.	Awan	(Ph.D.)	
–  S.	Chakraborthy		(Ph.D.)	
–  C.-H.	Chu	(Ph.D.)	
–  N.	Islam	(Ph.D.)	
–  M.	Li	(Ph.D.)	
Past	Students		
–  P.	Balaji	(Ph.D.)	
–  S.	Bhagvat	(M.S.)	
–  A.	Bhat	(M.S.)		
–  D.	BunDnas	(Ph.D.)	
–  L.	Chai	(Ph.D.)	
–  B.	Chandrasekharan	(M.S.)	
–  N.	Dandapanthula	(M.S.)	
–  V.	Dhanraj	(M.S.)	
–  T.	Gangadharappa	(M.S.)	
–  K.	Gopalakrishnan	(M.S.)	
–  G.	Santhanaraman	(Ph.D.)	
–  A.	Singh	(Ph.D.)	
–  J.	Sridhar	(M.S.)	
–  S.	Sur	(Ph.D.)	
–  H.	Subramoni	(Ph.D.)	
–  K.	Vaidyanathan	(Ph.D.)	
–  A.	Vishnu	(Ph.D.)	
–  J.	Wu	(Ph.D.)	
–  W.	Yu	(Ph.D.)	
Past	Research	Scien,st	
–  S.	Sur	
Current	Post-Doc	
–  J.	Lin	
–  D.	Banerjee	
Current	Programmer	
–  J.	Perkins	
Past	Post-Docs	
–  H.	Wang	
–  X.	Besseron	
–  H.-W.	Jin	
–  M.	Luo	
–  W.	Huang	(Ph.D.)	
–  W.	Jiang	(M.S.)	
–  J.	Jose	(Ph.D.)	
–  S.	Kini	(M.S.)	
–  M.	Koop	(Ph.D.)	
–  R.	Kumar	(M.S.)	
–  S.	Krishnamoorthy	(M.S.)	
–  K.	Kandalla	(Ph.D.)	
–  P.	Lai	(M.S.)	
–  J.	Liu	(Ph.D.)	
–  M.	Luo	(Ph.D.)	
–  A.	Mamidala	(Ph.D.)	
–  G.	Marsh	(M.S.)	
–  V.	Meshram	(M.S.)	
–  A.	Moody	(M.S.)	
–  S.	Naravula	(Ph.D.)	
–  R.	Noronha	(Ph.D.)	
–  X.	Ouyang	(Ph.D.)	
–  S.	Pai	(M.S.)	
–  S.	Potluri	(Ph.D.)
–  R.	Rajachandrasekar	(Ph.D.)	
	
–  K.	Kulkarni	(M.S.)	
–  M.	Rahman	(Ph.D.)	
–  D.	Shankar	(Ph.D.)	
–  A.	Venkatesh	(Ph.D.)	
–  J.	Zhang	(Ph.D.)	
–  E.	Mancini	
–  S.	Marcarelli	
–  J.	Vienne	
Current	Research	Scien,sts				Current	Senior	Research	Associate	
–  H.	Subramoni	
–  X.	Lu	
Past	Programmers	
–  D.	Bureddy	
						-		K.	Hamidouche	
Current	Research	Specialist	
–  M.	Arnold
HPCAC-Switzerland	(Mar	‘16)	 43	Network	Based	Compu5ng	Laboratory	
	Interna5onal	Workshop	on	Communica5on	
Architectures	at	Extreme	Scale	(Exacomm)
ExaComm	2015	was	held	with	Int’l	SupercompuDng	Conference	(ISC	‘15),	at	Frankfurt,	
Germany,	on	Thursday,	July	16th,	2015	
	
One	Keynote	Talk:	John	M.	Shalf,	CTO,	LBL/NERSC	
Four	Invited	Talks:	Dror	Goldenberg	(Mellanox);	MarDn	Schulz	(LLNL);	
Cyriel	Minkenberg	(IBM-Zurich);	Arthur	(Barney)	Maccabe	(ORNL)	
Panel:	Ron	Brightwell	(Sandia)	
Two	Research	Papers	
ExaComm	2016	will	be	held	in	conjuncDon	with	ISC	’16	
h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html	
Technical	Paper	Submission	Deadline:		Friday,	April	15,	2016
HPCAC-Switzerland	(Mar	‘16)	 44	Network	Based	Compu5ng	Laboratory	
panda@cse.ohio-state.edu	
Thank	You!	
The	High-Performance	Big	Data	Project	
h<p://hibd.cse.ohio-state.edu/
Network-Based	CompuDng	Laboratory	
h<p://nowlab.cse.ohio-state.edu/
The	MVAPICH2	Project	
h<p://mvapich.cse.ohio-state.edu/

More Related Content

PDF
Panda scalable hpc_bestpractices_tue100418
PDF
Overview of the MVAPICH Project and Future Roadmap
PDF
The Past, Present, and Future of OpenACC
PPTX
OpenACC Monthly Highlights: July 2021
PPTX
OpenACC Monthly Highlights: January 2021
PPTX
OpenACC Monthly Highlights: June 2021
PPTX
OpenACC Monthly Highlights: October2020
PPTX
OpenACC Monthly Highlights: March 2021
Panda scalable hpc_bestpractices_tue100418
Overview of the MVAPICH Project and Future Roadmap
The Past, Present, and Future of OpenACC
OpenACC Monthly Highlights: July 2021
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: June 2021
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: March 2021

What's hot (20)

PDF
Challenges and Opportunities for HPC Interconnects and MPI
PPTX
OpenACC Monthly Highlights: June 2020
PPTX
OpenACC Highlights: GTC Digital April 2020
PPTX
OpenACC Monthly Highlights: September 2021
PPTX
OpenACC Highlights: 2019 Year in Review
PPTX
OpenACC Monthly Highlights: November 2020
PDF
Update on the Mont-Blanc Project for ARM-based HPC
PPT
Welcome to the 2016 HPC Advisory Council Switzerland Conference
PPTX
OpenACC Monthly Highlights Summer 2019
PPTX
OpenACC Monthly Highlights
PPTX
OpenACC Monthly Highlights: February 2021
PPTX
CourboSpark: Decision Tree for Time-series on Spark
PPTX
2018 03 25 system ml ai and openpower meetup
PPTX
OpenACC Monthly Highlights: August 2020
PDF
Speeding up Programs with OpenACC in GCC
PPTX
OpenACC Monthly Highlights September 2019
PPTX
OpenACC Monthly Highlights: August 2021
PDF
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
PDF
Rapids: Data Science on GPUs
PDF
Network experiences with Public Cloud Services @ TNC2017
Challenges and Opportunities for HPC Interconnects and MPI
OpenACC Monthly Highlights: June 2020
OpenACC Highlights: GTC Digital April 2020
OpenACC Monthly Highlights: September 2021
OpenACC Highlights: 2019 Year in Review
OpenACC Monthly Highlights: November 2020
Update on the Mont-Blanc Project for ARM-based HPC
Welcome to the 2016 HPC Advisory Council Switzerland Conference
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights
OpenACC Monthly Highlights: February 2021
CourboSpark: Decision Tree for Time-series on Spark
2018 03 25 system ml ai and openpower meetup
OpenACC Monthly Highlights: August 2020
Speeding up Programs with OpenACC in GCC
OpenACC Monthly Highlights September 2019
OpenACC Monthly Highlights: August 2021
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
Rapids: Data Science on GPUs
Network experiences with Public Cloud Services @ TNC2017
Ad

Viewers also liked (18)

DOC
DanielBaileyCVGood
DOCX
Egidija CV 2016
DOC
CV Moore 2016
PDF
Web & Mobile App Technology
PPTX
Crea tu Propia Empresa
PDF
Postres
PDF
Request for New Hire Onboarding (Corporate) Form View
PPTX
Los retos de la educación boliviana
PPTX
Psychological Needs and Facebook Games
PDF
The venetian economy_1400-1797 (2)
DOCX
Unit 1 hrp
PPT
Development of occlusion
PPTX
Contribution Of Sectors In GDP Of India
PDF
Request for New Hire Onboarding (Corporate) - Samanage
PDF
jan_631_AMR_201101.pdf
PDF
0302anlagen_pressemappe.pdf
PDF
PR f. Geothermie, IGC 2012, Enerchange
DanielBaileyCVGood
Egidija CV 2016
CV Moore 2016
Web & Mobile App Technology
Crea tu Propia Empresa
Postres
Request for New Hire Onboarding (Corporate) Form View
Los retos de la educación boliviana
Psychological Needs and Facebook Games
The venetian economy_1400-1797 (2)
Unit 1 hrp
Development of occlusion
Contribution Of Sectors In GDP Of India
Request for New Hire Onboarding (Corporate) - Samanage
jan_631_AMR_201101.pdf
0302anlagen_pressemappe.pdf
PR f. Geothermie, IGC 2012, Enerchange
Ad

Similar to Addressing Emerging Challenges in Designing HPC Runtimes (20)

PDF
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
PPTX
OpenACC and Open Hackathons Monthly Highlights June 2025
PDF
OpenACC and Open Hackathons Monthly Highlights June 2022.pdf
PPTX
OpenACC Monthly Highlights - February 2018
PPTX
OpenACC and Open Hackathons Monthly Highlights: April 2022
PDF
Scallable Distributed Deep Learning on OpenPOWER systems
PPTX
Communication Frameworks for HPC and Big Data
PDF
Programming Models for Exascale Systems
PPTX
OpenACC Monthly Highlights - March 2018
PDF
A Library for Emerging High-Performance Computing Clusters
PPTX
OpenACC Monthly Highlights March 2019
PPTX
OpenACC Monthly Highlights: February 2022
PPTX
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
PPTX
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
PPTX
OpenACC Monthly Highlights: June 2019
PDF
OpenACC and Open Hackathons Monthly Highlights: Summer 2024
PPTX
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
PPTX
OpenACC Monthly Highlights: July 2020
PPTX
OpenACC Monthly Highlights: May 2020
PPTX
OpenACC Monthly Highlights June 2017
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2022.pdf
OpenACC Monthly Highlights - February 2018
OpenACC and Open Hackathons Monthly Highlights: April 2022
Scallable Distributed Deep Learning on OpenPOWER systems
Communication Frameworks for HPC and Big Data
Programming Models for Exascale Systems
OpenACC Monthly Highlights - March 2018
A Library for Emerging High-Performance Computing Clusters
OpenACC Monthly Highlights March 2019
OpenACC Monthly Highlights: February 2022
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
OpenACC Monthly Highlights: June 2019
OpenACC and Open Hackathons Monthly Highlights: Summer 2024
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights June 2017

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
cuic standard and advanced reporting.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Tartificialntelligence_presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine Learning_overview_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
SOPHOS-XG Firewall Administrator PPT.pptx
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A comparative analysis of optical character recognition models for extracting...
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
MIND Revenue Release Quarter 2 2025 Press Release
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing

Addressing Emerging Challenges in Designing HPC Runtimes