SlideShare a Scribd company logo
Next Generation MPICH: What to
Expect – Lightweight communication
and much more!
Ken	Raffenetti
Software	Development	Specialist
Argonne	National	Laboratory
Email:	raffenet@mcs.anl.gov
Web:	http://www.mcs.anl.gov/~raffenet
Outline
§ Current	MPICH
§ MPICH-3.3	and	beyond
– Lightweight	communication	overhead
– Memory	scalability
– Multi-threading
• MPICH	+	User-level	threads
– ROMIO	data	logging
– MPI	derived	datatypes
§ Summary
2
MPICH Today
§ MPICH	is	a	high-performance	and	widely	portable	open-
source	implementation	of	MPI
§ It	provides	all	features	of	MPI	that	have	been	defined	so	far	
(up	to	and	include	MPI-3.1)
§ Active	development	lead	by	Argonne	National	Laboratory	and	
University	of	Illinois	at	Urbana-Champaign
– Several	close	collaborators	who	contribute	features,	bug	fixes,	testing	
for	quality	assurance,	etc.
• IBM,	Microsoft,	Cray,	Intel,	Ohio	State	University,	Queen’s	University,	
Mellanox,	RIKEN	AICS	and	others
§ Current	stable	release	is	MPICH-3.2
§ www.mpich.org
3
MPICH: Goals and Philosophy
§ MPICH	aims	to	be	the	preferred	MPI	implementation	on	the	
top	machines	in	the	world
§ Our	philosophy	is	to	create	an	“MPICH	Ecosystem”
MPICH
Intel	
MPIIBM	
MPI
Cray
MPI
Microsoft	
MPI
MVAPICH
Tianhe
MPI
MPE
PETSc
MathWorks
HPCToolkit
TAU
Totalview
DDT
ADLB
ANSYS
Mellanox
MPICH-MXM
Lenovo
MPI
GA-MPI
CAF-MPI
OpenShmem
-MPI
4
MPICH-3.2
§ MPICH-3.2	is	the	latest	major	release	series	of	MPICH
§ Primary	focus	areas	for	mpich-3.2
– Support	for	MPI-3.1	functionality	(nonblocking collective	I/O	and	
others)
– Fortran	2008	bindings
– Support	for	the	Mellanox MXM	interface		(thanks	to	Mellanox)
– Support	for	the	Mellanox HCOLL	interface		(thanks	to	Mellanox)
– Support	for	the	LLC	interface	for	IB	and	Tofu		(thanks	to	RIKEN)
– Support	for	the	OFI	interface	(thanks	to	Intel)
– Improvements	to	MPICH/Portals	4
– MPI-4	Fault	Tolerance	(ULFM – experimental)
– Major	improvements	to	the	RMA	infrastructure
MPICH-3.2
MPICH
CH3
Nemesis
(intranode shared	memory)
TCP MXM Portals	4 OFI
ADI
Channel	Interface
Netmod Interface
MPI
LLC
OFI - Libfabric
Introduction	to	MPI,	Argonne	(06/06/2014) 7
OFI Netmod in CH3
§ All	of	MPI	over	fi_tagged
– Hardware	Send/Recv
– MPI	RMA	emulation	using	MPICH	packet	headers
– MPICH	control	messages
§ Where	to	improve?
– MPI	RMA	with	fi_rma
– Collectives	with	fi_trigger
– Would	require	major	infrastructure	changes	to	CH3
• Step	back	and	look	at	CH3	as	a	whole…
Outline
§ Current	MPICH
§ Next	Generation	MPICH
– Lightweight	communication	overhead
– Memory	scalability
– Multi-threading
• MPICH	+	User-level	threads
– ROMIO	data	logging
– MPI	derived	datatypes
§ Summary
9
Singular	Shared	Memory	Support
§ Performant	shared	memory	communication	centrally	managed	by	
Nemesis
§ Network	library	shared	memory	implementations	are	not	well	supported
– Inhibits	collective	offload
CH3 Shortcomings
Non-scalable	“Virtual	Connections”
§ 480	bytes	*	1	million	procs	=	480MB(!)	of	VCs	per	process
§ Connection-less	networks	emerging
– VC	and	associated	fields	are	overkill
Active	Message	First	Design
§ All	communication	involves	a	
packet	header	+	message	
payload
– Requires	a	non-
contiguous	memory	
access	for	all	messages
§ Send/Recv override	exists,	but	
was	somewhat	clunky	add-in
Function	Pointers	Not	Optimized	By	Compiler
if (vc->comm_ops && vc->comm_ops->isend){
mpi_errno =
vc->comm_ops->isend(vc, buf, count, ...)
goto fn_exit;
}
Netmod API
• Passes	down	limited	information	
and	functionality	to	the	network	
layer
• SendContig
• SendNoncontig
• iSendContig
• iStartContigMsg
• ...
Overheads
§ With	MPI	features	baked	into	next-generation	hardware,	we	anticipate	
network	library	overheads	will	dramatically	reduce.
§ Message	rate	will	come	to	be	dominated	by	MPICH	overheads
0 100 200 300 400 500 600 700 800 900
future
isend|ch3|dynamic
Application
MPICH
Libfabric
MPI on OFI
§ Point-to-point	data	movement
– Closely	maps	to	fi_tsend/trecv functionality
– How	can	MPICH	get	out	of	the	way?
MPI_Isend(buf, count, datatype, dest, tag, comm, &req)
fi_tsend(gl_data.endpoint, /* Local endpoint */
send_buffer, /* Packed or user */
data_sz, /* Size of the send */
gl_data.mr, /* Dynamic memory region */
to_addr(comm,dest), /* Destination fabric address */
match_bits(comm,tag), /* Match bits */
&req->ctx); /* Context */
More	configurable	shared	memory	in	
CH4
§ Involve	the	network	layer	in	the	
decision
– Support	SHM	aware	algorithms
§ One	or	more	SHM	transports	(POSIX,	
XPMEM,	CMA)
Addressing CH3’s shortcomings
High-Level	API
• Give	more	control	to	lower	layers
• netmod_send
• netmod_recv
• netmod_put
• netmod_get
• Fallback	to	Active	Message	based	
communication	when	necessary
• Operations	not	supported	by	the	
network
“Netmod Direct”
• Support	two	modes
• Multiple	netmods
• Retains	function	pointer	for	flexibility
• Single	netmod with	inlining into	device	layer
• No	function	pointer
MPI
CH4
Netmod
OFI UCX Portals	4
MPI
CH4/Netmod Direct
Network	Library
No	Virtual	Connection	data	structure
• Global	address	table	(still	O(p))
• Contains	all	process	addresses
• Index	into	global	table	by	translating	
(rank+comm)
• VCs	can	still	be	defined	at	the	lower	layers
MPI_Isend
ch3-netmod-
dynamic
ch4-netmod-base ch4-netmod-inline ch4-static ch4-shm-disabled ch4-thread-single
application-pre 13 13 13 61 55 53
mpi-pre 202 133 110 0 0 0
mpi-post 32 34 24 0 0 0
application-post 3 3 3 22 19 17
0
50
100
150
200
250
300
Total	Instructions
MPI_Isend
Outline
§ Current	MPICH	Status
§ MPICH-3.3
– Lightweight	communication	overhead
– Memory	scalability
– Multi-threading
• MPICH	+	User-level	threads
– ROMIO	data	logging
– MPI	derived	datatypes
§ Summary
15
Overview of the New LPID/GPID Design
(Replacement for VC)
§ Compressing	VC	(480Bytes	->	12Bytes)
– Compressing	Multi-transport	Functionality
• Function	pointers	are	moved	to	a	separate	array
– Deprioritizing	Dynamic	Processes
• Process	group	information	moved	to	COMM
§ Regular/Irregular	Rank	Mapping	Models
– DIRECT/DIRECT_INTRA
– OFFSET/OFFSET_INTRA
– STRIDE/STRIDE_INTRA
– STRIDE_BLOCK/STRIDE_BLOCK_INTRA
– LUT/LUT_INTRA
– MLUT
§ Rank-Address	Translate
– (comm,	rank)	->	(avtid,	lpid)
All	*_INTRA	(avtid==0)	model	uses	
MPIDI_CH4I_av_table0	to	save	1	
memory	reference	during	translation
Memory Usage Reduction
2K
200M
400M
600M
800M
1000M
128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 768K
Memory	Usage	(B)
Number	of	Processes
VC-VCRT-10COMM VC-VCRT-100COMM AV-Rankmap-10COMM AV-Rankmap-100COMM
MPI_Comm_split,	10	split	COMM	and	100	spilt	COMM
CH3	runs	out	of	memory	at	768K	procs,	100	COMMs
L1D Cache Misses
§ Compressing	VC	structure	
reduces	the	cache	misses	
during	communication
§ Deduction	in	L1D	cache	
misses	compensated	the	
overhead	of	additional	
instructions
1
10
100
1K
10K
100K
1M
DIRECT OFFSET STRIDE LUT
Number	of	L1D	cache	misses	(per	32M	lookups)
Rank	Mapping	Model
VC-VCRT AV-Rankmap-switch AV-Rankmap-hybrid
Outline
§ Current	MPICH	Status
§ MPICH-3.3
– Lightweight	communication	overhead
– Memory	scalability
– Multi-threading
• MPICH	+	User-level	threads
– ROMIO	data	logging
– MPI	derived	datatypes
§ Summary
19
Multithreaded MPI Work-Queue Model
§ Context
– Existing	lock-based	MPI	implementations	
unconditionally acquire	locks
– Nonblocking operations	may	block for	a	lock	
acquisition
• Not truly	nonblocking!
§ Consequences
– Nonblocking operations	may	be	slowed	by	blocking	
ones	from	other	threads
– Pipeline	stalls:	higher	latencies,	lower	thoughput,	
and	less	communication-computation	overlapping
§ Work-Queue	Model
– One	or	multiple	work-queues	per	endpoint
– Decouple blocking	and	nonblocking operations
– Nonblocking operations	enqueue work	descriptors	
and	leave	if	critical	section	held
– Threads	issue	work	on	behalf	of	other	threads	when	
acquiring	a	critical	section
– Nonblocking operations	are	truly	nonblocking
MPI_Send(...)
{
CS_TRY_ENTER;
if(!success) {
CS_ENTER;
}
flush_workq();
Wait_Progress();
CS_EXIT;
}
enqueue
Dequeue
Hardware
T
x
T
x
0
50000
100000
150000
200000
250000
300000
350000
0 10 20 30 40
Issuing	Rate	(Requests/s)
Number	of	Threads	Per	Rank
Nonblocking Irecv issuing	rate	between	
two	Haswell+Mellanox FDR	nodes
CLH
CLH-LPW
Nonblocking
Operation
Blocking
Operation
Work-Queue	per	
Communication	Context
Work-Queue Model Through 3 Steps
§ Step	1:	Single	Endpoint
– Current	MPICH
– Single	endpoint	per	MPI	process
– Worst	case	contention
§ Step	2:	Multiple	User-Transparent	
Endpoints
– Multiple	internal	endpoints	(BG/Q	
style)
– Transparent	to	the	user
– E.g.:	one	endpoint	per	comm,	per	
neighbor	process	(regular	apps)
§ Step	3:	Multiple	User-Visible	Endpoints
– MPI-4	Endpoints	proposal
– Multiple	endpoints	managed	by	
the	user
Hardware
Application
MPI
CTX	
User	Endpoint
Hardware
Application
MPI
CTX	
User	Endpoint
CTX	CTX	
Hardware
Application
MPI
CTX	
User	Endpoint
CTX	CTX	
User	EndpointUser	Endpoint
Current Work-Queue Model Implementation
MPI_Put(void* org_buf,...)
{
CS_TRYENTER(&success);
if(!success) {
/* Enqueue my work */
elem = {PUT, org_buf,…};
enqueue(&work_queue, elem);
}
else{
/* Flush the work queue */
while(!empty(work_queue)) {
elemt = dequeue(&work_queue);
switch (elem.op){
case PUT:
MPID_Put(elem.org_buf, …);
...
}
}
/* Issue my own op */
MPID_Put(elem.org_buf, …);
}
if(success)
CS_EXIT;
}
§ MPIR	Layer
– All	communication	devices	can	take	advantage
– Single	endpoint	(endpoints	have	not	been	exposed	yet)
§ Progress	semantics:
– Nonblocking calls:	flush	queue	if	lock	acquired
– Blocking	calls:
• Flush	work-queue	at	entry
• Flush		work-queue	within	the	progress	engine
§ Unlimited	work-queue
§ Locked	queue	implementation
§ Pthread mutex used	for	the	global	MPICH	lock
§ Work-queue:	multiple	implementations
– Mutex locked	queue
– Michal	Scott’s	lock-free	queue
– New	Multi-Producer-Single-Consumer	lock-free	
queues
Data Transfer Rate with Threaded MPI RMA
§ Transfer	data	domain	between	two	
processes
§ Stencil-like	halo	exchange	(actual	
domain	exchange,	not	like	OSU	
benchmarks)
§ Each	thread	gets	a	subdomain
§ Transfer	unit	is	a	chunk
§ Passive	target	synchronization
– Master	thread	does	Lock
– All	threads Put chunks
– All	threads	do	Flush every	
window_size
– Master	threads	does	Unlock
Chunk Window Thread	subdomain
Data	domain
P0
P1
Put	+	Lock	[+	Flush]
Put + Lock with a Mutex Work-queue (CH3+MXM)
No Concurrent Waiting Threads
4096
32768
262144
2097152
1 16 256 4096 65536
Transfer	Rate	(Chunks/s)
Chunk	Size	(Bytes)
Core	0	(Single	Threaded)	
original
work-queue
4096
32768
262144
2097152
1 16 256 4096 65536
Transfer	Rate	(Chunks/s)
Chunk	Size	(Bytes)
NUMA	Node	0	(9	cores)
original
work-queue
4096
32768
262144
2097152
1 16 256 4096 65536
Transfer	Rate	(Chunks/s)
Chunk	Size	(Bytes)
Socket	0	(18	cores)
original
work-queue
4096
32768
262144
2097152
1 16 256 4096 65536
Transfer	Rate	(Chunks/s)
Chunk	Size	(Bytes)
All	cores	(36)
original
work-queue
Put + Flush (w=64) + Lock with a Mutex Work-Queue
Concurrent Waiting Threads
4096
32768
262144
2097152
1 16 256 4096 65536
Transfer	Rate	(Chunks/s)
Chunk	Size	(Bytes)
Core	0	(Single	Threaded)
original
work-queue
4096
32768
262144
2097152
1 16 256 4096 65536
Transfer	Rate	(Chunks/s)
Chunk	Size	(Bytes)
NUMA	Node	0	(9	cores)
original
work-queue
4096
32768
262144
2097152
1 16 256 4096 65536
Transfer	Rate	(Chunks/s)
Chunk	Size	(Bytes)
Socket	0	(18	cores)
original
work-queue
4096
32768
262144
2097152
1 16 256 4096 65536
Transfer	Rate	(Chunks/s)
Chunk	Size	(Bytes)
All	cores	(36)
original
work-queue
Breakdown Analysis
0%
0%
1%
10%
100%
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
Number	of	Threads
WQ_enq WQ_deq Other
0%
20%
40%
60%
80%
100%
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
Number	of	Threads
WQ_enq WQ_deq Other
§ Put	+	Lock
– Queuing	work	is	the	major	
bottleneck!
– Currently	debugging	a	using	
faster	lock-free	queue
– Goal	~	0	overhead
§ Put	+	Flush	+	Lock
– Queuing/Dequeuing work	is	
negligible
– Bottleneck	somewhere	else
– Hypothesis:	all	threads	waiting	
for	completion	without	issuing	
(next	slide)
Watch	out!
Log	scale
Point-to-Point Message Rate with a Mutex Work-Queue
4096
65536
1048576
1 32 1024 32768
Transfer	Size(Chunks/s)
Chunk	Size	(Bytes)
Core	0
original
work-queue
4096
65536
1048576
1 32 1024 32768
Transfer	Size(Chunks/s)
Chunk	Size	(Bytes)
NUMA	Node	0	(9	cores)
original
work-queue
4096
65536
1048576
1 32 1024 32768
Transfer	Size(Chunks/s)
Chunk	Size	(Bytes)
Socket	0	(18	cores)
original
work-queue
4096
65536
1048576
1 32 1024 32768
Transfer	Size(Chunks/s)
Chunk	Size	(Bytes)
All	cores	(36)
original
work-queue
Results with a Lock-Free Work-Queue
§ Put	+	Lock
– Michael	Scott’s	lock-free	queue	(MS-WorkQ)
– Linked	to	TCMalloc
– Still	significantly	below	single-threaded
– Working	on	faster	lock-free	queues
65536
131072
262144
524288
1048576
2097152
4194304
1 4 16 64 256 1024 4096
Transfer	Size(Chunks/s)
Chunk	Size	(Bytes)
NUMA	Node	0	(9	cores)
Single-threaded
Mutex-WorkQ
MS-WorkQ
Original
Outline
§ Current	MPICH
§ MPICH-3.3	and	beyond
– Lightweight	communication	overhead
– Memory	scalability
– Multi-threading
• MPICH	+	User-level	threads
– ROMIO	data	logging
– MPI	derived	datatypes
§ Summary
29
Supporting User-level Threads in MPICH
(Argobots)
§ Motivation
– Traditional	MPI	implementations	are	only	aware	
of	kernel	threads
– Thread-synchronization	is	costly	to	ensure	
thread-safety	and	progress	requirement	from	
MPI
– Wasted	resources	if	a	kernel	thread	blocks	for	
MPI	communication
§ Argobots-aware	MPICH
– Supports	Argobots as	another	threading	model
– Lightweight	context	switching	to	overlap	costly	
blocking	operations
• Communication,	locks,	etc.
– Reduced	thread-synchronization	opportunities
• Guaranteed	consistency	within	an	ES	without	
locks	or	memory	barriers MPI+Argobots Execution Model
ULT
ES
ULT
ES
MPI
ULT
ES
ULT
ES
MPI
timeline
ULT1: do computation,
start a MPI send
Context switch to ULT2
ULT1: communication in
background
Context switch back to ULT1
ULT2: communication in
background
ULT1
ULT2
Outline
§ Current	MPICH	Status
§ MPICH-3.3
– Lightweight	communication	overhead
– Memory	scalability
– Multi-threading
• MPICH	+	User-level	threads
– ROMIO	data	logging
– MPI	derived	datatypes
§ Summary
31
Wesley Bland
Senior Software Developer, Intel Corporation
Intel HPC Developers Conference
November 12, 2016
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
MPICH-OFI*
Open-source implementation based on MPICH
• Uses the new CH4 infrastructure
• Co-designed with MPICH community
• Targets existing and new fabrics via next-gen Open Fabrics Interface (OFI)
• Ethernet/sockets, Intel® Omni-Path, Cray Aries*, IBM BG/Q*, InfiniBand*
• Will be the default implementation available on the Aurora Supercomputer
at Argonne National Laboratory
33
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
MPICH-OFI* Developments in 2016
Multiple hackathons with Argonne and internal development work to expand
CH4 feature set:
• Capability sets
• Improved RMA support
• Reduce instruction overhead
• Support for improved internal concurrency
34
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Capability Sets
Allows the user to compile-time select a set of OFI features to optimize lookups
later in the execution.
Optimized for the best performance for each OFI provider.
Can enable runtime configuration to make things more flexible, if desired.
35
PSM2 Capability Set (subset)
ENABLE_DATA ON
ENABLE_AV_TABLE ON
ENABLE_SCALABLE_ENDPOINTS OFF
Sockets Capability Set (subset)
ENABLE_DATA ON
ENABLE_AV_TABLE ON
ENABLE_SCALABLE_ENDPOINTS ON
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Improved RMA Support
Map MPI functionality directly to OFI features as much as possible
• OFI has direct support for Put, Get, Accumulate, etc.
• This may reduce software overhead and utilize underlying communication
fabric better.
36
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
As low as 43 instructions from
application to OFI with all
optimizations on
Reduce branching as much as possible
Reduce memory footprint
Reduced Instruction Overhead
37
0
50
100
150
200
250
All Dynamic Ticket Locks Inline CH4 Inline OFI No Locks All Static
InstructionCount
MPI_Send (OFI/CH4) Software Overhead
App-post
MPI-post
CH4-post
OFI-post
OFI-pre
CH4-pre
MPI-pre
App-pre
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Support for improved internal concurrency
Parallel packing and unpacking using derived datatypes
§ The approach shares threads between MPI and OpenMP*
– MPI can steal application threads that are idle
– MPI creates tasks that application threads can execute when idle
§ MPI doesn’t create additional threads.
– No oversubscription.
§ This model maps well to other runtimes,
such as Intel® TBB or Cilk™ Plus
38
MPI
Call
MPI
App
Collective Selection
Improved Shared Memory Support
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Collective Selection
Current MPICH
• Static selection of algorithms based on message size and communicator size
at initialization
Proposal
• Introduce intelligent selection to determine optimal collective algorithm
• Could be picked from a static configuration, runtime selection, etc.
• Can use default algorithms or device/netmod/etc. specific algorithms
40
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Improved Shared Memory Support
- Support multiple shmmods (pronounced shmem-mods)
- Might implement a subset of the API and fall back to active messages for
default support
- Similar architecture to current netmod design
41
CH4
AM NM 2NM 1SHM 2SHM 1 Fallback
Copyright © 2016, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND
INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information
and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product
when combined with other products.
Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are
trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture
are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
42
Next Generation MPICH: What to Expect - Lightweight Communication and More
ROMIO data logging
§ WHO:	Rob	Latham	(ANL)
§ PROBLEM:	How	to	make	use	of	new	
layers	in	storage	hierarchy?
§ SOLUTION:	“ad_logfs”	maintains	a	log-
structured	record	of	all	write	activities
– Sits	below	MPI-IO	routines:	transparent	
save	for	‘logfs:’	prefix
– Maintains	one	set	of	files	per	MPI	process
• Metadata,	data,	and	global	state
– Can	replay	on	close,	explicit	sync,	or	upon	
first	read
§ Intent:	log	all	I/O	to	NVRAM	or	SSD,	defer	
replay	to	parallel	file	system
Outline
§ Current	MPICH	Status
§ MPICH-3.3
– Lightweight	communication	overhead
– Memory	scalability
– Multi-threading
• MPICH	+	User-level	threads
– ROMIO	data	logging
– MPI	derived	datatypes
§ Summary
45
DAME: a new engine for derived datatypes
§ Who: Tarun Prabu,	Bill	Gropp (UIUC)
§ Why:	DAME	is	an	improved	engine	for	derived-datatypes
– The	Dataloop code	(type	processing	today)	effective,	but	requires	many	function	calls	(the	“piece	
functions”)	for	each	“leaf	type”
– Piece	Functions	(function	pointers)	are	difficult	for	most	(all?)	compilers	to	inline,	even	with	things	
like	link-time	optimizations
§ What:	DAME	implements	a	new	description	of	the	MPI	datatype,	then	transforms	that	
description	into	efficient	memory	operations
§ Design	Principles:
– Low	processing	overhead
– Maximize	ability	of	compiler	to	optimize	code
– Simplify	partial	packing
– Enable	memory	access	optimizations
§ Optimizations:
– Memory	access	optimizations	can	be	done	by	shuffling	primitives	as	desired.	This	is	done	at	
“commit”	time.
– Other	optimizations	such	as	normalization	(e.g.	an	indexed	with	identical	stride	between	elements),	
displacement	sorting	and	merging	can	also	easily	be	performed	at	commit-time.
§ More	information	at:	https://wiki.mpich.org/mpich/index.php/DAME
Outline
§ Current	MPICH
§ MPICH-3.3	and	beyond
– Lightweight	communication	overhead
– Memory	scalability
– Multi-threading
• MPICH	+	User-level	threads
– ROMIO	data	logging
– MPI	derived	datatypes
§ Summary
47
MPICH-3.3 Next Major Release
§ The	CH4	device
– Replacement	for	CH3
• CH3	still	supported	and	maintained	for	the	time-being
– Primary	objectives
• Lightweight	communication	overhead
– Ability	to	support	high-level	network	APIs	(OFI,	UCX,	Portals	4)
– E.g.,	tag-matching	in	hardware,	direct	PUT/GET	communication
– Low	memory	footprint
• Support	for	very	high	thread	concurrency
– Improvements	to	message	rates	in	highly	threaded	environments	
(MPI_THREAD_MULTIPLE)
– Support	for	multiple	network	endpoints	(THREAD_MULTIPLE	or	not)
MPICH-3.3 Timeline
§ CH4	code	in	main	MPICH	repo	(recently	moved	to	GitHub)	
http://github.com/pmodels/mpich
– Some	work-in-progress	features	in	Pull	Request	branches
§ MPICH-3.3a2	release	out	this	week
– Subsequent	pre-releases	as	the	code	is	stabilized,	features	added
§ GA	Release	mid-2017
Thank you
§ Questions?

More Related Content

PDF
Intel the-latest-on-ofi
PDF
Overview of Intel® Omni-Path Architecture
PDF
Challenges for Deploying a High-Performance Computing Application to the Cloud
PDF
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
PDF
Move Message Passing Interface Applications to the Next Level
PDF
Best Practices and Performance Studies for High-Performance Computing Clusters
PDF
Preparing Codes for Intel Knights Landing (KNL)
PPTX
TensorRT survey
Intel the-latest-on-ofi
Overview of Intel® Omni-Path Architecture
Challenges for Deploying a High-Performance Computing Application to the Cloud
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Move Message Passing Interface Applications to the Next Level
Best Practices and Performance Studies for High-Performance Computing Clusters
Preparing Codes for Intel Knights Landing (KNL)
TensorRT survey

What's hot (20)

PDF
OpenPOWER Application Optimization
PPSX
FD.io Vector Packet Processing (VPP)
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
[Webinar Slides] Programming the Network Dataplane in P4
PDF
Zero-Copy Event-Driven Servers with Netty
PPT
Introduction to MPI
PDF
Messaging With Erlang And Jabber
PDF
Optimize Performance of I/O-intensive Java applications Using Zero Copy
PPTX
MPI message passing interface
PPTX
The Message Passing Interface (MPI) in Layman's Terms
PDF
Linaro Connect 2016 (BKK16) - Introduction to LISA
PDF
Userspace networking
PDF
Building A Linux Cluster Using Raspberry PI #1!
PPT
What is [Open] MPI?
PPTX
A Perspective on the Future of Computer Architecture
PDF
Building A Linux Cluster Using Raspberry PI #2!
PPT
1.prallelism
PPTX
A Taste of Open Fabrics Interfaces
PDF
Presentation systemc
PDF
DPDK Summit 2015 - Aspera - Charles Shiflett
OpenPOWER Application Optimization
FD.io Vector Packet Processing (VPP)
Network Programming: Data Plane Development Kit (DPDK)
[Webinar Slides] Programming the Network Dataplane in P4
Zero-Copy Event-Driven Servers with Netty
Introduction to MPI
Messaging With Erlang And Jabber
Optimize Performance of I/O-intensive Java applications Using Zero Copy
MPI message passing interface
The Message Passing Interface (MPI) in Layman's Terms
Linaro Connect 2016 (BKK16) - Introduction to LISA
Userspace networking
Building A Linux Cluster Using Raspberry PI #1!
What is [Open] MPI?
A Perspective on the Future of Computer Architecture
Building A Linux Cluster Using Raspberry PI #2!
1.prallelism
A Taste of Open Fabrics Interfaces
Presentation systemc
DPDK Summit 2015 - Aspera - Charles Shiflett
Ad

Similar to Next Generation MPICH: What to Expect - Lightweight Communication and More (20)

PDF
Application Profiling at the HPCAC High Performance Center
PDF
Advanced Scalable Decomposition Method with MPICH Environment for HPC
PPT
OpenPOWER Webinar
PDF
HPC Best Practices: Application Performance Optimization
PDF
More mpi4py
PPTX
Programming using MPI and OpenMP
PDF
IBM MQ - better application performance
PPT
Sap basis online training classes
PPT
Lecture9
PPTX
CA UNIT III.pptx
DOC
REEJA_CV1
PDF
Move fast and make things with microservices
PDF
The Challenges of SDN/OpenFlow in an Operational and Large-scale Network
PPTX
SLUG 2015 PMIx Overview
PPTX
Exascale Process Management Interface
PPTX
Hhm 3474 mq messaging technologies and support for high availability and acti...
PDF
“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...
PPTX
HPC Controls Future
PDF
UCX: An Open Source Framework for HPC Network APIs and Beyond
Application Profiling at the HPCAC High Performance Center
Advanced Scalable Decomposition Method with MPICH Environment for HPC
OpenPOWER Webinar
HPC Best Practices: Application Performance Optimization
More mpi4py
Programming using MPI and OpenMP
IBM MQ - better application performance
Sap basis online training classes
Lecture9
CA UNIT III.pptx
REEJA_CV1
Move fast and make things with microservices
The Challenges of SDN/OpenFlow in an Operational and Large-scale Network
SLUG 2015 PMIx Overview
Exascale Process Management Interface
Hhm 3474 mq messaging technologies and support for high availability and acti...
“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...
HPC Controls Future
UCX: An Open Source Framework for HPC Network APIs and Beyond
Ad

More from Intel® Software (20)

PPTX
AI for All: Biology is eating the world & AI is eating Biology
PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
PDF
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
PDF
AI for good: Scaling AI in science, healthcare, and more.
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
PPTX
AWS & Intel Webinar Series - Accelerating AI Research
PPTX
Intel Developer Program
PDF
Intel AIDC Houston Summit - Overview Slides
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
PDF
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
PDF
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
PDF
AIDC India - AI on IA
PDF
AIDC India - Intel Movidius / Open Vino Slides
PDF
AIDC India - AI Vision Slides
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
AI for All: Biology is eating the world & AI is eating Biology
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
AI for good: Scaling AI in science, healthcare, and more.
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
AWS & Intel Webinar Series - Accelerating AI Research
Intel Developer Program
Intel AIDC Houston Summit - Overview Slides
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
AIDC India - AI on IA
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - AI Vision Slides
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...

Recently uploaded (20)

PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Modernising the Digital Integration Hub
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
August Patch Tuesday
PDF
Getting Started with Data Integration: FME Form 101
PDF
Architecture types and enterprise applications.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
STKI Israel Market Study 2025 version august
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPT
What is a Computer? Input Devices /output devices
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Zenith AI: Advanced Artificial Intelligence
Hindi spoken digit analysis for native and non-native speakers
Modernising the Digital Integration Hub
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative study of natural language inference in Swahili using monolingua...
August Patch Tuesday
Getting Started with Data Integration: FME Form 101
Architecture types and enterprise applications.pdf
DP Operators-handbook-extract for the Mautical Institute
STKI Israel Market Study 2025 version august
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
What is a Computer? Input Devices /output devices
Group 1 Presentation -Planning and Decision Making .pptx
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Programs and apps: productivity, graphics, security and other tools
A contest of sentiment analysis: k-nearest neighbor versus neural network
O2C Customer Invoices to Receipt V15A.pptx
NewMind AI Weekly Chronicles – August ’25 Week III
Zenith AI: Advanced Artificial Intelligence

Next Generation MPICH: What to Expect - Lightweight Communication and More

  • 1. Next Generation MPICH: What to Expect – Lightweight communication and much more! Ken Raffenetti Software Development Specialist Argonne National Laboratory Email: raffenet@mcs.anl.gov Web: http://www.mcs.anl.gov/~raffenet
  • 2. Outline § Current MPICH § MPICH-3.3 and beyond – Lightweight communication overhead – Memory scalability – Multi-threading • MPICH + User-level threads – ROMIO data logging – MPI derived datatypes § Summary 2
  • 3. MPICH Today § MPICH is a high-performance and widely portable open- source implementation of MPI § It provides all features of MPI that have been defined so far (up to and include MPI-3.1) § Active development lead by Argonne National Laboratory and University of Illinois at Urbana-Champaign – Several close collaborators who contribute features, bug fixes, testing for quality assurance, etc. • IBM, Microsoft, Cray, Intel, Ohio State University, Queen’s University, Mellanox, RIKEN AICS and others § Current stable release is MPICH-3.2 § www.mpich.org 3
  • 4. MPICH: Goals and Philosophy § MPICH aims to be the preferred MPI implementation on the top machines in the world § Our philosophy is to create an “MPICH Ecosystem” MPICH Intel MPIIBM MPI Cray MPI Microsoft MPI MVAPICH Tianhe MPI MPE PETSc MathWorks HPCToolkit TAU Totalview DDT ADLB ANSYS Mellanox MPICH-MXM Lenovo MPI GA-MPI CAF-MPI OpenShmem -MPI 4
  • 5. MPICH-3.2 § MPICH-3.2 is the latest major release series of MPICH § Primary focus areas for mpich-3.2 – Support for MPI-3.1 functionality (nonblocking collective I/O and others) – Fortran 2008 bindings – Support for the Mellanox MXM interface (thanks to Mellanox) – Support for the Mellanox HCOLL interface (thanks to Mellanox) – Support for the LLC interface for IB and Tofu (thanks to RIKEN) – Support for the OFI interface (thanks to Intel) – Improvements to MPICH/Portals 4 – MPI-4 Fault Tolerance (ULFM – experimental) – Major improvements to the RMA infrastructure
  • 6. MPICH-3.2 MPICH CH3 Nemesis (intranode shared memory) TCP MXM Portals 4 OFI ADI Channel Interface Netmod Interface MPI LLC
  • 8. OFI Netmod in CH3 § All of MPI over fi_tagged – Hardware Send/Recv – MPI RMA emulation using MPICH packet headers – MPICH control messages § Where to improve? – MPI RMA with fi_rma – Collectives with fi_trigger – Would require major infrastructure changes to CH3 • Step back and look at CH3 as a whole…
  • 9. Outline § Current MPICH § Next Generation MPICH – Lightweight communication overhead – Memory scalability – Multi-threading • MPICH + User-level threads – ROMIO data logging – MPI derived datatypes § Summary 9
  • 10. Singular Shared Memory Support § Performant shared memory communication centrally managed by Nemesis § Network library shared memory implementations are not well supported – Inhibits collective offload CH3 Shortcomings Non-scalable “Virtual Connections” § 480 bytes * 1 million procs = 480MB(!) of VCs per process § Connection-less networks emerging – VC and associated fields are overkill Active Message First Design § All communication involves a packet header + message payload – Requires a non- contiguous memory access for all messages § Send/Recv override exists, but was somewhat clunky add-in Function Pointers Not Optimized By Compiler if (vc->comm_ops && vc->comm_ops->isend){ mpi_errno = vc->comm_ops->isend(vc, buf, count, ...) goto fn_exit; } Netmod API • Passes down limited information and functionality to the network layer • SendContig • SendNoncontig • iSendContig • iStartContigMsg • ...
  • 12. MPI on OFI § Point-to-point data movement – Closely maps to fi_tsend/trecv functionality – How can MPICH get out of the way? MPI_Isend(buf, count, datatype, dest, tag, comm, &req) fi_tsend(gl_data.endpoint, /* Local endpoint */ send_buffer, /* Packed or user */ data_sz, /* Size of the send */ gl_data.mr, /* Dynamic memory region */ to_addr(comm,dest), /* Destination fabric address */ match_bits(comm,tag), /* Match bits */ &req->ctx); /* Context */
  • 13. More configurable shared memory in CH4 § Involve the network layer in the decision – Support SHM aware algorithms § One or more SHM transports (POSIX, XPMEM, CMA) Addressing CH3’s shortcomings High-Level API • Give more control to lower layers • netmod_send • netmod_recv • netmod_put • netmod_get • Fallback to Active Message based communication when necessary • Operations not supported by the network “Netmod Direct” • Support two modes • Multiple netmods • Retains function pointer for flexibility • Single netmod with inlining into device layer • No function pointer MPI CH4 Netmod OFI UCX Portals 4 MPI CH4/Netmod Direct Network Library No Virtual Connection data structure • Global address table (still O(p)) • Contains all process addresses • Index into global table by translating (rank+comm) • VCs can still be defined at the lower layers
  • 14. MPI_Isend ch3-netmod- dynamic ch4-netmod-base ch4-netmod-inline ch4-static ch4-shm-disabled ch4-thread-single application-pre 13 13 13 61 55 53 mpi-pre 202 133 110 0 0 0 mpi-post 32 34 24 0 0 0 application-post 3 3 3 22 19 17 0 50 100 150 200 250 300 Total Instructions MPI_Isend
  • 15. Outline § Current MPICH Status § MPICH-3.3 – Lightweight communication overhead – Memory scalability – Multi-threading • MPICH + User-level threads – ROMIO data logging – MPI derived datatypes § Summary 15
  • 16. Overview of the New LPID/GPID Design (Replacement for VC) § Compressing VC (480Bytes -> 12Bytes) – Compressing Multi-transport Functionality • Function pointers are moved to a separate array – Deprioritizing Dynamic Processes • Process group information moved to COMM § Regular/Irregular Rank Mapping Models – DIRECT/DIRECT_INTRA – OFFSET/OFFSET_INTRA – STRIDE/STRIDE_INTRA – STRIDE_BLOCK/STRIDE_BLOCK_INTRA – LUT/LUT_INTRA – MLUT § Rank-Address Translate – (comm, rank) -> (avtid, lpid) All *_INTRA (avtid==0) model uses MPIDI_CH4I_av_table0 to save 1 memory reference during translation
  • 17. Memory Usage Reduction 2K 200M 400M 600M 800M 1000M 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 768K Memory Usage (B) Number of Processes VC-VCRT-10COMM VC-VCRT-100COMM AV-Rankmap-10COMM AV-Rankmap-100COMM MPI_Comm_split, 10 split COMM and 100 spilt COMM CH3 runs out of memory at 768K procs, 100 COMMs
  • 18. L1D Cache Misses § Compressing VC structure reduces the cache misses during communication § Deduction in L1D cache misses compensated the overhead of additional instructions 1 10 100 1K 10K 100K 1M DIRECT OFFSET STRIDE LUT Number of L1D cache misses (per 32M lookups) Rank Mapping Model VC-VCRT AV-Rankmap-switch AV-Rankmap-hybrid
  • 19. Outline § Current MPICH Status § MPICH-3.3 – Lightweight communication overhead – Memory scalability – Multi-threading • MPICH + User-level threads – ROMIO data logging – MPI derived datatypes § Summary 19
  • 20. Multithreaded MPI Work-Queue Model § Context – Existing lock-based MPI implementations unconditionally acquire locks – Nonblocking operations may block for a lock acquisition • Not truly nonblocking! § Consequences – Nonblocking operations may be slowed by blocking ones from other threads – Pipeline stalls: higher latencies, lower thoughput, and less communication-computation overlapping § Work-Queue Model – One or multiple work-queues per endpoint – Decouple blocking and nonblocking operations – Nonblocking operations enqueue work descriptors and leave if critical section held – Threads issue work on behalf of other threads when acquiring a critical section – Nonblocking operations are truly nonblocking MPI_Send(...) { CS_TRY_ENTER; if(!success) { CS_ENTER; } flush_workq(); Wait_Progress(); CS_EXIT; } enqueue Dequeue Hardware T x T x 0 50000 100000 150000 200000 250000 300000 350000 0 10 20 30 40 Issuing Rate (Requests/s) Number of Threads Per Rank Nonblocking Irecv issuing rate between two Haswell+Mellanox FDR nodes CLH CLH-LPW Nonblocking Operation Blocking Operation Work-Queue per Communication Context
  • 21. Work-Queue Model Through 3 Steps § Step 1: Single Endpoint – Current MPICH – Single endpoint per MPI process – Worst case contention § Step 2: Multiple User-Transparent Endpoints – Multiple internal endpoints (BG/Q style) – Transparent to the user – E.g.: one endpoint per comm, per neighbor process (regular apps) § Step 3: Multiple User-Visible Endpoints – MPI-4 Endpoints proposal – Multiple endpoints managed by the user Hardware Application MPI CTX User Endpoint Hardware Application MPI CTX User Endpoint CTX CTX Hardware Application MPI CTX User Endpoint CTX CTX User EndpointUser Endpoint
  • 22. Current Work-Queue Model Implementation MPI_Put(void* org_buf,...) { CS_TRYENTER(&success); if(!success) { /* Enqueue my work */ elem = {PUT, org_buf,…}; enqueue(&work_queue, elem); } else{ /* Flush the work queue */ while(!empty(work_queue)) { elemt = dequeue(&work_queue); switch (elem.op){ case PUT: MPID_Put(elem.org_buf, …); ... } } /* Issue my own op */ MPID_Put(elem.org_buf, …); } if(success) CS_EXIT; } § MPIR Layer – All communication devices can take advantage – Single endpoint (endpoints have not been exposed yet) § Progress semantics: – Nonblocking calls: flush queue if lock acquired – Blocking calls: • Flush work-queue at entry • Flush work-queue within the progress engine § Unlimited work-queue § Locked queue implementation § Pthread mutex used for the global MPICH lock § Work-queue: multiple implementations – Mutex locked queue – Michal Scott’s lock-free queue – New Multi-Producer-Single-Consumer lock-free queues
  • 23. Data Transfer Rate with Threaded MPI RMA § Transfer data domain between two processes § Stencil-like halo exchange (actual domain exchange, not like OSU benchmarks) § Each thread gets a subdomain § Transfer unit is a chunk § Passive target synchronization – Master thread does Lock – All threads Put chunks – All threads do Flush every window_size – Master threads does Unlock Chunk Window Thread subdomain Data domain P0 P1 Put + Lock [+ Flush]
  • 24. Put + Lock with a Mutex Work-queue (CH3+MXM) No Concurrent Waiting Threads 4096 32768 262144 2097152 1 16 256 4096 65536 Transfer Rate (Chunks/s) Chunk Size (Bytes) Core 0 (Single Threaded) original work-queue 4096 32768 262144 2097152 1 16 256 4096 65536 Transfer Rate (Chunks/s) Chunk Size (Bytes) NUMA Node 0 (9 cores) original work-queue 4096 32768 262144 2097152 1 16 256 4096 65536 Transfer Rate (Chunks/s) Chunk Size (Bytes) Socket 0 (18 cores) original work-queue 4096 32768 262144 2097152 1 16 256 4096 65536 Transfer Rate (Chunks/s) Chunk Size (Bytes) All cores (36) original work-queue
  • 25. Put + Flush (w=64) + Lock with a Mutex Work-Queue Concurrent Waiting Threads 4096 32768 262144 2097152 1 16 256 4096 65536 Transfer Rate (Chunks/s) Chunk Size (Bytes) Core 0 (Single Threaded) original work-queue 4096 32768 262144 2097152 1 16 256 4096 65536 Transfer Rate (Chunks/s) Chunk Size (Bytes) NUMA Node 0 (9 cores) original work-queue 4096 32768 262144 2097152 1 16 256 4096 65536 Transfer Rate (Chunks/s) Chunk Size (Bytes) Socket 0 (18 cores) original work-queue 4096 32768 262144 2097152 1 16 256 4096 65536 Transfer Rate (Chunks/s) Chunk Size (Bytes) All cores (36) original work-queue
  • 26. Breakdown Analysis 0% 0% 1% 10% 100% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Number of Threads WQ_enq WQ_deq Other 0% 20% 40% 60% 80% 100% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Number of Threads WQ_enq WQ_deq Other § Put + Lock – Queuing work is the major bottleneck! – Currently debugging a using faster lock-free queue – Goal ~ 0 overhead § Put + Flush + Lock – Queuing/Dequeuing work is negligible – Bottleneck somewhere else – Hypothesis: all threads waiting for completion without issuing (next slide) Watch out! Log scale
  • 27. Point-to-Point Message Rate with a Mutex Work-Queue 4096 65536 1048576 1 32 1024 32768 Transfer Size(Chunks/s) Chunk Size (Bytes) Core 0 original work-queue 4096 65536 1048576 1 32 1024 32768 Transfer Size(Chunks/s) Chunk Size (Bytes) NUMA Node 0 (9 cores) original work-queue 4096 65536 1048576 1 32 1024 32768 Transfer Size(Chunks/s) Chunk Size (Bytes) Socket 0 (18 cores) original work-queue 4096 65536 1048576 1 32 1024 32768 Transfer Size(Chunks/s) Chunk Size (Bytes) All cores (36) original work-queue
  • 28. Results with a Lock-Free Work-Queue § Put + Lock – Michael Scott’s lock-free queue (MS-WorkQ) – Linked to TCMalloc – Still significantly below single-threaded – Working on faster lock-free queues 65536 131072 262144 524288 1048576 2097152 4194304 1 4 16 64 256 1024 4096 Transfer Size(Chunks/s) Chunk Size (Bytes) NUMA Node 0 (9 cores) Single-threaded Mutex-WorkQ MS-WorkQ Original
  • 29. Outline § Current MPICH § MPICH-3.3 and beyond – Lightweight communication overhead – Memory scalability – Multi-threading • MPICH + User-level threads – ROMIO data logging – MPI derived datatypes § Summary 29
  • 30. Supporting User-level Threads in MPICH (Argobots) § Motivation – Traditional MPI implementations are only aware of kernel threads – Thread-synchronization is costly to ensure thread-safety and progress requirement from MPI – Wasted resources if a kernel thread blocks for MPI communication § Argobots-aware MPICH – Supports Argobots as another threading model – Lightweight context switching to overlap costly blocking operations • Communication, locks, etc. – Reduced thread-synchronization opportunities • Guaranteed consistency within an ES without locks or memory barriers MPI+Argobots Execution Model ULT ES ULT ES MPI ULT ES ULT ES MPI timeline ULT1: do computation, start a MPI send Context switch to ULT2 ULT1: communication in background Context switch back to ULT1 ULT2: communication in background ULT1 ULT2
  • 31. Outline § Current MPICH Status § MPICH-3.3 – Lightweight communication overhead – Memory scalability – Multi-threading • MPICH + User-level threads – ROMIO data logging – MPI derived datatypes § Summary 31
  • 32. Wesley Bland Senior Software Developer, Intel Corporation Intel HPC Developers Conference November 12, 2016
  • 33. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice MPICH-OFI* Open-source implementation based on MPICH • Uses the new CH4 infrastructure • Co-designed with MPICH community • Targets existing and new fabrics via next-gen Open Fabrics Interface (OFI) • Ethernet/sockets, Intel® Omni-Path, Cray Aries*, IBM BG/Q*, InfiniBand* • Will be the default implementation available on the Aurora Supercomputer at Argonne National Laboratory 33
  • 34. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice MPICH-OFI* Developments in 2016 Multiple hackathons with Argonne and internal development work to expand CH4 feature set: • Capability sets • Improved RMA support • Reduce instruction overhead • Support for improved internal concurrency 34
  • 35. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Capability Sets Allows the user to compile-time select a set of OFI features to optimize lookups later in the execution. Optimized for the best performance for each OFI provider. Can enable runtime configuration to make things more flexible, if desired. 35 PSM2 Capability Set (subset) ENABLE_DATA ON ENABLE_AV_TABLE ON ENABLE_SCALABLE_ENDPOINTS OFF Sockets Capability Set (subset) ENABLE_DATA ON ENABLE_AV_TABLE ON ENABLE_SCALABLE_ENDPOINTS ON
  • 36. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Improved RMA Support Map MPI functionality directly to OFI features as much as possible • OFI has direct support for Put, Get, Accumulate, etc. • This may reduce software overhead and utilize underlying communication fabric better. 36
  • 37. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice As low as 43 instructions from application to OFI with all optimizations on Reduce branching as much as possible Reduce memory footprint Reduced Instruction Overhead 37 0 50 100 150 200 250 All Dynamic Ticket Locks Inline CH4 Inline OFI No Locks All Static InstructionCount MPI_Send (OFI/CH4) Software Overhead App-post MPI-post CH4-post OFI-post OFI-pre CH4-pre MPI-pre App-pre
  • 38. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Support for improved internal concurrency Parallel packing and unpacking using derived datatypes § The approach shares threads between MPI and OpenMP* – MPI can steal application threads that are idle – MPI creates tasks that application threads can execute when idle § MPI doesn’t create additional threads. – No oversubscription. § This model maps well to other runtimes, such as Intel® TBB or Cilk™ Plus 38 MPI Call MPI App
  • 40. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Collective Selection Current MPICH • Static selection of algorithms based on message size and communicator size at initialization Proposal • Introduce intelligent selection to determine optimal collective algorithm • Could be picked from a static configuration, runtime selection, etc. • Can use default algorithms or device/netmod/etc. specific algorithms 40
  • 41. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Improved Shared Memory Support - Support multiple shmmods (pronounced shmem-mods) - Might implement a subset of the API and fall back to active messages for default support - Similar architecture to current netmod design 41 CH4 AM NM 2NM 1SHM 2SHM 1 Fallback
  • 42. Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 42
  • 44. ROMIO data logging § WHO: Rob Latham (ANL) § PROBLEM: How to make use of new layers in storage hierarchy? § SOLUTION: “ad_logfs” maintains a log- structured record of all write activities – Sits below MPI-IO routines: transparent save for ‘logfs:’ prefix – Maintains one set of files per MPI process • Metadata, data, and global state – Can replay on close, explicit sync, or upon first read § Intent: log all I/O to NVRAM or SSD, defer replay to parallel file system
  • 45. Outline § Current MPICH Status § MPICH-3.3 – Lightweight communication overhead – Memory scalability – Multi-threading • MPICH + User-level threads – ROMIO data logging – MPI derived datatypes § Summary 45
  • 46. DAME: a new engine for derived datatypes § Who: Tarun Prabu, Bill Gropp (UIUC) § Why: DAME is an improved engine for derived-datatypes – The Dataloop code (type processing today) effective, but requires many function calls (the “piece functions”) for each “leaf type” – Piece Functions (function pointers) are difficult for most (all?) compilers to inline, even with things like link-time optimizations § What: DAME implements a new description of the MPI datatype, then transforms that description into efficient memory operations § Design Principles: – Low processing overhead – Maximize ability of compiler to optimize code – Simplify partial packing – Enable memory access optimizations § Optimizations: – Memory access optimizations can be done by shuffling primitives as desired. This is done at “commit” time. – Other optimizations such as normalization (e.g. an indexed with identical stride between elements), displacement sorting and merging can also easily be performed at commit-time. § More information at: https://wiki.mpich.org/mpich/index.php/DAME
  • 47. Outline § Current MPICH § MPICH-3.3 and beyond – Lightweight communication overhead – Memory scalability – Multi-threading • MPICH + User-level threads – ROMIO data logging – MPI derived datatypes § Summary 47
  • 48. MPICH-3.3 Next Major Release § The CH4 device – Replacement for CH3 • CH3 still supported and maintained for the time-being – Primary objectives • Lightweight communication overhead – Ability to support high-level network APIs (OFI, UCX, Portals 4) – E.g., tag-matching in hardware, direct PUT/GET communication – Low memory footprint • Support for very high thread concurrency – Improvements to message rates in highly threaded environments (MPI_THREAD_MULTIPLE) – Support for multiple network endpoints (THREAD_MULTIPLE or not)
  • 49. MPICH-3.3 Timeline § CH4 code in main MPICH repo (recently moved to GitHub) http://github.com/pmodels/mpich – Some work-in-progress features in Pull Request branches § MPICH-3.3a2 release out this week – Subsequent pre-releases as the code is stabilized, features added § GA Release mid-2017