SlideShare a Scribd company logo
How to make MPI Awesome:
MPI Sessions
Follow-on to Jeff’s crazy thoughts discussed in Bordeaux
Random group of people who have been talking about this stuff:
Wesley Bland, Ryan Grant, Dan Holmes, Kathryn Mohror,
Martin Schulz, Anthony Skjellum, Jeff Squyres
What we want
• Any thread (e.g., library) can use MPI any time it wants
• But still be able to totally clean up MPI if/when desired
• New parameters to initialize the MPI API
MPI Process
// Library 1
MPI_Init(…);
// Library 2
MPI_Init(…);
// Library 3
MPI_Init(…);
// Library 4
MPI_Init(…);
// Library 5
MPI_Init(…);
// Library 6
MPI_Init(…);// Library 7
MPI_Init(…);
// Library 8
MPI_Init(…);
// Library 9
MPI_Init(…);
// Library 10
MPI_Init(…);
// Library 11
MPI_Init(…);
// Library 12
MPI_Init(…);
Before MPI-3.1, this could be erroneous
int my_thread1_main(void *context) {
MPI_Initialized(&flag);
// …
}
int my_thread2_main(void *context) {
MPI_Initialized(&flag);
// …
}
int main(int argc, char **argv) {
MPI_Init_thread(…, MPI_THREAD_FUNNELED, …);
pthread_create(…, my_thread1_main, NULL);
pthread_create(…, my_thread2_main, NULL);
// …
}
These might
run at the same time (!)
The MPI-3.1 solution
• MPI_INITIALIZED (and friends) are allowed to
be called at any time
– …even by multiple threads
– …regardless of MPI_THREAD_* level
• This is a simple, easy-to-explain solution
– And probably what most applications do, anyway 
• But many other paths were investigated
MPI-3.1 MPI_INIT / FINALIZE limitations
• Cannot init MPI from different entities within a process without
a priori knowledge / coordination
– I.e.: MPI-3.1 (intentionally) still did not solve the underlying problem 
MPI Process
// Library 1 (thread)
MPI_Initialized(&flag);
if (!flag) MPI_Init(…);
// Library 2 (thread)
MPI_Initialized(&flag);
if (!flag) MPI_Init(…);
THIS IS INSUFFICIENT /
POTENTIALLY ERRONEOUS
(More of) What we want
• Fix MPI-3.1 limitations:
– Cannot init MPI from different entities within a
process without a priori knowledge / coordination
– Cannot initialize MPI more than once
– Cannot set error behavior of MPI initialization
– Cannot re-initialize MPI after it has been finalized
All these things overlap
Still be able to
finalize MPI
Any thread can
use MPI any time
Re-initialize MPI
Affect MPI
initialization error
behavior
How do we get those things?
KEEP
CALM
AND
LISTEN TO
THE ENTIRE
PROPOSAL
New concept: “session”
• A local handle to the MPI library
– Implementation intent: lightweight / uses very few
resources
– Can also cache some local state
• Can have multiple sessions in an MPI process
– MPI_Session_init(…, &session);
– MPI_Session_finalize(…, &session);
MPI Session
MPI Process
ocean library
MPI_SESSION_INIT(…)
atmosphere library
MPI_SESSION_INIT(…)
MPI library
MPI Session
MPI Process
ocean library atmosphere library
MPI library
ocean
session
atmos-
phere
session
Unique handles to the underlying MPI library
Initialize / finalize a session
• MPI_Session_init(
– IN MPI_Info info,
– IN MPI_Errhandler errhandler,
– OUT MPI_Session *session)
• MPI_Session_finalize(
– INOUT MPI_Session *session)
• Parameters described in next slides…
Session init params
• Info: For future expansion
• Errhandler: to be invoked if
MPI_SESSION_INIT errors
– Likely need a new type of errhandler
• …or a generic errhandler
• FT working is discussing exactly this topic
MPI Session
MPI Process
ocean library atmosphere library
MPI library
ocean
Errors
return
atmos-
phere
Errors
abort
Unique errhandlers, info, local state, etc.
Great. I have a session.
Now what?
Fair warning
• The MPI runtime has
long-since been a
bastard stepchild
– Barely acknowledged in
the standard
– Mainly in the form of
non-normative
suggestions
• It’s time to change that
Overview
• General scheme:
– Query the underlying
run-time system
• Get a “set” of processes
– Determine the processes
you want
• Create an MPI_Group
– Create a communicator
with just those processes
• Create an MPI_Comm
Query runtime
for set of processes
MPI_Group
MPI_Comm
MPI_Session
Runtime concepts
• Expose 2 concepts to MPI from the runtime:
1. Static sets of processes
2. Each set caches (key,value) string tuples
These slides only discuss static sets
(unchanged for the life of the process).
However, there are several useful scenarios that
involve dynamic membership of sets over time.
More discussion needs to occur for these scenarios.
For the purposes of these slides,
just consider static sets.
Static sets of processes
• Sets are identified by string name
• Two sets are mandated
– “mpi://WORLD”
– “mpi://SELF”
• Other sets can be defined by the system:
– “location://rack/19”
– “network://leaf-switch/37”
– “arch://x86_64”
– “job://12942”
– … etc.
• Processes can be in more than one set
These names are
implementation-
dependent
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
arch://x86_64
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
job://12942
arch://x86_64
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://SELF mpi://SELF mpi://SELF mpi://SELF
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
location://rack/self location://rack/self
location://rack/17 location://rack/23
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
user://ocean user://atmosphere
mpiexec 
--np 2 --set user://ocean ocean.exe : 
--np 2 --set user://atmosphere atmosphere.exe
Querying the run-time
• MPI_Session_get_names(
– IN MPI_Session session,
– OUT char **set_names)
• Returns argv-style list of
0-terminated names
– Must be freed by caller
Example list of set names returned
mpi://WORLD
mpi://SELF
arch://x86_64
location://rack/17
job://12942
user://ocean
Values in sets
• Each set has an associated MPI_Info object
• One mandated key in each info:
– “size”: number of processes in this set
• Runtime may also provide other keys
– Implementation-dependent
Querying the run-time
• MPI_Session_get_info(
– IN MPI_Session session,
– IN const char *set_name,
– OUT MPI_Info *info)
• Use existing MPI_Info functions to retrieve
(key,value) tuples
Example
MPI_Info info;
MPI_Session_get_info(session, “mpi://WORLD”,
&info);
char *size_str[MPI_MAX_INFO_VAL]
MPI_Info_get(info, “size”, …, size_str, …);
int size = atoi(size_str);
Ummmm… great.
What’s the point of that?
Make MPI_Groups!
• MPI_Group_create_from_session(
– IN MPI_Session session,
– IN const char *set_name,
– OUT MPI_Group *group);
Advice to implementers:
This MPI_Group can still be a
lightweight object (even if there are
a large number of processes in it)
Example
// Make a group of procs from “location://rack/self”
MPI_Create_group_from_session_name(
session, “location://rack/self”, &group);
// Use just the even procs
MPI_Group_size(group, &size);
ranges[0][0] = 0;
ranges[0][1] = size;
ranges[0][2] = 2;
MPI_Group_range_incl(group, 1, ranges,
&group_of_evens);
Make a communicator from that group
• MPI_Create_comm_from_group(
– IN MPI_Group group,
– IN const char *tag, // for matching (see next slide)
– IN MPI_Info info,
– IN MPI_Errhandler errhander,
– OUT MPI_Comm *comm)
Note: this is different than
the existing function
MPI_Comm_create_group(
oldcomm, group, (int) tag,
&newcomm)
Might need a better name
for this new function…?
String tag is used to match concurrent
creations by different entities
MPI Process
ocean library
atmosphere
library
MPI Process
ocean library
atmosphere
library
MPI Process
ocean library
atmosphere
library
MPI_Create_comm_from_group(…, tag = “gov.anl.ocean”, …)
MPI_Create_comm_from_group(.., tag = “gov.llnl.atmosphere”, …)
Make any kind of communicator
• MPI_Create_cart_comm_from_group(
– IN MPI_Group group,
– IN const char *tag,
– IN MPI_Info info,
– IN MPI_Errhandler errhander,
– IN int ndims,
– IN const int dims[],
– IN const int periods[],
– IN int reorder,
– OUT MPI_Comm *comm)
Make any kind of communicator
• MPI_Create_graph_comm_from_group(…)
• MPI_Create_dist_graph_comm_from_group(…)
• MPI_Create_dist_graph_adjacent_comm_from
_group(…)
Run-time static sets across different
sessions in the same process
• Making communicators from the same static
set will always result in the same local rank
– Even if created from different sessions
See example in the
next slide…
Run-time static sets across different
sessions in the same process
// Session, group, and communicator 1
MPI_Create_group_from_session_name(session_1,
“mpi://WORLD”, &group1);
MPI_Create_comm_from_group(group1, “ocean”, …, &comm1);
MPI_Comm_rank(comm1, &rank1);
// Session, group, and communicator 2
MPI_Create_group_from_session_name(session_2,
“mpi://WORLD”, &group2);
MPI_Create_comm_from_group(group2, “atmosphere”, …,
&comm2);
MPI_Comm_rank(comm2, &rank2);
// Ranks are guaranteed to be the same
assert(rank1 == rank2);
Law of Least
Astonishment
Mixing requests from different
sessions: disallowed
// Session, group, and communicator 1
MPI_Create_group_from_session_name(session_1,
“mpi://WORLD”, &group1);
MPI_Create_comm_from_group(group1, “ocean”, …, &comm1);
MPI_Isend(…, &req[0]);
// Session, group, and communicator 2
MPI_Create_group_from_session_name(session_2,
“mpi://WORLD”, &group2);
MPI_Create_comm_from_group(group2, “atmosphere”, …,
&comm2);
MPI_Isend(…, &req[1]);
// Mixing requests from different
// sessions is disallowed
MPI_Waitall(2, req, …);
Rationale: this is difficult to
optimize, particularly if a session
maps to hardware resources
MPI_Session_finalize
• Analogous to MPI_FINALIZE
– Can block waiting for the destruction of the
objects derived from that session
• Communicators, Windows, Files, … etc.
– Each session that is initialized must be finalized
Well, that all sounds great.
…but who calls MPI_INIT?
And what session does
MPI_COMM_WORLD /
MPI_COMM_SELF belong to?
New concept: no longer require
MPI_INIT / MPI_FINALIZE
New concept: no longer require
MPI_INIT / MPI_FINALIZE
• WHAT?!
• When will MPI initialize itself?
• How will MPI finalize itself?
– It is still (very) desirable to allow MPI to clean
itself up so that MPI processes can be “valgrind
clean” when they exit
Split MPI APIs into two sets
Performance doesn’t
matter (as much)
• Functions that create / query /
destroy:
– MPI_Comm
– MPI_File
– MPI_Win
– MPI_Info
– MPI_Op
– MPI_Errhandler
– MPI_Datatype
– MPI_Group
– MPI_Session
– Attributes
– Processes
• MPI_T
Performance
absolutely matters
• Point to point
• Collectives
• I/O
• RMA
• Test/Wait
• Handle language xfer
Split MPI APIs into two sets
Performance doesn’t
matter (as much)
• Functions that create / query /
destroy:
– MPI_Comm
– MPI_File
– MPI_Win
– MPI_Info
– MPI_Op
– MPI_Errhandler
– MPI_Datatype
– MPI_Group
– MPI_Session
– Attributes
– Processes
• MPI_T
Performance
absolutely matters
• Point to point
• Collectives
• I/O
• RMA
• Test/Wait
• Handle language xfer
Ensure that MPI is
initialized (and/or
finalized) by these
functions
These functions still can’t
be used unless MPI is
initialized
Split MPI APIs into two sets
Performance doesn’t
matter (as much)
• Functions that create / query /
destroy:
– MPI_Comm
– MPI_File
– MPI_Win
– MPI_Info
– MPI_Op
– MPI_Errhandler
– MPI_Datatype
– MPI_Group
– MPI_Session
– Attributes
– Processes
• MPI_T
Performance absolutely
matters
• Point to point
• Collectives
• I/O
• RMA
• Test/Wait
• Handle language xfer
These functions init / finalize
MPI transparently
These functions can’t be called
without a handle created from
the left-hand column
Split MPI APIs into two sets
Performance doesn’t
matter (as much)
• Functions that create / query /
destroy:
– MPI_Comm
– MPI_File
– MPI_Win
– MPI_Info
– MPI_Op
– MPI_Errhandler
– MPI_Datatype
– MPI_Group
– MPI_Session
– Attributes
– Processes
• MPI_T
Performance absolutely
matters
• Point to point
• Collectives
• I/O
• RMA
• Test/Wait
• Handle language xfer
MPI_COMM_WORLD and
MPI_COMM_SELF are notable
exceptions.
…I’ll address this shortly.
Example
int main() {
// Create a datatype – initializes MPI
MPI_Type_contiguous(2, MPI_INT, &mytype);
The creation of the first user-
defined MPI object initializes MPI
Initialization can be a local action!
Example
int main() {
// Create a datatype – initializes MPI
MPI_Type_contiguous(2, MPI_INT, &mytype);
// Free the datatype – finalizes MPI
MPI_Type_free(&mytype);
// Valgrind clean
return 0;
}
The destruction of the last user-
defined MPI object finalizes /
cleans up MPI. This is guaranteed.
There are some
corner cases
described on the
following slides.
Example
int main() {
// Create a datatype – initializes MPI
MPI_Type_contiguous(2, MPI_INT, &mytype);
// Free the datatype – finalizes MPI
MPI_Type_free(&mytype);
// Re-initialize MPI!
MPI_Type_dup(MPI_INT, &mytype);
We can also re-initialize MPI!
(it’s transparent to the user – so why not?)
Example
int main() {
// Create a datatype – initializes MPI
MPI_Type_contiguous(2, MPI_INT, &mytype);
// Free the datatype – finalizes MPI
MPI_Type_free(&mytype);
// Re-initialize MPI!
MPI_Type_dup(MPI_INT, &mytype);
return 0;
}
(Sometimes) Not an error to exit
the process with MPI still initialized
The overall theme
• Just use MPI functions whenever you want
– MPI will initialize as it needs to
– Initialization essentially becomes an
implementation detail
• Finalization will occur whenever all user-
defined handles are destroyed
Wait a minute –
What about MPI_COMM_WORLD?
int main() {
// Can’t I do this?
MPI_Send(…, MPI_COMM_WORLD);
This would be calling a
“performance matters”
function before a
“performance doesn’t
matter” function
I.e., MPI has not initialized yet
Wait a minute –
What about MPI_COMM_WORLD?
int main() {
// This is valid
MPI_Init(NULL, NULL);
MPI_Send(…, MPI_COMM_WORLD);
Re-define MPI_INIT and MPI_FINALIZE:
constructor and destructor for
MPI_COMM_WORLD and MPI_COMM_SELF
INIT and FINALIZE
int main() {
MPI_Init(NULL, NULL);
MPI_Send(…, MPI_COMM_WORLD);
MPI_Finalize();
}
INIT and FINALIZE continue to exist for two reasons:
1. Backwards compatibility
2. Convenience
So let’s keep them as close to MPI-3.1 as possible:
• If you call INIT, you have to call FINALIZE
• You can only call INIT / FINALIZE once
• INITIALIZED / FINALIZED only refer to INIT / FINALIZE (not sessions)
If you want different behavior, use sessions
INIT and FINALIZE
• INIT/FINALIZE create an implicit session
– You cannot extract an MPI_Session handle for the
implicit session created by MPI_INIT[_THREAD]
• Yes, you can use INIT/FINALIZE in the same
MPI process as other sessions
Backwards compatibility:
INITIALIZED and FINALIZED behavior
int main() {
MPI_Initialized(&flag); assert(flag == false);
MPI_Finalized(&flag); assert(flag == false);
MPI_Session_create(…, &session1);
MPI_Initialized(&flag); assert(flag == false);
MPI_Finalized(&flag); assert(flag == false);
MPI_Init(NULL, NULL);
MPI_Initialized(&flag); assert(flag == true);
MPI_Finalized(&flag); assert(flag == false);
MPI_Session_free(…, &session1);
MPI_Initialized(&flag); assert(flag == true);
MPI_Finalized(&flag); assert(flag == false);
MPI_Session_create(…, &session2);
MPI_Initialized(&flag); assert(flag == true);
MPI_Finalized(&flag); assert(flag == false);
MPI_Finalize();
MPI_Initialized(&flag); assert(flag == true);
MPI_Finalized(&flag); assert(flag == true);
MPI_Session_free(…, &session2);
MPI_Initialized(&flag); assert(flag == true);
MPI_Finalized(&flag); assert(flag == true);
}
Short version:
INITIALIZED,
FINALIZED,
IS_THREAD_MAIN
all still refer to
INIT / FINALIZE
FIN
(for the main part of the proposal)
Items that still need more
discussion
Issues that still need more discussion
• Dynamic runtime sets
– Temporal
– Membership
• Covered in other proposals:
– Thread concurrent vs. non-concurrent
– Generic error handlers
Issues that still need more discussion
• If COMM_WORLD|SELF are not available by
default:
– Do we need new destruction hooks to replace SELF
attribute callbacks on FINALIZE?
– What is the default error handler behavior for
functions without comm/file/win?
• Do we need syntactic sugar to get a comm from
mpi://WORLD?
• How do tools hook into MPI initialization and
finalization?
Session queries
• Query session handle equality
– MPI_Session_query(handle1, handle1_type,
handle2, handle2_type, bool *are_they_equal)
– Not 100% sure we need this…?
Session thread support
• Associate thread level support with sessions
• Three options:
1. Similar to MPI-3.1: “first” initialization picks
thread level
2. Let each session pick its own thread level (via
info key in SESSION_CREATE)
3. Just make MPI always be THREAD_MULTIPLE

More Related Content

PDF
Session 14 validation_steps_sap
PDF
SAP ISU : Installation Groups - Billing Sequence Control
DOCX
Sap payroll schema. functions , rules and operations – an overview
PPTX
Hcl indian payroll_3
PPTX
SAP HR AND HCM Interview questions
PDF
Sap HR questions
PDF
Copa configuration
PDF
Installation Groups
Session 14 validation_steps_sap
SAP ISU : Installation Groups - Billing Sequence Control
Sap payroll schema. functions , rules and operations – an overview
Hcl indian payroll_3
SAP HR AND HCM Interview questions
Sap HR questions
Copa configuration
Installation Groups

Similar to MPI Sessions: a proposal to the MPI Forum (20)

PDF
How to Make MPI Awesome: A Proposal for MPI Sessions
PPT
Open MPI
PDF
PDF
Parallel Programming Slide - Michael J.Quinn
PPT
Parallel computing(2)
ODP
Introduction to MPI
PDF
Move Message Passing Interface Applications to the Next Level
PDF
High Performance Computing using MPI
PPTX
CarX Street Deluxe edition v1.4.0 Free Download
PPTX
Nickelodeon All Star Brawl 2 v1.13 Free Download
PPTX
VSO ConvertXto HD Free CRACKS Download .
PPTX
Replay Media Catcher Free CRACK Download
PPTX
Mini Airways v0.11.3 TENOKE Free Download
PPTX
The Daum PotPlayer Free CRACK Download.
PPTX
Horizon Zero Dawn Remastered Free Download For PC
PDF
Enscape Latest 2025 Crack Free Download
PDF
Minitool Partition Wizard Crack Free Download
PDF
iTop VPN Latest Version 2025 Crack Free Download
PDF
Wondershare Filmora Crack Free Download
How to Make MPI Awesome: A Proposal for MPI Sessions
Open MPI
Parallel Programming Slide - Michael J.Quinn
Parallel computing(2)
Introduction to MPI
Move Message Passing Interface Applications to the Next Level
High Performance Computing using MPI
CarX Street Deluxe edition v1.4.0 Free Download
Nickelodeon All Star Brawl 2 v1.13 Free Download
VSO ConvertXto HD Free CRACKS Download .
Replay Media Catcher Free CRACK Download
Mini Airways v0.11.3 TENOKE Free Download
The Daum PotPlayer Free CRACK Download.
Horizon Zero Dawn Remastered Free Download For PC
Enscape Latest 2025 Crack Free Download
Minitool Partition Wizard Crack Free Download
iTop VPN Latest Version 2025 Crack Free Download
Wondershare Filmora Crack Free Download
Ad

More from Jeff Squyres (20)

PDF
Open MPI State of the Union X SC'16 BOF
PDF
MPI Fourm SC'15 BOF
PPTX
Open MPI SC'15 State of the Union BOF
PDF
Cisco's journey from Verbs to Libfabric
PPTX
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
PPTX
Fun with Github webhooks: verifying Signed-off-by
PPTX
Open MPI new version number scheme and roadmap
PDF
The State of libfabric in Open MPI
PPTX
Cisco usNIC libfabric provider
PPTX
2014 01-21-mpi-community-feedback
PDF
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
PDF
Cisco usNIC: how it works, how it is used in Open MPI
PPTX
Cisco EuroMPI'13 vendor session presentation
PPTX
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
PDF
MPI History
PPTX
MOSSCon 2013, Cisco Open Source talk
PPTX
Ethernet and TCP optimizations
PPTX
Friends don't let friends leak MPI_Requests
PPTX
MPI-3 Timer requests proposal
PPTX
MPI_Mprobe is good for you
Open MPI State of the Union X SC'16 BOF
MPI Fourm SC'15 BOF
Open MPI SC'15 State of the Union BOF
Cisco's journey from Verbs to Libfabric
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
Fun with Github webhooks: verifying Signed-off-by
Open MPI new version number scheme and roadmap
The State of libfabric in Open MPI
Cisco usNIC libfabric provider
2014 01-21-mpi-community-feedback
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
Cisco usNIC: how it works, how it is used in Open MPI
Cisco EuroMPI'13 vendor session presentation
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
MPI History
MOSSCon 2013, Cisco Open Source talk
Ethernet and TCP optimizations
Friends don't let friends leak MPI_Requests
MPI-3 Timer requests proposal
MPI_Mprobe is good for you
Ad

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
KodekX | Application Modernization Development
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
cuic standard and advanced reporting.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
MIND Revenue Release Quarter 2 2025 Press Release
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KodekX | Application Modernization Development
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
cuic standard and advanced reporting.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction

MPI Sessions: a proposal to the MPI Forum

  • 1. How to make MPI Awesome: MPI Sessions Follow-on to Jeff’s crazy thoughts discussed in Bordeaux Random group of people who have been talking about this stuff: Wesley Bland, Ryan Grant, Dan Holmes, Kathryn Mohror, Martin Schulz, Anthony Skjellum, Jeff Squyres
  • 2. What we want • Any thread (e.g., library) can use MPI any time it wants • But still be able to totally clean up MPI if/when desired • New parameters to initialize the MPI API MPI Process // Library 1 MPI_Init(…); // Library 2 MPI_Init(…); // Library 3 MPI_Init(…); // Library 4 MPI_Init(…); // Library 5 MPI_Init(…); // Library 6 MPI_Init(…);// Library 7 MPI_Init(…); // Library 8 MPI_Init(…); // Library 9 MPI_Init(…); // Library 10 MPI_Init(…); // Library 11 MPI_Init(…); // Library 12 MPI_Init(…);
  • 3. Before MPI-3.1, this could be erroneous int my_thread1_main(void *context) { MPI_Initialized(&flag); // … } int my_thread2_main(void *context) { MPI_Initialized(&flag); // … } int main(int argc, char **argv) { MPI_Init_thread(…, MPI_THREAD_FUNNELED, …); pthread_create(…, my_thread1_main, NULL); pthread_create(…, my_thread2_main, NULL); // … } These might run at the same time (!)
  • 4. The MPI-3.1 solution • MPI_INITIALIZED (and friends) are allowed to be called at any time – …even by multiple threads – …regardless of MPI_THREAD_* level • This is a simple, easy-to-explain solution – And probably what most applications do, anyway  • But many other paths were investigated
  • 5. MPI-3.1 MPI_INIT / FINALIZE limitations • Cannot init MPI from different entities within a process without a priori knowledge / coordination – I.e.: MPI-3.1 (intentionally) still did not solve the underlying problem  MPI Process // Library 1 (thread) MPI_Initialized(&flag); if (!flag) MPI_Init(…); // Library 2 (thread) MPI_Initialized(&flag); if (!flag) MPI_Init(…); THIS IS INSUFFICIENT / POTENTIALLY ERRONEOUS
  • 6. (More of) What we want • Fix MPI-3.1 limitations: – Cannot init MPI from different entities within a process without a priori knowledge / coordination – Cannot initialize MPI more than once – Cannot set error behavior of MPI initialization – Cannot re-initialize MPI after it has been finalized
  • 7. All these things overlap Still be able to finalize MPI Any thread can use MPI any time Re-initialize MPI Affect MPI initialization error behavior
  • 8. How do we get those things?
  • 10. New concept: “session” • A local handle to the MPI library – Implementation intent: lightweight / uses very few resources – Can also cache some local state • Can have multiple sessions in an MPI process – MPI_Session_init(…, &session); – MPI_Session_finalize(…, &session);
  • 11. MPI Session MPI Process ocean library MPI_SESSION_INIT(…) atmosphere library MPI_SESSION_INIT(…) MPI library
  • 12. MPI Session MPI Process ocean library atmosphere library MPI library ocean session atmos- phere session Unique handles to the underlying MPI library
  • 13. Initialize / finalize a session • MPI_Session_init( – IN MPI_Info info, – IN MPI_Errhandler errhandler, – OUT MPI_Session *session) • MPI_Session_finalize( – INOUT MPI_Session *session) • Parameters described in next slides…
  • 14. Session init params • Info: For future expansion • Errhandler: to be invoked if MPI_SESSION_INIT errors – Likely need a new type of errhandler • …or a generic errhandler • FT working is discussing exactly this topic
  • 15. MPI Session MPI Process ocean library atmosphere library MPI library ocean Errors return atmos- phere Errors abort Unique errhandlers, info, local state, etc.
  • 16. Great. I have a session. Now what?
  • 17. Fair warning • The MPI runtime has long-since been a bastard stepchild – Barely acknowledged in the standard – Mainly in the form of non-normative suggestions • It’s time to change that
  • 18. Overview • General scheme: – Query the underlying run-time system • Get a “set” of processes – Determine the processes you want • Create an MPI_Group – Create a communicator with just those processes • Create an MPI_Comm Query runtime for set of processes MPI_Group MPI_Comm MPI_Session
  • 19. Runtime concepts • Expose 2 concepts to MPI from the runtime: 1. Static sets of processes 2. Each set caches (key,value) string tuples These slides only discuss static sets (unchanged for the life of the process). However, there are several useful scenarios that involve dynamic membership of sets over time. More discussion needs to occur for these scenarios. For the purposes of these slides, just consider static sets.
  • 20. Static sets of processes • Sets are identified by string name • Two sets are mandated – “mpi://WORLD” – “mpi://SELF” • Other sets can be defined by the system: – “location://rack/19” – “network://leaf-switch/37” – “arch://x86_64” – “job://12942” – … etc. • Processes can be in more than one set These names are implementation- dependent
  • 21. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 mpi://WORLD
  • 22. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 mpi://WORLD arch://x86_64
  • 23. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 mpi://WORLD job://12942 arch://x86_64
  • 24. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 mpi://SELF mpi://SELF mpi://SELF mpi://SELF
  • 25. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 location://rack/self location://rack/self location://rack/17 location://rack/23
  • 26. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 user://ocean user://atmosphere mpiexec --np 2 --set user://ocean ocean.exe : --np 2 --set user://atmosphere atmosphere.exe
  • 27. Querying the run-time • MPI_Session_get_names( – IN MPI_Session session, – OUT char **set_names) • Returns argv-style list of 0-terminated names – Must be freed by caller Example list of set names returned mpi://WORLD mpi://SELF arch://x86_64 location://rack/17 job://12942 user://ocean
  • 28. Values in sets • Each set has an associated MPI_Info object • One mandated key in each info: – “size”: number of processes in this set • Runtime may also provide other keys – Implementation-dependent
  • 29. Querying the run-time • MPI_Session_get_info( – IN MPI_Session session, – IN const char *set_name, – OUT MPI_Info *info) • Use existing MPI_Info functions to retrieve (key,value) tuples
  • 30. Example MPI_Info info; MPI_Session_get_info(session, “mpi://WORLD”, &info); char *size_str[MPI_MAX_INFO_VAL] MPI_Info_get(info, “size”, …, size_str, …); int size = atoi(size_str);
  • 32. Make MPI_Groups! • MPI_Group_create_from_session( – IN MPI_Session session, – IN const char *set_name, – OUT MPI_Group *group); Advice to implementers: This MPI_Group can still be a lightweight object (even if there are a large number of processes in it)
  • 33. Example // Make a group of procs from “location://rack/self” MPI_Create_group_from_session_name( session, “location://rack/self”, &group); // Use just the even procs MPI_Group_size(group, &size); ranges[0][0] = 0; ranges[0][1] = size; ranges[0][2] = 2; MPI_Group_range_incl(group, 1, ranges, &group_of_evens);
  • 34. Make a communicator from that group • MPI_Create_comm_from_group( – IN MPI_Group group, – IN const char *tag, // for matching (see next slide) – IN MPI_Info info, – IN MPI_Errhandler errhander, – OUT MPI_Comm *comm) Note: this is different than the existing function MPI_Comm_create_group( oldcomm, group, (int) tag, &newcomm) Might need a better name for this new function…?
  • 35. String tag is used to match concurrent creations by different entities MPI Process ocean library atmosphere library MPI Process ocean library atmosphere library MPI Process ocean library atmosphere library MPI_Create_comm_from_group(…, tag = “gov.anl.ocean”, …) MPI_Create_comm_from_group(.., tag = “gov.llnl.atmosphere”, …)
  • 36. Make any kind of communicator • MPI_Create_cart_comm_from_group( – IN MPI_Group group, – IN const char *tag, – IN MPI_Info info, – IN MPI_Errhandler errhander, – IN int ndims, – IN const int dims[], – IN const int periods[], – IN int reorder, – OUT MPI_Comm *comm)
  • 37. Make any kind of communicator • MPI_Create_graph_comm_from_group(…) • MPI_Create_dist_graph_comm_from_group(…) • MPI_Create_dist_graph_adjacent_comm_from _group(…)
  • 38. Run-time static sets across different sessions in the same process • Making communicators from the same static set will always result in the same local rank – Even if created from different sessions See example in the next slide…
  • 39. Run-time static sets across different sessions in the same process // Session, group, and communicator 1 MPI_Create_group_from_session_name(session_1, “mpi://WORLD”, &group1); MPI_Create_comm_from_group(group1, “ocean”, …, &comm1); MPI_Comm_rank(comm1, &rank1); // Session, group, and communicator 2 MPI_Create_group_from_session_name(session_2, “mpi://WORLD”, &group2); MPI_Create_comm_from_group(group2, “atmosphere”, …, &comm2); MPI_Comm_rank(comm2, &rank2); // Ranks are guaranteed to be the same assert(rank1 == rank2); Law of Least Astonishment
  • 40. Mixing requests from different sessions: disallowed // Session, group, and communicator 1 MPI_Create_group_from_session_name(session_1, “mpi://WORLD”, &group1); MPI_Create_comm_from_group(group1, “ocean”, …, &comm1); MPI_Isend(…, &req[0]); // Session, group, and communicator 2 MPI_Create_group_from_session_name(session_2, “mpi://WORLD”, &group2); MPI_Create_comm_from_group(group2, “atmosphere”, …, &comm2); MPI_Isend(…, &req[1]); // Mixing requests from different // sessions is disallowed MPI_Waitall(2, req, …); Rationale: this is difficult to optimize, particularly if a session maps to hardware resources
  • 41. MPI_Session_finalize • Analogous to MPI_FINALIZE – Can block waiting for the destruction of the objects derived from that session • Communicators, Windows, Files, … etc. – Each session that is initialized must be finalized
  • 42. Well, that all sounds great. …but who calls MPI_INIT? And what session does MPI_COMM_WORLD / MPI_COMM_SELF belong to?
  • 43. New concept: no longer require MPI_INIT / MPI_FINALIZE
  • 44. New concept: no longer require MPI_INIT / MPI_FINALIZE • WHAT?! • When will MPI initialize itself? • How will MPI finalize itself? – It is still (very) desirable to allow MPI to clean itself up so that MPI processes can be “valgrind clean” when they exit
  • 45. Split MPI APIs into two sets Performance doesn’t matter (as much) • Functions that create / query / destroy: – MPI_Comm – MPI_File – MPI_Win – MPI_Info – MPI_Op – MPI_Errhandler – MPI_Datatype – MPI_Group – MPI_Session – Attributes – Processes • MPI_T Performance absolutely matters • Point to point • Collectives • I/O • RMA • Test/Wait • Handle language xfer
  • 46. Split MPI APIs into two sets Performance doesn’t matter (as much) • Functions that create / query / destroy: – MPI_Comm – MPI_File – MPI_Win – MPI_Info – MPI_Op – MPI_Errhandler – MPI_Datatype – MPI_Group – MPI_Session – Attributes – Processes • MPI_T Performance absolutely matters • Point to point • Collectives • I/O • RMA • Test/Wait • Handle language xfer Ensure that MPI is initialized (and/or finalized) by these functions These functions still can’t be used unless MPI is initialized
  • 47. Split MPI APIs into two sets Performance doesn’t matter (as much) • Functions that create / query / destroy: – MPI_Comm – MPI_File – MPI_Win – MPI_Info – MPI_Op – MPI_Errhandler – MPI_Datatype – MPI_Group – MPI_Session – Attributes – Processes • MPI_T Performance absolutely matters • Point to point • Collectives • I/O • RMA • Test/Wait • Handle language xfer These functions init / finalize MPI transparently These functions can’t be called without a handle created from the left-hand column
  • 48. Split MPI APIs into two sets Performance doesn’t matter (as much) • Functions that create / query / destroy: – MPI_Comm – MPI_File – MPI_Win – MPI_Info – MPI_Op – MPI_Errhandler – MPI_Datatype – MPI_Group – MPI_Session – Attributes – Processes • MPI_T Performance absolutely matters • Point to point • Collectives • I/O • RMA • Test/Wait • Handle language xfer MPI_COMM_WORLD and MPI_COMM_SELF are notable exceptions. …I’ll address this shortly.
  • 49. Example int main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); The creation of the first user- defined MPI object initializes MPI Initialization can be a local action!
  • 50. Example int main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype); // Valgrind clean return 0; } The destruction of the last user- defined MPI object finalizes / cleans up MPI. This is guaranteed. There are some corner cases described on the following slides.
  • 51. Example int main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype); // Re-initialize MPI! MPI_Type_dup(MPI_INT, &mytype); We can also re-initialize MPI! (it’s transparent to the user – so why not?)
  • 52. Example int main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype); // Re-initialize MPI! MPI_Type_dup(MPI_INT, &mytype); return 0; } (Sometimes) Not an error to exit the process with MPI still initialized
  • 53. The overall theme • Just use MPI functions whenever you want – MPI will initialize as it needs to – Initialization essentially becomes an implementation detail • Finalization will occur whenever all user- defined handles are destroyed
  • 54. Wait a minute – What about MPI_COMM_WORLD? int main() { // Can’t I do this? MPI_Send(…, MPI_COMM_WORLD); This would be calling a “performance matters” function before a “performance doesn’t matter” function I.e., MPI has not initialized yet
  • 55. Wait a minute – What about MPI_COMM_WORLD? int main() { // This is valid MPI_Init(NULL, NULL); MPI_Send(…, MPI_COMM_WORLD); Re-define MPI_INIT and MPI_FINALIZE: constructor and destructor for MPI_COMM_WORLD and MPI_COMM_SELF
  • 56. INIT and FINALIZE int main() { MPI_Init(NULL, NULL); MPI_Send(…, MPI_COMM_WORLD); MPI_Finalize(); } INIT and FINALIZE continue to exist for two reasons: 1. Backwards compatibility 2. Convenience So let’s keep them as close to MPI-3.1 as possible: • If you call INIT, you have to call FINALIZE • You can only call INIT / FINALIZE once • INITIALIZED / FINALIZED only refer to INIT / FINALIZE (not sessions) If you want different behavior, use sessions
  • 57. INIT and FINALIZE • INIT/FINALIZE create an implicit session – You cannot extract an MPI_Session handle for the implicit session created by MPI_INIT[_THREAD] • Yes, you can use INIT/FINALIZE in the same MPI process as other sessions
  • 58. Backwards compatibility: INITIALIZED and FINALIZED behavior int main() { MPI_Initialized(&flag); assert(flag == false); MPI_Finalized(&flag); assert(flag == false); MPI_Session_create(…, &session1); MPI_Initialized(&flag); assert(flag == false); MPI_Finalized(&flag); assert(flag == false); MPI_Init(NULL, NULL); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false); MPI_Session_free(…, &session1); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false); MPI_Session_create(…, &session2); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false); MPI_Finalize(); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == true); MPI_Session_free(…, &session2); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == true); } Short version: INITIALIZED, FINALIZED, IS_THREAD_MAIN all still refer to INIT / FINALIZE
  • 59. FIN (for the main part of the proposal)
  • 60. Items that still need more discussion
  • 61. Issues that still need more discussion • Dynamic runtime sets – Temporal – Membership • Covered in other proposals: – Thread concurrent vs. non-concurrent – Generic error handlers
  • 62. Issues that still need more discussion • If COMM_WORLD|SELF are not available by default: – Do we need new destruction hooks to replace SELF attribute callbacks on FINALIZE? – What is the default error handler behavior for functions without comm/file/win? • Do we need syntactic sugar to get a comm from mpi://WORLD? • How do tools hook into MPI initialization and finalization?
  • 63. Session queries • Query session handle equality – MPI_Session_query(handle1, handle1_type, handle2, handle2_type, bool *are_they_equal) – Not 100% sure we need this…?
  • 64. Session thread support • Associate thread level support with sessions • Three options: 1. Similar to MPI-3.1: “first” initialization picks thread level 2. Let each session pick its own thread level (via info key in SESSION_CREATE) 3. Just make MPI always be THREAD_MULTIPLE

Editor's Notes

  • #41: Even though WAITALL is semantically equivalent to loop of WAITs, an implementation may have to scan/choose whether to use an optimized plural implementation or have to split it out into a sequence of WAITs. Plus: what does it mean if multiple sessions have different thread levels and we WAITALL on requests from them?
  • #49: NOTE: Handle xfer functions need to be high performance, too. We convinced ourselves that this is still implementable: MPICH-like implementations: no issue. OMPI-like implementation: can have an initial table that is all the pre-defined f2c lookups. Upon first user handle creation, alloc a new table (and re-alloc every time you need to grow after that).