SlideShare a Scribd company logo
PROGRAMMING USING
MPI AND OPENMP
- Mayuri Sewatkar(16101A1002)
Topics Covered
 MPI
 MPI Principles
 Building blocks
 The Message Passing Interface(MPI)
 Overlapping Communication and
Computation
 Collective Communication
Operations
 Composite Synchronization
Constructs
 Pros and Cons of MPI
 OpenMP
 Threading
 Parallel Programming
Model
 Combining MPI and
OpenMP
 Shared Memory
Programming
 Pros and Cons of OpenMP
What is MPI???
 Message Passing Interface (MPI) is a language-independent
communications protocol used to program parallel computers. Both
point-to-point and collective communication are supported.
 MPI "is a message-passing application programmer interface, together
with protocol and semantic specifications for how its features must
behave in any implementation." So, MPI is a specification, not an
implementation.
 MPI's goals are high performance, scalability, and portability.
MPI Principles
 MPI-1 model has no shared memory concept.
 MPI-2 has only a limited distributed shared memory
concept.
 MPI-3 includes new Fortran 2008 bindings, while it
removes deprecated C++ bindings as well as many
deprecated routines and MPI objects.
MPI Building Blocks
 Since interactions are accomplished by sending and receiving messages,
the basic operations in the message-passing programming paradigm are
SEND and RECEIVE.
 In their simplest form, the prototypes of these operations are defined as
follows:
 send(void *sendbuf, int nelems, int dest)
 receive(void *recvbuf, int nelems, int source)
 The sendbuf points to a buffer that stores the data to be sent, recvbuf
points to a buffer that stores the data to be received, nelems is the
number of data units to be sent and received, dest is the identifier of the
process that receives the data, and source is the identifier of the process
that sends the data.
MPI: the Message Passing Interface
 MPI defines a standard library for message-passing that can be used to
develop portable message-passing programs using either C or Fortran.
 The MPI standard defines both the syntax as well as the semantics of a
core set of library routines that are very useful in writing message-
passing programs.
 The MPI library contains over 125 routines.
 These routines are used to initialize and terminate the MPI library, to
get information about the parallel computing environment, and to send
and receive messages.
MPI: the Message Passing Interface
 MPI_Init - Initializes MPI.
 This function must be called in every MPI program, must be called
before any other MPI functions and must be called only once in an MPI
program.
 MPI_Init(&argc,&argv);
 MPI_Comm_size - Determines the number of processes.
 Returns the total number of MPI processes in the specified
communicator (MPI_COMM_WORLD).
 It represents the number of MPI tasks available to your application.
MPI: the Message Passing Interface
 MPI_Comm_rank - Determines the label of the calling process.
 Returns the rank of the calling MPI process within the specified
communicator. Initially, each process will be assigned a unique
integer rank between 0 and number of tasks - 1 within the
communicator MPI_COMM_WORLD. This rank is often referred
to as a task ID.
 MPI_Comm_rank (comm,&rank);
 MPI_Send - Sends a message.
 It performs a blocking send i.e. this routine may block until the
message is received by the destination process.
 int MPI_Send(void *buf, int count, MPI_Datatype datatype, int
dest, int tag, MPI_Comm comm)
 Buf -> initial address
of send buffer
 Count -> number of
elements in send
buffer
 Datatype -> datatype
of each send buffer
element
 Dest -> rank of
destination
 Tag -> message tag
 Comm ->
communicator
MPI: the Message Passing Interface
 MPI_Recv - Receives a message.
 The count argument indicates the maximum length of
a message;
 int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag,MPI_Comm comm,
MPI_Status *status)
 MPI_Finalize - Terminates MPI.
 This function should be the last MPI routine called in
every MPI program - no other MPI routines may be
called after it.
 MPI_Finalize();
 buf->initial address of
receive buffer
 status->status object
 count->maximum
number of elements in
receive buffer
 datatype->datatype of
each receive buffer
element
 source->rank of source
 tag->message tag
 comm->communicator
Compiling and running MPI
mpicc –o helloworld helloworld.cCompiling
mpirun –np 4 ./ helloworldRunning
MPI Example – Hello World
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int rank, size; MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "Hello World from process %d of %dn", rank, size );
MPI_Finalize();
return 0;
}
MPI Example – Hello World
Output –
Hello World from process 0 of 4
Hello World from process 2 of 4
Hello World from process 3 of 4
Hello World from process 1 of 4
Overlapping Communication and
Computation
 A blocking send operation remains blocked until the message has been
copied out of the send buffer (either into a system buffer at the source
process or sent to the destination process).
 Similarly, a blocking receive operation returns only after the message
has been received and copied into the receive buffer.
 In order to overlap communication with computation, MPI provides a
pair of functions for performing non-blocking send and receive
operations.
 These functions:
 MPI_Isend and
 MPI_Irecv.
Overlapping Communication and
Computation
 MPI_Isend
 MPI_Isend starts a send operation but does not complete, that is, it returns
before the data is copied out of the buffer.
 The calling sequences of MPI_Isend is
 int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request *request)
 MPI_Irecv
 MPI_Irecv starts a receive operation but returns before the data has been
received and copied into the buffer.
 The calling sequences of MPI_Irecv is
 int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
MPI_Comm comm, MPI_Request *request)
Collective Communication Operations
 MPI provides the following routines for collective communication:
 MPI_Bcast() -> Broadcast (one to all)
 MPI_Reduce() -> Reduction (all to one)
 MPI_Allreduce() -> Reduction (all to all)
 MPI_Scatter() -> Distribute data (one to all)
 MPI_Gather() -> Collect data (all to one)
 MPI_Alltoall() -> Distribute data (all to all)
 MPI_Allgather() -> Collect data (all to all)
Composite Synchronization
Constructs
 By design, Pthreads provide support for a basic set of operations.
 Higher level constructs can be built using basic synchronization
constructs.
 We discuss two such constructs - read-write locks and barriers.
 A read lock is granted when there are other threads that may already
have read locks.
 If there is a write lock on the data (or if there are queued write locks),
the thread performs a condition wait.
 If there are multiple threads requesting a write lock, they must perform
a condition wait.
 With this description, we can design functions for
 read locks mylib_rwlock_rlock,
 write locks mylib_rwlock_wlock and
 unlocking mylib_rwlock_unlock.
Read-Write Locks
 The lock data type mylib_rwlock_t holds the following:
 a count of the number of readers,
 the writer (a 0/1 integer specifying whether a writer is present),
 a condition variable readers_proceed that is signaled when readers
can proceed,
 a condition variable writer_proceed that is signaled when one of the
writers can proceed,
 a count pending_writers of pending writers, and
 a mutex read_write_lock associated with the shared data structure
Barriers
 As in MPI, a barrier holds a thread until all threads participating in the
barrier have reached it.
 Barriers can be implemented using a counter, a mutex and a condition
variable.
 A single integer is used to keep track of the number of threads that have
reached the barrier.
 If the count is less than the total number of threads, the threads execute
a condition wait.
 The last thread entering (and setting the count to the number of
threads) wakes up all the threads using a condition broadcast.
Barriers
typedef struct
{
pthread_mutex_t count_lock;
pthread_cond_t ok_to_proceed;
int count;
} mylib_barrier_t;
void mylib_init_barrier(mylib_barrier_t *b)
{
b -> count = 0;
pthread_mutex_init(&(b -> count_lock), NULL);
pthread_cond_init(&(b -> ok_to_proceed), NULL);
}
Pros and Cons of MPI
 Pros
 Does not require shared memory architectures which are more expensive
than distributed memory architectures
 Can be used on a wider range of problems since it exploits both task
parallelism and data parallelism
 Can run on both shared memory and distributed memory architectures
 Highly portable with specific optimization for the implementation on most
hardware
 Cons
 Requires more programming changes to go from serial to parallel version
 Can be harder to debug
What is OpenMP???
 OpenMP (Open Multi-Processing) is an application programming
interface (API) that supports multi-platform shared memory
multiprocessing programming in C, C++, and Fortran, on most
platforms, processor architectures and operating systems, including
Solaris, AIX, HP-UX, Linux, MacOS, and Windows.
 OpenMP uses a portable, scalable model that gives programmers a
simple and flexible interface for developing parallel applications for
platforms ranging from the standard desktop computer to the
supercomputer.
What is OpenMP???
 OpenMP is basically an add on in compiler. It is available in GCC (gnu
compiler) , Intel compiler and with other compilers.
 OpenMP target shared memory systems i.e. where processor shared the
main memory.
 OpenMP is based on thread approach . It launches a single process which in
turn can create n number of thread as desired. It is based on what is called
"fork and join method" i.e. depending on particular task it can launch
desired number of thread as directed by user.
Threading
 A thread is a single stream of control in the flow of a program.
 Static Threads
 All work is allocated and assigned at runtime
 Dynamic Threads
 Consists of one Master and a pool of threads
 The pool is assigned some of the work at runtime, but not all of it
 When a thread from the pool becomes idle, the Master gives it a new
assignment
 “Round-robin assignments”
Parallel Programing Model
 OpenMP uses the fork-join model of parallel execution.
 All OpenMP programs begin with a single master thread.
 The master thread executes sequentially until a parallel region is
encountered, when it creates a team of parallel threads (FORK).
 When the team threads complete the parallel region, they synchronize and
terminate, leaving only the master thread that executes sequentially
(JOIN).
Variables
 2 types of Variables
 Private
 Shared
 Private Variables
 Variables in a thread’s private space can only be accessed by the thread
 Private variable has a different address in the execution context of every
thread.
 Clause : private «variable list»
 Shared Variables
 Variables in the global data space are accessed by all parallel threads.
 Shared-variable has the same address in the execution context of every
thread. All threads have access to shared variables.
Variables
 A thread can access its own private variables, but cannot access the
private variable of another thread.
 In parallel for pragma, variables are by default shared, except
loop index variable which is private.
#pragma omp parallel for private(privDbl )
for ( i = 0; i < arraySize; i++ ) {
for ( privIndx = 0; privIndx < 16; privIndx++ ) { privDbl = ( (double)
privIndx ) / 16;
y[i] = sin( exp( cos( - exp( sin(x[i]) ) ) ) ) + cos( privDbl );
}
}
OpenMP Functions
 omp_get_num_procs ()
 Returns the number of CPUs in the multiprocessor on which this thread is
executing,
 The integer returned by this function may be less than the total number of
physical processors in the multiprocessor, depending on how the run-time
system gives processes access to processors.
 e.g. int t= omp_get_num_procs();
 omp_get_num_threads()
 Returns the number of threads active in the current parallel region
 t=omp_get_num_threads();
OpenMP Functions Contd.
 omp_set_num_threads()
 Allows to set the number of threads executing the parallel sections of code
 Setting the number of threads equal to the number of available CPUs
 e.g. omp_set_num_threads(t);
 omp_get _thread_num()
 Returns the thread identification number, from 0 to n-1 where n are
number of active threads.
 tid = omp_get_thread_num();
OpenMP compiler directives (Pragma)
 A compiler directive in C or c++ is called a pragma.
 Format:
 #pragma omp directive-name [clause,..]
1. #pragma omp parallel
 Block of code should be executed by all of the threads (code block is
replicated among the threads)
 use curly braces {} to create a block of code from a statement group.
OpenMP compiler directives (Pragma)
2. #pragma omp parallel for
 indicate to the compiler that the iterations of a for loop may
be executed in parallel.
 e.g.
#pragma omp parallel for
for (i = first; i < size; i += prime)
marked[i] = 1;
Compiling and running OpenMP
$gcc -o hello_omp hello_omp.c –fopenmpCompiling
$./hello_ompRunning
Combining MPI and OpenMP
 In many cases hybrid programs using both MPI and OpenMP execute
faster than programs using only MPL.
 Sometimes hybrid programs execute faster because they have lower
communication overhead.
 Suppose we are executing our program on a cluster of m multiprocessors,
where each multiprocessor has k CPUs. In order to utilize every CPU, a
program relying on MPI must create mk processes. During
communication steps, mk processes are active.
 On the other hand, a hybrid program need only create m processes. In
parallel sections of code, the workload is divided among k threads on
each multiprocessor. Hence every CPU is utilized.
 However, during communication
steps, only m processes are active.
This may well give the hybrid
program lower communication
overhead than a "pure" MPI
program, resulting in higher
speedup.
Combining MPI and OpenMP
Shared Memory Programing
 The underlying hardware is assumed to be a collection
of processors, each with access to the same shared
memory.
 Because they have access 10 the same memory
locations, processors can interact and synchronize with
each other through shared variables.
 The standard view of parallelism in a shared memory program is
fork/join parallelism.
 When the program begins execution, only a single thread, called the
master thread, is active.
 The master thread executes the sequential portions of the algorithm. At
those points where parallel operations arc required, the master thread
forks (creates or awakens) additional threads.
 The master thread and the created threads work concurrently through
the parallel section, At the end of the parallel code the created threads
die or are suspended, and the flow of control returns to the single
master thread. This is called a join.
Shared Memory Programing
 The shared-memory model is
characterized by forkjoin parallelism,
in which parallelism comes and goes.
 At the beginning of execution only a
single thread, called the master thread,
is active.
 The master thread executes the serial
portions 0f the program. It forks
additional threads to help it execute
parallel portions of the program.
 These threads are deactivated when
serial execution resumes.
Shared Memory Programing
 A key difference, then, between the shared-memory model and the
message passing model is that in the message-passing model all
processes typically remain active throughout the execution of the
program, whereas in the shared-memory model the number of
active threads is one at the program's start and finish and may
change dynamically throughout the execution of the program.
 Parallel shared-memory programs range from those with only a
single fork/join around a single loop to those in which most of the
code segments are executed in parallel. Hence the shared-memory
model supports incremental paI1lllelization, the process of
transforming a sequential program into a parallel program one
block of code at a line.
Shared Memory Programing
Pros and Cons of OpenMP
 Pros
 Considered by some to be easier to program and debug (compared to
MPI)
 Data layout and decomposition is handled automatically by directives.
 Allows incremental parallelism: directives can be added incrementally,
so the program can be parallelized one portion after another and thus
no dramatic change to code is needed.
 Unified code for both serial and parallel applications: OpenMP
constructs are treated as comments when sequential compilers are
used.
 Original (serial) code statements need not, in general, be modified when
parallelized with OpenMP. This reduces the chance of inadvertently
introducing bugs and helps maintenance as well.
 Both coarse-grained and fine-grained parallelism are possible
Pros and Cons of OpenMP
 Cons
 Currently only runs efficiently in shared-memory multiprocessor
platforms
 Requires a compiler that supports OpenMP.
 Scalability is limited by memory architecture.
 Reliable error handling is missing.
 Lacks fine-grained mechanisms to control thread-processor
mapping.
 Synchronization between subsets of threads is not allowed.
 Mostly used for loop parallelization
 Can be difficult to debug, due to implicit communication between
threads via shared variables.

More Related Content

PDF
Introduction to High-Performance Computing
PDF
High Performance Computing using MPI
PDF
Inter Process Communication
PDF
Introduction to OpenMP
PPT
Message passing interface
PPT
Open MPI
PDF
Introduction to OpenMP (Performance)
PDF
Open mp
Introduction to High-Performance Computing
High Performance Computing using MPI
Inter Process Communication
Introduction to OpenMP
Message passing interface
Open MPI
Introduction to OpenMP (Performance)
Open mp

What's hot (20)

PPTX
Unit 1-introduction to scripts
PPTX
The Message Passing Interface (MPI) in Layman's Terms
PPTX
Message Passing Interface (MPI)-A means of machine communication
PDF
Introduction to OpenMP
PPT
Rpc (Distributed computing)
PPT
Ipc in linux
PPT
Distributed Operating System
PPT
Case study windows
PPTX
File sharing ppt
PPT
File replication
PPTX
Importance of msil in dot net
PPTX
Linux and DNS Server
PDF
MPI Tutorial
PDF
Distributed Operating System_1
PPTX
MPI message passing interface
PPT
Operating Systems - "Chapter 4: Multithreaded Programming"
PPTX
Error Recovery strategies and yacc | Compiler Design
PDF
Lecture 2 more about parallel computing
PDF
Implementation of Pipe in Linux
PPT
Shell and its types in LINUX
Unit 1-introduction to scripts
The Message Passing Interface (MPI) in Layman's Terms
Message Passing Interface (MPI)-A means of machine communication
Introduction to OpenMP
Rpc (Distributed computing)
Ipc in linux
Distributed Operating System
Case study windows
File sharing ppt
File replication
Importance of msil in dot net
Linux and DNS Server
MPI Tutorial
Distributed Operating System_1
MPI message passing interface
Operating Systems - "Chapter 4: Multithreaded Programming"
Error Recovery strategies and yacc | Compiler Design
Lecture 2 more about parallel computing
Implementation of Pipe in Linux
Shell and its types in LINUX
Ad

Similar to My ppt hpc u4 (20)

PPT
Lecture9
PPTX
25-MPI-OpenMP.pptx
PPTX
Smalland Survive the Wilds v1.6.2 Free Download
PPTX
Cricket 07 Download For Pc Windows 7,10,11 Free
PPTX
TVersity Pro Media Server Free CRACK Download
PPTX
ScreenHunter Pro 7 Free crack Download
PPTX
Arcsoft TotalMedia Theatre crack Free 2025 Download
PPTX
CyberLink MediaShow Ultra Free CRACK 6.0.10019 Download
PDF
AutoCAD 2025 Crack By Autodesk Free Serial Number
PDF
Wondershare Filmora Crack 2025 For Windows Free
PDF
Wondershare Filmora Crack 2025 For Windows Free
PDF
Smalland Survive the Wilds v1.6.2 Free Download
PDF
ScreenHunter Pro 7 Free crack Download 2025
PDF
CyberLink MediaShow Ultra Free CRACK 6.0.10019 Download
PDF
Wondershare Filmora Crack 2025 For Windows Free
PDF
Arcsoft TotalMedia Theatre crack Free 2025 Download
PDF
TVersity Pro Media Server Free CRACK Download
PPT
Parallel computing(2)
PPTX
CarX Street Deluxe edition v1.4.0 Free Download
PPTX
Nickelodeon All Star Brawl 2 v1.13 Free Download
Lecture9
25-MPI-OpenMP.pptx
Smalland Survive the Wilds v1.6.2 Free Download
Cricket 07 Download For Pc Windows 7,10,11 Free
TVersity Pro Media Server Free CRACK Download
ScreenHunter Pro 7 Free crack Download
Arcsoft TotalMedia Theatre crack Free 2025 Download
CyberLink MediaShow Ultra Free CRACK 6.0.10019 Download
AutoCAD 2025 Crack By Autodesk Free Serial Number
Wondershare Filmora Crack 2025 For Windows Free
Wondershare Filmora Crack 2025 For Windows Free
Smalland Survive the Wilds v1.6.2 Free Download
ScreenHunter Pro 7 Free crack Download 2025
CyberLink MediaShow Ultra Free CRACK 6.0.10019 Download
Wondershare Filmora Crack 2025 For Windows Free
Arcsoft TotalMedia Theatre crack Free 2025 Download
TVersity Pro Media Server Free CRACK Download
Parallel computing(2)
CarX Street Deluxe edition v1.4.0 Free Download
Nickelodeon All Star Brawl 2 v1.13 Free Download
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PDF
KodekX | Application Modernization Development
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Modernizing your data center with Dell and AMD
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
KodekX | Application Modernization Development
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Modernizing your data center with Dell and AMD
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology

My ppt hpc u4

  • 1. PROGRAMMING USING MPI AND OPENMP - Mayuri Sewatkar(16101A1002)
  • 2. Topics Covered  MPI  MPI Principles  Building blocks  The Message Passing Interface(MPI)  Overlapping Communication and Computation  Collective Communication Operations  Composite Synchronization Constructs  Pros and Cons of MPI  OpenMP  Threading  Parallel Programming Model  Combining MPI and OpenMP  Shared Memory Programming  Pros and Cons of OpenMP
  • 3. What is MPI???  Message Passing Interface (MPI) is a language-independent communications protocol used to program parallel computers. Both point-to-point and collective communication are supported.  MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation." So, MPI is a specification, not an implementation.  MPI's goals are high performance, scalability, and portability.
  • 4. MPI Principles  MPI-1 model has no shared memory concept.  MPI-2 has only a limited distributed shared memory concept.  MPI-3 includes new Fortran 2008 bindings, while it removes deprecated C++ bindings as well as many deprecated routines and MPI objects.
  • 5. MPI Building Blocks  Since interactions are accomplished by sending and receiving messages, the basic operations in the message-passing programming paradigm are SEND and RECEIVE.  In their simplest form, the prototypes of these operations are defined as follows:  send(void *sendbuf, int nelems, int dest)  receive(void *recvbuf, int nelems, int source)  The sendbuf points to a buffer that stores the data to be sent, recvbuf points to a buffer that stores the data to be received, nelems is the number of data units to be sent and received, dest is the identifier of the process that receives the data, and source is the identifier of the process that sends the data.
  • 6. MPI: the Message Passing Interface  MPI defines a standard library for message-passing that can be used to develop portable message-passing programs using either C or Fortran.  The MPI standard defines both the syntax as well as the semantics of a core set of library routines that are very useful in writing message- passing programs.  The MPI library contains over 125 routines.  These routines are used to initialize and terminate the MPI library, to get information about the parallel computing environment, and to send and receive messages.
  • 7. MPI: the Message Passing Interface  MPI_Init - Initializes MPI.  This function must be called in every MPI program, must be called before any other MPI functions and must be called only once in an MPI program.  MPI_Init(&argc,&argv);  MPI_Comm_size - Determines the number of processes.  Returns the total number of MPI processes in the specified communicator (MPI_COMM_WORLD).  It represents the number of MPI tasks available to your application.
  • 8. MPI: the Message Passing Interface  MPI_Comm_rank - Determines the label of the calling process.  Returns the rank of the calling MPI process within the specified communicator. Initially, each process will be assigned a unique integer rank between 0 and number of tasks - 1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID.  MPI_Comm_rank (comm,&rank);  MPI_Send - Sends a message.  It performs a blocking send i.e. this routine may block until the message is received by the destination process.  int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)  Buf -> initial address of send buffer  Count -> number of elements in send buffer  Datatype -> datatype of each send buffer element  Dest -> rank of destination  Tag -> message tag  Comm -> communicator
  • 9. MPI: the Message Passing Interface  MPI_Recv - Receives a message.  The count argument indicates the maximum length of a message;  int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag,MPI_Comm comm, MPI_Status *status)  MPI_Finalize - Terminates MPI.  This function should be the last MPI routine called in every MPI program - no other MPI routines may be called after it.  MPI_Finalize();  buf->initial address of receive buffer  status->status object  count->maximum number of elements in receive buffer  datatype->datatype of each receive buffer element  source->rank of source  tag->message tag  comm->communicator
  • 10. Compiling and running MPI mpicc –o helloworld helloworld.cCompiling mpirun –np 4 ./ helloworldRunning
  • 11. MPI Example – Hello World #include "mpi.h" #include <stdio.h> int main( int argc, char *argv[] ) { int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "Hello World from process %d of %dn", rank, size ); MPI_Finalize(); return 0; }
  • 12. MPI Example – Hello World Output – Hello World from process 0 of 4 Hello World from process 2 of 4 Hello World from process 3 of 4 Hello World from process 1 of 4
  • 13. Overlapping Communication and Computation  A blocking send operation remains blocked until the message has been copied out of the send buffer (either into a system buffer at the source process or sent to the destination process).  Similarly, a blocking receive operation returns only after the message has been received and copied into the receive buffer.  In order to overlap communication with computation, MPI provides a pair of functions for performing non-blocking send and receive operations.  These functions:  MPI_Isend and  MPI_Irecv.
  • 14. Overlapping Communication and Computation  MPI_Isend  MPI_Isend starts a send operation but does not complete, that is, it returns before the data is copied out of the buffer.  The calling sequences of MPI_Isend is  int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)  MPI_Irecv  MPI_Irecv starts a receive operation but returns before the data has been received and copied into the buffer.  The calling sequences of MPI_Irecv is  int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
  • 15. Collective Communication Operations  MPI provides the following routines for collective communication:  MPI_Bcast() -> Broadcast (one to all)  MPI_Reduce() -> Reduction (all to one)  MPI_Allreduce() -> Reduction (all to all)  MPI_Scatter() -> Distribute data (one to all)  MPI_Gather() -> Collect data (all to one)  MPI_Alltoall() -> Distribute data (all to all)  MPI_Allgather() -> Collect data (all to all)
  • 16. Composite Synchronization Constructs  By design, Pthreads provide support for a basic set of operations.  Higher level constructs can be built using basic synchronization constructs.  We discuss two such constructs - read-write locks and barriers.  A read lock is granted when there are other threads that may already have read locks.  If there is a write lock on the data (or if there are queued write locks), the thread performs a condition wait.  If there are multiple threads requesting a write lock, they must perform a condition wait.  With this description, we can design functions for  read locks mylib_rwlock_rlock,  write locks mylib_rwlock_wlock and  unlocking mylib_rwlock_unlock.
  • 17. Read-Write Locks  The lock data type mylib_rwlock_t holds the following:  a count of the number of readers,  the writer (a 0/1 integer specifying whether a writer is present),  a condition variable readers_proceed that is signaled when readers can proceed,  a condition variable writer_proceed that is signaled when one of the writers can proceed,  a count pending_writers of pending writers, and  a mutex read_write_lock associated with the shared data structure
  • 18. Barriers  As in MPI, a barrier holds a thread until all threads participating in the barrier have reached it.  Barriers can be implemented using a counter, a mutex and a condition variable.  A single integer is used to keep track of the number of threads that have reached the barrier.  If the count is less than the total number of threads, the threads execute a condition wait.  The last thread entering (and setting the count to the number of threads) wakes up all the threads using a condition broadcast.
  • 19. Barriers typedef struct { pthread_mutex_t count_lock; pthread_cond_t ok_to_proceed; int count; } mylib_barrier_t; void mylib_init_barrier(mylib_barrier_t *b) { b -> count = 0; pthread_mutex_init(&(b -> count_lock), NULL); pthread_cond_init(&(b -> ok_to_proceed), NULL); }
  • 20. Pros and Cons of MPI  Pros  Does not require shared memory architectures which are more expensive than distributed memory architectures  Can be used on a wider range of problems since it exploits both task parallelism and data parallelism  Can run on both shared memory and distributed memory architectures  Highly portable with specific optimization for the implementation on most hardware  Cons  Requires more programming changes to go from serial to parallel version  Can be harder to debug
  • 21. What is OpenMP???  OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most platforms, processor architectures and operating systems, including Solaris, AIX, HP-UX, Linux, MacOS, and Windows.  OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer.
  • 22. What is OpenMP???  OpenMP is basically an add on in compiler. It is available in GCC (gnu compiler) , Intel compiler and with other compilers.  OpenMP target shared memory systems i.e. where processor shared the main memory.  OpenMP is based on thread approach . It launches a single process which in turn can create n number of thread as desired. It is based on what is called "fork and join method" i.e. depending on particular task it can launch desired number of thread as directed by user.
  • 23. Threading  A thread is a single stream of control in the flow of a program.  Static Threads  All work is allocated and assigned at runtime  Dynamic Threads  Consists of one Master and a pool of threads  The pool is assigned some of the work at runtime, but not all of it  When a thread from the pool becomes idle, the Master gives it a new assignment  “Round-robin assignments”
  • 24. Parallel Programing Model  OpenMP uses the fork-join model of parallel execution.  All OpenMP programs begin with a single master thread.  The master thread executes sequentially until a parallel region is encountered, when it creates a team of parallel threads (FORK).  When the team threads complete the parallel region, they synchronize and terminate, leaving only the master thread that executes sequentially (JOIN).
  • 25. Variables  2 types of Variables  Private  Shared  Private Variables  Variables in a thread’s private space can only be accessed by the thread  Private variable has a different address in the execution context of every thread.  Clause : private «variable list»  Shared Variables  Variables in the global data space are accessed by all parallel threads.  Shared-variable has the same address in the execution context of every thread. All threads have access to shared variables.
  • 26. Variables  A thread can access its own private variables, but cannot access the private variable of another thread.  In parallel for pragma, variables are by default shared, except loop index variable which is private. #pragma omp parallel for private(privDbl ) for ( i = 0; i < arraySize; i++ ) { for ( privIndx = 0; privIndx < 16; privIndx++ ) { privDbl = ( (double) privIndx ) / 16; y[i] = sin( exp( cos( - exp( sin(x[i]) ) ) ) ) + cos( privDbl ); } }
  • 27. OpenMP Functions  omp_get_num_procs ()  Returns the number of CPUs in the multiprocessor on which this thread is executing,  The integer returned by this function may be less than the total number of physical processors in the multiprocessor, depending on how the run-time system gives processes access to processors.  e.g. int t= omp_get_num_procs();  omp_get_num_threads()  Returns the number of threads active in the current parallel region  t=omp_get_num_threads();
  • 28. OpenMP Functions Contd.  omp_set_num_threads()  Allows to set the number of threads executing the parallel sections of code  Setting the number of threads equal to the number of available CPUs  e.g. omp_set_num_threads(t);  omp_get _thread_num()  Returns the thread identification number, from 0 to n-1 where n are number of active threads.  tid = omp_get_thread_num();
  • 29. OpenMP compiler directives (Pragma)  A compiler directive in C or c++ is called a pragma.  Format:  #pragma omp directive-name [clause,..] 1. #pragma omp parallel  Block of code should be executed by all of the threads (code block is replicated among the threads)  use curly braces {} to create a block of code from a statement group.
  • 30. OpenMP compiler directives (Pragma) 2. #pragma omp parallel for  indicate to the compiler that the iterations of a for loop may be executed in parallel.  e.g. #pragma omp parallel for for (i = first; i < size; i += prime) marked[i] = 1;
  • 31. Compiling and running OpenMP $gcc -o hello_omp hello_omp.c –fopenmpCompiling $./hello_ompRunning
  • 32. Combining MPI and OpenMP  In many cases hybrid programs using both MPI and OpenMP execute faster than programs using only MPL.  Sometimes hybrid programs execute faster because they have lower communication overhead.  Suppose we are executing our program on a cluster of m multiprocessors, where each multiprocessor has k CPUs. In order to utilize every CPU, a program relying on MPI must create mk processes. During communication steps, mk processes are active.  On the other hand, a hybrid program need only create m processes. In parallel sections of code, the workload is divided among k threads on each multiprocessor. Hence every CPU is utilized.
  • 33.  However, during communication steps, only m processes are active. This may well give the hybrid program lower communication overhead than a "pure" MPI program, resulting in higher speedup. Combining MPI and OpenMP
  • 34. Shared Memory Programing  The underlying hardware is assumed to be a collection of processors, each with access to the same shared memory.  Because they have access 10 the same memory locations, processors can interact and synchronize with each other through shared variables.
  • 35.  The standard view of parallelism in a shared memory program is fork/join parallelism.  When the program begins execution, only a single thread, called the master thread, is active.  The master thread executes the sequential portions of the algorithm. At those points where parallel operations arc required, the master thread forks (creates or awakens) additional threads.  The master thread and the created threads work concurrently through the parallel section, At the end of the parallel code the created threads die or are suspended, and the flow of control returns to the single master thread. This is called a join. Shared Memory Programing
  • 36.  The shared-memory model is characterized by forkjoin parallelism, in which parallelism comes and goes.  At the beginning of execution only a single thread, called the master thread, is active.  The master thread executes the serial portions 0f the program. It forks additional threads to help it execute parallel portions of the program.  These threads are deactivated when serial execution resumes. Shared Memory Programing
  • 37.  A key difference, then, between the shared-memory model and the message passing model is that in the message-passing model all processes typically remain active throughout the execution of the program, whereas in the shared-memory model the number of active threads is one at the program's start and finish and may change dynamically throughout the execution of the program.  Parallel shared-memory programs range from those with only a single fork/join around a single loop to those in which most of the code segments are executed in parallel. Hence the shared-memory model supports incremental paI1lllelization, the process of transforming a sequential program into a parallel program one block of code at a line. Shared Memory Programing
  • 38. Pros and Cons of OpenMP  Pros  Considered by some to be easier to program and debug (compared to MPI)  Data layout and decomposition is handled automatically by directives.  Allows incremental parallelism: directives can be added incrementally, so the program can be parallelized one portion after another and thus no dramatic change to code is needed.  Unified code for both serial and parallel applications: OpenMP constructs are treated as comments when sequential compilers are used.  Original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the chance of inadvertently introducing bugs and helps maintenance as well.  Both coarse-grained and fine-grained parallelism are possible
  • 39. Pros and Cons of OpenMP  Cons  Currently only runs efficiently in shared-memory multiprocessor platforms  Requires a compiler that supports OpenMP.  Scalability is limited by memory architecture.  Reliable error handling is missing.  Lacks fine-grained mechanisms to control thread-processor mapping.  Synchronization between subsets of threads is not allowed.  Mostly used for loop parallelization  Can be difficult to debug, due to implicit communication between threads via shared variables.