My ppt hpc u4

PROGRAMMING USING
MPI AND OPENMP
- Mayuri Sewatkar(16101A1002)

Topics Covered
 MPI
 MPI Principles
 Building blocks
 The Message Passing Interface(MPI)
 Overlapping Communication and
Computation
 Collective Communication
Operations
 Composite Synchronization
Constructs
 Pros and Cons of MPI
 OpenMP
 Threading
 Parallel Programming
Model
 Combining MPI and
OpenMP
 Shared Memory
Programming
 Pros and Cons of OpenMP

What is MPI???
 Message Passing Interface (MPI) is a language-independent
communications protocol used to program parallel computers. Both
point-to-point and collective communication are supported.
 MPI "is a message-passing application programmer interface, together
with protocol and semantic specifications for how its features must
behave in any implementation." So, MPI is a specification, not an
implementation.
 MPI's goals are high performance, scalability, and portability.

MPI Principles
 MPI-1 model has no shared memory concept.
 MPI-2 has only a limited distributed shared memory
concept.
 MPI-3 includes new Fortran 2008 bindings, while it
removes deprecated C++ bindings as well as many
deprecated routines and MPI objects.

MPI Building Blocks
 Since interactions are accomplished by sending and receiving messages,
the basic operations in the message-passing programming paradigm are
SEND and RECEIVE.
 In their simplest form, the prototypes of these operations are defined as
follows:
 send(void *sendbuf, int nelems, int dest)
 receive(void *recvbuf, int nelems, int source)
 The sendbuf points to a buffer that stores the data to be sent, recvbuf
points to a buffer that stores the data to be received, nelems is the
number of data units to be sent and received, dest is the identifier of the
process that receives the data, and source is the identifier of the process
that sends the data.

MPI: the Message Passing Interface
 MPI defines a standard library for message-passing that can be used to
develop portable message-passing programs using either C or Fortran.
 The MPI standard defines both the syntax as well as the semantics of a
core set of library routines that are very useful in writing message-
passing programs.
 The MPI library contains over 125 routines.
 These routines are used to initialize and terminate the MPI library, to
get information about the parallel computing environment, and to send
and receive messages.

 MPI_Init - Initializes MPI.
 This function must be called in every MPI program, must be called
before any other MPI functions and must be called only once in an MPI
program.
 MPI_Init(&argc,&argv);
 MPI_Comm_size - Determines the number of processes.
 Returns the total number of MPI processes in the specified
communicator (MPI_COMM_WORLD).
 It represents the number of MPI tasks available to your application.

 MPI_Comm_rank - Determines the label of the calling process.
 Returns the rank of the calling MPI process within the specified
communicator. Initially, each process will be assigned a unique
integer rank between 0 and number of tasks - 1 within the
communicator MPI_COMM_WORLD. This rank is often referred
to as a task ID.
 MPI_Comm_rank (comm,&rank);
 MPI_Send - Sends a message.
 It performs a blocking send i.e. this routine may block until the
message is received by the destination process.
 int MPI_Send(void *buf, int count, MPI_Datatype datatype, int
dest, int tag, MPI_Comm comm)
 Buf -> initial address
of send buffer
 Count -> number of
elements in send
buffer
 Datatype -> datatype
of each send buffer
element
 Dest -> rank of
destination
 Tag -> message tag
 Comm ->
communicator

 MPI_Recv - Receives a message.
 The count argument indicates the maximum length of
a message;
 int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag,MPI_Comm comm,
MPI_Status *status)
 MPI_Finalize - Terminates MPI.
 This function should be the last MPI routine called in
every MPI program - no other MPI routines may be
called after it.
 MPI_Finalize();
 buf->initial address of
receive buffer
 status->status object
 count->maximum
number of elements in
receive buffer
 datatype->datatype of
each receive buffer
element
 source->rank of source
 tag->message tag
 comm->communicator

Compiling and running MPI
mpicc –o helloworld helloworld.cCompiling
mpirun –np 4 ./ helloworldRunning

MPI Example – Hello World
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int rank, size; MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "Hello World from process %d of %dn", rank, size );
MPI_Finalize();
return 0;
}

MPI Example – Hello World
Output –
Hello World from process 0 of 4

Overlapping Communication and
Computation
 A blocking send operation remains blocked until the message has been
copied out of the send buffer (either into a system buffer at the source
process or sent to the destination process).
 Similarly, a blocking receive operation returns only after the message
has been received and copied into the receive buffer.
 In order to overlap communication with computation, MPI provides a
pair of functions for performing non-blocking send and receive
operations.
 These functions:
 MPI_Isend and
 MPI_Irecv.

Overlapping Communication and
Computation
 MPI_Isend
 MPI_Isend starts a send operation but does not complete, that is, it returns
before the data is copied out of the buffer.
 The calling sequences of MPI_Isend is
 int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request *request)
 MPI_Irecv
 MPI_Irecv starts a receive operation but returns before the data has been
received and copied into the buffer.
 The calling sequences of MPI_Irecv is
 int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
MPI_Comm comm, MPI_Request *request)

Collective Communication Operations
 MPI provides the following routines for collective communication:
 MPI_Bcast() -> Broadcast (one to all)
 MPI_Reduce() -> Reduction (all to one)
 MPI_Allreduce() -> Reduction (all to all)
 MPI_Scatter() -> Distribute data (one to all)
 MPI_Gather() -> Collect data (all to one)
 MPI_Alltoall() -> Distribute data (all to all)
 MPI_Allgather() -> Collect data (all to all)

Composite Synchronization
Constructs
 By design, Pthreads provide support for a basic set of operations.
 Higher level constructs can be built using basic synchronization
constructs.
 We discuss two such constructs - read-write locks and barriers.
 A read lock is granted when there are other threads that may already
have read locks.
 If there is a write lock on the data (or if there are queued write locks),
the thread performs a condition wait.
 If there are multiple threads requesting a write lock, they must perform
a condition wait.
 With this description, we can design functions for
 read locks mylib_rwlock_rlock,
 write locks mylib_rwlock_wlock and
 unlocking mylib_rwlock_unlock.

Read-Write Locks
 The lock data type mylib_rwlock_t holds the following:
 a count of the number of readers,
 the writer (a 0/1 integer specifying whether a writer is present),
 a condition variable readers_proceed that is signaled when readers
can proceed,
 a condition variable writer_proceed that is signaled when one of the
writers can proceed,
 a count pending_writers of pending writers, and
 a mutex read_write_lock associated with the shared data structure

Barriers
 As in MPI, a barrier holds a thread until all threads participating in the
barrier have reached it.
 Barriers can be implemented using a counter, a mutex and a condition
variable.
 A single integer is used to keep track of the number of threads that have
reached the barrier.
 If the count is less than the total number of threads, the threads execute
a condition wait.
 The last thread entering (and setting the count to the number of
threads) wakes up all the threads using a condition broadcast.

Barriers
typedef struct
{
pthread_mutex_t count_lock;
pthread_cond_t ok_to_proceed;
int count;
} mylib_barrier_t;
void mylib_init_barrier(mylib_barrier_t *b)
{
b -> count = 0;
pthread_mutex_init(&(b -> count_lock), NULL);
pthread_cond_init(&(b -> ok_to_proceed), NULL);
}

Pros and Cons of MPI
 Pros
 Does not require shared memory architectures which are more expensive
than distributed memory architectures
 Can be used on a wider range of problems since it exploits both task
parallelism and data parallelism
 Can run on both shared memory and distributed memory architectures
 Highly portable with specific optimization for the implementation on most
hardware
 Cons
 Requires more programming changes to go from serial to parallel version
 Can be harder to debug

What is OpenMP???
 OpenMP (Open Multi-Processing) is an application programming
interface (API) that supports multi-platform shared memory
multiprocessing programming in C, C++, and Fortran, on most
platforms, processor architectures and operating systems, including
Solaris, AIX, HP-UX, Linux, MacOS, and Windows.
 OpenMP uses a portable, scalable model that gives programmers a
simple and flexible interface for developing parallel applications for
platforms ranging from the standard desktop computer to the
supercomputer.

What is OpenMP???
 OpenMP is basically an add on in compiler. It is available in GCC (gnu
compiler) , Intel compiler and with other compilers.
 OpenMP target shared memory systems i.e. where processor shared the
main memory.
 OpenMP is based on thread approach . It launches a single process which in
turn can create n number of thread as desired. It is based on what is called
"fork and join method" i.e. depending on particular task it can launch
desired number of thread as directed by user.

Threading
 A thread is a single stream of control in the flow of a program.
 Static Threads
 All work is allocated and assigned at runtime
 Dynamic Threads
 Consists of one Master and a pool of threads
 The pool is assigned some of the work at runtime, but not all of it
 When a thread from the pool becomes idle, the Master gives it a new
assignment
 “Round-robin assignments”

Parallel Programing Model
 OpenMP uses the fork-join model of parallel execution.
 All OpenMP programs begin with a single master thread.
 The master thread executes sequentially until a parallel region is
encountered, when it creates a team of parallel threads (FORK).
 When the team threads complete the parallel region, they synchronize and
terminate, leaving only the master thread that executes sequentially
(JOIN).

Variables
 2 types of Variables
 Private
 Shared
 Private Variables
 Variables in a thread’s private space can only be accessed by the thread
 Private variable has a different address in the execution context of every
thread.
 Clause : private «variable list»
 Shared Variables
 Variables in the global data space are accessed by all parallel threads.
 Shared-variable has the same address in the execution context of every
thread. All threads have access to shared variables.

Variables
 A thread can access its own private variables, but cannot access the
private variable of another thread.
 In parallel for pragma, variables are by default shared, except
loop index variable which is private.
#pragma omp parallel for private(privDbl )
for ( i = 0; i < arraySize; i++ ) {
for ( privIndx = 0; privIndx < 16; privIndx++ ) { privDbl = ( (double)
privIndx ) / 16;
y[i] = sin( exp( cos( - exp( sin(x[i]) ) ) ) ) + cos( privDbl );
}
}

OpenMP Functions
 omp_get_num_procs ()
 Returns the number of CPUs in the multiprocessor on which this thread is
executing,
 The integer returned by this function may be less than the total number of
physical processors in the multiprocessor, depending on how the run-time
system gives processes access to processors.
 e.g. int t= omp_get_num_procs();
 omp_get_num_threads()
 Returns the number of threads active in the current parallel region
 t=omp_get_num_threads();

OpenMP Functions Contd.
 omp_set_num_threads()
 Allows to set the number of threads executing the parallel sections of code
 Setting the number of threads equal to the number of available CPUs
 e.g. omp_set_num_threads(t);
 omp_get _thread_num()
 Returns the thread identification number, from 0 to n-1 where n are
number of active threads.
 tid = omp_get_thread_num();

OpenMP compiler directives (Pragma)
 A compiler directive in C or c++ is called a pragma.
 Format:
 #pragma omp directive-name [clause,..]
1. #pragma omp parallel
 Block of code should be executed by all of the threads (code block is
replicated among the threads)
 use curly braces {} to create a block of code from a statement group.

OpenMP compiler directives (Pragma)
2. #pragma omp parallel for
 indicate to the compiler that the iterations of a for loop may
be executed in parallel.
 e.g.
#pragma omp parallel for
for (i = first; i < size; i += prime)
marked[i] = 1;

Compiling and running OpenMP
$gcc -o hello_omp hello_omp.c –fopenmpCompiling
$./hello_ompRunning

Combining MPI and OpenMP
 In many cases hybrid programs using both MPI and OpenMP execute
faster than programs using only MPL.
 Sometimes hybrid programs execute faster because they have lower
communication overhead.
 Suppose we are executing our program on a cluster of m multiprocessors,
where each multiprocessor has k CPUs. In order to utilize every CPU, a
program relying on MPI must create mk processes. During
communication steps, mk processes are active.
 On the other hand, a hybrid program need only create m processes. In
parallel sections of code, the workload is divided among k threads on
each multiprocessor. Hence every CPU is utilized.

 However, during communication
steps, only m processes are active.
This may well give the hybrid
program lower communication
overhead than a "pure" MPI
program, resulting in higher
speedup.
Combining MPI and OpenMP

Shared Memory Programing
 The underlying hardware is assumed to be a collection
of processors, each with access to the same shared
memory.
 Because they have access 10 the same memory
locations, processors can interact and synchronize with
each other through shared variables.

 The standard view of parallelism in a shared memory program is
fork/join parallelism.
 When the program begins execution, only a single thread, called the
master thread, is active.
 The master thread executes the sequential portions of the algorithm. At
those points where parallel operations arc required, the master thread
forks (creates or awakens) additional threads.
 The master thread and the created threads work concurrently through
the parallel section, At the end of the parallel code the created threads
die or are suspended, and the flow of control returns to the single
master thread. This is called a join.

 The shared-memory model is
characterized by forkjoin parallelism,
in which parallelism comes and goes.
 At the beginning of execution only a
single thread, called the master thread,
is active.
 The master thread executes the serial
portions 0f the program. It forks
additional threads to help it execute
parallel portions of the program.
 These threads are deactivated when
serial execution resumes.

 A key difference, then, between the shared-memory model and the
message passing model is that in the message-passing model all
processes typically remain active throughout the execution of the
program, whereas in the shared-memory model the number of
active threads is one at the program's start and finish and may
change dynamically throughout the execution of the program.
 Parallel shared-memory programs range from those with only a
single fork/join around a single loop to those in which most of the
code segments are executed in parallel. Hence the shared-memory
model supports incremental paI1lllelization, the process of
transforming a sequential program into a parallel program one
block of code at a line.

Pros and Cons of OpenMP
 Pros
 Considered by some to be easier to program and debug (compared to
MPI)
 Data layout and decomposition is handled automatically by directives.
 Allows incremental parallelism: directives can be added incrementally,
so the program can be parallelized one portion after another and thus
no dramatic change to code is needed.
 Unified code for both serial and parallel applications: OpenMP
constructs are treated as comments when sequential compilers are
used.
 Original (serial) code statements need not, in general, be modified when
parallelized with OpenMP. This reduces the chance of inadvertently
introducing bugs and helps maintenance as well.
 Both coarse-grained and fine-grained parallelism are possible

Pros and Cons of OpenMP
 Cons
 Currently only runs efficiently in shared-memory multiprocessor
platforms
 Requires a compiler that supports OpenMP.
 Scalability is limited by memory architecture.
 Reliable error handling is missing.
 Lacks fine-grained mechanisms to control thread-processor
mapping.
 Synchronization between subsets of threads is not allowed.
 Mostly used for loop parallelization
 Can be difficult to debug, due to implicit communication between
threads via shared variables.

My ppt hpc u4

More Related Content

What's hot (20)

Similar to My ppt hpc u4 (20)

Recently uploaded (20)

My ppt hpc u4