SlideShare a Scribd company logo
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 8
Matrix-vector Multiplication
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter Objectives
• Review matrix-vector multiplicaiton
• Propose replication of vectors
• Develop three parallel programs, each based on a
different data decomposition
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Outline
• Sequential algorithm and its complexity
• Design, analysis, and implementation of three parallel
programs
• Rowwise block striped
• Columnwise block striped
• Checkerboard block
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
2 1 0 4
3 2 1 1
4 3 1 2
3 0 2 0
1
3
4
1
=
×
Sequential Algorithm
2
2 1 5
0
4
1
3
5
9
4
1
9
2 1 0 4 1
3
4
1
3
3
1
9
2 3 13
1
4
14
1
1
14
1
3
4
1
3 2 1 1
4
4
1
13
3
3
17
1 4 19
2
1
19
1
3
4
1
4 3 1 2
3 3
1
3
3
0 11
2
4
11
0 1 11
1
3
4
1
3 0 2 0
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
matrix decomposition
2 1 0 4
3 1
3 2 1 1
4
4 3 1 2
3 0 2
3 0 2 0
2 1 0 4
3 2 1 1
4 3 1 2
3 0 2 0
2 1 0 4
3 2 1 1
4 3 1 2
3 0 2 0
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Storing Vectors
• Divide vector elements among processes
• Replicate vector elements
• Vector replication acceptable because vectors have
only n elements, versus n2 elements in matrices
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Rowwise Block Striped Matrix
• Partitioning through domain decomposition
• Primitive task associated with
• Row of matrix
• Entire vector
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Phases of Parallel Algorithm
Row i of A
b
Row i of A
b
ci
Inner product computation
Row i of A
b c
All-gather communication
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Agglomeration and Mapping
• Static number of tasks
• Regular communication pattern (all-gather)
• Computation time per task is constant
• Strategy:
• Agglomerate groups of rows
• Create one task per MPI process
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Block-to-replicated Transformation
Process 0
Process 1
Process 2
Process 0
Process 1
Process 2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI_Allgatherv
Allgatherv
After
Before
0
1
2
3
Processes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI_Allgatherv
int MPI_Allgatherv (
void *send_buffer,
int send_cnt,
MPI_Datatype send_type,
void *receive_buffer,
int *receive_cnt,
int *receive_disp,
MPI_Datatype receive_type,
MPI_Comm communicator)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI_Allgatherv in Action
c o n
c o n c a t e n a t e
send_buffer
receive_buffer
send_ cnt = 3
3 4 4
receive_ cnt
0 3 7
receive_ disp
c a t e
c o n c a t e n a t e
send_buffer
receive_buffer
3 4 4
receive_ cnt
send_ cnt = 4
0 3 7
receive_ disp
n a t e
c o n c a t e n a t e
send_buffer
receive_buffer
3 4 4
receive_ cnt
send_ cnt = 4
0 3 7
receive_ disp
Process 2
Process 1
Process 0 Process 0
Process 1
Process 2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function replicate_block_vector
• Create space for entire vector
• Create “mixed transfer” arrays
• Call MPI_Allgatherv
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function read_replicated_vector
• Process p-1
• Opens file
• Reads vector length
• Broadcast vector length (root process = p-1)
• Allocate space for vector
• Process p-1 reads vector, closes file
• Broadcast vector
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function print_replicated_vector
• Process 0 prints vector
• Exact call to printf depends on value of parameter
datatype
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Run-time Expression
• χ: inner product loop iteration time
• Computational time: χ n⎡n/p⎤
• All-gather requires ⎡log p⎤ messages with latency λ
• Total vector elements transmitted:
n(2⎡log p⎤ -1) / 2⎡log p⎤
• each element occupies 8 bytes
• Total execution time:
χ n⎡n/p⎤ + λ⎡log p⎤ + 8n(2⎡log p⎤ -1) / (2⎡log p⎤ β)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Benchmarking Results
Execution Time (msec)
p Predicted Actual Speedup Mflops
1 63.4 63.4 1.00 31.6
2 32.4 32.7 1.94 61.2
3 22.3 22.7 2.79 88.1
4 17.0 17.8 3.56 112.4
5 14.1 15.2 4.16 131.6
6 12.0 13.3 4.76 150.4
7 10.5 12.2 5.19 163.9
8 9.4 11.1 5.70 180.2
16 5.7 7.2 8.79 277.8
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Columnwise Block Striped Matrix
• Partitioning through domain decomposition
• Task associated with
• Column of matrix
• Vector element
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Proc 4
Processor 0’s initial computation
Processor 1’s initial computation
Proc 2
Proc 3
Matrix-Vector Multiplication
c0 = a0,0 b0 + a0,1 b1 + a0,2 b2 + a0,3 b3 + a4,4 b4
c1 = a1,0 b0 + a1,1 b1 + a1,2 b2 + a1,3 b3 + a1,4 b4
c2 = a2,0 b0 + a2,1 b1 + a2,2 b2 + a2,3 b3 + a2,4 b4
c3 = a3,0 b0 + a3,1 b1 + a3,2 b2 + a3,3 b3 + b3,4 b4
c4 = a4,0 b0 + a4,1 b1 + a4,2 b2 + a4,3 b3 + a4,4 b4
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
All-to-all Exchange (before)
P0 P1 P2 P3 P4
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
All-to-all Exchange (after)
P0 P1 P2 P3 P4
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Phases of Parallel Algorithm
Column
i
of
A
b
Column
i
of
A
b ~c
Multiplications
Column
i
of
A
b ~c
All-to-all exchange
Column
i
of
A
b c
Reduction
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Agglomeration and Mapping
• Static number of tasks
• Regular communication pattern (all-to-all)
• Computation time per task is constant
• Strategy:
• Agglomerate groups of columns
• Create one task per MPI process
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Reading a Block-Column Matrix
File
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI_Scatterv
Scatterv
0
1
2
3
Processes
Before After
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Header for MPI_Scatterv
int MPI_Scatterv (
void *send_buffer,
int *send_cnt,
int *send_disp,
MPI_Datatype send_type,
void *receive_buffer,
int receive_cnt,
MPI_Datatype receive_type,
int root,
MPI_Comm communicator)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Printing a Block-Column Matrix
• Data motion opposite to that we did when reading the
matrix
• Replace “scatter” with “gather”
• Use “v” variant because different processes contribute
different numbers of elements
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function MPI_Gatherv
Gatherv
0
1
2
3
Processes Before After
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Header for MPI_Gatherv
int MPI_Gatherv (
void *send_buffer,
int send_cnt,
MPI_Datatype send_type,
void *receive_buffer,
int *receive_cnt,
int *receive_disp,
MPI_Datatype receive_type,
int root,
MPI_Comm communicator)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function MPI_Alltoallv
0
1
2
3
Processes
Alltoallv
Before After
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Header for MPI_Alltoallv
int MPI_Gatherv (
void *send_buffer,
int *send_cnt,
int *send_disp,
MPI_Datatype send_type,
void *receive_buffer,
int *receive_cnt,
int *receive_disp,
MPI_Datatype receive_type,
MPI_Comm communicator)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Run-time Expression
• χ: inner product loop iteration time
• Computational time: χ n⎡n/p⎤
• All-gather requires p-1 messages, each of length about
n/p
• 8 bytes per element
• Total execution time:
χ n⎡n/p⎤ + (p-1)(λ + (8n/p)/β)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Benchmarking Results
Execution Time (msec)
p Predicted Actual Speedup Mflops
1 63.4 63.8 1.00 31.4
2 32.4 32.9 1.92 60.8
3 22.2 22.6 2.80 88.5
4 17.2 17.5 3.62 114.3
5 14.3 14.5 4.37 137.9
6 12.5 12.6 5.02 158.7
7 11.3 11.2 5.65 178.6
8 10.4 10.0 6.33 200.0
16 8.5 7.6 8.33 263.2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Checkerboard Block
Decomposition
• Associate primitive task with each element of the matrix A
• Each primitive task performs one multiply
• Agglomerate primitive tasks into rectangular blocks
• Processes form a 2-D grid
• Vector b distributed by blocks among processes in first column of grid
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Tasks after Agglomeration
b A
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Algorithm’s Phases
b A
Redistribute b
Matrix-vector
multiply
Reduce vectors
across rows
c
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Redistributing Vector b
• Step 1: Move b from processes in first row to processes in first column
• If p square
» First column/first row processes send/receive portions of b
• If p not square
» Gather b on process 0, 0
» Process 0, 0 broadcasts to first row procs
• Step 2: First row processes scatter b within columns
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Redistributing Vector b
(a)
Gather b Scatter b
Broadcast
blocks of b
Send/ Recv
blocks of b
Broadcast
blocks of b
(b)
(a)
Gather b Scatter b
Broadcast
blocks of b
Send/ Recv
blocks of b
Broadcast
blocks of b
(b)
When p is a square number
When p is not a square number
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Creating Communicators
• Want processes in a virtual 2-D grid
• Create a custom communicator to do this
• Collective communications involve all processes in a
communicator
• We need to do broadcasts, reductions among subsets
of processes
• We will create communicators for processes in same
row or same column
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
What’s in a Communicator?
• Process group
• Context
• Attributes
• Topology (lets us address processes another way)
• Others we won’t consider
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Creating 2-D Virtual Grid of
Processes
• MPI_Dims_create
• Input parameters
» Total number of processes in desired grid
» Number of grid dimensions
• Returns number of processes in each dim
• MPI_Cart_create
• Creates communicator with cartesian topology
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI_Dims_create
int MPI_Dims_create (
int nodes,
/* Input - Procs in grid */
int dims,
/* Input - Number of dims */
int *size)
/* Input/Output - Size of
each grid dimension */
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI_Cart_create
int MPI_Cart_create (
MPI_Comm old_comm, /* Input - old communicator */
int dims, /* Input - grid dimensions */
int *size, /* Input - # procs in each dim */
int *periodic,
/* Input - periodic[j] is 1 if dimension j
wraps around; 0 otherwise */
int reorder,
/* 1 if process ranks can be reordered */
MPI_Comm *cart_comm)
/* Output - new communicator */
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Using MPI_Dims_create and
MPI_Cart_create
MPI_Comm cart_comm;
int p;
int periodic[2];
int size[2];
...
size[0] = size[1] = 0;
MPI_Dims_create (p, 2, size);
periodic[0] = periodic[1] = 0;
MPI_Cart_create (MPI_COMM_WORLD, 2, size,
1, &cart_comm);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Useful Grid-related Functions
• MPI_Cart_rank
• Given coordinates of process in Cartesian communicator,
returns process rank
• MPI_Cart_coords
• Given rank of process in Cartesian communicator, returns
process’ coordinates
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Header for MPI_Cart_rank
int MPI_Cart_rank (
MPI_Comm comm,
/* In - Communicator */
int *coords,
/* In - Array containing process’
grid location */
int *rank)
/* Out - Rank of process at
specified coords */
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Header for MPI_Cart_coords
int MPI_Cart_coords (
MPI_Comm comm,
/* In - Communicator */
int rank,
/* In - Rank of process */
int dims,
/* In - Dimensions in virtual grid */
int *coords)
/* Out - Coordinates of specified
process in virtual grid */
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI_Comm_split
• Partitions the processes of a communicator into one or more subgroups
• Constructs a communicator for each subgroup
• Allows processes in each subgroup to perform their own collective
communications
• Needed for columnwise scatter and rowwise reduce
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Header for MPI_Comm_split
int MPI_Comm_split (
MPI_Comm old_comm,
/* In - Existing communicator */
int partition, /* In - Partition number */
int new_rank,
/* In - Ranking order of processes
in new communicator */
MPI_Comm *new_comm)
/* Out - New communicator shared by
processes in same partition */
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example: Create Communicators
for Process Rows
MPI_Comm grid_comm; /* 2-D process grid */
MPI_Comm grid_coords[2];
/* Location of process in grid */
MPI_Comm row_comm;
/* Processes in same row */
MPI_Comm_split (grid_comm, grid_coords[0],
grid_coords[1], &row_comm);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Run-time Expression
• Computational time: χ ⎡n/√p⎤ ⎡n/√p⎤
• Suppose p a square number
• Redistribute b
• Send/recv: λ + 8 ⎡n/√p⎤ / β
• Broadcast: log √p ( λ + 8 ⎡n/√p⎤ / β)
• Reduce partial results:
log √p ( λ + 8 ⎡n/√p⎤ / β)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Benchmarking
Procs Predicte
d(msec)
Actual
(msec)
Speedup Megaflops
1 63.4 63.4 1.00 31.6
4 17.8 17.4 3.64 114.9
9 9.7 9.7 6.53 206.2
16 6.2 6.2 10.21 322.6
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Comparison of Three Algorithms
0
2
4
6
8
10
12
0 5 10 15 20
Processors
Speedup
Rowwise Block
Striped
Columnwise Block
Striped
Checkerboard
Block
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (1/3)
• Matrix decomposition ⇒ communications needed
• Rowwise block striped: all-gather
• Columnwise block striped: all-to-all exchange
• Checkerboard block: gather, scatter, broadcast, reduce
• All three algorithms: roughly same number of messages
• Elements transmitted per process varies
• First two algorithms: Θ(n) elements per process
• Checkerboard algorithm: Θ(n/√p) elements
• Checkerboard block algorithm has better scalability
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (2/3)
• Communicators with Cartesian topology
• Creation
• Identifying processes by rank or coords
• Subdividing communicators
• Allows collective operations among subsets of processes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (3/3)
• Parallel programs and supporting functions much
longer than C counterparts
• Extra code devoted to reading, distributing, printing
matrices and vectors
• Developing and debugging these functions is tedious
and difficult
• Makes sense to generalize functions and put them in
libraries for reuse

More Related Content

PDF
Parallel Programming Slide - Michael J.Quinn
PPT
chapter4.ppt
PDF
Open mp
PPTX
Onnc intro
PPT
OpenMP-Quinn17_L4bOpen <MP_Open MP_Open MP
PPT
chapter7.ppt
PDF
How to Build a Telegraf Plugin by Noah Crowley
PDF
Building a Telegraf Plugin by Noah Crowly | Developer Advocate | InfluxData
Parallel Programming Slide - Michael J.Quinn
chapter4.ppt
Open mp
Onnc intro
OpenMP-Quinn17_L4bOpen <MP_Open MP_Open MP
chapter7.ppt
How to Build a Telegraf Plugin by Noah Crowley
Building a Telegraf Plugin by Noah Crowly | Developer Advocate | InfluxData

Similar to Parallel Programming Slide 2- Michael J.Quinn (20)

PDF
Two C++ Tools: Compiler Explorer and Cpp Insights
PDF
cscript_controller.pdf
PPT
Chapter 4 - Completing the Problem-Solving Process
PDF
Parallel Computing - Lec 6
PDF
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
PDF
Integrating Existing C++ Libraries into PySpark with Esther Kundin
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
PPTX
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
PDF
DevoxxUK: Optimizating Application Performance on Kubernetes
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
PPSX
matrixmultiplicationparallel.ppsx
PPSX
MAtrix Multiplication Parallel.ppsx
PDF
Write your own telegraf plugin
PDF
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
PPTX
Advanced technologies and techniques for debugging HPC applications
PPT
Introduction_to_computing_system_form_bits.ppt
PDF
High Performance Erlang - Pitfalls and Solutions
PDF
Toronto meetup 20190917
PDF
LCA14: LCA14-412: GPGPU on ARM SoC session
PPTX
GPORCA: Query Optimization as a Service
Two C++ Tools: Compiler Explorer and Cpp Insights
cscript_controller.pdf
Chapter 4 - Completing the Problem-Solving Process
Parallel Computing - Lec 6
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Concurrent Programming OpenMP @ Distributed System Discussion
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
DevoxxUK: Optimizating Application Performance on Kubernetes
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
matrixmultiplicationparallel.ppsx
MAtrix Multiplication Parallel.ppsx
Write your own telegraf plugin
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Advanced technologies and techniques for debugging HPC applications
Introduction_to_computing_system_form_bits.ppt
High Performance Erlang - Pitfalls and Solutions
Toronto meetup 20190917
LCA14: LCA14-412: GPGPU on ARM SoC session
GPORCA: Query Optimization as a Service
Ad

Recently uploaded (20)

PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
master seminar digital applications in india
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Pre independence Education in Inndia.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Sports Quiz easy sports quiz sports quiz
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Insiders guide to clinical Medicine.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Lesson notes of climatology university.
PPTX
Cell Structure & Organelles in detailed.
PPTX
Institutional Correction lecture only . . .
PDF
01-Introduction-to-Information-Management.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
GDM (1) (1).pptx small presentation for students
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
master seminar digital applications in india
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Pre independence Education in Inndia.pdf
Anesthesia in Laparoscopic Surgery in India
Sports Quiz easy sports quiz sports quiz
O5-L3 Freight Transport Ops (International) V1.pdf
Microbial disease of the cardiovascular and lymphatic systems
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Insiders guide to clinical Medicine.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Lesson notes of climatology university.
Cell Structure & Organelles in detailed.
Institutional Correction lecture only . . .
01-Introduction-to-Information-Management.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
GDM (1) (1).pptx small presentation for students
Ad

Parallel Programming Slide 2- Michael J.Quinn

  • 1. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication
  • 2. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter Objectives • Review matrix-vector multiplicaiton • Propose replication of vectors • Develop three parallel programs, each based on a different data decomposition
  • 3. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Outline • Sequential algorithm and its complexity • Design, analysis, and implementation of three parallel programs • Rowwise block striped • Columnwise block striped • Checkerboard block
  • 4. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 2 1 0 4 3 2 1 1 4 3 1 2 3 0 2 0 1 3 4 1 = × Sequential Algorithm 2 2 1 5 0 4 1 3 5 9 4 1 9 2 1 0 4 1 3 4 1 3 3 1 9 2 3 13 1 4 14 1 1 14 1 3 4 1 3 2 1 1 4 4 1 13 3 3 17 1 4 19 2 1 19 1 3 4 1 4 3 1 2 3 3 1 3 3 0 11 2 4 11 0 1 11 1 3 4 1 3 0 2 0
  • 5. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. matrix decomposition 2 1 0 4 3 1 3 2 1 1 4 4 3 1 2 3 0 2 3 0 2 0 2 1 0 4 3 2 1 1 4 3 1 2 3 0 2 0 2 1 0 4 3 2 1 1 4 3 1 2 3 0 2 0
  • 6. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Storing Vectors • Divide vector elements among processes • Replicate vector elements • Vector replication acceptable because vectors have only n elements, versus n2 elements in matrices
  • 7. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Rowwise Block Striped Matrix • Partitioning through domain decomposition • Primitive task associated with • Row of matrix • Entire vector
  • 8. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Phases of Parallel Algorithm Row i of A b Row i of A b ci Inner product computation Row i of A b c All-gather communication
  • 9. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Agglomeration and Mapping • Static number of tasks • Regular communication pattern (all-gather) • Computation time per task is constant • Strategy: • Agglomerate groups of rows • Create one task per MPI process
  • 10. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Block-to-replicated Transformation Process 0 Process 1 Process 2 Process 0 Process 1 Process 2
  • 11. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MPI_Allgatherv Allgatherv After Before 0 1 2 3 Processes
  • 12. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MPI_Allgatherv int MPI_Allgatherv ( void *send_buffer, int send_cnt, MPI_Datatype send_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatype receive_type, MPI_Comm communicator)
  • 13. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MPI_Allgatherv in Action c o n c o n c a t e n a t e send_buffer receive_buffer send_ cnt = 3 3 4 4 receive_ cnt 0 3 7 receive_ disp c a t e c o n c a t e n a t e send_buffer receive_buffer 3 4 4 receive_ cnt send_ cnt = 4 0 3 7 receive_ disp n a t e c o n c a t e n a t e send_buffer receive_buffer 3 4 4 receive_ cnt send_ cnt = 4 0 3 7 receive_ disp Process 2 Process 1 Process 0 Process 0 Process 1 Process 2
  • 14. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Function replicate_block_vector • Create space for entire vector • Create “mixed transfer” arrays • Call MPI_Allgatherv
  • 15. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Function read_replicated_vector • Process p-1 • Opens file • Reads vector length • Broadcast vector length (root process = p-1) • Allocate space for vector • Process p-1 reads vector, closes file • Broadcast vector
  • 16. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Function print_replicated_vector • Process 0 prints vector • Exact call to printf depends on value of parameter datatype
  • 17. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Run-time Expression • χ: inner product loop iteration time • Computational time: χ n⎡n/p⎤ • All-gather requires ⎡log p⎤ messages with latency λ • Total vector elements transmitted: n(2⎡log p⎤ -1) / 2⎡log p⎤ • each element occupies 8 bytes • Total execution time: χ n⎡n/p⎤ + λ⎡log p⎤ + 8n(2⎡log p⎤ -1) / (2⎡log p⎤ β)
  • 18. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Benchmarking Results Execution Time (msec) p Predicted Actual Speedup Mflops 1 63.4 63.4 1.00 31.6 2 32.4 32.7 1.94 61.2 3 22.3 22.7 2.79 88.1 4 17.0 17.8 3.56 112.4 5 14.1 15.2 4.16 131.6 6 12.0 13.3 4.76 150.4 7 10.5 12.2 5.19 163.9 8 9.4 11.1 5.70 180.2 16 5.7 7.2 8.79 277.8
  • 19. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Columnwise Block Striped Matrix • Partitioning through domain decomposition • Task associated with • Column of matrix • Vector element
  • 20. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Proc 4 Processor 0’s initial computation Processor 1’s initial computation Proc 2 Proc 3 Matrix-Vector Multiplication c0 = a0,0 b0 + a0,1 b1 + a0,2 b2 + a0,3 b3 + a4,4 b4 c1 = a1,0 b0 + a1,1 b1 + a1,2 b2 + a1,3 b3 + a1,4 b4 c2 = a2,0 b0 + a2,1 b1 + a2,2 b2 + a2,3 b3 + a2,4 b4 c3 = a3,0 b0 + a3,1 b1 + a3,2 b2 + a3,3 b3 + b3,4 b4 c4 = a4,0 b0 + a4,1 b1 + a4,2 b2 + a4,3 b3 + a4,4 b4
  • 21. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. All-to-all Exchange (before) P0 P1 P2 P3 P4
  • 22. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. All-to-all Exchange (after) P0 P1 P2 P3 P4
  • 23. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Phases of Parallel Algorithm Column i of A b Column i of A b ~c Multiplications Column i of A b ~c All-to-all exchange Column i of A b c Reduction
  • 24. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Agglomeration and Mapping • Static number of tasks • Regular communication pattern (all-to-all) • Computation time per task is constant • Strategy: • Agglomerate groups of columns • Create one task per MPI process
  • 25. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Reading a Block-Column Matrix File
  • 26. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MPI_Scatterv Scatterv 0 1 2 3 Processes Before After
  • 27. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Header for MPI_Scatterv int MPI_Scatterv ( void *send_buffer, int *send_cnt, int *send_disp, MPI_Datatype send_type, void *receive_buffer, int receive_cnt, MPI_Datatype receive_type, int root, MPI_Comm communicator)
  • 28. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Printing a Block-Column Matrix • Data motion opposite to that we did when reading the matrix • Replace “scatter” with “gather” • Use “v” variant because different processes contribute different numbers of elements
  • 29. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Function MPI_Gatherv Gatherv 0 1 2 3 Processes Before After
  • 30. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Header for MPI_Gatherv int MPI_Gatherv ( void *send_buffer, int send_cnt, MPI_Datatype send_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatype receive_type, int root, MPI_Comm communicator)
  • 31. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Function MPI_Alltoallv 0 1 2 3 Processes Alltoallv Before After
  • 32. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Header for MPI_Alltoallv int MPI_Gatherv ( void *send_buffer, int *send_cnt, int *send_disp, MPI_Datatype send_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatype receive_type, MPI_Comm communicator)
  • 33. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Run-time Expression • χ: inner product loop iteration time • Computational time: χ n⎡n/p⎤ • All-gather requires p-1 messages, each of length about n/p • 8 bytes per element • Total execution time: χ n⎡n/p⎤ + (p-1)(λ + (8n/p)/β)
  • 34. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Benchmarking Results Execution Time (msec) p Predicted Actual Speedup Mflops 1 63.4 63.8 1.00 31.4 2 32.4 32.9 1.92 60.8 3 22.2 22.6 2.80 88.5 4 17.2 17.5 3.62 114.3 5 14.3 14.5 4.37 137.9 6 12.5 12.6 5.02 158.7 7 11.3 11.2 5.65 178.6 8 10.4 10.0 6.33 200.0 16 8.5 7.6 8.33 263.2
  • 35. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Checkerboard Block Decomposition • Associate primitive task with each element of the matrix A • Each primitive task performs one multiply • Agglomerate primitive tasks into rectangular blocks • Processes form a 2-D grid • Vector b distributed by blocks among processes in first column of grid
  • 36. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Tasks after Agglomeration b A
  • 37. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Algorithm’s Phases b A Redistribute b Matrix-vector multiply Reduce vectors across rows c
  • 38. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Redistributing Vector b • Step 1: Move b from processes in first row to processes in first column • If p square » First column/first row processes send/receive portions of b • If p not square » Gather b on process 0, 0 » Process 0, 0 broadcasts to first row procs • Step 2: First row processes scatter b within columns
  • 39. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Redistributing Vector b (a) Gather b Scatter b Broadcast blocks of b Send/ Recv blocks of b Broadcast blocks of b (b) (a) Gather b Scatter b Broadcast blocks of b Send/ Recv blocks of b Broadcast blocks of b (b) When p is a square number When p is not a square number
  • 40. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Creating Communicators • Want processes in a virtual 2-D grid • Create a custom communicator to do this • Collective communications involve all processes in a communicator • We need to do broadcasts, reductions among subsets of processes • We will create communicators for processes in same row or same column
  • 41. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. What’s in a Communicator? • Process group • Context • Attributes • Topology (lets us address processes another way) • Others we won’t consider
  • 42. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Creating 2-D Virtual Grid of Processes • MPI_Dims_create • Input parameters » Total number of processes in desired grid » Number of grid dimensions • Returns number of processes in each dim • MPI_Cart_create • Creates communicator with cartesian topology
  • 43. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MPI_Dims_create int MPI_Dims_create ( int nodes, /* Input - Procs in grid */ int dims, /* Input - Number of dims */ int *size) /* Input/Output - Size of each grid dimension */
  • 44. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MPI_Cart_create int MPI_Cart_create ( MPI_Comm old_comm, /* Input - old communicator */ int dims, /* Input - grid dimensions */ int *size, /* Input - # procs in each dim */ int *periodic, /* Input - periodic[j] is 1 if dimension j wraps around; 0 otherwise */ int reorder, /* 1 if process ranks can be reordered */ MPI_Comm *cart_comm) /* Output - new communicator */
  • 45. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Using MPI_Dims_create and MPI_Cart_create MPI_Comm cart_comm; int p; int periodic[2]; int size[2]; ... size[0] = size[1] = 0; MPI_Dims_create (p, 2, size); periodic[0] = periodic[1] = 0; MPI_Cart_create (MPI_COMM_WORLD, 2, size, 1, &cart_comm);
  • 46. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Useful Grid-related Functions • MPI_Cart_rank • Given coordinates of process in Cartesian communicator, returns process rank • MPI_Cart_coords • Given rank of process in Cartesian communicator, returns process’ coordinates
  • 47. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Header for MPI_Cart_rank int MPI_Cart_rank ( MPI_Comm comm, /* In - Communicator */ int *coords, /* In - Array containing process’ grid location */ int *rank) /* Out - Rank of process at specified coords */
  • 48. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Header for MPI_Cart_coords int MPI_Cart_coords ( MPI_Comm comm, /* In - Communicator */ int rank, /* In - Rank of process */ int dims, /* In - Dimensions in virtual grid */ int *coords) /* Out - Coordinates of specified process in virtual grid */
  • 49. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MPI_Comm_split • Partitions the processes of a communicator into one or more subgroups • Constructs a communicator for each subgroup • Allows processes in each subgroup to perform their own collective communications • Needed for columnwise scatter and rowwise reduce
  • 50. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Header for MPI_Comm_split int MPI_Comm_split ( MPI_Comm old_comm, /* In - Existing communicator */ int partition, /* In - Partition number */ int new_rank, /* In - Ranking order of processes in new communicator */ MPI_Comm *new_comm) /* Out - New communicator shared by processes in same partition */
  • 51. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example: Create Communicators for Process Rows MPI_Comm grid_comm; /* 2-D process grid */ MPI_Comm grid_coords[2]; /* Location of process in grid */ MPI_Comm row_comm; /* Processes in same row */ MPI_Comm_split (grid_comm, grid_coords[0], grid_coords[1], &row_comm);
  • 52. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Run-time Expression • Computational time: χ ⎡n/√p⎤ ⎡n/√p⎤ • Suppose p a square number • Redistribute b • Send/recv: λ + 8 ⎡n/√p⎤ / β • Broadcast: log √p ( λ + 8 ⎡n/√p⎤ / β) • Reduce partial results: log √p ( λ + 8 ⎡n/√p⎤ / β)
  • 53. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Benchmarking Procs Predicte d(msec) Actual (msec) Speedup Megaflops 1 63.4 63.4 1.00 31.6 4 17.8 17.4 3.64 114.9 9 9.7 9.7 6.53 206.2 16 6.2 6.2 10.21 322.6
  • 54. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Comparison of Three Algorithms 0 2 4 6 8 10 12 0 5 10 15 20 Processors Speedup Rowwise Block Striped Columnwise Block Striped Checkerboard Block
  • 55. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Summary (1/3) • Matrix decomposition ⇒ communications needed • Rowwise block striped: all-gather • Columnwise block striped: all-to-all exchange • Checkerboard block: gather, scatter, broadcast, reduce • All three algorithms: roughly same number of messages • Elements transmitted per process varies • First two algorithms: Θ(n) elements per process • Checkerboard algorithm: Θ(n/√p) elements • Checkerboard block algorithm has better scalability
  • 56. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Summary (2/3) • Communicators with Cartesian topology • Creation • Identifying processes by rank or coords • Subdividing communicators • Allows collective operations among subsets of processes
  • 57. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Summary (3/3) • Parallel programs and supporting functions much longer than C counterparts • Extra code devoted to reading, distributing, printing matrices and vectors • Developing and debugging these functions is tedious and difficult • Makes sense to generalize functions and put them in libraries for reuse