1. Dense Matrix Algorithms
Ananth Grama, Anshul Gupta,
George Karypis, and Vipin Kumar
To accompany the text “Introduction to Parallel Computing”,
Addison Wesley, 2003.
3. Matix Algorithms: Introduction
• Due to their regular structure, parallel computations
involving matrices and vectors readily lend themselves to
data-decomposition.
• Typical algorithms rely on input, output, or intermediate
data decomposition.
• Most algorithms use one- and two-dimensional block,
cyclic, and block-cyclic partitionings.
4. Matrix-Vector Multiplication
• We aim to multiply a dense n x n matrix A with an n x 1
vector x to yield the n x 1 result vector y.
• The serial algorithm requires n2
multiplications and
additions.
5. Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix is partitioned among n processors, with
each processor storing complete row of the matrix.
• The n x 1 vector x is distributed such that each process
owns one of its elements.
6. Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Multiplication of an n x n matrix with an n x 1 vector using
rowwise block 1-D partitioning. For the one-row-per-process
case, p = n.
7. Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• Since each process starts with only one element of x ,
an all-to-all broadcast is required to distribute all the
elements to all the processes.
• Process Pi now computes .
• The all-to-all broadcast and the computation of y[i] both
take time Θ(n) . Therefore, the parallel time is Θ(n) .
8. Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• Consider now the case when p < n and we use block 1D
partitioning.
• Each process initially stores n=p complete rows of the
matrix and a portion of the vector of size n=p.
• The all-to-all broadcast takes place among p processes
and involves messages of size n=p.
• This is followed by n=p local dot products.
• Thus, the parallel run time of this procedure is
This is cost-optimal.
9. Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Scalability Analysis:
• We know that T0 = pTP - W, therefore, we have,
• For isoefficiency, we have W = KT0, where K = E/(1 – E)
for desired efficiency E.
• From this, we have W = O(p2
) (from the tw term).
• There is also a bound on isoefficiency because of
concurrency. In this case, p < n, therefore, W = n2
=
Ω(p2
).
• Overall isoefficiency is W = O(p2
).
10. Matrix-Vector Multiplication:
2-D Partitioning
• The n x n matrix is partitioned among n2
processors such
that each processor owns a single element.
• The n x 1 vector x is distributed only in the last column of
n processors.
11. Matrix-Vector Multiplication: 2-D Partitioning
Matrix-vector multiplication with block 2-D partitioning. For the
one-element-per-process case, p = n2
if the matrix size is n x n .
12. Matrix-Vector Multiplication:
2-D Partitioning
• We must first align the vector with the matrix
appropriately.
• The first communication step for the 2-D partitioning
aligns the vector x along the principal diagonal of the
matrix.
• The second step copies the vector elements from each
diagonal process to all the processes in the
corresponding column using n simultaneous broadcasts
among all processors in the column.
• Finally, the result vector is computed by performing an
all-to-one reduction along the columns.
13. Matrix-Vector Multiplication:
2-D Partitioning
• Three basic communication operations are used in this
algorithm: one-to-one communication to align the vector
along the main diagonal, one-to-all broadcast of each
vector element among the n processes of each column,
and all-to-one reduction in each row.
• Each of these operations takes Θ(log n) time and the
parallel time is Θ(log n) .
• The cost (process-time product) is Θ(n2
log n) ; hence,
the algorithm is not cost-optimal.
14. Matrix-Vector Multiplication:
2-D Partitioning
• When using fewer than n2
processors, each process
owns an block of the matrix.
• The vector is distributed in portions of elements in
the last process-column only.
• In this case, the message sizes for the alignment,
broadcast, and reduction are all .
• The computation is a product of an
submatrix with a vector of length .
16. Matrix-Vector Multiplication:
2-D Partitioning
• Scalability Analysis:
•
• Equating T0 with W, term by term, for isoefficiency, we
have, as the dominant term.
• The isoefficiency due to concurrency is O(p).
• The overall isoefficiency is (due to the
network bandwidth).
• For cost optimality, we have, . For this,
we have,
17. Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense, square
matrices A and B to yield the product matrix C =A x B.
• The serial complexity is O(n3
).
• We do not consider better serial algorithms (Strassen's
method), although, these can be used as serial kernels in the
parallel algorithms.
• A useful concept in this case is called block operations. In
this view, an n x n matrix A can be regarded as a q x q array
of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x
(n/q) submatrix.
• In this view, we perform q3
matrix multiplications, each
involving (n/q) x (n/q) matrices.
18. Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p
blocks Ai,j and Bi,j (0 ≤ i, j < ) of size
each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k and
Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows and B along
columns.
• Perform local submatrix multiplication.
19. Matrix-Matrix Multiplication
• The two broadcasts take time
• The computation requires multiplications of
sized submatrices.
• The parallel run time is approximately
• The algorithm is cost optimal and the isoefficiency is
O(p1.5
) due to bandwidth term tw and concurrency.
• Major drawback of the algorithm is that it is not memory
optimal.
20. Matrix-Matrix Multiplication:
Cannon's Algorithm
• In this algorithm, we schedule the computations of the
processes of the ith row such that, at any given
time, each process is using a different block Ai,k.
• These blocks can be systematically rotated among the
processes after every submatrix multiplication so that
every process gets a fresh Ai,k after each rotation.
22. Matrix-Matrix Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a way that each
process multiplies its local submatrices. This is done by
shifting all submatrices Ai,j to the left (with wraparound)
by i steps and all submatrices Bi,j up (with wraparound)
by j steps.
• Perform local block multiplication.
• Each block of A moves one step left and each block of B
moves one step up (again with wraparound).
• Perform next block multiplication, add to partial result,
repeat until all blocks have been multiplied.
23. Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the maximum distance over
which a block shifts is , the two shift operations require
a total of time.
• Each of the single-step shifts in the compute-and-shift
phase of the algorithm takes time.
• The computation time for multiplying matrices of size
is .
• The parallel time is approximately:
• The cost-efficiency and isoefficiency of the algorithm are
identical to the first algorithm, except, this is memory optimal.
24. Matrix-Matrix Multiplication:
DNS Algorithm
• Uses a 3-D partitioning.
• Visualize the matrix multiplication algorithm as a cube .
matrices A and B come in two orthogonal faces and
result C comes out the other orthogonal face.
• Each internal node in the cube represents a single add-
multiply operation (and thus the complexity).
• DNS algorithm partitions this cube using a 3-D block
scheme.
25. Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of processors.
• Move the columns of A and rows of B and perform
broadcast.
• Each processor computes a single add-multiply.
• This is followed by an accumulation along the C
dimension.
• Since each add-multiply takes constant time and
accumulation and broadcast takes log n time, the total
runtime is log n.
• This is not cost optimal. It can be made cost optimal by
using n / log n processors along the direction of
accumulation.
27. Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n3
processors.
• Assume that the number of processes p is equal to q3
for
some q < n.
• The two matrices are partitioned into blocks of size (n/q)
x(n/q).
• Each matrix can thus be regarded as a q x q two-
dimensional square array of blocks.
• The algorithm follows from the previous one, except, in
this case, we operate on blocks rather than on individual
elements.
28. Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n3
processors.
• The first one-to-one communication step is performed for
both A and B, and takes time for each matrix.
• The two one-to-all broadcasts take time
for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:
• The isoefficiency function is .