SlideShare a Scribd company logo
Dense Matrix Algorithms
Ananth Grama, Anshul Gupta,
George Karypis, and Vipin Kumar
To accompany the text “Introduction to Parallel Computing”,
Addison Wesley, 2003.
Topic Overview
• Matrix-Vector Multiplication
• Matrix-Matrix Multiplication
• Solving a System of Linear Equations
Matix Algorithms: Introduction
• Due to their regular structure, parallel computations
involving matrices and vectors readily lend themselves to
data-decomposition.
• Typical algorithms rely on input, output, or intermediate
data decomposition.
• Most algorithms use one- and two-dimensional block,
cyclic, and block-cyclic partitionings.
Matrix-Vector Multiplication
• We aim to multiply a dense n x n matrix A with an n x 1
vector x to yield the n x 1 result vector y.
• The serial algorithm requires n2
multiplications and
additions.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix is partitioned among n processors, with
each processor storing complete row of the matrix.
• The n x 1 vector x is distributed such that each process
owns one of its elements.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Multiplication of an n x n matrix with an n x 1 vector using
rowwise block 1-D partitioning. For the one-row-per-process
case, p = n.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• Since each process starts with only one element of x ,
an all-to-all broadcast is required to distribute all the
elements to all the processes.
• Process Pi now computes .
• The all-to-all broadcast and the computation of y[i] both
take time Θ(n) . Therefore, the parallel time is Θ(n) .
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• Consider now the case when p < n and we use block 1D
partitioning.
• Each process initially stores n=p complete rows of the
matrix and a portion of the vector of size n=p.
• The all-to-all broadcast takes place among p processes
and involves messages of size n=p.
• This is followed by n=p local dot products.
• Thus, the parallel run time of this procedure is
This is cost-optimal.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Scalability Analysis:
• We know that T0 = pTP - W, therefore, we have,
• For isoefficiency, we have W = KT0, where K = E/(1 – E)
for desired efficiency E.
• From this, we have W = O(p2
) (from the tw term).
• There is also a bound on isoefficiency because of
concurrency. In this case, p < n, therefore, W = n2
=
Ω(p2
).
• Overall isoefficiency is W = O(p2
).
Matrix-Vector Multiplication:
2-D Partitioning
• The n x n matrix is partitioned among n2
processors such
that each processor owns a single element.
• The n x 1 vector x is distributed only in the last column of
n processors.
Matrix-Vector Multiplication: 2-D Partitioning
Matrix-vector multiplication with block 2-D partitioning. For the
one-element-per-process case, p = n2
if the matrix size is n x n .
Matrix-Vector Multiplication:
2-D Partitioning
• We must first align the vector with the matrix
appropriately.
• The first communication step for the 2-D partitioning
aligns the vector x along the principal diagonal of the
matrix.
• The second step copies the vector elements from each
diagonal process to all the processes in the
corresponding column using n simultaneous broadcasts
among all processors in the column.
• Finally, the result vector is computed by performing an
all-to-one reduction along the columns.
Matrix-Vector Multiplication:
2-D Partitioning
• Three basic communication operations are used in this
algorithm: one-to-one communication to align the vector
along the main diagonal, one-to-all broadcast of each
vector element among the n processes of each column,
and all-to-one reduction in each row.
• Each of these operations takes Θ(log n) time and the
parallel time is Θ(log n) .
• The cost (process-time product) is Θ(n2
log n) ; hence,
the algorithm is not cost-optimal.
Matrix-Vector Multiplication:
2-D Partitioning
• When using fewer than n2
processors, each process
owns an block of the matrix.
• The vector is distributed in portions of elements in
the last process-column only.
• In this case, the message sizes for the alignment,
broadcast, and reduction are all .
• The computation is a product of an
submatrix with a vector of length .
Matrix-Vector Multiplication:
2-D Partitioning
• The first alignment step takes time
• The broadcast and reductions take time
• Local matrix-vector products take time
• Total time is
Matrix-Vector Multiplication:
2-D Partitioning
• Scalability Analysis:
•
• Equating T0 with W, term by term, for isoefficiency, we
have, as the dominant term.
• The isoefficiency due to concurrency is O(p).
• The overall isoefficiency is (due to the
network bandwidth).
• For cost optimality, we have, . For this,
we have,
Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense, square
matrices A and B to yield the product matrix C =A x B.
• The serial complexity is O(n3
).
• We do not consider better serial algorithms (Strassen's
method), although, these can be used as serial kernels in the
parallel algorithms.
• A useful concept in this case is called block operations. In
this view, an n x n matrix A can be regarded as a q x q array
of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x
(n/q) submatrix.
• In this view, we perform q3
matrix multiplications, each
involving (n/q) x (n/q) matrices.
Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p
blocks Ai,j and Bi,j (0 ≤ i, j < ) of size
each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k and
Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows and B along
columns.
• Perform local submatrix multiplication.
Matrix-Matrix Multiplication
• The two broadcasts take time
• The computation requires multiplications of
sized submatrices.
• The parallel run time is approximately
• The algorithm is cost optimal and the isoefficiency is
O(p1.5
) due to bandwidth term tw and concurrency.
• Major drawback of the algorithm is that it is not memory
optimal.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• In this algorithm, we schedule the computations of the
processes of the ith row such that, at any given
time, each process is using a different block Ai,k.
• These blocks can be systematically rotated among the
processes after every submatrix multiplication so that
every process gets a fresh Ai,k after each rotation.
Matrix-Matrix Multiplication:
Cannon's Algorithm
Communication steps in Cannon's algorithm on 16 processes.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a way that each
process multiplies its local submatrices. This is done by
shifting all submatrices Ai,j to the left (with wraparound)
by i steps and all submatrices Bi,j up (with wraparound)
by j steps.
• Perform local block multiplication.
• Each block of A moves one step left and each block of B
moves one step up (again with wraparound).
• Perform next block multiplication, add to partial result,
repeat until all blocks have been multiplied.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the maximum distance over
which a block shifts is , the two shift operations require
a total of time.
• Each of the single-step shifts in the compute-and-shift
phase of the algorithm takes time.
• The computation time for multiplying matrices of size
is .
• The parallel time is approximately:
• The cost-efficiency and isoefficiency of the algorithm are
identical to the first algorithm, except, this is memory optimal.
Matrix-Matrix Multiplication:
DNS Algorithm
• Uses a 3-D partitioning.
• Visualize the matrix multiplication algorithm as a cube .
matrices A and B come in two orthogonal faces and
result C comes out the other orthogonal face.
• Each internal node in the cube represents a single add-
multiply operation (and thus the complexity).
• DNS algorithm partitions this cube using a 3-D block
scheme.
Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of processors.
• Move the columns of A and rows of B and perform
broadcast.
• Each processor computes a single add-multiply.
• This is followed by an accumulation along the C
dimension.
• Since each add-multiply takes constant time and
accumulation and broadcast takes log n time, the total
runtime is log n.
• This is not cost optimal. It can be made cost optimal by
using n / log n processors along the direction of
accumulation.
Matrix-Matrix Multiplication:
DNS Algorithm
The communication steps in the DNS algorithm while
multiplying 4 x 4 matrices A and B on 64 processes.
Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n3
processors.
• Assume that the number of processes p is equal to q3
for
some q < n.
• The two matrices are partitioned into blocks of size (n/q)
x(n/q).
• Each matrix can thus be regarded as a q x q two-
dimensional square array of blocks.
• The algorithm follows from the previous one, except, in
this case, we operate on blocks rather than on individual
elements.
Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n3
processors.
• The first one-to-one communication step is performed for
both A and B, and takes time for each matrix.
• The two one-to-all broadcasts take time
for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:
• The isoefficiency function is .

More Related Content

PPT
Chap8 slides
PPT
densematrix.ppt
PPT
Matrix Multiplication(An example of concurrent programming)
PDF
Parallel Programming Slide 2- Michael J.Quinn
PPTX
IOEfficientParalleMatrixMultiplication_present
PDF
Computer algorithm(Dynamic Programming).pdf
PDF
IRJET- Matrix Multiplication using Strassen’s Method
PDF
Chapter 7: Matrix Multiplication
Chap8 slides
densematrix.ppt
Matrix Multiplication(An example of concurrent programming)
Parallel Programming Slide 2- Michael J.Quinn
IOEfficientParalleMatrixMultiplication_present
Computer algorithm(Dynamic Programming).pdf
IRJET- Matrix Multiplication using Strassen’s Method
Chapter 7: Matrix Multiplication

Similar to hpc-unit-IV-2-dense-matrix-algorithms.ppt (20)

PDF
Testing of Matrices Multiplication Methods on Different Processors
PPT
Chapter 16
PPTX
Strassen's Matrix Multiplication divide and conquere algorithm
PPT
Dynamic_methods_Greedy_algorithms_11.ppt
PPT
slides11.ppt
PDF
CAPS_Intro_PPT computer methods for elec
PDF
Tiling matrix-matrix multiply, code tuning
PPTX
Matrix multiplication
PDF
mcm project file report 2nd year dsa rep
PDF
Concurrent Programming
PPT
Analytical Models of Parallel Programs
PPTX
Matrix Chain Multiplication.pptx file pp
PPTX
dynamic programming complete by Mumtaz Ali (03154103173)
PPTX
Dynamic Programming - Matrix Chain Multiplication
PPTX
asgarftfygjjnbna34y2awsertcvghbjknzrdxfcgvhji-slid jes-fccm20.pptx
PPTX
In-class slides with activities
PPTX
Dynamic programming1
PPTX
Matrix chain multiplication
PPT
Parallel algorithms
PPTX
Linear algebra for deep learning
Testing of Matrices Multiplication Methods on Different Processors
Chapter 16
Strassen's Matrix Multiplication divide and conquere algorithm
Dynamic_methods_Greedy_algorithms_11.ppt
slides11.ppt
CAPS_Intro_PPT computer methods for elec
Tiling matrix-matrix multiply, code tuning
Matrix multiplication
mcm project file report 2nd year dsa rep
Concurrent Programming
Analytical Models of Parallel Programs
Matrix Chain Multiplication.pptx file pp
dynamic programming complete by Mumtaz Ali (03154103173)
Dynamic Programming - Matrix Chain Multiplication
asgarftfygjjnbna34y2awsertcvghbjknzrdxfcgvhji-slid jes-fccm20.pptx
In-class slides with activities
Dynamic programming1
Matrix chain multiplication
Parallel algorithms
Linear algebra for deep learning
Ad

Recently uploaded (20)

PDF
Well-logging-methods_new................
PDF
Digital Logic Computer Design lecture notes
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Sustainable Sites - Green Building Construction
DOCX
573137875-Attendance-Management-System-original
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
web development for engineering and engineering
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
PPT on Performance Review to get promotions
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPT
Project quality management in manufacturing
Well-logging-methods_new................
Digital Logic Computer Design lecture notes
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Sustainable Sites - Green Building Construction
573137875-Attendance-Management-System-original
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
web development for engineering and engineering
Operating System & Kernel Study Guide-1 - converted.pdf
Lesson 3_Tessellation.pptx finite Mathematics
PPT on Performance Review to get promotions
Lecture Notes Electrical Wiring System Components
Internet of Things (IOT) - A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Strings in CPP - Strings in C++ are sequences of characters used to store and...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
UNIT 4 Total Quality Management .pptx
Project quality management in manufacturing
Ad

hpc-unit-IV-2-dense-matrix-algorithms.ppt

  • 1. Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”, Addison Wesley, 2003.
  • 2. Topic Overview • Matrix-Vector Multiplication • Matrix-Matrix Multiplication • Solving a System of Linear Equations
  • 3. Matix Algorithms: Introduction • Due to their regular structure, parallel computations involving matrices and vectors readily lend themselves to data-decomposition. • Typical algorithms rely on input, output, or intermediate data decomposition. • Most algorithms use one- and two-dimensional block, cyclic, and block-cyclic partitionings.
  • 4. Matrix-Vector Multiplication • We aim to multiply a dense n x n matrix A with an n x 1 vector x to yield the n x 1 result vector y. • The serial algorithm requires n2 multiplications and additions.
  • 5. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • The n x n matrix is partitioned among n processors, with each processor storing complete row of the matrix. • The n x 1 vector x is distributed such that each process owns one of its elements.
  • 6. Matrix-Vector Multiplication: Rowwise 1-D Partitioning Multiplication of an n x n matrix with an n x 1 vector using rowwise block 1-D partitioning. For the one-row-per-process case, p = n.
  • 7. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Since each process starts with only one element of x , an all-to-all broadcast is required to distribute all the elements to all the processes. • Process Pi now computes . • The all-to-all broadcast and the computation of y[i] both take time Θ(n) . Therefore, the parallel time is Θ(n) .
  • 8. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Consider now the case when p < n and we use block 1D partitioning. • Each process initially stores n=p complete rows of the matrix and a portion of the vector of size n=p. • The all-to-all broadcast takes place among p processes and involves messages of size n=p. • This is followed by n=p local dot products. • Thus, the parallel run time of this procedure is This is cost-optimal.
  • 9. Matrix-Vector Multiplication: Rowwise 1-D Partitioning Scalability Analysis: • We know that T0 = pTP - W, therefore, we have, • For isoefficiency, we have W = KT0, where K = E/(1 – E) for desired efficiency E. • From this, we have W = O(p2 ) (from the tw term). • There is also a bound on isoefficiency because of concurrency. In this case, p < n, therefore, W = n2 = Ω(p2 ). • Overall isoefficiency is W = O(p2 ).
  • 10. Matrix-Vector Multiplication: 2-D Partitioning • The n x n matrix is partitioned among n2 processors such that each processor owns a single element. • The n x 1 vector x is distributed only in the last column of n processors.
  • 11. Matrix-Vector Multiplication: 2-D Partitioning Matrix-vector multiplication with block 2-D partitioning. For the one-element-per-process case, p = n2 if the matrix size is n x n .
  • 12. Matrix-Vector Multiplication: 2-D Partitioning • We must first align the vector with the matrix appropriately. • The first communication step for the 2-D partitioning aligns the vector x along the principal diagonal of the matrix. • The second step copies the vector elements from each diagonal process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column. • Finally, the result vector is computed by performing an all-to-one reduction along the columns.
  • 13. Matrix-Vector Multiplication: 2-D Partitioning • Three basic communication operations are used in this algorithm: one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column, and all-to-one reduction in each row. • Each of these operations takes Θ(log n) time and the parallel time is Θ(log n) . • The cost (process-time product) is Θ(n2 log n) ; hence, the algorithm is not cost-optimal.
  • 14. Matrix-Vector Multiplication: 2-D Partitioning • When using fewer than n2 processors, each process owns an block of the matrix. • The vector is distributed in portions of elements in the last process-column only. • In this case, the message sizes for the alignment, broadcast, and reduction are all . • The computation is a product of an submatrix with a vector of length .
  • 15. Matrix-Vector Multiplication: 2-D Partitioning • The first alignment step takes time • The broadcast and reductions take time • Local matrix-vector products take time • Total time is
  • 16. Matrix-Vector Multiplication: 2-D Partitioning • Scalability Analysis: • • Equating T0 with W, term by term, for isoefficiency, we have, as the dominant term. • The isoefficiency due to concurrency is O(p). • The overall isoefficiency is (due to the network bandwidth). • For cost optimality, we have, . For this, we have,
  • 17. Matrix-Matrix Multiplication • Consider the problem of multiplying two n x n dense, square matrices A and B to yield the product matrix C =A x B. • The serial complexity is O(n3 ). • We do not consider better serial algorithms (Strassen's method), although, these can be used as serial kernels in the parallel algorithms. • A useful concept in this case is called block operations. In this view, an n x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix. • In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.
  • 18. Matrix-Matrix Multiplication • Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < ) of size each. • Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. • Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < . • All-to-all broadcast blocks of A along rows and B along columns. • Perform local submatrix multiplication.
  • 19. Matrix-Matrix Multiplication • The two broadcasts take time • The computation requires multiplications of sized submatrices. • The parallel run time is approximately • The algorithm is cost optimal and the isoefficiency is O(p1.5 ) due to bandwidth term tw and concurrency. • Major drawback of the algorithm is that it is not memory optimal.
  • 20. Matrix-Matrix Multiplication: Cannon's Algorithm • In this algorithm, we schedule the computations of the processes of the ith row such that, at any given time, each process is using a different block Ai,k. • These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh Ai,k after each rotation.
  • 21. Matrix-Matrix Multiplication: Cannon's Algorithm Communication steps in Cannon's algorithm on 16 processes.
  • 22. Matrix-Matrix Multiplication: Cannon's Algorithm • Align the blocks of A and B in such a way that each process multiplies its local submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i steps and all submatrices Bi,j up (with wraparound) by j steps. • Perform local block multiplication. • Each block of A moves one step left and each block of B moves one step up (again with wraparound). • Perform next block multiplication, add to partial result, repeat until all blocks have been multiplied.
  • 23. Matrix-Matrix Multiplication: Cannon's Algorithm • In the alignment step, since the maximum distance over which a block shifts is , the two shift operations require a total of time. • Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time. • The computation time for multiplying matrices of size is . • The parallel time is approximately: • The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm, except, this is memory optimal.
  • 24. Matrix-Matrix Multiplication: DNS Algorithm • Uses a 3-D partitioning. • Visualize the matrix multiplication algorithm as a cube . matrices A and B come in two orthogonal faces and result C comes out the other orthogonal face. • Each internal node in the cube represents a single add- multiply operation (and thus the complexity). • DNS algorithm partitions this cube using a 3-D block scheme.
  • 25. Matrix-Matrix Multiplication: DNS Algorithm • Assume an n x n x n mesh of processors. • Move the columns of A and rows of B and perform broadcast. • Each processor computes a single add-multiply. • This is followed by an accumulation along the C dimension. • Since each add-multiply takes constant time and accumulation and broadcast takes log n time, the total runtime is log n. • This is not cost optimal. It can be made cost optimal by using n / log n processors along the direction of accumulation.
  • 26. Matrix-Matrix Multiplication: DNS Algorithm The communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes.
  • 27. Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n3 processors. • Assume that the number of processes p is equal to q3 for some q < n. • The two matrices are partitioned into blocks of size (n/q) x(n/q). • Each matrix can thus be regarded as a q x q two- dimensional square array of blocks. • The algorithm follows from the previous one, except, in this case, we operate on blocks rather than on individual elements.
  • 28. Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n3 processors. • The first one-to-one communication step is performed for both A and B, and takes time for each matrix. • The two one-to-all broadcasts take time for each matrix. • The reduction takes time . • Multiplication of submatrices takes time. • The parallel time is approximated by: • The isoefficiency function is .