SlideShare a Scribd company logo
By
Dr. Heman Pathak
Associate Professor
KGC - Dehradun
CHAPTER-6
Parallel Computing
Theory and Practice
By Michel J. Quinn
CLASSIFYING PARALLEL ALGORITHMS
Parallel Algorithms
Data Parallelism
SIMD MIMD
Control Parallelism
MIMD Pipelined
Data Parallelism in MIMD
• The number of data items per functional unit is determined before any of the data items are
processed.
• Pre scheduling is commonly used when the time needed to process each data item is
identical, or when the ratio of data items to functional units is high.
Pre-scheduled data parallel algorithm
• Data items are not assigned to functional units until run time.
• A global list of work to be done is kept, and when a functional unit is without work, another
task (or small set of tasks) is removed from the list and assigned.
• Processes schedule themselves as the program executes, hence the name self scheduled.
self-scheduled data parallel algorithm
Control Parallelism
• Control parallelism is achieved through the simultaneous
application of different operations to different data elements.
• The flow of data among these processes can be arbitrarily
complex.
• If the data-flow graph forms a simple directed path, then we say
the algorithm is pipelined.
• We will use the term asynchronous algorithm to refer to any
control-parallel algorithm that is not pipelined.
REDUCTION
5
• lf a cost optimal CREW PRAM algorithm exists and the way the PRAM processors
interact through shared variables maps onto the target architecture, a PRAM
algorithm is a reasonable staring point.
Design Strategy 1
 Consider the problem of performing a reduction operation on a set of n
values, where n is much larger than p, the number of available
processors.
 Objective is to develop a parallel algorithm that introduces the
minimum amount of extra operations compared to the best sequential
algorithm.
REDUCTION - Where n is much larger than p
6
Summation is the reduction operation
• Cost-optimal PRAM algorithm for global sum exists:
▫ n/ log n processors can add n numbers in (log n) time.
• Same principle can be used to develop good parallel algorithms for real SIMD and MIMD
computers, even if p << n/ log n.
• 𝒏/𝒑 𝒐𝒓 𝒏/𝒑 values are allocated to each processor.
• In the first phase of the parallel algorithm each processor adds its set of values, resulting
in p partial sums.
• In the second phase partial sums are combine into global sum.
REDUCTION - Where n is much larger than p
7
Summation is the reduction operation
• It is important to check to make sure that the constant of proportionality
associated with the cost of the PRAM algorithm is not significantly higher
than the constant of proportionality associated with an optimal sequential
algorithm.
• Make sure that the total number of operations performed by all the
processors executing the PRAM algorithm is about the same as the total
number of operations performed by a single processor executing the best
sequential algorithm.
Hypercube SIMD Model
8
6 7
54
2 3
10
Sum of n Numbers: Hypercube
• If the PRAM processor interaction pattern forms a graph that embeds with
dilation-1 in a target SIMD architecture, then there is a natural translation
from the PRAM algorithm to the SIMD algorithm.
• The processors in the PRAM summation algorithm combine values in a
binomial tree pattern.
• Dialation-1 embedding of binomial tree on hypercube is possible.
• The hypercube processor array version follows directly from the PRAM
algorithm, the only significant difference is that the hypercube processor
array model has no shared memory; processors interact by passing data.
9
Sum of n Numbers: Hypercube
10
Sum of n Numbers: Hypercube
11
Sum of n Numbers: Hypercube
• Each processing element adds at most 𝑛/𝑝 values to
find its local sum in (n/p) time.
• Processor 0, which iterates the second inner for loop
more than any other processor, performs log p
communication steps and log p addition steps.
• The complexity of finding the sum of n values is
(n/p + log p) using the hypercube processor array
model with p processors.
12
Sum of n Numbers: Hypercube
13
Every processing element to have a copy of the global sum
• Add a broadcast phase to the end of the algorithm. Once processing
element 0 has the global sum, the value can be transmitted to the
other processors in log p communication steps by reversing the
direction of the edges in the binomial tree.
• Each processing element swaps values across every dimension of the
hypercube. After log n swap-and-accumulate steps, every processing
element has the global sum.
Sum of n Numbers: Hypercube
14
Shuffle Exchange SIMD Model
15
Sum of n Numbers: Shuffle Exchange
 lf the PRAM processor interaction pattern does not form a graph that embeds
in the target SIMD architecture, then the translation is not straightforward,
but may still have an efficient SIMD algorithm.
 No dilation-1 embedding of a binomial tree in a shuffle-exchange network.
 The sums are combined in pairs, then a logarithmic number of combining
steps can find the grand total.
 Two data routings - shuffle followed by an exchange – on the shuffle
exchange model are sufficient to bring together two subtotals.
 After log p shuffle exchange steps, processor 0 has the grand total.
16
17
Sum of n Numbers: Shuffle Exchange
18
Sum of n Numbers: Shuffle Exchange
• At the termination of this algorithm, the value of (0)sum is
the sum.
• Every processing element spends (n/p) time computing
its local sum.
• Since there are log p iterations of the shuffle-exchange-add
loop and every iteration takes constant time, the parallel
algorithm has complexity (n/p + log p).
19
2-D MESH SlMD Model
20
Sum of n Numbers: 2D-MESH
• No dilation-1 embedding exists for a balanced binary tree or binomial
tree in a mesh.
• Establish a lower bound on the complexity of parallel algorithm to be
used on a particular topology. Once the lower bound is established,
there is no reason to search for a solution of lower complexity.
• In order to find the sum of n values spread evenly among p processors
organized in 𝑝 X 𝑝 mesh, at least one of the processor in the mesh
must eventually contains the grand sum.
21
Sum of n Numbers: 2D-MESH
• The total number of communication steps to get the subtotals from
the corner processors must be at least 2( 𝒑 - 1), assuming that
during any time unit only communications in a single direction are
allowed.
• Since the algorithm has at least 2 ( 𝑝 - 1) communication steps,
the time complexity of the parallel algorithm is at least
(n/p + 𝑝).
• There is no point looking for a parallel algorithm for a model that
requires (log p) communication steps.
22
23
24
UMA Multiprocessor Model
25
Sum of n Numbers: UMA
 Unlike the PRAM model, processors execute instructions
asynchronously.
 For that reason we must ensure that no processor accesses a
variable containing a partial sum until that variable has been set.
 Each element of array flags begins with the value 0.
 When the value is set to 1, the corresponding element of array
mono has a partial sum in it.
26
Sum of n Numbers: UMA
27
Sum of n Numbers: UMA n=16 p=4
28
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
6 -4 19 2 -9 0 3 -5 10 -3 -8 1 7 -2 4 5
P=0 0 4 8 12
14 6 -9 10 7
P=1 1 5 9 13
-9 -4 0 -3 -2
P=2 2 6 10 14
18 19 3 -8 4
P=3 3 7 11 15
3 2 -5 1 5
P0=14 P1=-9 P2=18 P3=3
P0=32 P1=-6 P2=18 P3=3
P0=26
Worst-case Time Complexity
• If the initial process creates p-1 other processes all by itself, the time
complexity of the process creation is (p).
• In practice we do not count this cost, since processes are created only
once, at the beginning of the program, and most algorithms we analyze
form subroutines of larger applications.
• Sequentially initializing array flags has time complexity (p).
• Each process finds the sum of n/p values. If we make the assumption
that memory bank conflicts do not increase the complexity by more
than a constant factor, the complexity of this section of code is (n/p).
29
Worst-case Time Complexity
• The while loop executes log p times.
• Each iteration of the while loop has time complexity (1).
The total complexity of while loop is (log p).
• Synchronization among all the processes occurs at the
final endfor.
• Complexity of synchronization is (p).
• The overall complexity of the algorithm is-
• (p + n/p + log p + p) = (n/p + p)
30
Sum of n Numbers: UMA
 Since the time complexity of the parallel algorithm is
(n/p + p) why bother with a complicated fan-in-style
parallel addition?
 It is simpler to compute the global sum from the local Sums
by having each process enter a critical section where its
local sum is added to the global sum.
 The resulting algorithm has time complexity (n/p + p).
31
32
Sum of n Numbers: UMA
33
 Both algorithms has been implemented on
the Sequent BalanceTM (a UMA multiprocessor)
using Sequent C.
 Figure compares the execution times of the
two reduction steps as a function of the
number of active processes.
 The original fan-in-style algorithm is
uniformly superior to the critical section-
style algorithm.
 The constant of proportionality associated
with the (p) term is smaller in the first
program.
Sum of n Numbers: UMA
34
Design Strategy 2: Look for a data-parallel algorithm before considering a control parallel algorithm.
• The only parallelism we can exploit on a SIMD computer is data parallelism.
• On MIMD computers, however we can look for ways to exploit both data parallelism
and control parallelism.
• Data-parallel algorithms are more common, easier to design and debug and better
able to scale to large numbers of processors than control parallel algorithms.
• For this reason a data-parallel solution should be sought first, and a control parallel
implementation considered a last resort.
• When we write in data-parallel style on a MIMD machine, the result is a SPMD (Single
Program Multiple Data) program. In general, SPMD programs are easier to write and
debug than arbitrary MIMD programs.
In Multicomputers
35
BROADCAST - in Multicomputers
• One processor broadcasting a list of values to all other
processors on a hypercube multicomputer.
• The execution time of the implemented algorithm has two
primary components:
▫ The time needed to initiate the messages and
▫ The time needed to perform the data transfers.
• Message start-up time is called message-passing overhead
or message latency.
36
BROADCAST - in Multicomputers
 If the amount of data to be broadcast is small, the message-
passing overhead time dominates the data-transfer time.
 The best algorithm is the one that minimizes the number of
communications performed by any processor.
 The binomial tree is a suitable broadcast pattern because there
is a dilation-1 embedding of a binomial tree into a hypercube.
 The resulting algorithm requires only log p communication
steps.
37
BROADCAST - in Multicomputers
38
BROADCAST - in Multicomputers
39
P=8 Source = 000 Value
Id Position
Partner
i=0
Partner
i=1
Partner
i=2
000 000 001 010 100
001 001 - 011 101
010 010 - - 110
011 011 - - 111
100 100 - - -
101 101 - - -
110 110 - - -
111 111 - - -
BROADCAST - in Multicomputers
40
 If the amount of data to be broadcast is large, the data-
transfer time dominates the message-passing overhead.
 Under these circumstances the binomial tree-based
algorithm has a serious weakness-
 at any one time no more than p/2 out of p log p
communication links are in use.
 If the time needed to pass the message from one processor
to another is M, then the broadcast algorithm requires time
M log p.
BROADCAST - in Multicomputers
41
 Johnsson and Ho ( 1989) have designed a broadcast algorithm that
executes up to log p times faster than the binomial-tree algorithm.
 Their algorithm relies upon the fact that every hypercube contains
log p edge-disjoint spanning trees with the same root node.
 The algorithm breaks the message into log p parts and broadcasts
each part to other nodes through a different binomial spanning tree.
 Because the spanning trees have no edges in common, all data flows
concurrently, and the entire algorithm executes approximately in time
Mlog p/ log p = M.
BROADCAST - in Multicomputers
42
BROADCAST - in Multicomputers
Design Strategy 3: As problem size grows, use the
algorithm that makes best use of the available resources.
In the case of broadcasting large
data sets on a hypercube
multicomputer, the most
constrained resource is the
network capacity.
Johnsson and Ho's algorithm
makes better use of this resource
than the binomial tree broadcast
algorithm and, as a result,
achieves higher performance.
43
Hypercube Multi-Computers
44
Prefix-Sums-On Hypercube Multicomputers
45
Prefix-Sums-On Hypercube Multicomputers
46
 The cost optimal algorithm requires
𝑛/𝑙𝑜𝑔𝑛 processors to solve the
problem in (log n) time.
 In order to achieve cost optimality,
each processor uses the best
sequential algorithm to manipulate
its own set of n/p elements of A.
 Same strategy is used to design an
efficient multicomputer algorithm
where p << n/log n.
Prefix-Sums-On Hypercube Multicomputers
47
Design Strategy 4:
Let each processor perform the most efficient
sequential algorithm on its share of the data.
Prefix-Sums-On Hypercube Multicomputers
48
There are n number of elements and p processors. n is integer multiple of p.
The elements of A are distributed evenly among the local memories of the p processors.
During step one each processor finds the sum of its n/p elements.
In step two the processors cooperate to find the p prefix sums of their local sums.
During step three each processor computes the prefix sums of its n/p values, using values
held in lower-numbered processors.
Prefix-Sums-On Hypercube Multicomputers
49
Prefix-Sums-On Hypercube Multicomputers
50
 The communication time required by
step two depends upon the
multicomputer's topology.
 The memory access pattern of the
PRAM algorithm does not directly
translate into a communication
pattern having a dilation-1 embedding
in a hypercube.
 For this reason we should look for a
better method of computing the prefix
sums.
Sum of n Numbers: Hypercube
51
Prefix-Sums - On Hypercube Multicomputers
52
 Finding prefix sums is similar to performing a reduction, except, for each element in
the list, we are only interested in values from prior elements.
 We can modify the hypercube reduction algorithm to perform prefix sums.
 As in the reduction algorithm, every processor swaps values across each dimension
of the hypercube. However, the processor maintains two variables containing totals.
 The first variable contains the total of all values received.
 The second variable contains the total of all values received from smaller-numbered
processors.
 At the end of log p swap-and-add steps, the second variable associated with each
processor contains the prefix sum for that processor.
53
Prefix-Sums-On Hypercube Multicomputers
54
 : The time needed to perform operation 
 : The time needed to initiate a message
 : The message transmission time per value
For example, sending a k-element from one
processor to another requires time  + k.
Prefix-Sums-On Hypercube Multicomputers
55
• During step one each processor finds the sum of n/p values
in (n/p -1)  time units.
• During step three processor 0 computes the prefix sums of
its n/p values in (n/p - 1 )  time units.
• Processors 1 through p - 1 must add the sum of the lower-
numbered processors’ values to the first element on its list
before computing the prefix sums.
• These processors perform step three in (n/p)  time units.
Prefix-Sums-On Hypercube Multicomputers
56
 Step two has log p phases.
 During each phase a processor performs the  operation, at most, two
times, so the computation time required by step two is no more than
2logp.
 During each phase a processor sends one value to a neighbouring
processor and receives one value from that processor.
 The total communication time of step two is 2( + ) logp.
 Summing the computation and the communication time yields a total
execution time of 2( +  + ) logp for step two of the algorithm.
Prefix-Sums-On Hypercube Multicomputers
57
Estimated Execution Time = (n/p -1)  + 2( +  + ) logp + (n/p) 
Prefix-Sums-On Hypercube Multicomputers
58
 In other words, the efficiency of this
algorithm cannot exceed 50%, no
matter how large the problem size or
how small the message latency.
 Figure compares the predicted
speedup with the speedup actually
achieved by this algorithm on the
nCUBE 3200TM, where the associative
operator is integer addition,  = 414
nanoseconds,  = 363 microseconds,
and  = 4.5 microseconds.

More Related Content

PDF
P, NP, NP-Complete, and NP-Hard
PPTX
RECURSIVE DESCENT PARSING
PPTX
daa-unit-3-greedy method
PDF
Parallel Algorithms
PPTX
PPT
Heuristic Search Techniques Unit -II.ppt
PPTX
15 puzzle problem using branch and bound
PDF
I. Mini-Max Algorithm in AI
P, NP, NP-Complete, and NP-Hard
RECURSIVE DESCENT PARSING
daa-unit-3-greedy method
Parallel Algorithms
Heuristic Search Techniques Unit -II.ppt
15 puzzle problem using branch and bound
I. Mini-Max Algorithm in AI

What's hot (20)

PPTX
Chapter 8 Operating Systems silberschatz : deadlocks
PDF
Greedy algorithm activity selection fractional
PPTX
CLR AND LALR PARSER
PPTX
Dining philosopher problem operating system
PPTX
AI_Session 19 Backtracking CSP.pptx
PPTX
Adversarial search
PPTX
Theory of computation / Post’s Correspondence Problems (PCP)
PDF
I. Alpha-Beta Pruning in ai
PPTX
Job sequencing with deadline
PPT
Operating Systems - "Chapter 4: Multithreaded Programming"
PPT
Heuristic Search Techniques {Artificial Intelligence}
PDF
Chapter 5: Mapping and Scheduling
PPTX
Elements of dynamic programming
PPTX
Disk Scheduling Algorithm in Operating System
PPTX
Minmax Algorithm In Artificial Intelligence slides
PPTX
Finite automata-for-lexical-analysis
PPT
5.1 greedy
PDF
Meltdown & spectre
PPTX
Dining Philosopher Problem
PPT
Top down parsing
Chapter 8 Operating Systems silberschatz : deadlocks
Greedy algorithm activity selection fractional
CLR AND LALR PARSER
Dining philosopher problem operating system
AI_Session 19 Backtracking CSP.pptx
Adversarial search
Theory of computation / Post’s Correspondence Problems (PCP)
I. Alpha-Beta Pruning in ai
Job sequencing with deadline
Operating Systems - "Chapter 4: Multithreaded Programming"
Heuristic Search Techniques {Artificial Intelligence}
Chapter 5: Mapping and Scheduling
Elements of dynamic programming
Disk Scheduling Algorithm in Operating System
Minmax Algorithm In Artificial Intelligence slides
Finite automata-for-lexical-analysis
5.1 greedy
Meltdown & spectre
Dining Philosopher Problem
Top down parsing
Ad

Similar to Elementary Parallel Algorithms (20)

PPT
Chap5 slides
PDF
Optimization of Collective Communication in MPICH
PPTX
Matrix multiplication
PPTX
Performance measures
PDF
WVKULAK13_submission_14
PPT
PDF
Algorithm Analysis.pdf
PPT
Introduction to Data Structures Sorting and searching
PPTX
Data Structures - Lecture 1 [introduction]
PDF
Distributed Machine Learning and the Parameter Server
PDF
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
ODP
Parallel Programming on the ANDC cluster
PPTX
Distributed Machine Learning and AI.pptx
PDF
complexity analysis.pdf
PPTX
Presentation 1 on algorithm for lab progress
PPTX
Analysis of Algorithms_Under Graduate Class Slide
PPTX
FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...
PPTX
Unit i basic concepts of algorithms
PPTX
Unit ii algorithm
PPTX
Four bit Signed Calculator.pptx
Chap5 slides
Optimization of Collective Communication in MPICH
Matrix multiplication
Performance measures
WVKULAK13_submission_14
Algorithm Analysis.pdf
Introduction to Data Structures Sorting and searching
Data Structures - Lecture 1 [introduction]
Distributed Machine Learning and the Parameter Server
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Parallel Programming on the ANDC cluster
Distributed Machine Learning and AI.pptx
complexity analysis.pdf
Presentation 1 on algorithm for lab progress
Analysis of Algorithms_Under Graduate Class Slide
FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...
Unit i basic concepts of algorithms
Unit ii algorithm
Four bit Signed Calculator.pptx
Ad

More from Heman Pathak (14)

PDF
Interconnection Network
PDF
Central processing unit
PDF
Registers and counters
PDF
Sequential Circuit
PDF
Combinational logic 2
PDF
Combinational logic 1
PDF
Simplification of Boolean Function
PDF
Chapter 2: Boolean Algebra and Logic Gates
PDF
Chapter 7: Matrix Multiplication
PDF
Cost optimal algorithm
PDF
Chapter 4: Parallel Programming Languages
PDF
Parallel Algorithm for Graph Coloring
PDF
Parallel Algorithms
PDF
Chapter 1 - introduction - parallel computing
Interconnection Network
Central processing unit
Registers and counters
Sequential Circuit
Combinational logic 2
Combinational logic 1
Simplification of Boolean Function
Chapter 2: Boolean Algebra and Logic Gates
Chapter 7: Matrix Multiplication
Cost optimal algorithm
Chapter 4: Parallel Programming Languages
Parallel Algorithm for Graph Coloring
Parallel Algorithms
Chapter 1 - introduction - parallel computing

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Geodesy 1.pptx...............................................
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
PPT on Performance Review to get promotions
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
web development for engineering and engineering
PDF
Digital Logic Computer Design lecture notes
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Project quality management in manufacturing
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
UNIT 4 Total Quality Management .pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Geodesy 1.pptx...............................................
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT on Performance Review to get promotions
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Structs to JSON How Go Powers REST APIs.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Mechanical Engineering MATERIALS Selection
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CYBER-CRIMES AND SECURITY A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
web development for engineering and engineering
Digital Logic Computer Design lecture notes
OOP with Java - Java Introduction (Basics)
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Project quality management in manufacturing
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
UNIT 4 Total Quality Management .pptx

Elementary Parallel Algorithms

  • 1. By Dr. Heman Pathak Associate Professor KGC - Dehradun CHAPTER-6 Parallel Computing Theory and Practice By Michel J. Quinn
  • 2. CLASSIFYING PARALLEL ALGORITHMS Parallel Algorithms Data Parallelism SIMD MIMD Control Parallelism MIMD Pipelined
  • 3. Data Parallelism in MIMD • The number of data items per functional unit is determined before any of the data items are processed. • Pre scheduling is commonly used when the time needed to process each data item is identical, or when the ratio of data items to functional units is high. Pre-scheduled data parallel algorithm • Data items are not assigned to functional units until run time. • A global list of work to be done is kept, and when a functional unit is without work, another task (or small set of tasks) is removed from the list and assigned. • Processes schedule themselves as the program executes, hence the name self scheduled. self-scheduled data parallel algorithm
  • 4. Control Parallelism • Control parallelism is achieved through the simultaneous application of different operations to different data elements. • The flow of data among these processes can be arbitrarily complex. • If the data-flow graph forms a simple directed path, then we say the algorithm is pipelined. • We will use the term asynchronous algorithm to refer to any control-parallel algorithm that is not pipelined.
  • 5. REDUCTION 5 • lf a cost optimal CREW PRAM algorithm exists and the way the PRAM processors interact through shared variables maps onto the target architecture, a PRAM algorithm is a reasonable staring point. Design Strategy 1  Consider the problem of performing a reduction operation on a set of n values, where n is much larger than p, the number of available processors.  Objective is to develop a parallel algorithm that introduces the minimum amount of extra operations compared to the best sequential algorithm.
  • 6. REDUCTION - Where n is much larger than p 6 Summation is the reduction operation • Cost-optimal PRAM algorithm for global sum exists: ▫ n/ log n processors can add n numbers in (log n) time. • Same principle can be used to develop good parallel algorithms for real SIMD and MIMD computers, even if p << n/ log n. • 𝒏/𝒑 𝒐𝒓 𝒏/𝒑 values are allocated to each processor. • In the first phase of the parallel algorithm each processor adds its set of values, resulting in p partial sums. • In the second phase partial sums are combine into global sum.
  • 7. REDUCTION - Where n is much larger than p 7 Summation is the reduction operation • It is important to check to make sure that the constant of proportionality associated with the cost of the PRAM algorithm is not significantly higher than the constant of proportionality associated with an optimal sequential algorithm. • Make sure that the total number of operations performed by all the processors executing the PRAM algorithm is about the same as the total number of operations performed by a single processor executing the best sequential algorithm.
  • 9. Sum of n Numbers: Hypercube • If the PRAM processor interaction pattern forms a graph that embeds with dilation-1 in a target SIMD architecture, then there is a natural translation from the PRAM algorithm to the SIMD algorithm. • The processors in the PRAM summation algorithm combine values in a binomial tree pattern. • Dialation-1 embedding of binomial tree on hypercube is possible. • The hypercube processor array version follows directly from the PRAM algorithm, the only significant difference is that the hypercube processor array model has no shared memory; processors interact by passing data. 9
  • 10. Sum of n Numbers: Hypercube 10
  • 11. Sum of n Numbers: Hypercube 11
  • 12. Sum of n Numbers: Hypercube • Each processing element adds at most 𝑛/𝑝 values to find its local sum in (n/p) time. • Processor 0, which iterates the second inner for loop more than any other processor, performs log p communication steps and log p addition steps. • The complexity of finding the sum of n values is (n/p + log p) using the hypercube processor array model with p processors. 12
  • 13. Sum of n Numbers: Hypercube 13 Every processing element to have a copy of the global sum • Add a broadcast phase to the end of the algorithm. Once processing element 0 has the global sum, the value can be transmitted to the other processors in log p communication steps by reversing the direction of the edges in the binomial tree. • Each processing element swaps values across every dimension of the hypercube. After log n swap-and-accumulate steps, every processing element has the global sum.
  • 14. Sum of n Numbers: Hypercube 14
  • 16. Sum of n Numbers: Shuffle Exchange  lf the PRAM processor interaction pattern does not form a graph that embeds in the target SIMD architecture, then the translation is not straightforward, but may still have an efficient SIMD algorithm.  No dilation-1 embedding of a binomial tree in a shuffle-exchange network.  The sums are combined in pairs, then a logarithmic number of combining steps can find the grand total.  Two data routings - shuffle followed by an exchange – on the shuffle exchange model are sufficient to bring together two subtotals.  After log p shuffle exchange steps, processor 0 has the grand total. 16
  • 17. 17
  • 18. Sum of n Numbers: Shuffle Exchange 18
  • 19. Sum of n Numbers: Shuffle Exchange • At the termination of this algorithm, the value of (0)sum is the sum. • Every processing element spends (n/p) time computing its local sum. • Since there are log p iterations of the shuffle-exchange-add loop and every iteration takes constant time, the parallel algorithm has complexity (n/p + log p). 19
  • 20. 2-D MESH SlMD Model 20
  • 21. Sum of n Numbers: 2D-MESH • No dilation-1 embedding exists for a balanced binary tree or binomial tree in a mesh. • Establish a lower bound on the complexity of parallel algorithm to be used on a particular topology. Once the lower bound is established, there is no reason to search for a solution of lower complexity. • In order to find the sum of n values spread evenly among p processors organized in 𝑝 X 𝑝 mesh, at least one of the processor in the mesh must eventually contains the grand sum. 21
  • 22. Sum of n Numbers: 2D-MESH • The total number of communication steps to get the subtotals from the corner processors must be at least 2( 𝒑 - 1), assuming that during any time unit only communications in a single direction are allowed. • Since the algorithm has at least 2 ( 𝑝 - 1) communication steps, the time complexity of the parallel algorithm is at least (n/p + 𝑝). • There is no point looking for a parallel algorithm for a model that requires (log p) communication steps. 22
  • 23. 23
  • 24. 24
  • 26. Sum of n Numbers: UMA  Unlike the PRAM model, processors execute instructions asynchronously.  For that reason we must ensure that no processor accesses a variable containing a partial sum until that variable has been set.  Each element of array flags begins with the value 0.  When the value is set to 1, the corresponding element of array mono has a partial sum in it. 26
  • 27. Sum of n Numbers: UMA 27
  • 28. Sum of n Numbers: UMA n=16 p=4 28 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 6 -4 19 2 -9 0 3 -5 10 -3 -8 1 7 -2 4 5 P=0 0 4 8 12 14 6 -9 10 7 P=1 1 5 9 13 -9 -4 0 -3 -2 P=2 2 6 10 14 18 19 3 -8 4 P=3 3 7 11 15 3 2 -5 1 5 P0=14 P1=-9 P2=18 P3=3 P0=32 P1=-6 P2=18 P3=3 P0=26
  • 29. Worst-case Time Complexity • If the initial process creates p-1 other processes all by itself, the time complexity of the process creation is (p). • In practice we do not count this cost, since processes are created only once, at the beginning of the program, and most algorithms we analyze form subroutines of larger applications. • Sequentially initializing array flags has time complexity (p). • Each process finds the sum of n/p values. If we make the assumption that memory bank conflicts do not increase the complexity by more than a constant factor, the complexity of this section of code is (n/p). 29
  • 30. Worst-case Time Complexity • The while loop executes log p times. • Each iteration of the while loop has time complexity (1). The total complexity of while loop is (log p). • Synchronization among all the processes occurs at the final endfor. • Complexity of synchronization is (p). • The overall complexity of the algorithm is- • (p + n/p + log p + p) = (n/p + p) 30
  • 31. Sum of n Numbers: UMA  Since the time complexity of the parallel algorithm is (n/p + p) why bother with a complicated fan-in-style parallel addition?  It is simpler to compute the global sum from the local Sums by having each process enter a critical section where its local sum is added to the global sum.  The resulting algorithm has time complexity (n/p + p). 31
  • 32. 32
  • 33. Sum of n Numbers: UMA 33  Both algorithms has been implemented on the Sequent BalanceTM (a UMA multiprocessor) using Sequent C.  Figure compares the execution times of the two reduction steps as a function of the number of active processes.  The original fan-in-style algorithm is uniformly superior to the critical section- style algorithm.  The constant of proportionality associated with the (p) term is smaller in the first program.
  • 34. Sum of n Numbers: UMA 34 Design Strategy 2: Look for a data-parallel algorithm before considering a control parallel algorithm. • The only parallelism we can exploit on a SIMD computer is data parallelism. • On MIMD computers, however we can look for ways to exploit both data parallelism and control parallelism. • Data-parallel algorithms are more common, easier to design and debug and better able to scale to large numbers of processors than control parallel algorithms. • For this reason a data-parallel solution should be sought first, and a control parallel implementation considered a last resort. • When we write in data-parallel style on a MIMD machine, the result is a SPMD (Single Program Multiple Data) program. In general, SPMD programs are easier to write and debug than arbitrary MIMD programs.
  • 36. BROADCAST - in Multicomputers • One processor broadcasting a list of values to all other processors on a hypercube multicomputer. • The execution time of the implemented algorithm has two primary components: ▫ The time needed to initiate the messages and ▫ The time needed to perform the data transfers. • Message start-up time is called message-passing overhead or message latency. 36
  • 37. BROADCAST - in Multicomputers  If the amount of data to be broadcast is small, the message- passing overhead time dominates the data-transfer time.  The best algorithm is the one that minimizes the number of communications performed by any processor.  The binomial tree is a suitable broadcast pattern because there is a dilation-1 embedding of a binomial tree into a hypercube.  The resulting algorithm requires only log p communication steps. 37
  • 38. BROADCAST - in Multicomputers 38
  • 39. BROADCAST - in Multicomputers 39 P=8 Source = 000 Value Id Position Partner i=0 Partner i=1 Partner i=2 000 000 001 010 100 001 001 - 011 101 010 010 - - 110 011 011 - - 111 100 100 - - - 101 101 - - - 110 110 - - - 111 111 - - -
  • 40. BROADCAST - in Multicomputers 40  If the amount of data to be broadcast is large, the data- transfer time dominates the message-passing overhead.  Under these circumstances the binomial tree-based algorithm has a serious weakness-  at any one time no more than p/2 out of p log p communication links are in use.  If the time needed to pass the message from one processor to another is M, then the broadcast algorithm requires time M log p.
  • 41. BROADCAST - in Multicomputers 41  Johnsson and Ho ( 1989) have designed a broadcast algorithm that executes up to log p times faster than the binomial-tree algorithm.  Their algorithm relies upon the fact that every hypercube contains log p edge-disjoint spanning trees with the same root node.  The algorithm breaks the message into log p parts and broadcasts each part to other nodes through a different binomial spanning tree.  Because the spanning trees have no edges in common, all data flows concurrently, and the entire algorithm executes approximately in time Mlog p/ log p = M.
  • 42. BROADCAST - in Multicomputers 42
  • 43. BROADCAST - in Multicomputers Design Strategy 3: As problem size grows, use the algorithm that makes best use of the available resources. In the case of broadcasting large data sets on a hypercube multicomputer, the most constrained resource is the network capacity. Johnsson and Ho's algorithm makes better use of this resource than the binomial tree broadcast algorithm and, as a result, achieves higher performance. 43
  • 46. Prefix-Sums-On Hypercube Multicomputers 46  The cost optimal algorithm requires 𝑛/𝑙𝑜𝑔𝑛 processors to solve the problem in (log n) time.  In order to achieve cost optimality, each processor uses the best sequential algorithm to manipulate its own set of n/p elements of A.  Same strategy is used to design an efficient multicomputer algorithm where p << n/log n.
  • 47. Prefix-Sums-On Hypercube Multicomputers 47 Design Strategy 4: Let each processor perform the most efficient sequential algorithm on its share of the data.
  • 48. Prefix-Sums-On Hypercube Multicomputers 48 There are n number of elements and p processors. n is integer multiple of p. The elements of A are distributed evenly among the local memories of the p processors. During step one each processor finds the sum of its n/p elements. In step two the processors cooperate to find the p prefix sums of their local sums. During step three each processor computes the prefix sums of its n/p values, using values held in lower-numbered processors.
  • 50. Prefix-Sums-On Hypercube Multicomputers 50  The communication time required by step two depends upon the multicomputer's topology.  The memory access pattern of the PRAM algorithm does not directly translate into a communication pattern having a dilation-1 embedding in a hypercube.  For this reason we should look for a better method of computing the prefix sums.
  • 51. Sum of n Numbers: Hypercube 51
  • 52. Prefix-Sums - On Hypercube Multicomputers 52  Finding prefix sums is similar to performing a reduction, except, for each element in the list, we are only interested in values from prior elements.  We can modify the hypercube reduction algorithm to perform prefix sums.  As in the reduction algorithm, every processor swaps values across each dimension of the hypercube. However, the processor maintains two variables containing totals.  The first variable contains the total of all values received.  The second variable contains the total of all values received from smaller-numbered processors.  At the end of log p swap-and-add steps, the second variable associated with each processor contains the prefix sum for that processor.
  • 53. 53
  • 54. Prefix-Sums-On Hypercube Multicomputers 54  : The time needed to perform operation   : The time needed to initiate a message  : The message transmission time per value For example, sending a k-element from one processor to another requires time  + k.
  • 55. Prefix-Sums-On Hypercube Multicomputers 55 • During step one each processor finds the sum of n/p values in (n/p -1)  time units. • During step three processor 0 computes the prefix sums of its n/p values in (n/p - 1 )  time units. • Processors 1 through p - 1 must add the sum of the lower- numbered processors’ values to the first element on its list before computing the prefix sums. • These processors perform step three in (n/p)  time units.
  • 56. Prefix-Sums-On Hypercube Multicomputers 56  Step two has log p phases.  During each phase a processor performs the  operation, at most, two times, so the computation time required by step two is no more than 2logp.  During each phase a processor sends one value to a neighbouring processor and receives one value from that processor.  The total communication time of step two is 2( + ) logp.  Summing the computation and the communication time yields a total execution time of 2( +  + ) logp for step two of the algorithm.
  • 57. Prefix-Sums-On Hypercube Multicomputers 57 Estimated Execution Time = (n/p -1)  + 2( +  + ) logp + (n/p) 
  • 58. Prefix-Sums-On Hypercube Multicomputers 58  In other words, the efficiency of this algorithm cannot exceed 50%, no matter how large the problem size or how small the message latency.  Figure compares the predicted speedup with the speedup actually achieved by this algorithm on the nCUBE 3200TM, where the associative operator is integer addition,  = 414 nanoseconds,  = 363 microseconds, and  = 4.5 microseconds.