Parallel Algorithms

Algorithms
Parallel Algorithms
1

An overview
• A parallel merging algorithm
• Accelerated Cascading and Parallel List
Ranking

Parallel merging through
partitioning
The partitioning strategy consists of:
• Breaking up the given problem into many
independent subproblems of equal size
• Solving the subproblems in parallel
This is similar to the divide-and-conquer
strategy in sequential computing.

Partitioning and Merging
Given a set S with a relation , S is linearly
ordered, if for every pair a,b S.
• either a b or b a.
The merging problem is the following:

Input: Two sorted arrays A = (a1, a2,..., am) and
B = (b1, b2,..., bn) whose elements are drawn
from a linearly ordered set.
Output: A merged sorted sequence
C = (c1, c2,..., cm+n).

Merging
For example, if A = (2,8,11,13,17,20) and B =
(3,6,10,15,16,73), the merged sequence
C = (2,3,6,8,10,11,13,15,16,17,20,73).

Merging
A sequential algorithm
• Simultaneously move two pointers along the
two arrays
• Write the items in sorted order in another
array

• The complexity of the sequential algorithm is
O(m + n).
• We will use the partitioning strategy for
solving this problem in parallel.

Definitions:
rank(ai : A) is the number of elements in A less
than or equal to ai A.
rank(bi : A) is the number of elements in A less
than or equal to bi B.

Merging
For example, consider the arrays:
A = (2,8,11,13,17,20)
B = (3,6,10,15,16,73)
rank(11 : A) = 3 and rank(11 : B) = 3.

Merging
• The position of an element ai A in the sorted
array C is:
rank(ai : A) + rank(ai : B).
For example, the position of 11 in the sorted
array C is:
rank(11 : A) + rank(11 : B) = 3 + 3 = 6.

Parallel Merging
• The idea is to decompose the overall merging
problem into many smaller merging
problems.
• When the problem size is sufficiently small,
we will use the sequential algorithm.

Merging
• The main task is to generate smaller merging
problems such that:
• Each sequence in such a smaller problem has
O(log m) or O(log n) elements.
• Then we can use the sequential algorithm since
the time complexity will be O(log m + log n).

Parallel Merging
Step 1. Divide the array B into blocks such that each
block has log m elements. Hence there are m/log m
blocks.
For each block, the last elements are
i log m, 1 i m/log m

Parallel Merging
Step 2. We allocate one processor for each last
element in B.
•For a last element i log m, this processor does
a binary search in the array A to determine two
elements ak, ak+1 such that ak i log m ak+1.
•All the m/log m binary searches are done in
parallel and take O(log m) time each.

Parallel Merging
• After the binary searches are over, the array
A is divided into m/log m blocks.
• There is a one-to-one correspondence
between the blocks in A and B. We call a pair
of such blocks as matching blocks.

Parallel Merging
• Each block in A is determined in the following
way.
• Consider the two elements i log m and(i + 1)
log m. These are the elements in the (i + 1)-th
block of B.
• The two elements that determine rank(i log m
: A) and rank((i + 1) log m : A) define the
matching block in A

Parallel Merging
• These two matching blocks determine a smaller
merging problem.
• Every element inside a matching block has to be
ranked inside the other matching block.
• Hence, the problem of merging a pair of matching
blocks is an independent subproblem which does
not affect any other block.

Parallel Merging
• If the size of each block in A is O(log m), we can
directly run the sequential algorithm on every pair of
matching blocks from A and B.
• Some blocks in A may be larger than O(log m) and
hence we have to do some more work to break
them into smaller blocks.

Parallel Merging
If a block in Ai is larger than O(log m) and the
matching block of Ai is Bj, we do the following
•We divide Ai into blocks of size O(log m).
•Then we apply the same algorithm to rank the
boundary elements of each block in Ai in Bj.
•Now each block in A is of size O(log m)
•This takes O(log log m) time.

Parallel Merging
Step 3.
• We now take every pair of matching blocks from A
and B and run the sequential merging algorithm.
• One processor is allocated for every matching pair
and this processor merges the pair in O(log m)
time.
We have to analyse the time and processor
complexities of each of the steps to get the overall
complexities.

Parallel Merging
Complexity of Step 1
• The task in Step 1 is to partition B into
blocks of size log m.
• We allocate m/log m processors.
• Since B is an array, processor Pi, 1 i m/log
m can find the element i log m in O(1) time.

Parallel Merging
• In Step 2, m/log m processors do binary
search in array A in O(log n) time each.
• Hence the time complexity is O(log n) and
the work done is
(m log n)/ log m (m log(m + n)) / log m (m + n)
for n,m 4. Hence the total work is O(m + n).

Parallel Merging
• In Step 3, we use m/log m processors
• Each processor merges a pair Ai, Bi in O(log m)
time.Hence the total work done is m.
Theorem
Let A and B be two sorted sequences each of
length n. A and B can be merged in O(log n) time
using O(n) operations in the CREW PRAM.

25
Accelerated Cascading and Parallel List
Ranking
• We will first discuss a technique called
accelerated cascading for designing very fast
parallel algorithms.
• We will then study a very important technique
for ranking the elements of a list in parallel.

26
Fast computation of maximum
Input: An array A holding p elements from a linearly ordered
universe S. We assume that all the elements in A are
distinct.
Output: The maximum element from the array A.
We use a boolean array M such that M(k)=1 if and only if
A(k) is the maximum element in A.
Initialization: We allocate p processors to set each entry in
M to 1.

27
Step 1: Assign p processors for each element in A, p2
processors overall.
•Consider the p processors allocated to A(j). We name
these processors as P1, P2,..., Pi,..., Pp.
•Pi compares A(j) with A(i) :
If A(i) > A(j) then M(j) := 0
else do nothing.

28
Step 2: At the end of Step 1, M(k) , 1 k p will
be 1 if and only if A(k) is the maximum element.
•We allocate p processors, one for each entry in
M.
•If the entry is 0, the processor does nothing.
•If the entry is 1, it outputs the index k of the
maximum element.

29
Complexity: The processor requirement is p2
and the time complexity is O(1).
• We need concurrent write facility and hence
the Common CRCW PRAM model.

30
Optimal computation of
maximum
• This is the same algorithm which we used for
adding n numbers.

31
Optimal computation of
maximum
• This algorithm takes O(n) processors and
O(log n) time.
• We can reduce the processor complexity to
O(n / log n). Hence the algorithm does optimal
O(n) work.

32
An O(log log n) time algorithm
• Instead of a binary tree, we use a more complex
tree. Assume that .
• The root of the tree has children.
• Each node at the i-th level has children for
.
• Each node at level k has two children.
2
2
k
n
1
2
2
k
n
1
2
2
k i
0 1i k

33
Some Properties
• The depth of the tree is k. Since
• The number of nodes at the i-th level is
2
2 , loglog
k
n k n
2 2
,for 0 .2
k k i
i k

34
The Algorithm
• The algorithm proceeds level by level,
starting from the leaves.
• At every level, we compute the maximum of
all the children of an internal node by the O(1)
time algorithm.
• The time complexity is O(log log n) since the
depth of the tree is O(log log n).

35
Total Work:
• Recall that the O(1) time algorithm needs
O(p2) work for p elements.
• Each node at the i-th level has children.
• So the total work for each node at the i-th
level is .
1
2
2
k i
1
22
( )2
k i
O

36
Total Work:
• There are nodes at the i-th level.
Hence the total work for the i-th level is:
• For O(log log n) levels, the total work is
O(n log log n) . This is suboptimal.
2 2
2
k k i
1
2 22 2 2
( ) (2 ) ( )2 2
kk i k k i
O O nO

37
Accelerated cascading
• The first algorithm which is based on a binary
tree, is optimal but slow.
• The second algorithm is suboptimal, but very
fast.
• We combine these two algorithms through
the accelerated cascading strategy.

38
• We start with the optimal algorithm until the
size of the problem is reduced to a certain
value.
• Then we use the suboptimal but very fast
algorithm.

39
Phase 1.
• We apply the binary tree algorithm, starting
from the leaves and upto log log log n
levels.
• The number of candidates reduces to
• The total work done so far is O(n) and the
total time is O(log log log n) .
logloglog
2 loglog
.n
n n
n

40
Phase 2.
• In this phase, we use the fast algorithm on
the remaining candidates.
• The total work is .
• The total time is .
• Theorem: Maximum of n elements can be
computed in O(log log n) time and O(n)
work on the Common CRCW PRAM.
( )
loglog
n
n O
n
( loglog ) ( )O n n O n
(loglog ) (loglog )O n O n

41
Two parallel list ranking algorithms
• An O(log n) time and O(n log n) work list
ranking algorithm.
• An O(log n loglog n) time and O(n) work list
ranking algorithm.

42
List ranking
Input: A linked list L of n elements.
L is given in an array S such that the entry S(i)
contains the index of the node which is the
successor of the node i in L.
Output: The distance of each node i from the
end of the list.

43
List ranking
List ranking can be solved in O(n) time
sequentially for a list of length n.
•Hence, a work-optimal parallel algorithm
should do only O(n) work.

44
A simple list ranking algorithm
Output: For each 1 i n, the distance R(i) of
node i from the end of the list.
begin
for 1 i n do in parallel
if S(i) 0 then R(i) := 1
else R(i) := 0
endfor
while S(i) 0 and S(S(i)) 0 do
Set R(i) := R(i) + R(S(i))
Set S(i) := S(S(i))
end

45
• At the start of an iteration of the while loop,
R(i) counts the nodes in a sublist starting at i
(a subset of nodes which are adjacent in the
list).

46
• After the iteration, R(i) counts the nodes in a
sublist of double the size.
• When the while loop terminates, R(i) counts
all the nodes starting from i and until the end
of the list

47
Complexity and model
• The algorithm terminates after O(log n)
iterations of the while loop.
• The work complexity is O(n log n) since we
allocate one processor for each node.
• We need the CREW PRAM model since
several nodes may try to read the same
successor (S) values.

48
Complexity and model
Exercise :
Modify the algorithm to run on the EREW
PRAM with the same time and processor
complexities.

49
The strategy for an optimal
algorithm
• Our aim is to modify the simple algorithm so
that it does optimal O(n) work.
• The best algorithm would be the one which
does O(n) work and takes O(log n) time.
• There is an algorithm meeting these criteria,
however the algorithm and its analysis are
very involved.

50
algorithm
• We will study an algorithm which does O(n)
work and takes O(log n loglog n) time.
• However, in future we will use the optimal
algorithm for designing other algorithms.

51
algorithm
1. Shrink the initial list L by removing some of
the nodes.
The modified list should have O(n / log n)
nodes.
2. Apply the pointer jumping technique (the
suboptimal algorithm) on the list with O(n /
log n) nodes.

52
algorithm
3. Restore the original list and rank all the
nodes removed in Step 1.
The important step is Step1. We need to
choose a subset of nodes for removal.

53
Independent sets
Definition
A set I of nodes is independent if whenever i
I , S(i) I.
The blue nodes form an independent set in this
list

54
Independent sets
• The main task is to pick an independent set
correctly.
• We pick an independent set by first coloring
the nodes of the list by two colors.

55
2-coloring the nodes of a list
Definition: A k-coloring of a graph G is a
mapping: c : V 0,1,…,k - 1 such that
c(i) c(j) if i, j E.
• It is very easy to design an O(n) time
sequential algorithm for 2-coloring the nodes of
a linked list.

56
2-coloring the nodes of a list
• We will assume the following result:
Theorem: A linked list with n nodes can be 2-
colored in O(log n) time and O(n) work.

57
Independent sets
• When we 2-color the nodes of a list,
alternate nodes get the same color.
• Hence, we can remove the nodes of the
same color to reduce the size of the
original list from n to n/2.

58
Independent sets
• However, we need a list of size to run our
pointer jumping algorithm for list ranking.
• If we repeat the process loglog n time, we will
reduce the size of the list to
i.e., to
loglog
2 n
n
log
n
n
log
n
n

59
Preserving the information
• When we reduce the size of the list to ,
we have lost a lot of information because the
removed nodes are no longer present in the
list.
• Hence, we have to put back the removed
nodes in their original positions to correctly
compute the ranks of all the nodes in the list.
log
n
n

60
Preserving the information
• Note that we have removed the nodes in
O(log log n) iterations.
• So, we have to replace the nodes also in
O(log log n) iterations.

Parallel Algorithms

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Parallel Algorithms (20)

More from Dr Sandeep Kumar Poonia (20)

Recently uploaded (20)

Parallel Algorithms