Grain Packing & scheduling Ch2 Hwang - Copy.ppt

1
Steps in Creating a Parallel Program
4 steps:
Decomposition, Assignment, Orchestration, Mapping+ Scheduling
• Done by programmer or system software (compiler, runtime, ...)
• Issues are the same, so assume programmer does it all explicitly
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequential
computation
Parallel
program
A
s
s
i
g
n
m
e
n
t
D
e
c
o
m
p
o
s
i
t
i
o
n
M
a
p
p
i
n
g
O
r
c
h
e
s
t
r
a
t
i
o
n
Parallel Algorithm

Programming Level
Concurrent Processing
These can be attributed to issues such as:
• level of concurrency,
• computational granularity,
• time and space complexity,
• communication latencies, and
• scheduling and load balancing.

Programming Level
Independence among segments of a program is
a necessary condition to execute them
concurrently.
In general, two independent segments should be
executed in any order without affecting each
other — a segment can be an instruction or a
sequence of instructions.

Programming Level
Dependence graph is used to determine the
dependence relations among the program
segments.

Programming Level
Dependence Graph — A dependence graph is a
directed graph G º G (N,A) in which the set of
nodes (N) represents the program segments and
the set of the directed arcs (A) shows the order
of dependence among the segments

Programming Level
Dependence Graph
• Dependence comes in various forms and kinds:
− Data dependence
− Control dependence
− Resource dependence

Programming Level
Data Dependence: If an instruction uses a
value produced by a previous instruction, then
the second instruction is data dependent to the
first instruction.
Data Dependence comes in different forms:

Programming Level
Data dependence
Flow dependence: At least one output of S1 is an input
ofS2 (Read-After-Write: RAW).
Anti dependence: Output of S2 is overlapped with the
input toS1 (Write-After-Read: WAR).
Output dependence: S1 and S2 write to the same location
(Write-After-Write: WAW).
S1 S2

S
1 ® S
2
S1
S2


Programming Level
Data dependence
I/O dependence: The same file is referred to by
both I/O statements.
Unknown dependence: The dependence relation
can not be determined.

Programming Level
Example
Assume the following sequence of the
instructions:
• S
1
: R
1 ¬(A)
• S
2
: R
2 ¬(R
1
) + (R
2
)
• S
3
: R
1 ¬(R
3
)
• S
4
: B
¬(R
1
)

Programming Level
Example
S
1
S 4
S 3
S 1
2
S

Programming Level
Control dependence
The order of execution is determined during
run-time — Conditional statements.
Control dependence could also exist between
operations performed in successive iterations of
a loop:
Do I = 1, N
If (A(I-1) = 0) then A(I) = 0
End

Programming Level
Control dependence
Control dependence often does not allow
efficient exploitation of parallelism.

Programming Level
Resource dependence
Conflict in using shared resources such as
concurrent request for the same functional unit.
A resource conflict arises when two instructions
attempt to use the same resource at the same
time.

Programming Level
Resource dependence
Within the scope of the resource dependence
then we can talk about storage dependence,
ALU dependence,...

Programming Level
Question
What is “true Dependence”?
What is “False Dependence”?

Programming Level
Bernstein's Conditions
• Let Ii and Oi be the input and output sets of process
Pi, respectively. The two processes P1 and P2 can be
executed in parallel (P1 || P2) iff:
I1 Ç O2 = 0 WAR
I2 Ç O1 = 0 RAW
O1 Ç O2 = 0 WAW

Programming Level
• In general, P1, P2,..., Pk can be executed in parallel if
Bernstein condition is held for every pair of
processes:
P1 || P2 || P3... || Pk iff
Pi || Pj " i ¹ j

Fall 2004
Programming Level
· parallelism relation (||) is commutative:
Pi || Pj Þ Pj || Pi
· parallelism relation (||) is not transitive:
Pi || Pj and Pj || Pk
does not necessarily guarantee Pi || Pk

Fall 2004
Programming Level
· parallelism relation (||) is not equivalence relation:
· parallelism relation (||) is associative:
Pi || Pj ||Pk Þ (Pi || Pj) || Pk = Pi || (Pj || Pk)

Programming Level
Bernstein's Conditions — Example
• Detect parallelism in the following program, assume
a uniform execution time:
P1: C = D * E
P2: M = G + C
P3: A = B + C
P4: C = L + M
P5: F = G / E

Programming Level
Concurrent Processing — Bernstein's Condition
*
+3
+2
/
D E
C
B
L
E
G
F
G
+1

Programming Level
Concurrent Processing — Bernstein's Condition
Example: If two adders are available then,
*
+1
+3
+2
/
D E
C
B
L
E
G
F
G
A

Programming Level
Bernstein's Conditions — Example
*
+1 +3
+2
/
P
1
P 2
P3
P4
P5
Resource dependence Data dependence

Programming Level
Hardware parallelism is referred to the type and
degree of parallelism defined by the architecture
and hardware multiplicity — a k-issue
processor is a processor with hardware
capability that issues K instructions per
machine cycle.

Programming Level
Software parallelism is defined by the control
and data dependence of programs. It is a
function of algorithm, programming style, and
compiler optimization.

Programming Level
Hardware vs. software parallelism
N = (A * B) + (C * D)
M = (A * B) - (C * D)

Programming Level
• In machine code we have:
Load R1, A
Load R2, B
Load R3, C
Load R4, D
Mult Rx, R1, R2
Mult Ry, R3, R4
Add R1, Rx, Ry
Sub R2, Rx, Ry
Store N, R1

Programming Level
• A machine which allows parallel multiplications,
add/subtraction and simultaneous load/store
operations gives an average software parallelism of
(10/4) = 2.5 instructions.

Programming Level
*1
+
*2
-
Load A Load B Load C Load D
Store M
Store N

Programming Level
• For a machine which does not allow simultaneous
Load/Store and arithmetic operations, we have an
average software parallelism of (10/8) = 1.25
instructions.

Programming Level
* 1
+
* 2
-
Load A
Load B
Load C
Load D
Store M
Store N

Programming Level
• Now assume a multiprocessor composing of two
processors:

Programming Level
*1
+
*2
-
Load A
Load B
Load C
Load D
Store M
Store N
Store L Store K
Load L
Load K
added instructions
for inter processor
communication.
}

Programming Level
Compilation support is one way to solve the
mismatch between software and hardware
parallelism.
A suitable compiler could exploit hardware
features in order to improve performance.

Programming Level
Summary
Dependence Graph
Different types of dependencies
Hardware Parallelism
Software Parallelism

Programming Level
Detection of concurrency in a program at
instruction level using techniques like Bernstein
is not practical, specially in case of large
programs.
So we will look at detection of concurrency at a
higher level. This bring us to the issue of
partitioning, scheduling, and load balancing.

Programming Level
Two issues of concern:
• How can we partition a program into concurrent
branches, program modules, or grains to yield the
shortest execution time, and
• What is the optimal size of concurrent grains in a
computation?

Programming Level
Partitioning is defined as the ability to partition
a program into subprograms that can be
executed in parallel.
Within the scope of partitioning, two major
issues are of concern:
• Grain Size, and
• Latency

Programming Level
Partitioning and Scheduling
Grain Size
• Granularity or grain size is a measure of the amount
of computation involved in a process — It
determines the basic program segment chosen for
parallel processing.
• Grain sizes are commonly named as:
− fine,
− medium, and
− coarse.

Programming Level
Latency
• Latency imposes a limiting factor on the scalability
of the machine size.
• Communication latency — inter-processor
communication — is a major factor of concern to a
system designer.

Programming Level
• In general, n tasks communicating with each
other may require n (n-1) / 2 communication
links among them.
• This leads to a communication bound which
limits the number of processors allowed in a
computer system.

Programming Level
Parallelism can be exploited at various levels:
• Job or Program
• Subprogram
• Procedure, Task, or Subroutine
• Loops or Iteration
• Instruction or Statement

Programming Level
The lower the level, the finer the granularity.
The finer the granularity, the higher the
communication and scheduling overheads.
The finer the granularity, the higher the degree
of parallelism.

Programming Level
Instruction and loop levels represent fine grain
size.
Procedure and subprogram levels represent
medium grain size, and
Job and subprogram levels represent coarse
grain size.

Programming Level
Instruction Level
• An instruction level granularity represents a grain
size consisting of up to 20 instructions.
• This level offers a high degree of parallelism in
common programs.
• It is expected that parallelism will be exploited by
compiler automatically.

Programming Level
Loop Level
• Here typically, we are concerned about iterative
loops with less than 500 instructions.
• At this level, one can distinguish two classes of
loop:
− Loops with independent iterations and,
− Loops with dependent iterations.

Programming Level
Procedure Level
• A typical grain at this level contains less than 2,000
instructions.
• Communication requirement and penalty at this
level is less compared with that of the fine grain
levels at the expense of more complexities in
detection of parallelism — inter-procedural
dependence.

Programming Level
Subprogram Level
• Multiprogramming on a uni-processor or on a
multiprocessor platform represents this level.
• In the past, parallelism at this level was
exploited by the programmers or algorithm
designers rather than by compilers.

Programming Level
Job Level
• This level corresponds to the parallel execution of
independent jobs (programs) on concurrent
computers.
• Supercomputers with a small number of powerful
processors are the best platform for this level of
parallelism. In general, parallelism at this level is
exploitable by the program loader and the operating
system.

Programming Level
Grain Packing
Let us look at the following example to
motivate the effect of grain size on the
performance.
In the following discussion, we make a
reference to the term program graph.

Programming Level
Grain Packing
A program graph is a dependence graph in
which:
• Each operation is labeled as (n, s) where n is the
node identifier and s is the execution time of the
node.
• Each edge is labeled as (v, d) where v is the edge
identifier and d is the communication delay.
Consider the following program graph:

Programming Level
Grain Packing — Fine grain size
1,1 4,1 5,1
7,2
6,1
11,2
2,1 3,1
8,2 10,2
9,2
a,6
b,6
c,6 d,6
d,6
d,6
f,6
e,6
e,6 f,6
12,2
13,2
17,2
16,2
15,2
14,2
n,4
o,3
4.
l,3
k,4
j,4
m,3
3.
p,3
h,4
g,4
i,4

Programming Level
Nodes 1-6 are all memory references
• 1 cycle to calculate address, and
• 6 cycles to fetch data from memory.
Other nodes are CPU operations requiring 2
cycles each.

Programming Level
The idea behind grain packing is to apply:
• Fine-grain first to achieve a higher degree of
parallelism,
• Combine multiple fine-grain nodes into a coarse-
grain node if it can eliminate unnecessary
communication delays or reduce the overall cost.

D
E
C
B
Programming Level
Grain Packing — Partitioning
A
1,1 4,1 5,1
7,2
6,1
11,2
2,1 3,1
8,2 10,2
9,2
a,6
b,6
c,6 d,6
d,6
d,6
f,6
e,6
e,6 f,6
12,2
13,2
17,2
16,2
15,2
14,2
n,4
o,3
4.
l,3
k,4
j,4
m,3
3.
p,3
h,4
g,4
i,4

Programming Level
Grain Packing — Coarse Grain
E,6
C,4
B,4 D,6
A,8
4.
g,4
e,6 f,6
d,6
d,6
c,6
b,6
a,6
h,4
i,4
n,4
j,4
k,4
3.

Programming Level
Grain Packing — Scheduling Fine Grain Size
P1
6 9
1
11 2 3 8 7
0 1 7 9 10 12 18 20 22 24 28
P2
10 13
12
5
6
4 16
15
14 17
0 1 2 8 10 14 16 19 21 24 26 30 32 35 37 40 42
Busy Communication Idle

Programming Level
Grain Packing — Scheduling Coarse Grain Size
P1
A C D
0 8 14 18 22 28 32 38
E
7
P2
B
0 14 18 22

Programming Level
Grain Packing
As can be noted, through grain packing we
were able to reduce the overall execution time
by reducing the communication overhead.
The concept of grain packing can be recursively
applied.

Programming Level
Task Duplication
Task duplication is another way to reduce the
communication overhead and hence the
execution time.
Consider the following program graph:

Programming Level
Task Duplication
D,2 E,2
B,1 C,1
A,4
d,4
a,1
b,1
c,1
c,8
a,8
e,4

Programming Level
Task Duplication
P1
A B D
0 4 6 13 21 23 27
E 7
P2
C
0 4 12 13 14 16 20
E

Programming Level
Task Duplication
Now let us duplicate tasks A and C:
D,2
B,1 C',1
A,4
d,4
a,1
b,1
c,1
a,1
a,1
E,2
C,1
c,1
e,4
A',4

Programming Level
Task Duplication
P1
A B D
0 4 6 7 8 10 14
C
E 7
P2
C
0 4 5 6 7 9 13
E
A

Programming Level
Summary
In partitioning a task into subtasks two issues
must be taken into consideration:
• Grain Size, and
• Latency
Program Graph
Grain Packing
Task duplication

Programming Level
Loop Scheduling
Loops are the largest source of parallelism in a
program. Therefore, there is a need to pay
attention to the partitioning and allocation of
loop iterations among processors in a parallel
environment.

Programming Level
Loop Scheduling
Practically we can talk about three classes of
loops:
• Sequential Loops,
• Doall Loops — vector (parallel) loops,
• Doacross Loops — Loop with intermediate degree
of parallelism.

Programming Level
Loop Scheduling — Doall Loops
Static Scheduling
Dynamic Scheduling

Programming Level
Static Scheduling schemes assign a fixed
number of iterations to each processor:
• Block Scheduling (Static chunking),
• Cyclic Scheduling
In practice, cyclic scheduling offers a better
load balancing than block scheduling if the
computation performed by each iteration varies
significantly.

Programming Level
Dynamic scheduling schemes have been
proposed to respond to the imbalance work-load
of the static scheduling schemes:
• Self Scheduling Scheme
• Fixed size chunking Scheme
• Guided Self Scheduling Scheme
• Factoring Scheme
• Trapezoid Self Scheduling Scheme

Programming Level
Loop Scheduling
Practice has shown that the lost of parallelism
after serializing Doacross loops is very
significant — Chen & Yew (1991).
There is a need to develop good schemes for the
parallel execution of Doacross loops.

Programming Level
Loop Scheduling — Doacross Loops
In a Doacross loop, iteration may be either data
or control dependent on each other:
• Control dependence is caused by conditional
statements.
• Data dependence appears in the form of sharing
computational results. Data dependence can be
either in the form of lexically-forward or lexically-
backward.

Programming Level
Doacross loops can be either:
• regular (dependence distance is fixed) or
• irregular (dependence distance varies from iteration
to iteration).
Regular Doacross loops are more amenable to
parallel execution than irregular loops.

Programming Level
DOACROSS Model — Cytron 1986
Pre-synchronization Scheduling — Krothapalli
& Sadayappan 1990
Staggered Distribution Scheme — Lim et. al.,
1992

Programming Level
Loop Scheduling
DOACROSS Model — Example
0 •
2 •
4 •
6 •
8 •
10 •
12 •
14 •
16 •
18 •
20 •
22 •
24 •
26 •
28 •
30 •
32 •
34 •
36 •
38 •
i1
i2
i3
i4
i5
i6
i7
i8
P1 P2 P3
T
I
M
E

Programming Level
Loop Scheduling
DOACROSS model was aimed to model the
execution of the sequential loops, vector loops,
and loops with intermediate parallelism by
considering the:
• control dependencies, and
• data dependencies.

Programming Level
Loop Scheduling
Control dependencies are caused by conditional
and unconditional branches.
Data dependencies are due to the lexically
forward and lexically backward dependencies.
This simple model, however, does not consider
the effect of inter-processor communication
cost.

Programming Level
A proper allocation scheme should partition a
task (program graph) into subtasks (sub-graphs)
and distributes the subtasks among processors
in an attempt to minimize:
• processor contentions, and
• inter-processor communications.

Programming Level
There are two main allocation schemes:
• Static, and
• Dynamic.

Grain Packing & scheduling Ch2 Hwang - Copy.ppt

More Related Content

Similar to Grain Packing & scheduling Ch2 Hwang - Copy.ppt (20)

Recently uploaded (20)

Grain Packing & scheduling Ch2 Hwang - Copy.ppt