SlideShare a Scribd company logo
1
Steps in Creating a Parallel Program
4 steps:
Decomposition, Assignment, Orchestration, Mapping+ Scheduling
• Done by programmer or system software (compiler, runtime, ...)
• Issues are the same, so assume programmer does it all explicitly
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequential
computation
Parallel
program
A
s
s
i
g
n
m
e
n
t
D
e
c
o
m
p
o
s
i
t
i
o
n
M
a
p
p
i
n
g
O
r
c
h
e
s
t
r
a
t
i
o
n
Parallel Algorithm
Programming Level
Concurrent Processing
These can be attributed to issues such as:
• level of concurrency,
• computational granularity,
• time and space complexity,
• communication latencies, and
• scheduling and load balancing.
Programming Level
Concurrent Processing
Independence among segments of a program is
a necessary condition to execute them
concurrently.
In general, two independent segments should be
executed in any order without affecting each
other — a segment can be an instruction or a
sequence of instructions.
Programming Level
Concurrent Processing
Dependence graph is used to determine the
dependence relations among the program
segments.
Programming Level
Concurrent Processing
Dependence Graph — A dependence graph is a
directed graph G º G (N,A) in which the set of
nodes (N) represents the program segments and
the set of the directed arcs (A) shows the order
of dependence among the segments
Programming Level
Concurrent Processing
Dependence Graph
• Dependence comes in various forms and kinds:
− Data dependence
− Control dependence
− Resource dependence
Programming Level
Concurrent Processing
Data Dependence: If an instruction uses a
value produced by a previous instruction, then
the second instruction is data dependent to the
first instruction.
Data Dependence comes in different forms:
Programming Level
Data dependence
Flow dependence: At least one output of S1 is an input
ofS2 (Read-After-Write: RAW).
Anti dependence: Output of S2 is overlapped with the
input toS1 (Write-After-Read: WAR).
Output dependence: S1 and S2 write to the same location
(Write-After-Write: WAW).
S1 S2

S
1 ® S
2
S1
S2

Programming Level
Data dependence
I/O dependence: The same file is referred to by
both I/O statements.
Unknown dependence: The dependence relation
can not be determined.
Programming Level
Example
Assume the following sequence of the
instructions:
• S
1
: R
1 ¬(A)
• S
2
: R
2 ¬(R
1
) + (R
2
)
• S
3
: R
1 ¬(R
3
)
• S
4
: B
¬(R
1
)
Programming Level
Example
S
1
S 4
S 3
S 1
2
S
Programming Level
Control dependence
The order of execution is determined during
run-time — Conditional statements.
Control dependence could also exist between
operations performed in successive iterations of
a loop:
Do I = 1, N
If (A(I-1) = 0) then A(I) = 0
End
Programming Level
Control dependence
Control dependence often does not allow
efficient exploitation of parallelism.
Programming Level
Resource dependence
Conflict in using shared resources such as
concurrent request for the same functional unit.
A resource conflict arises when two instructions
attempt to use the same resource at the same
time.
Programming Level
Resource dependence
Within the scope of the resource dependence
then we can talk about storage dependence,
ALU dependence,...
Programming Level
Question
What is “true Dependence”?
What is “False Dependence”?
Programming Level
Concurrent Processing
Bernstein's Conditions
• Let Ii and Oi be the input and output sets of process
Pi, respectively. The two processes P1 and P2 can be
executed in parallel (P1 || P2) iff:
I1 Ç O2 = 0 WAR
I2 Ç O1 = 0 RAW
O1 Ç O2 = 0 WAW
Programming Level
Concurrent Processing
Bernstein's Conditions
• In general, P1, P2,..., Pk can be executed in parallel if
Bernstein condition is held for every pair of
processes:
P1 || P2 || P3... || Pk iff
Pi || Pj " i ¹ j
Fall 2004
Programming Level
Concurrent Processing
Bernstein's Conditions
· parallelism relation (||) is commutative:
Pi || Pj Þ Pj || Pi
· parallelism relation (||) is not transitive:
Pi || Pj and Pj || Pk
does not necessarily guarantee Pi || Pk
Fall 2004
Programming Level
Concurrent Processing
Bernstein's Conditions
· parallelism relation (||) is not equivalence relation:
· parallelism relation (||) is associative:
Pi || Pj ||Pk Þ (Pi || Pj) || Pk = Pi || (Pj || Pk)
Programming Level
Concurrent Processing
Bernstein's Conditions — Example
• Detect parallelism in the following program, assume
a uniform execution time:
P1: C = D * E
P2: M = G + C
P3: A = B + C
P4: C = L + M
P5: F = G / E
Programming Level
Concurrent Processing — Bernstein's Condition
*
+3
+2
/
D E
C
B
L
E
G
F
G
+1
Programming Level
Concurrent Processing — Bernstein's Condition
Example: If two adders are available then,
*
+1
+3
+2
/
D E
C
B
L
E
G
F
G
A
Programming Level
Concurrent Processing
Bernstein's Conditions — Example
*
+1 +3
+2
/
P
1
P 2
P3
P4
P5
Resource dependence Data dependence
Programming Level
Concurrent Processing
Hardware parallelism is referred to the type and
degree of parallelism defined by the architecture
and hardware multiplicity — a k-issue
processor is a processor with hardware
capability that issues K instructions per
machine cycle.
Programming Level
Concurrent Processing
Software parallelism is defined by the control
and data dependence of programs. It is a
function of algorithm, programming style, and
compiler optimization.
Programming Level
Concurrent Processing
Hardware vs. software parallelism
N = (A * B) + (C * D)
M = (A * B) - (C * D)
Programming Level
Concurrent Processing
Hardware vs. software parallelism
• In machine code we have:
Load R1, A
Load R2, B
Load R3, C
Load R4, D
Mult Rx, R1, R2
Mult Ry, R3, R4
Add R1, Rx, Ry
Sub R2, Rx, Ry
Store N, R1
Programming Level
Concurrent Processing
Hardware vs. software parallelism
• A machine which allows parallel multiplications,
add/subtraction and simultaneous load/store
operations gives an average software parallelism of
(10/4) = 2.5 instructions.
Programming Level
Concurrent Processing
Hardware vs. software parallelism
*1
+
*2
-
Load A Load B Load C Load D
Store M
Store N
Programming Level
Concurrent Processing
Hardware vs. software parallelism
• For a machine which does not allow simultaneous
Load/Store and arithmetic operations, we have an
average software parallelism of (10/8) = 1.25
instructions.
Programming Level
Concurrent Processing
Hardware vs. software parallelism
* 1
+
* 2
-
Load A
Load B
Load C
Load D
Store M
Store N
Programming Level
Concurrent Processing
Hardware vs. software parallelism
• Now assume a multiprocessor composing of two
processors:
Programming Level
Concurrent Processing
Hardware vs. software parallelism
*1
+
*2
-
Load A
Load B
Load C
Load D
Store M
Store N
Store L Store K
Load L
Load K
added instructions
for inter processor
communication.
}
Programming Level
Concurrent Processing
Compilation support is one way to solve the
mismatch between software and hardware
parallelism.
A suitable compiler could exploit hardware
features in order to improve performance.
Programming Level
Summary
Dependence Graph
Different types of dependencies
Bernstein's Conditions
Hardware Parallelism
Software Parallelism
Programming Level
Concurrent Processing
Detection of concurrency in a program at
instruction level using techniques like Bernstein
is not practical, specially in case of large
programs.
So we will look at detection of concurrency at a
higher level. This bring us to the issue of
partitioning, scheduling, and load balancing.
Programming Level
Concurrent Processing
Two issues of concern:
• How can we partition a program into concurrent
branches, program modules, or grains to yield the
shortest execution time, and
• What is the optimal size of concurrent grains in a
computation?
Programming Level
Concurrent Processing
Partitioning is defined as the ability to partition
a program into subprograms that can be
executed in parallel.
Within the scope of partitioning, two major
issues are of concern:
• Grain Size, and
• Latency
Programming Level
Partitioning and Scheduling
Grain Size
• Granularity or grain size is a measure of the amount
of computation involved in a process — It
determines the basic program segment chosen for
parallel processing.
• Grain sizes are commonly named as:
− fine,
− medium, and
− coarse.
Programming Level
Partitioning and Scheduling
Latency
• Latency imposes a limiting factor on the scalability
of the machine size.
• Communication latency — inter-processor
communication — is a major factor of concern to a
system designer.
Programming Level
Partitioning and Scheduling
• In general, n tasks communicating with each
other may require n (n-1) / 2 communication
links among them.
• This leads to a communication bound which
limits the number of processors allowed in a
computer system.
Programming Level
Partitioning and Scheduling
Parallelism can be exploited at various levels:
• Job or Program
• Subprogram
• Procedure, Task, or Subroutine
• Loops or Iteration
• Instruction or Statement
Programming Level
Partitioning and Scheduling
The lower the level, the finer the granularity.
The finer the granularity, the higher the
communication and scheduling overheads.
The finer the granularity, the higher the degree
of parallelism.
Programming Level
Partitioning and Scheduling
Instruction and loop levels represent fine grain
size.
Procedure and subprogram levels represent
medium grain size, and
Job and subprogram levels represent coarse
grain size.
Programming Level
Partitioning and Scheduling
Instruction Level
• An instruction level granularity represents a grain
size consisting of up to 20 instructions.
• This level offers a high degree of parallelism in
common programs.
• It is expected that parallelism will be exploited by
compiler automatically.
Programming Level
Partitioning and Scheduling
Loop Level
• Here typically, we are concerned about iterative
loops with less than 500 instructions.
• At this level, one can distinguish two classes of
loop:
− Loops with independent iterations and,
− Loops with dependent iterations.
Programming Level
Partitioning and Scheduling
Procedure Level
• A typical grain at this level contains less than 2,000
instructions.
• Communication requirement and penalty at this
level is less compared with that of the fine grain
levels at the expense of more complexities in
detection of parallelism — inter-procedural
dependence.
Programming Level
Partitioning and Scheduling
Subprogram Level
• Multiprogramming on a uni-processor or on a
multiprocessor platform represents this level.
• In the past, parallelism at this level was
exploited by the programmers or algorithm
designers rather than by compilers.
Programming Level
Partitioning and Scheduling
Job Level
• This level corresponds to the parallel execution of
independent jobs (programs) on concurrent
computers.
• Supercomputers with a small number of powerful
processors are the best platform for this level of
parallelism. In general, parallelism at this level is
exploitable by the program loader and the operating
system.
Programming Level
Grain Packing
Let us look at the following example to
motivate the effect of grain size on the
performance.
In the following discussion, we make a
reference to the term program graph.
Programming Level
Grain Packing
A program graph is a dependence graph in
which:
• Each operation is labeled as (n, s) where n is the
node identifier and s is the execution time of the
node.
• Each edge is labeled as (v, d) where v is the edge
identifier and d is the communication delay.
Consider the following program graph:
Programming Level
Grain Packing — Fine grain size
1,1 4,1 5,1
7,2
6,1
11,2
2,1 3,1
8,2 10,2
9,2
a,6
b,6
c,6 d,6
d,6
d,6
f,6
e,6
e,6 f,6
12,2
13,2
17,2
16,2
15,2
14,2
n,4
o,3
4.
l,3
k,4
j,4
m,3
3.
p,3
h,4
g,4
i,4
Programming Level
Grain Packing — Fine grain size
Nodes 1-6 are all memory references
• 1 cycle to calculate address, and
• 6 cycles to fetch data from memory.
Other nodes are CPU operations requiring 2
cycles each.
Programming Level
Grain Packing — Fine grain size
The idea behind grain packing is to apply:
• Fine-grain first to achieve a higher degree of
parallelism,
• Combine multiple fine-grain nodes into a coarse-
grain node if it can eliminate unnecessary
communication delays or reduce the overall cost.
D
E
C
B
Programming Level
Grain Packing — Partitioning
A
1,1 4,1 5,1
7,2
6,1
11,2
2,1 3,1
8,2 10,2
9,2
a,6
b,6
c,6 d,6
d,6
d,6
f,6
e,6
e,6 f,6
12,2
13,2
17,2
16,2
15,2
14,2
n,4
o,3
4.
l,3
k,4
j,4
m,3
3.
p,3
h,4
g,4
i,4
Programming Level
Grain Packing — Coarse Grain
E,6
C,4
B,4 D,6
A,8
4.
g,4
e,6 f,6
d,6
d,6
c,6
b,6
a,6
h,4
i,4
n,4
j,4
k,4
3.
Programming Level
Grain Packing — Scheduling Fine Grain Size
P1
6 9
1
11 2 3 8 7
0 1 7 9 10 12 18 20 22 24 28
P2
10 13
12
5
6
4 16
15
14 17
0 1 2 8 10 14 16 19 21 24 26 30 32 35 37 40 42
Busy Communication Idle
Programming Level
Grain Packing — Scheduling Coarse Grain Size
P1
A C D
0 8 14 18 22 28 32 38
E
Busy Communication Idle
7
P2
B
0 14 18 22
Programming Level
Grain Packing
As can be noted, through grain packing we
were able to reduce the overall execution time
by reducing the communication overhead.
The concept of grain packing can be recursively
applied.
Programming Level
Task Duplication
Task duplication is another way to reduce the
communication overhead and hence the
execution time.
Consider the following program graph:
Programming Level
Task Duplication
D,2 E,2
B,1 C,1
A,4
d,4
a,1
b,1
c,1
c,8
a,8
e,4
Programming Level
Task Duplication
Busy Communication Idle
P1
A B D
0 4 6 13 21 23 27
E 7
P2
C
0 4 12 13 14 16 20
E
Programming Level
Task Duplication
Now let us duplicate tasks A and C:
D,2
B,1 C',1
A,4
d,4
a,1
b,1
c,1
a,1
a,1
E,2
C,1
c,1
e,4
A',4
Programming Level
Task Duplication
Busy Communication Idle
P1
A B D
0 4 6 7 8 10 14
C
E 7
P2
C
0 4 5 6 7 9 13
E
A
Programming Level
Summary
In partitioning a task into subtasks two issues
must be taken into consideration:
• Grain Size, and
• Latency
Program Graph
Grain Packing
Task duplication
Programming Level
Loop Scheduling
Loops are the largest source of parallelism in a
program. Therefore, there is a need to pay
attention to the partitioning and allocation of
loop iterations among processors in a parallel
environment.
Programming Level
Loop Scheduling
Practically we can talk about three classes of
loops:
• Sequential Loops,
• Doall Loops — vector (parallel) loops,
• Doacross Loops — Loop with intermediate degree
of parallelism.
Programming Level
Loop Scheduling — Doall Loops
Static Scheduling
Dynamic Scheduling
Programming Level
Loop Scheduling — Doall Loops
Static Scheduling schemes assign a fixed
number of iterations to each processor:
• Block Scheduling (Static chunking),
• Cyclic Scheduling
In practice, cyclic scheduling offers a better
load balancing than block scheduling if the
computation performed by each iteration varies
significantly.
Programming Level
Loop Scheduling — Doall Loops
Dynamic scheduling schemes have been
proposed to respond to the imbalance work-load
of the static scheduling schemes:
• Self Scheduling Scheme
• Fixed size chunking Scheme
• Guided Self Scheduling Scheme
• Factoring Scheme
• Trapezoid Self Scheduling Scheme
Programming Level
Loop Scheduling
Practice has shown that the lost of parallelism
after serializing Doacross loops is very
significant — Chen & Yew (1991).
There is a need to develop good schemes for the
parallel execution of Doacross loops.
Programming Level
Loop Scheduling — Doacross Loops
In a Doacross loop, iteration may be either data
or control dependent on each other:
• Control dependence is caused by conditional
statements.
• Data dependence appears in the form of sharing
computational results. Data dependence can be
either in the form of lexically-forward or lexically-
backward.
Programming Level
Loop Scheduling — Doacross Loops
Doacross loops can be either:
• regular (dependence distance is fixed) or
• irregular (dependence distance varies from iteration
to iteration).
Regular Doacross loops are more amenable to
parallel execution than irregular loops.
Programming Level
Loop Scheduling — Doacross Loops
DOACROSS Model — Cytron 1986
Pre-synchronization Scheduling — Krothapalli
& Sadayappan 1990
Staggered Distribution Scheme — Lim et. al.,
1992
Programming Level
Loop Scheduling
DOACROSS Model — Example
0 •
2 •
4 •
6 •
8 •
10 •
12 •
14 •
16 •
18 •
20 •
22 •
24 •
26 •
28 •
30 •
32 •
34 •
36 •
38 •
i1
i2
i3
i4
i5
i6
i7
i8
P1 P2 P3
T
I
M
E
Programming Level
Loop Scheduling
DOACROSS model was aimed to model the
execution of the sequential loops, vector loops,
and loops with intermediate parallelism by
considering the:
• control dependencies, and
• data dependencies.
Programming Level
Loop Scheduling
Control dependencies are caused by conditional
and unconditional branches.
Data dependencies are due to the lexically
forward and lexically backward dependencies.
This simple model, however, does not consider
the effect of inter-processor communication
cost.
Programming Level
Partitioning and Scheduling
A proper allocation scheme should partition a
task (program graph) into subtasks (sub-graphs)
and distributes the subtasks among processors
in an attempt to minimize:
• processor contentions, and
• inter-processor communications.
Programming Level
Partitioning and Scheduling
There are two main allocation schemes:
• Static, and
• Dynamic.

More Related Content

PPTX
Lec 4 (program and network properties)
PDF
Program and Network Properties
PPT
programnetwork_properties-parallelism_ch2.ppt
PPT
advanced computer architesture-conditions of parallelism
PPTX
Parallel programming model
PPT
Unit-3.ppt
PPT
Os2 2
PPT
SecondPresentationDesigning_Parallel_Programs.ppt
Lec 4 (program and network properties)
Program and Network Properties
programnetwork_properties-parallelism_ch2.ppt
advanced computer architesture-conditions of parallelism
Parallel programming model
Unit-3.ppt
Os2 2
SecondPresentationDesigning_Parallel_Programs.ppt

Similar to Grain Packing & scheduling Ch2 Hwang - Copy.ppt (20)

PPT
computer architecture module3 notes module
PPT
CALecture3Module1.ppt
PPT
BIL406-Chapter-6-Basic Parallelism and CPU.ppt
PDF
Parallel Computing - Lec 5
PPT
Hardware and Software parallelism
PDF
Design of Parallel and HPC, Lecture: Memory Models
PDF
Parallel and Distributed computing: why parallellismpdf
DOCX
parallelism
PPT
Parallel Computing 2007: Overview
PPTX
20090720 smith
PPT
Lecture1
PPT
PMSCS 657_Parallel and Distributed processing
PPT
2. ILP Processors.ppt
PDF
Lecture 4 principles of parallel algorithm design updated
PDF
Example : parallelize a simple problem
PPTX
Solution Patterns for Parallel Programming
PPT
Os2
PPTX
Instruction-Level Parallelism and Its Exploitation.pptx
PDF
Understanding of linux kernel memory model
PPT
BIL406-Chapter-4-Parallel Processing Concept.ppt
computer architecture module3 notes module
CALecture3Module1.ppt
BIL406-Chapter-6-Basic Parallelism and CPU.ppt
Parallel Computing - Lec 5
Hardware and Software parallelism
Design of Parallel and HPC, Lecture: Memory Models
Parallel and Distributed computing: why parallellismpdf
parallelism
Parallel Computing 2007: Overview
20090720 smith
Lecture1
PMSCS 657_Parallel and Distributed processing
2. ILP Processors.ppt
Lecture 4 principles of parallel algorithm design updated
Example : parallelize a simple problem
Solution Patterns for Parallel Programming
Os2
Instruction-Level Parallelism and Its Exploitation.pptx
Understanding of linux kernel memory model
BIL406-Chapter-4-Parallel Processing Concept.ppt
Ad

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Lesson notes of climatology university.
PDF
RMMM.pdf make it easy to upload and study
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Institutional Correction lecture only . . .
Renaissance Architecture: A Journey from Faith to Humanism
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Final Presentation General Medicine 03-08-2024.pptx
Lesson notes of climatology university.
RMMM.pdf make it easy to upload and study
GDM (1) (1).pptx small presentation for students
PPH.pptx obstetrics and gynecology in nursing
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Microbial disease of the cardiovascular and lymphatic systems
Abdominal Access Techniques with Prof. Dr. R K Mishra
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Ad

Grain Packing & scheduling Ch2 Hwang - Copy.ppt

  • 1. 1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping+ Scheduling • Done by programmer or system software (compiler, runtime, ...) • Issues are the same, so assume programmer does it all explicitly P0 Tasks Processes Processors P1 P2 P3 p0 p1 p2 p3 p0 p1 p2 p3 Partitioning Sequential computation Parallel program A s s i g n m e n t D e c o m p o s i t i o n M a p p i n g O r c h e s t r a t i o n Parallel Algorithm
  • 2. Programming Level Concurrent Processing These can be attributed to issues such as: • level of concurrency, • computational granularity, • time and space complexity, • communication latencies, and • scheduling and load balancing.
  • 3. Programming Level Concurrent Processing Independence among segments of a program is a necessary condition to execute them concurrently. In general, two independent segments should be executed in any order without affecting each other — a segment can be an instruction or a sequence of instructions.
  • 4. Programming Level Concurrent Processing Dependence graph is used to determine the dependence relations among the program segments.
  • 5. Programming Level Concurrent Processing Dependence Graph — A dependence graph is a directed graph G º G (N,A) in which the set of nodes (N) represents the program segments and the set of the directed arcs (A) shows the order of dependence among the segments
  • 6. Programming Level Concurrent Processing Dependence Graph • Dependence comes in various forms and kinds: − Data dependence − Control dependence − Resource dependence
  • 7. Programming Level Concurrent Processing Data Dependence: If an instruction uses a value produced by a previous instruction, then the second instruction is data dependent to the first instruction. Data Dependence comes in different forms:
  • 8. Programming Level Data dependence Flow dependence: At least one output of S1 is an input ofS2 (Read-After-Write: RAW). Anti dependence: Output of S2 is overlapped with the input toS1 (Write-After-Read: WAR). Output dependence: S1 and S2 write to the same location (Write-After-Write: WAW). S1 S2  S 1 ® S 2 S1 S2 
  • 9. Programming Level Data dependence I/O dependence: The same file is referred to by both I/O statements. Unknown dependence: The dependence relation can not be determined.
  • 10. Programming Level Example Assume the following sequence of the instructions: • S 1 : R 1 ¬(A) • S 2 : R 2 ¬(R 1 ) + (R 2 ) • S 3 : R 1 ¬(R 3 ) • S 4 : B ¬(R 1 )
  • 12. Programming Level Control dependence The order of execution is determined during run-time — Conditional statements. Control dependence could also exist between operations performed in successive iterations of a loop: Do I = 1, N If (A(I-1) = 0) then A(I) = 0 End
  • 13. Programming Level Control dependence Control dependence often does not allow efficient exploitation of parallelism.
  • 14. Programming Level Resource dependence Conflict in using shared resources such as concurrent request for the same functional unit. A resource conflict arises when two instructions attempt to use the same resource at the same time.
  • 15. Programming Level Resource dependence Within the scope of the resource dependence then we can talk about storage dependence, ALU dependence,...
  • 16. Programming Level Question What is “true Dependence”? What is “False Dependence”?
  • 17. Programming Level Concurrent Processing Bernstein's Conditions • Let Ii and Oi be the input and output sets of process Pi, respectively. The two processes P1 and P2 can be executed in parallel (P1 || P2) iff: I1 Ç O2 = 0 WAR I2 Ç O1 = 0 RAW O1 Ç O2 = 0 WAW
  • 18. Programming Level Concurrent Processing Bernstein's Conditions • In general, P1, P2,..., Pk can be executed in parallel if Bernstein condition is held for every pair of processes: P1 || P2 || P3... || Pk iff Pi || Pj " i ¹ j
  • 19. Fall 2004 Programming Level Concurrent Processing Bernstein's Conditions · parallelism relation (||) is commutative: Pi || Pj Þ Pj || Pi · parallelism relation (||) is not transitive: Pi || Pj and Pj || Pk does not necessarily guarantee Pi || Pk
  • 20. Fall 2004 Programming Level Concurrent Processing Bernstein's Conditions · parallelism relation (||) is not equivalence relation: · parallelism relation (||) is associative: Pi || Pj ||Pk Þ (Pi || Pj) || Pk = Pi || (Pj || Pk)
  • 21. Programming Level Concurrent Processing Bernstein's Conditions — Example • Detect parallelism in the following program, assume a uniform execution time: P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E
  • 22. Programming Level Concurrent Processing — Bernstein's Condition * +3 +2 / D E C B L E G F G +1
  • 23. Programming Level Concurrent Processing — Bernstein's Condition Example: If two adders are available then, * +1 +3 +2 / D E C B L E G F G A
  • 24. Programming Level Concurrent Processing Bernstein's Conditions — Example * +1 +3 +2 / P 1 P 2 P3 P4 P5 Resource dependence Data dependence
  • 25. Programming Level Concurrent Processing Hardware parallelism is referred to the type and degree of parallelism defined by the architecture and hardware multiplicity — a k-issue processor is a processor with hardware capability that issues K instructions per machine cycle.
  • 26. Programming Level Concurrent Processing Software parallelism is defined by the control and data dependence of programs. It is a function of algorithm, programming style, and compiler optimization.
  • 27. Programming Level Concurrent Processing Hardware vs. software parallelism N = (A * B) + (C * D) M = (A * B) - (C * D)
  • 28. Programming Level Concurrent Processing Hardware vs. software parallelism • In machine code we have: Load R1, A Load R2, B Load R3, C Load R4, D Mult Rx, R1, R2 Mult Ry, R3, R4 Add R1, Rx, Ry Sub R2, Rx, Ry Store N, R1
  • 29. Programming Level Concurrent Processing Hardware vs. software parallelism • A machine which allows parallel multiplications, add/subtraction and simultaneous load/store operations gives an average software parallelism of (10/4) = 2.5 instructions.
  • 30. Programming Level Concurrent Processing Hardware vs. software parallelism *1 + *2 - Load A Load B Load C Load D Store M Store N
  • 31. Programming Level Concurrent Processing Hardware vs. software parallelism • For a machine which does not allow simultaneous Load/Store and arithmetic operations, we have an average software parallelism of (10/8) = 1.25 instructions.
  • 32. Programming Level Concurrent Processing Hardware vs. software parallelism * 1 + * 2 - Load A Load B Load C Load D Store M Store N
  • 33. Programming Level Concurrent Processing Hardware vs. software parallelism • Now assume a multiprocessor composing of two processors:
  • 34. Programming Level Concurrent Processing Hardware vs. software parallelism *1 + *2 - Load A Load B Load C Load D Store M Store N Store L Store K Load L Load K added instructions for inter processor communication. }
  • 35. Programming Level Concurrent Processing Compilation support is one way to solve the mismatch between software and hardware parallelism. A suitable compiler could exploit hardware features in order to improve performance.
  • 36. Programming Level Summary Dependence Graph Different types of dependencies Bernstein's Conditions Hardware Parallelism Software Parallelism
  • 37. Programming Level Concurrent Processing Detection of concurrency in a program at instruction level using techniques like Bernstein is not practical, specially in case of large programs. So we will look at detection of concurrency at a higher level. This bring us to the issue of partitioning, scheduling, and load balancing.
  • 38. Programming Level Concurrent Processing Two issues of concern: • How can we partition a program into concurrent branches, program modules, or grains to yield the shortest execution time, and • What is the optimal size of concurrent grains in a computation?
  • 39. Programming Level Concurrent Processing Partitioning is defined as the ability to partition a program into subprograms that can be executed in parallel. Within the scope of partitioning, two major issues are of concern: • Grain Size, and • Latency
  • 40. Programming Level Partitioning and Scheduling Grain Size • Granularity or grain size is a measure of the amount of computation involved in a process — It determines the basic program segment chosen for parallel processing. • Grain sizes are commonly named as: − fine, − medium, and − coarse.
  • 41. Programming Level Partitioning and Scheduling Latency • Latency imposes a limiting factor on the scalability of the machine size. • Communication latency — inter-processor communication — is a major factor of concern to a system designer.
  • 42. Programming Level Partitioning and Scheduling • In general, n tasks communicating with each other may require n (n-1) / 2 communication links among them. • This leads to a communication bound which limits the number of processors allowed in a computer system.
  • 43. Programming Level Partitioning and Scheduling Parallelism can be exploited at various levels: • Job or Program • Subprogram • Procedure, Task, or Subroutine • Loops or Iteration • Instruction or Statement
  • 44. Programming Level Partitioning and Scheduling The lower the level, the finer the granularity. The finer the granularity, the higher the communication and scheduling overheads. The finer the granularity, the higher the degree of parallelism.
  • 45. Programming Level Partitioning and Scheduling Instruction and loop levels represent fine grain size. Procedure and subprogram levels represent medium grain size, and Job and subprogram levels represent coarse grain size.
  • 46. Programming Level Partitioning and Scheduling Instruction Level • An instruction level granularity represents a grain size consisting of up to 20 instructions. • This level offers a high degree of parallelism in common programs. • It is expected that parallelism will be exploited by compiler automatically.
  • 47. Programming Level Partitioning and Scheduling Loop Level • Here typically, we are concerned about iterative loops with less than 500 instructions. • At this level, one can distinguish two classes of loop: − Loops with independent iterations and, − Loops with dependent iterations.
  • 48. Programming Level Partitioning and Scheduling Procedure Level • A typical grain at this level contains less than 2,000 instructions. • Communication requirement and penalty at this level is less compared with that of the fine grain levels at the expense of more complexities in detection of parallelism — inter-procedural dependence.
  • 49. Programming Level Partitioning and Scheduling Subprogram Level • Multiprogramming on a uni-processor or on a multiprocessor platform represents this level. • In the past, parallelism at this level was exploited by the programmers or algorithm designers rather than by compilers.
  • 50. Programming Level Partitioning and Scheduling Job Level • This level corresponds to the parallel execution of independent jobs (programs) on concurrent computers. • Supercomputers with a small number of powerful processors are the best platform for this level of parallelism. In general, parallelism at this level is exploitable by the program loader and the operating system.
  • 51. Programming Level Grain Packing Let us look at the following example to motivate the effect of grain size on the performance. In the following discussion, we make a reference to the term program graph.
  • 52. Programming Level Grain Packing A program graph is a dependence graph in which: • Each operation is labeled as (n, s) where n is the node identifier and s is the execution time of the node. • Each edge is labeled as (v, d) where v is the edge identifier and d is the communication delay. Consider the following program graph:
  • 53. Programming Level Grain Packing — Fine grain size 1,1 4,1 5,1 7,2 6,1 11,2 2,1 3,1 8,2 10,2 9,2 a,6 b,6 c,6 d,6 d,6 d,6 f,6 e,6 e,6 f,6 12,2 13,2 17,2 16,2 15,2 14,2 n,4 o,3 4. l,3 k,4 j,4 m,3 3. p,3 h,4 g,4 i,4
  • 54. Programming Level Grain Packing — Fine grain size Nodes 1-6 are all memory references • 1 cycle to calculate address, and • 6 cycles to fetch data from memory. Other nodes are CPU operations requiring 2 cycles each.
  • 55. Programming Level Grain Packing — Fine grain size The idea behind grain packing is to apply: • Fine-grain first to achieve a higher degree of parallelism, • Combine multiple fine-grain nodes into a coarse- grain node if it can eliminate unnecessary communication delays or reduce the overall cost.
  • 56. D E C B Programming Level Grain Packing — Partitioning A 1,1 4,1 5,1 7,2 6,1 11,2 2,1 3,1 8,2 10,2 9,2 a,6 b,6 c,6 d,6 d,6 d,6 f,6 e,6 e,6 f,6 12,2 13,2 17,2 16,2 15,2 14,2 n,4 o,3 4. l,3 k,4 j,4 m,3 3. p,3 h,4 g,4 i,4
  • 57. Programming Level Grain Packing — Coarse Grain E,6 C,4 B,4 D,6 A,8 4. g,4 e,6 f,6 d,6 d,6 c,6 b,6 a,6 h,4 i,4 n,4 j,4 k,4 3.
  • 58. Programming Level Grain Packing — Scheduling Fine Grain Size P1 6 9 1 11 2 3 8 7 0 1 7 9 10 12 18 20 22 24 28 P2 10 13 12 5 6 4 16 15 14 17 0 1 2 8 10 14 16 19 21 24 26 30 32 35 37 40 42 Busy Communication Idle
  • 59. Programming Level Grain Packing — Scheduling Coarse Grain Size P1 A C D 0 8 14 18 22 28 32 38 E Busy Communication Idle 7 P2 B 0 14 18 22
  • 60. Programming Level Grain Packing As can be noted, through grain packing we were able to reduce the overall execution time by reducing the communication overhead. The concept of grain packing can be recursively applied.
  • 61. Programming Level Task Duplication Task duplication is another way to reduce the communication overhead and hence the execution time. Consider the following program graph:
  • 62. Programming Level Task Duplication D,2 E,2 B,1 C,1 A,4 d,4 a,1 b,1 c,1 c,8 a,8 e,4
  • 63. Programming Level Task Duplication Busy Communication Idle P1 A B D 0 4 6 13 21 23 27 E 7 P2 C 0 4 12 13 14 16 20 E
  • 64. Programming Level Task Duplication Now let us duplicate tasks A and C: D,2 B,1 C',1 A,4 d,4 a,1 b,1 c,1 a,1 a,1 E,2 C,1 c,1 e,4 A',4
  • 65. Programming Level Task Duplication Busy Communication Idle P1 A B D 0 4 6 7 8 10 14 C E 7 P2 C 0 4 5 6 7 9 13 E A
  • 66. Programming Level Summary In partitioning a task into subtasks two issues must be taken into consideration: • Grain Size, and • Latency Program Graph Grain Packing Task duplication
  • 67. Programming Level Loop Scheduling Loops are the largest source of parallelism in a program. Therefore, there is a need to pay attention to the partitioning and allocation of loop iterations among processors in a parallel environment.
  • 68. Programming Level Loop Scheduling Practically we can talk about three classes of loops: • Sequential Loops, • Doall Loops — vector (parallel) loops, • Doacross Loops — Loop with intermediate degree of parallelism.
  • 69. Programming Level Loop Scheduling — Doall Loops Static Scheduling Dynamic Scheduling
  • 70. Programming Level Loop Scheduling — Doall Loops Static Scheduling schemes assign a fixed number of iterations to each processor: • Block Scheduling (Static chunking), • Cyclic Scheduling In practice, cyclic scheduling offers a better load balancing than block scheduling if the computation performed by each iteration varies significantly.
  • 71. Programming Level Loop Scheduling — Doall Loops Dynamic scheduling schemes have been proposed to respond to the imbalance work-load of the static scheduling schemes: • Self Scheduling Scheme • Fixed size chunking Scheme • Guided Self Scheduling Scheme • Factoring Scheme • Trapezoid Self Scheduling Scheme
  • 72. Programming Level Loop Scheduling Practice has shown that the lost of parallelism after serializing Doacross loops is very significant — Chen & Yew (1991). There is a need to develop good schemes for the parallel execution of Doacross loops.
  • 73. Programming Level Loop Scheduling — Doacross Loops In a Doacross loop, iteration may be either data or control dependent on each other: • Control dependence is caused by conditional statements. • Data dependence appears in the form of sharing computational results. Data dependence can be either in the form of lexically-forward or lexically- backward.
  • 74. Programming Level Loop Scheduling — Doacross Loops Doacross loops can be either: • regular (dependence distance is fixed) or • irregular (dependence distance varies from iteration to iteration). Regular Doacross loops are more amenable to parallel execution than irregular loops.
  • 75. Programming Level Loop Scheduling — Doacross Loops DOACROSS Model — Cytron 1986 Pre-synchronization Scheduling — Krothapalli & Sadayappan 1990 Staggered Distribution Scheme — Lim et. al., 1992
  • 76. Programming Level Loop Scheduling DOACROSS Model — Example 0 • 2 • 4 • 6 • 8 • 10 • 12 • 14 • 16 • 18 • 20 • 22 • 24 • 26 • 28 • 30 • 32 • 34 • 36 • 38 • i1 i2 i3 i4 i5 i6 i7 i8 P1 P2 P3 T I M E
  • 77. Programming Level Loop Scheduling DOACROSS model was aimed to model the execution of the sequential loops, vector loops, and loops with intermediate parallelism by considering the: • control dependencies, and • data dependencies.
  • 78. Programming Level Loop Scheduling Control dependencies are caused by conditional and unconditional branches. Data dependencies are due to the lexically forward and lexically backward dependencies. This simple model, however, does not consider the effect of inter-processor communication cost.
  • 79. Programming Level Partitioning and Scheduling A proper allocation scheme should partition a task (program graph) into subtasks (sub-graphs) and distributes the subtasks among processors in an attempt to minimize: • processor contentions, and • inter-processor communications.
  • 80. Programming Level Partitioning and Scheduling There are two main allocation schemes: • Static, and • Dynamic.