SlideShare a Scribd company logo
Pipelining

Pipelinig is an implementation technique whereby
multiple instructions are overlapped in execution.

It takes advantage of parallelism that exist among the
actions needed to execute an instruction.

Today pipelining is the key implementation technique
used to make fast CPU.

Each step in pipeline completes a part of an instruction.

Each of these steps is called a pipe stage or a pipe
segment.

The stages are connected one to next to form a pipe.

Instructions enter at one end, progress through the
stages, and exit at the other end.
Pipelining

The time required between moving an instruction one step
down the pipeline is a processor cycle.

Because all stages proceed at the same time, the length of
a processor cycle is determined by the time required for the
slowest pipe stage.

In a computer this processor cycle is usually 1 clock cycle
(some times it is 2).

If the stages are perfectly balanced, then the time per
instruction on the pipelined processor- assuming ideal
conditions- is equal to :
Time per instruction on unpipelined machine
-----------------------------------------------------------
Number of pipeline stages.
Pipelining

Under ideal conditions the speed up from pipelining equals
the number of pipe stages.

Practically

The stages will not be perfectly balanced

Pipelining involves some overhead.

Pipelining yields a reduction in average execution time per
instruction.

Pipelining is not visible to the programmer.
Basics of a RISC Instruction Set

All operations on data apply to data in registers.

The only operations that affect memory are load and store.

The instruction formats are few in number with all
instructions typically being one size (same size).

64 bit instructions are designated by having a D on the start.

DADD is the 64 bit version of ADD instr.
Basics of a RISC Instruction Set

3 classes of instructions.

ALU instructions

Load and Store Instructions

Branches and jumps.

ALU Instructions.

Take either 2 registers or

A register and a sign-extended immediate

Operate on these and store the result into a 3rd
register.

Eg. DADD, DSUB logical – AND, OR
Basics of a RISC Instruction Set

Load and Store Instructions.

Take a register source called a base register and an
immediate field called offset as operands.

Their sum is used as a memory address (Effective addr.)
LD – Load Word
SD – Store Word

In case of Load a 2nd
register operand acts as a destination
for the data loaded from memory.

In case of Store the 2nd
register operand is the source of the
data that is to be stored into memory.
Basics of a RISC Instruction Set

Branches and Jumps.

Branches are conditional transfer of control.

Two ways of specifying branch condition in RISC
(i) With a set of condition bits (called condition code)
(ii) By a limited set of comparisions

Between a pair of registers or

A register and zero

Branch destination is obtained by adding a sign-extended
offset (16 bits in MIPS) to the current PC.

Unconditional jumps are also provided in MIPS.
A Simple Pipelined Implementation
Focus on a pipeline for an integer subset of a RISC
architecture that consists of :

Load – Store

Branch

Integer ALU operations.
Every instruction in this subset can be implemented in at most
5 clock cycles.
The Five pipeline stages are as follows.

Instruction Fetch cycle (IF)

Instruction Decode / Register Fetch cycle (ID)

Execution / Effective Address cycle (EX)

Memory Access (MEM)

Write Back cycle (WB)
Pipeline Stages

Instruction Fetch Cycle (IF)

Send the content of PC to the memory and fetch the
current location from memory.

Update the PC to the next sequential instruction by
adding 4 to the PC (assuming 4 bytes instruction)

Instruction Decode /Register Fetch cycle (ID)

Decode the instruction and read the registers.

Decoding is done in parallel with reading registers.

This is possible because the register specifiers are at
a fixed location in a RISC architecture.

This technique is known as fixed field decoding.
Pipeline Stages

Execution / Effective Address Cycle (EX)

Performing one of the 3 fuctions depending on the
instruction type

Memory referrence

ALU adds the base reg and the offset to form EA.

Register – Register ALU instruction

Performs the operation

Register – Immediate ALU instruction

Performs the operation on the value from register
and the sign extended immediate.

Memory Access (MEM)

If the instruction is a load , memory does a read using EA
computed in previous cycle.

If it is a store the memory writes the data to the location
specified by EA.
Pipeline Stages

Write Back Cycle (WB)

Register-Register ALU instructions. Or Load instr.

Write the result into the register file whether it comes
from the memory system (for a load) or from the ALU
(for an ALU instr)

In this implementation

Branch instruction requires 2 cycles

Store instruction – 4 cycles

All other instructions – 5 cycles.
The Classic Five-Stage Pipeline
Classic Pipeline Stages

Starts a new instruction on each cycle.

On each clock cycle another instrucion is fetched and begins
its 5 cycle execution.

During each clock cycle, h/w will be executing some part of
the five different instructions.

A single ALU can not be asked to compute an effective
address and perform a subtract operation at the same time.
Classic Pipeline Stages

Because register file is used as a source in the ID stage and
as a destination in the WB stage it appears twice.

It is read in one part of a stage( clock cycle) and written in
another part, represented by a solid line and a dashed line.

IM – Instruction Memory

DM – Data Memory

CC – clock cycle.
Pipeline Registers between successive pipeline stages
Pipeline Registers

To ensure that instructions in different states of a pipe line
do not interfere with one another,

a separaion is done by introducing pipeline registers
between successive stages of the pipeline, sothat

at the end of a clock cycle all the results from a
given stage are stored into a register

that is used as the input to the next stage on the
next clock cycle.
Pipeline Registers

Pipeline registers prevent interference between two different
instructions in adjacent stages in the pipeline.

The registers also play a critical role of carrying data for a
given instruction from one stage to the other.

The edge-triggered property of register is critical. (value
change instantaneously on a clock edge)

Otherwise data from one instruction could interfere with the
execution of another.
Basic Performance Issues of Pipelining

Pipelining increases the CPU instruction throughput. (ie. the
number of instructions completed per unit time)

It does not reduces the execution time of an individual instr.

It usually slightly increases the execution time of each
instruction due to overhead in the control of the pipeline.

Program runs faster, eventhough no single instruction runs
faster.

The clock can run no faster than the time needed for the
slowest pipeline stage.

Pipeline overhead arises from the combination of pipeline
register delay and clock skew.
Basic Performance Issues of Pipelining

Pipeline registers add set up time – which is the time that a
register input must be stable before the clock signal that
triggers a write occurs, plus propagation delay to the clock
cycle.

Clock skew is a phenomenon in synchronous digital circuit
systems (such as computer systems) in which

the same sourced clock signal arrives at different
components at different times.

The instantaneous difference between the readings of
any two clocks is called their skew.
The Classic Five-Stage Pipeline
Pipeline Registers between successive pipeline stages
Pipeline Hazards
Hazards are situations, that prevent the next instruction in
the instruction stream from executing during its designated
clock cycle.
Hazards reduce the performance from the ideal speedup
gained by pipelining.
There are 3 classes of hazards.
Structural hazards - arise from resource conflicts when the
hardware can not support all possible combinations of
instructions simultaneously in overlapped execution.
Data hazards - arise when an instruction depends on the
results of a previous instruction.
Control hazards - arise from the pipelining of branches and
other instructions that change the PC.
Pipeline Hazards
Hazards in pipeline can make it necessary to stall
the pipeline.
Avoiding a hazard often requires the some
instructions in the pipeline be allowed to proceed
while others are delayed.
When an instruction is stalled, all instructions issued
later than the stalled instruction are also stalled.
Instructions issued earlier than the stalled instruction
must continue, otherwise the hazard will never clear.
As a result no new instructions are fetched during
the stall.
Performance of Pipeline with stalls
If we ignore the cycle time overhead of pipelining and assume
the stages are perfectly balanced, then the cycle time of the two
processors can be equal.
Performance of Pipeline with stalls
When all instructions take the same number of cycles,
which must also equal the number of pipeline stages (also
called the depth of the pipeline )
If there are no pipeline stalls, pipelining can improve
performance by the depth of the pipeline. (No. of
pipeline stages)
Structural Hazards
Requires pipelining of functional units and duplication of
resources to allow all possible combinations of instructions in
the pipeline.
If some combination of instructions cannot be accommodated
because of resource conflicts, the processor is said to have a
structural hazard.
Structural hazards arise when some functional unit is not fully
pipelined.
Some resource has not been duplicated enough to allow all
combinations of instructions.
Why would a designer allow structural hazards?
The primary reason is to reduce cost of the unit.
Structural Hazards
Structural Hazards
Data Hazards
Occur when the pipeline changes the order of read/write
accesses to operands so that the order differs from the
order seen by sequentially executing instructions on an
unpipelined processor.
Consider the pipelined execution of the following
instructions.
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
Data Hazards
Data Hazards
All the instructions after DADD use the result of DADD
instruction.
DADD writes the value of R1 in the WB pipe stage, but the
DSUB reads the value during its ID stage.
This problem is a data hazard.
Unless precautions are taken to prevent it , the DSUB
instruction will read the wrong value and try to use it.
AND reads R1 during CC4 will receive wrong value
because R1 will be updated at CC5 by DADD.
XOR operates properly because its register read occurs in
CC6, after register write.
OR also operates without a hazard because we perform
the register file reads in the second half of the cycle and
the writes in the first half.
Data Hazards
Minimizing Data Hazard Stalls by Forwarding
The previous problem can be solved with a simple hardware
technique called forwarding (also called bypassing and
sometimes short-circuiting ).
The key insight in forwarding is that the result is not really
needed by the DSUB until after the DADD actually produces
it.
If the result can be moved from the pipeline reg where the
DADD stores it to where the DSUB needs it, then the need
for a stall can be avoided.
Data Hazards
Forwarding works as follows
1) The ALU result from both the EX/MEM and MEM/WB
pipeline registers is always fed back to the ALU inputs.
2) If the forwarding h/w detects that the previous ALU
operations has written the register corresponding to a
source for the current ALU operation, control logic selects
the forwarded result as the ALU input rather than the value
read from the register file.
Data Hazards
Data Hazards
Data Hazards
Data Hazards Requiring Stalls
not all potential data hazards can be handled by
bypassing.
LD R1,0(R2)
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
Data Hazards
Data Hazards
Data Hazards
Branch Hazards
Control hazards can cause a greater performance loss,
than do data hazards.
When a branch is executed, it may or may not change the
PC to something other than its current value plus 4.
If a branch changes the PC to its target address, it is a
taken branch.
If it fall through it is not taken or untaken.
If instruction-i is a taken branch then the PC is normally not
changed until the end of ID, after the completion of
address calculation and comparison.
Branch Hazards
Follwoing figure shows a branch causes a 1-cycle stall in
the five-stage pipeline.
Branch Hazards
Reducing Pipeline Branch Penalties
software can try to minimize the branch penalty using
knowledge of the hardware scheme and of branch
behavior.
Four schemes
1) freeze or flush the pipeline, holding or deleting any
instructions after the branch until the branch destination is
known.
2) predicted-not-taken or predicted untaken scheme -
implemented by continuing to fetch instructions as if the
branch were a normal instruction. If the branch is taken,
however, we need to turn the fetched instruction into a no-
op and restart the fetch at the target address.
Branch Hazards
3) predicted taken scheme - no advantage in this approach
for the 5 stage pipe line.
4) delayed branch
branch instruction
sequential successor1
branch target if taken
The sequential successor is in the branch delay slot.
This instruction is executed whether or not the branch is
taken.
Branch Hazards
The predicted-not-taken scheme and the pipeline sequence
when the branch is untaken (top) and taken (bottom).
Branch Hazards
The pipeline behavior of the five-stage pipeline with a branch
delay is shown in figure.
Branch Hazards
Performance of Branch Schemes
Pipeline stall cycles from branches =
Branch frequency × Branch penalty
The branch frequency and branch penalty can have a
component from both unconditional and conditional branches.
However, the latter dominate since they are more frequent
Instruction Level Parallelism
 Pipelining overlaps the execution of instructions to improve
performance.
 Pipelining does not reduce the execution time of an
instruction.
 But it reduces the total execution time of the program.
 This potential overlap among instructions is called
“Instruction Level Parallelism”(ILP), since the instructions can
be evaluated in parallel.
Instruction Level Parallelism
 There are two main approaches to exploit ILP:
 An approach that relies on Hardware to help discover and
exploit parallelism dynamically.
 Used in Intel Core series dominate in the desktop and
server market.
 An approach that relies on software technology to find
parallelism, statically at Compiler time.
 Most processors for the PMD(Personal Mobile Device)
market use static approaches.
 However, future processors are using dynamic
approaches
Instruction Level Parallelism
 The value of CPI for a pipeline processor is the sum of the
base CPI and all contributions from stalls.
 Pipeline CPI = Ideal pipeline CPI +
Structural stalls +
Data hazard stalls +
Control stalls.
 Ideal pipeline CPI is a measure of the maximum performance
attainable by the implementation.
 By reducing each of the terms of the right hand side, we
minimize the overall pipeline CPI or alternatively, increase
the IPC ( Instructions Per Clock)
Instruction Level Parallelism
 The amount of parallelism available within a basic block is
quite small.
 Since these instructions are likely to depend upon one
another,
 the amount of overlap we can exploit within a basic
block is likely to be less than the average basic block
size.
 To obtain substantial performance enhancements, we
must exploit ILP across multiple basic blocks.
Instruction Level Parallelism
 The simplest and most common way to increase the ILP is
to exploit parallelism among iterations of a loop.
 This type of parallelism is often called loop-level
parallelism.
 Consider a simple example of a loop that adds two 1000-
element arrays and is completely parallel:
 for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];
 Every iteration of the loop can overlap with any other
iteration
 Within each loop iteration there is little or no
opportunity for overlap.
Instruction Level Parallelism
 There are number of techniques for converting such loop-
level parallelism into instruction-level parallelism.
 Basically, such techniques work by unrolling the loop
 either statically by the compiler or
 dynamically by the hardware
Data Dependence
 Determining how one instruction depends on another is
critical to determine
 How much parallelism exists in a program
 How that parallelism can be exploited.
 To exploit ILP we must determine which instructions can
be executed in parallel.
 If two instructions are parallel, they can execute
simultaneously.
 If two instructions are dependent, they are not parallel and
must be executed in order, although they may often be
partially overlapped
Bernstein’s Conditins for detection of Parallelism
 Bernstein conditions are based on the following two sets of
variables:
i. The Read set or input set Ri that consists of variables
read by the statement of instruction Ii.
ii.The Write set or output set Wi that consists of variables
written into by instruction Ii .
 Two instructions I1 and I2 can be executed parallelly if
they satisfies the following conditions:
 R1 ∩ W2 = φ
 R2 ∩ W1 = φ
 W1 ∩ W2 = φ
Data Dependence
 Three different types of dependences:
 data dependences (also called true data dependences),
 name dependences and
 control dependences.
 Data Dependences
 True data dependence ( or flow dependence)
 Anti dependence
 Output dependence
 An instruction j is data dependent on instruction i if either of
the following holds:
 Instruction i produces a result that may be used by
instruction j, or
 Instruction j is data dependent on instruction k and
instruction k is data dependent on instruction i
Data Dependence
 Dependences are a property of programs
 Pipeline organization determines
 if dependence is detected and
 if it causes a stall
 Data dependence conveys 3 things:
 Possibility of a hazard
 Order in which results must be calculated
 An upper bound on howmuch parallelism can possibly
be exploited.
 A dependence can be overcome in two different ways:
 Maintaining the dependence but avoiding a hazard
 Eliminating the dependence by transforming the code.
Name Dependence
 Two instructions use the same name but no flow of
information associated with that name.
 Two types of Name Dependences between an instruction i
that precedes instruction j in program order
 1) Antidependence: instruction j writes a register or
memory location that instruction i reads.
 The original ordering must be preserved to ensure that
instruction i reads the correct value.
 2) Output dependence: instruction i and instruction j write
the same register or memory location
 Ordering must be preserved to ensure that the value
finally written corresponds to instruction j.
 To resolve name dependences, we use renaming techniques
(register renaming)
Module 2 of apj Abdul kablam university hpc.pdf
Module 2 of apj Abdul kablam university hpc.pdf
Module 2 of apj Abdul kablam university hpc.pdf
Data Hazards
 A hazard is created whenever there is a dependence
between instructions, and
 they are close enough that the overlap during execution
would change the order of access to the operand involved
in the dependence.
 Because of the dependence, we have to preserve the
program order.
 Three types of Data Hazards
 Read after write (RAW)
 Write after write (WAW)
 Write after read (WAR)
Data Hazards
 Read after write (RAW)
 Instruction j tries to read a source before i writes it, so j
incorrectly gets the old value.
 This hazard is the most common type
 It corresponds to a true data dependentce.
 Pgm order must be preserved to ensure that j receives
the value from i .
 Write After Write (WAW)
 Instruction j tries to write an operand before it is written
by i.
 The writes end up being performed in the wrong order.
 This corresponds to an output dependence
Data Hazards
 Write After Read (WAR)
 Instruction j tries to write a destination before it is read
by i, so i incorrectly gets the new value.
 This hazard arises from an antidependence.
 Read After Read (RAR) case is not a hazard.
Control Dependences
 A control dependence determines the ordering of an
instruction, i, w.r.t. a branch instruction so that the instruction
i executed in correct program order and only when it should
be.
 These control dependences must be preserved to preserve
program order.
 One of the simplest examples of a control dependence is the
dependence of the statements in the “then” part of an if
statement on the branch.
Control Dependences
 For example, in the code segment :
if p1 {
s1
}
if p2 {
s2
}
 S1 is control dependent on p1, and S2 is control dependent
on p2 but not on p1.
Control Dependences
 In general, two constraints are imposed by control
dependences:
 An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution
is no longer controlled by the branch.
 For example, we cannot take an instruction from
the then portion of an if statement and move it
before the if statement.
 An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution
is controlled by the branch.
 For example, we cannot take a statement before
the if statement and move it into the then portion.
Basic Compiler Techniques for Exposing ILP
➢
These techniques are crucial for processors that use static
scheduling.
➢
The basic compiler techniques includes:
➢
Scheduling the code
➢
Loop unrolling
➢
Reducing branch costs with advanced brach
prediction
Basic Pipeline Scheduling
➢
To keep a pipeline full,
➢
parallelism among instructions must be exploited by
➢
finding sequences of unrelated instructions that
can be overlapped in the pipeline.
➢
To avoid a pipeline stall,
➢
the execution of a dependent instruction must be
➢
separated from the source instruction by a
distance in clock cycles equal to
➢
The pipeline latency of that source
instruction.
Basic Pipeline Scheduling
➢
A compiler’s ability to perform this scheduling depends
both on
➢
the amount of ILP available in the program and
➢
On the latencies of the functional units in the
pipeline.
Instruction
producing result
Instruction using
result
Latency in
clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
➢
Latencies of FP operations used is given above.
➢
The last column is the number of intervening clock cycles
needed to avoid a stall.
Basic Pipeline Scheduling
➢
We assume
➢
the standard five-stage integer pipeline, so that
branches have a delay of one clock cycle.
➢
the functional units are fully pipelined or replicated (as
many times as the pipeline depth),
➢
so that an operation of any type can be issued on
every clock cycle and
➢
there are no structural hazards.
➢
The integer ALU operation latency of 0
Basic Pipeline Scheduling
➢
Consider the following code segment which adds a scalar
to a vector:
for (i=999; i>=0; i--)
x[i] = x[i] + s ;
➢
This loop is parallel by noticing that the body of each
iteration is independent.
➢
The first step is to translate the above segment to MIPS
assembly language.
➢
In the following code segment,
➢
R1 is initially the address of the element in the array
with the highest address, and
➢
F2 contains the scalar value s.
➢
Register R2 is precomputed, so that 8(R2) is the
address of the last element to operate on.
Basic Pipeline Scheduling
➢
The straightforward MIPS code, not scheduled for the
pipeline, looks like :
➢
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer
;8 bytes (per DW)
BNE R1,R2,Loop ;branch R1!=R2
Basic Pipeline Scheduling
➢
Without any scheduling, the loop will execute as follows:
Clock cycle issued
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1,R2,Loop 9
Basic Pipeline Scheduling
➢
We can schedule the loop to obtain only two stalls and
reduce the time to seven cycles:
Clock cycle issued
Loop: L.D F0,0(R1) 1
DADDUI R1,R1,#-8 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4, 8(R1) 6
BNE R1,R2,Loop 7
➢
Two stalls after ADD.D are for use by the S.D
Basic Pipeline Scheduling
➢
In the previous example, we complete one loop iteration
and store back one array element every seven clock
cycles.
➢
The actual work of operating on the array element takes
just three (the load, add, and store) of those seven clock
cycles.
➢
The remaining four clock cycles consist of
➢
loop overhead—the DADDUI and BNE—and
➢
two stalls.
➢
To eliminate these four clock cycles
➢
we need to get more operations relative to the
number of overhead instructions.
Loop Unrolling
➢
A simple scheme for increasing the number of instructions
relative to the branch and overhead instructions is loop
unrolling.
➢
Unrolling simply replicates the loop body multiple times,
adjusting the loop termination code.
➢
Loop unrolling can also be used to improve scheduling.
➢
Because it eliminates the branch,
➢
it allows instructions from different iterations to be
scheduled together
Loop Unrolling
➢
If we simply replicated the instructions when we unrolled
the loop,
➢
the resulting use of the same registers could prevent
us from effectively scheduling the loop.
➢
Thus, we will want to use different registers for each
iteration,
➢
increasing the required number of registers.
Loop Unrolling without scheduling
➢
Here we assumes that the number of element is a multiple of 4.
➢
Note that R2 must now be set so that 32(R2) is the starting
address of the last four elements
Loop Unrolling without scheduling
➢
We have eliminated 3 branches and 3 decrements of R1 .
➢
Without scheduling, every operation in the unrolled loop is
followed by a dependent operation and thus will cause a
stall.
➢
This loop will run in 27 clock cycles:
➢
each L.D has 1 stall, (1x4 =4)
➢
each ADDD has 2 stalls, (2x4 =8)
➢
the DADDUI has 1 stall, (1x1 =1)
➢
plus 14 instruction issue cycles
➢
Or (27/4)=6.75 clock cycles for each elements.
➢
This can be scheduled to improve performance significally.
Loop Unrolling with scheduling
Loop Unrolling with scheduling
➢
The execution time of the unrolled loop has dropped to a
total of 14 clock cycles.
➢
or 3.5 clock cycles per element,
➢
compared with
➢
9 cycles per element before any unrolling or
scheduling
➢
7 cycles when scheduled but not unrolled.
➢
6.75 cycles with unrolling but no scheduling
Strip mining
➢
In real programs we do not usually know the upper bound on
the loop.
➢
Suppose it is n
➢
we would like to unroll the loop to make k copies of the body.
➢
Instead of a single unrolled loop, we generate a pair of
consecutive loops.
➢
The first executes (n mod k) times and has a body that
is the original loop.
➢
The second is the unrolled body surrounded by an outer
loop that iterates (n/k) times
➢
For large values of n, most of the execution time will be
spent in the unrolled loop body.
Loop Unrolling
➢
Loop unrolling is a simple but useful method for
➢
increasing the size of straight-line code fragments that
can be scheduled effectively.
➢
Three different effects limit the gains from loop unrolling:
(1) a decrease in the amount of overhead amortized with
each unroll
➢
If the loop is unrolled double the times(2n), the
overhead is reduced to 1/2 the overhead of
unrolling of n times.
(2) code size limitations
➢
growth in code size may increases instuction cache
miss rate
(3) compiler limitations – shortfall in registers.
- Register pressure
Branch Prediction
Loop unrolling is one way to reduce the number of branch
hazards.
We can also reduce the performance losses of branches
by predicting how they will behave.
Branch prediction schemes are of two types:
static branch prediction (or compile-time branch
prediction)
dynamic branch prediction
Static Branch Prediction
It is the simplest one, because
it does not rely on information about the dynamic
history of code executing.
It rely on information available at compile time
It predicts the outcome of a branch based solely on the
branch instruction.
i.e., uses information that was gathered before the
execution of the program.
use profile information collected from earlier runs.
Dynamic Branch Prediction
Predict branches dynamically based on program behavior.
It uses information about taken or not taken branches
gathered at run-time to predict the outcome of a branch.
The simplest dynamic branch-prediction scheme is a branch-
prediction buffer or branch history table.
A branch-prediction buffer is a small memory indexed by the
lower portion of the address of the branch instruction.
The memory location contains a bit that says whether
the branch was recently taken or not.
Dynamic Branch Prediction
Different branch instructions may have the same low-order
bits.
With such a buffer we don’t know the prediction is correct
The prediction is a hint that is assumed to be correct, and
fetching begins in the predicted direction.
If the hint turns out to be wrong, the prediction bit is
inverted and stored back.
This simple 1-bit prediction scheme has a performance
shortcoming:
Even if a branch is almost always taken, we will likely
predict incorrectly twice, rather than once, when it is not
taken
since the misprediction causes the prediction bit to
be flipped.
Dynamic Branch Prediction
2-bit Prediction Scheme :-
To overcome the weakness of 1-bit prediction scheme,
2-bit prediction schemes are often used.
In a 2-bit scheme, a prediction must miss twice before it is
changed.
Fig shows the finite-state diagram for a 2-bit prediction
scheme.
Dynamic Branch Prediction
Correlating Branch Predictors :-
The 2-bit predictor schemes use only the recent behavior of
a single branch
to predict the future behavior of that branch.
It may be possible to improve the prediction accuracy
if we also look at the recent behavior of other
branches rather than just the branch we are trying to
predict.
Branch predictors that use the behavior of other branches to
make a prediction are called correlating predictors or two-
level predictors.
Dynamic Branch Prediction
Correlating Branch Predictors :-
Consider the following code :
if (aa == 2) // branch b1
aa=0;
if (bb==2) // branch b2
bb=0;
if (aa!=bb) { // branch b3
........
}
The behavior of branch b3 is correlated with the behavior of
branches b1 and b2.
If branches b1 and b2 are both not taken then branch b3 will
be taken.
Dynamic Branch Prediction
Correlating Branch Predictors :-
A predictor that uses only the behavior of a single branch to
predict the outcome of that branch can never capture this
behavior.
Existing correlating predictors add information about the
behavior of the most recent branches to decide how to
predict a given branch.
For example, a (1,2) predictor uses
the behavior of the last branch to choose from among a
pair of 2-bit branch predictors in predicting a particular
branch.
Dynamic Branch Prediction
Correlating Branch Predictors :-
In general case an (m, n) predictor uses
the behavior of the last m branches to choose from 2
m
branch predictors,
each of which is an n-bit predictor for a single
branch.
The attraction of this type of correlating branch predictor is
that it can yield higher prediction rates than the 2-bit
scheme and
requires only a trivial amount of additional hardware.
Dynamic Branch Prediction
Correlating Branch Predictors :-
The global history of the most recent m branches can be
recorded in an m-bit shift register,
where each bit records whether the branch was
taken or not taken.
The branch prediction buffer can then be indexed using
a concatenation of the low-order bits from the
branch address with the m-bit global history.
Dynamic Branch Prediction
Correlating Branch Predictors :-
For example, in a (2, 2) buffer with 64 total entries,
the 4 low-order address bits of the branch (word address)
and
the 2 global bits representing the behavior of the two
most recently executed branches
form a 6-bit index that can be used to index the 64
counters.
The number of bits in an (m, n) predictor is
2
m
× n × Number of prediction entries selected by the
branch address
A 2-bit predictor with no global history is simply a (0,2)
predictor.
Dynamic Branch Prediction
Correlating Branch Predictors :-
Dynamic Branch Prediction
Tournament Predictors :-
Tournament predictors uses
multiple predictors,
usually one based on global information and
one based on local information, and
combining them with a selector.
Tournament predictors can achieve both
better accuracy at medium sizes (8K–32K bits) and
also make use of very large numbers of prediction bits
effectively.
Dynamic Branch Prediction
Tournament Predictors :-
Existing tournament predictors use a 2-bit saturating counter
per branch
to choose among two different predictors based on
which predictor (local, global, or even some mix) was
most effective in recent predictions.
As in a simple 2-bit predictor,
the saturating counter requires two mispredictions
before changing the identity of the preferred predictor.
Dynamic Branch Prediction
Tournament Predictors :-
The advantage of a tournament predictor is
its ability to select the right predictor for a particular
branch.
Dynamic Branch Prediction
Fig: The misprediction rate for three different predictors on
SPEC89(benchmark) as the total number of bits is increased.
Speculation overcome control dependency
by
Predicting branch outcome and
Speculatively executing instructions as if
predictions were correct.
Hardware Based Speculation
Hardware-based speculation combines three key ideas:
1) dynamic branch prediction to choose which instructions
to execute
2) speculation to allow the execution of instructions before
the control dependences are resolved (with the ability
to undo the effects of an incorrectly speculated
sequence)
3) dynamic scheduling to deal with the scheduling of
different combinations of basic blocks.
Hardware Based Speculation
Hardware-based speculation follows the predicted flow of
data values to choose when to execute instructions.
This method of executing programs is essentially a data
flow execution: Operations execute as soon as their
operands are available.
Hardware Based Speculation
The key idea behind implementing speculation is to
allow instructions to execute out of order
but to force them to commit in order and
to prevent any irrevocable action (such as updating
state or taking an exception) until an instruction
commits.
Hence, when we add speculation,
we need to separate the process of completing
execution from instruction commit,
since instructions may finish execution
considerably before they are ready to commit.
Hardware Based Speculation
Adding the commit phase to the instruction execution
sequence
requires an additional set of hardware buffers that
hold the results of instructions that have finished
execution but have not committed.
This hardware buffer, reorder buffer, is also used to pass
results among instructions that may be speculated.
Hardware Based Speculation
• The reorder buffer (ROB) provides additional
registers.
• The ROB holds the result of an instruction
between the time the operation associated with
the instruction completes and the time the
instruction commits.
• Hence, the ROB is a source of operands for
instructions.
Reorder Buffer (ROB)
• With speculation, the register file is not
updated until the instruction commits ;
• thus, the ROB supplies operands in the
interval between completion of
instruction execution and instruction
commit.
Reorder Buffer (ROB)
Each entry in the ROB contains four fields:
the instruction type,
the destination field,
the value field, and
the ready field.
The instruction type field indicates whether the instruction
is
a branch (and has no destination result),
a store (which has a memory address destination), or
a register operation (ALU operation or load, which
has register destinations).
Reorder Buffer (ROB)
The destination field supplies
the register number (for loads and ALU operations) or
the memory address (for stores) where the instruction
result should be written.
The value field is used
to hold the value of the instruction result until the
instruction commits.
The ready field
Indicates that the instruction has completed execution,
and the value is ready.
Reorder Buffer (ROB)
Basic Structure with H/W Based Speculation
There are the four steps involved in instruction execution:
Issue
Execute
Write result
Commit
Steps in Execution
Issue
Get an instruction from the instruction queue.
Issue the instruction if there is an empty reservation
station and an empty slot in the ROB;
send the operands to the reservation station if
they are available in either the registers or the
ROB.
Update the control entries to indicate the buffers are
in use.
The number of the ROB entry allocated for the result
is also sent to the reservation station, so that the
number can be used to tag the result when it is
placed on the CDB (Common Data Bus).
Steps in Execution
Issue
If either all reservations are full or the ROB is full,
then instruction issue is stalled until both have
available entries.
Write Result :
When the result is available,
write it on the CDB (with the ROB tag sent
when the instruction issued) and from the CDB
into the ROB, as well as to any reservation
stations waiting for this result.
Mark the reservation station as available.
Steps in Execution
Write Result :
Special actions are required for store instructions.
If the value to be stored is available,
it is written into the Value field of the ROB entry
for the store.
If the value to be stored is not available yet,
the CDB must be monitored until that value is
broadcast, at which time the Value field of the
ROB entry of the store is updated.
Steps in Execution
Commit :
This is the final stage of completing an instruction,
after which only its result remains.
There are three different sequences of actions at
commit depending on whether the committing
instruction is
a branch with an incorrect prediction,
a store, or
any other instruction (normal commit)
Steps in Execution
Commit :
The normal commit case occurs when an instruction
reaches the head of the ROB and its result is present
in the buffer;
at this point, the processor updates the register
with the result and
removes the instruction from the ROB.
Committing a store is similar except that
memory is updated rather than a result register.
Steps in Execution
Commit :
When a branch with incorrect prediction reaches the
head of the ROB, it indicates that the speculation was
wrong.
The ROB is flushed and execution is restarted at the
correct successor of the branch.
If the branch was correctly predicted, the branch is
finished.
Steps in Execution
Once an instruction commits,
its entry in the ROB is reclaimed and
the register or memory destination is updated,
eliminating the need for the ROB entry.
If the ROB is filled, we simply stop issuing instructions
until an entry is made free.
Steps in Execution
Multithreading: Exploiting Thread-Level
Parallelism to Improve Uniprocessor Throughput
allows multiple threads to share the
functional units of a single processor in an overlapping
fashion.
In contrast, a more general method to exploit thread-
level parallelism (TLP) is with a multiprocessor that has
multiple independent threads operating at once and in
parallel.
Multithreading, however, does not duplicate the entire
processor as a multiprocessor does.
Instead, multithreading shares most of the processor
core among a set of threads, duplicating only
.
contd..
• Duplicating the per-thread state of a processor core
means creating a separate register file, a separate PC,
and a separate page table for each thread.
• There are three main hardware approaches to
multithreading.
1. Fine-grained multithreading switches between threads
on each clock, causing the execution of instructions
from multiple threads to be interleaved.
2. Coarse-grained multithreading switches threads only
on costly stalls, such as level two or three cache
misses.
3. Simultaneous multithreading is a variation on fine
grained multithreading that arises naturally when fine-
grained multithreading is implemented on top of a
multiple-issue, dynamically scheduled processor.
. The horizontal dimension represents the
instruction execution capability in each clock cycle. The vertical
dimension represents a sequence of clock cycles. An empty (white) box
indicates that the corresponding execution slot is unused in that clock
cycle. The shades of gray and black correspond to four different threads
in the multithreading processors.
End of Module 2

More Related Content

PPTX
Pipelining in Computer System Achitecture
PDF
Pipelining 16 computers Artitacher pdf
PPTX
Lecture-9 Parallel-processing .pptx
PPTX
pipelining
PPT
Pipelining slides
PPT
PPT
Pipelining _
PDF
The AVR Pipelining explanation detailed.pdf
Pipelining in Computer System Achitecture
Pipelining 16 computers Artitacher pdf
Lecture-9 Parallel-processing .pptx
pipelining
Pipelining slides
Pipelining _
The AVR Pipelining explanation detailed.pdf

Similar to Module 2 of apj Abdul kablam university hpc.pdf (20)

PPTX
Assembly p1
PPTX
ADNSU Computer Architecture Topic Presentation.pptx
DOC
Pipeline Mechanism
PPTX
Pipelining of Processors Computer Architecture
PPTX
3 Pipelining
PDF
pipelining ppt.pdf
PPT
Pipeline hazard
PDF
Topic2a ss pipelines
PPTX
Advanced Pipelining in ARM Processors.pptx
DOCX
Bc0040
PPTX
pipeline in computer architecture design
PPTX
complete DLD.pptxbjngjjgujjhhujhhhuujhguh
PPTX
Presentation on risc pipeline
PPTX
Pipelining of Processors
PPT
PPTX
CISC & RISC Architecture
PDF
INCREASING THE THROUGHPUT USING EIGHT STAGE PIPELINING
PPTX
Pipelining , structural hazards
PPTX
Instruction pipelining
PPTX
Design pipeline architecture for various stage pipelines
Assembly p1
ADNSU Computer Architecture Topic Presentation.pptx
Pipeline Mechanism
Pipelining of Processors Computer Architecture
3 Pipelining
pipelining ppt.pdf
Pipeline hazard
Topic2a ss pipelines
Advanced Pipelining in ARM Processors.pptx
Bc0040
pipeline in computer architecture design
complete DLD.pptxbjngjjgujjhhujhhhuujhguh
Presentation on risc pipeline
Pipelining of Processors
CISC & RISC Architecture
INCREASING THE THROUGHPUT USING EIGHT STAGE PIPELINING
Pipelining , structural hazards
Instruction pipelining
Design pipeline architecture for various stage pipelines
Ad

Recently uploaded (20)

PDF
Pre independence Education in Inndia.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Pharma ospi slides which help in ospi learning
PDF
RMMM.pdf make it easy to upload and study
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Cell Structure & Organelles in detailed.
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Pre independence Education in Inndia.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Renaissance Architecture: A Journey from Faith to Humanism
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Pharma ospi slides which help in ospi learning
RMMM.pdf make it easy to upload and study
2.FourierTransform-ShortQuestionswithAnswers.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
O7-L3 Supply Chain Operations - ICLT Program
VCE English Exam - Section C Student Revision Booklet
Cell Structure & Organelles in detailed.
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Microbial disease of the cardiovascular and lymphatic systems
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Ad

Module 2 of apj Abdul kablam university hpc.pdf

  • 1. Pipelining  Pipelinig is an implementation technique whereby multiple instructions are overlapped in execution.  It takes advantage of parallelism that exist among the actions needed to execute an instruction.  Today pipelining is the key implementation technique used to make fast CPU.  Each step in pipeline completes a part of an instruction.  Each of these steps is called a pipe stage or a pipe segment.  The stages are connected one to next to form a pipe.  Instructions enter at one end, progress through the stages, and exit at the other end.
  • 2. Pipelining  The time required between moving an instruction one step down the pipeline is a processor cycle.  Because all stages proceed at the same time, the length of a processor cycle is determined by the time required for the slowest pipe stage.  In a computer this processor cycle is usually 1 clock cycle (some times it is 2).  If the stages are perfectly balanced, then the time per instruction on the pipelined processor- assuming ideal conditions- is equal to : Time per instruction on unpipelined machine ----------------------------------------------------------- Number of pipeline stages.
  • 3. Pipelining  Under ideal conditions the speed up from pipelining equals the number of pipe stages.  Practically  The stages will not be perfectly balanced  Pipelining involves some overhead.  Pipelining yields a reduction in average execution time per instruction.  Pipelining is not visible to the programmer.
  • 4. Basics of a RISC Instruction Set  All operations on data apply to data in registers.  The only operations that affect memory are load and store.  The instruction formats are few in number with all instructions typically being one size (same size).  64 bit instructions are designated by having a D on the start.  DADD is the 64 bit version of ADD instr.
  • 5. Basics of a RISC Instruction Set  3 classes of instructions.  ALU instructions  Load and Store Instructions  Branches and jumps.  ALU Instructions.  Take either 2 registers or  A register and a sign-extended immediate  Operate on these and store the result into a 3rd register.  Eg. DADD, DSUB logical – AND, OR
  • 6. Basics of a RISC Instruction Set  Load and Store Instructions.  Take a register source called a base register and an immediate field called offset as operands.  Their sum is used as a memory address (Effective addr.) LD – Load Word SD – Store Word  In case of Load a 2nd register operand acts as a destination for the data loaded from memory.  In case of Store the 2nd register operand is the source of the data that is to be stored into memory.
  • 7. Basics of a RISC Instruction Set  Branches and Jumps.  Branches are conditional transfer of control.  Two ways of specifying branch condition in RISC (i) With a set of condition bits (called condition code) (ii) By a limited set of comparisions  Between a pair of registers or  A register and zero  Branch destination is obtained by adding a sign-extended offset (16 bits in MIPS) to the current PC.  Unconditional jumps are also provided in MIPS.
  • 8. A Simple Pipelined Implementation Focus on a pipeline for an integer subset of a RISC architecture that consists of :  Load – Store  Branch  Integer ALU operations. Every instruction in this subset can be implemented in at most 5 clock cycles. The Five pipeline stages are as follows.  Instruction Fetch cycle (IF)  Instruction Decode / Register Fetch cycle (ID)  Execution / Effective Address cycle (EX)  Memory Access (MEM)  Write Back cycle (WB)
  • 9. Pipeline Stages  Instruction Fetch Cycle (IF)  Send the content of PC to the memory and fetch the current location from memory.  Update the PC to the next sequential instruction by adding 4 to the PC (assuming 4 bytes instruction)  Instruction Decode /Register Fetch cycle (ID)  Decode the instruction and read the registers.  Decoding is done in parallel with reading registers.  This is possible because the register specifiers are at a fixed location in a RISC architecture.  This technique is known as fixed field decoding.
  • 10. Pipeline Stages  Execution / Effective Address Cycle (EX)  Performing one of the 3 fuctions depending on the instruction type  Memory referrence  ALU adds the base reg and the offset to form EA.  Register – Register ALU instruction  Performs the operation  Register – Immediate ALU instruction  Performs the operation on the value from register and the sign extended immediate.  Memory Access (MEM)  If the instruction is a load , memory does a read using EA computed in previous cycle.  If it is a store the memory writes the data to the location specified by EA.
  • 11. Pipeline Stages  Write Back Cycle (WB)  Register-Register ALU instructions. Or Load instr.  Write the result into the register file whether it comes from the memory system (for a load) or from the ALU (for an ALU instr)  In this implementation  Branch instruction requires 2 cycles  Store instruction – 4 cycles  All other instructions – 5 cycles.
  • 13. Classic Pipeline Stages  Starts a new instruction on each cycle.  On each clock cycle another instrucion is fetched and begins its 5 cycle execution.  During each clock cycle, h/w will be executing some part of the five different instructions.  A single ALU can not be asked to compute an effective address and perform a subtract operation at the same time.
  • 14. Classic Pipeline Stages  Because register file is used as a source in the ID stage and as a destination in the WB stage it appears twice.  It is read in one part of a stage( clock cycle) and written in another part, represented by a solid line and a dashed line.  IM – Instruction Memory  DM – Data Memory  CC – clock cycle.
  • 15. Pipeline Registers between successive pipeline stages
  • 16. Pipeline Registers  To ensure that instructions in different states of a pipe line do not interfere with one another,  a separaion is done by introducing pipeline registers between successive stages of the pipeline, sothat  at the end of a clock cycle all the results from a given stage are stored into a register  that is used as the input to the next stage on the next clock cycle.
  • 17. Pipeline Registers  Pipeline registers prevent interference between two different instructions in adjacent stages in the pipeline.  The registers also play a critical role of carrying data for a given instruction from one stage to the other.  The edge-triggered property of register is critical. (value change instantaneously on a clock edge)  Otherwise data from one instruction could interfere with the execution of another.
  • 18. Basic Performance Issues of Pipelining  Pipelining increases the CPU instruction throughput. (ie. the number of instructions completed per unit time)  It does not reduces the execution time of an individual instr.  It usually slightly increases the execution time of each instruction due to overhead in the control of the pipeline.  Program runs faster, eventhough no single instruction runs faster.  The clock can run no faster than the time needed for the slowest pipeline stage.  Pipeline overhead arises from the combination of pipeline register delay and clock skew.
  • 19. Basic Performance Issues of Pipelining  Pipeline registers add set up time – which is the time that a register input must be stable before the clock signal that triggers a write occurs, plus propagation delay to the clock cycle.  Clock skew is a phenomenon in synchronous digital circuit systems (such as computer systems) in which  the same sourced clock signal arrives at different components at different times.  The instantaneous difference between the readings of any two clocks is called their skew.
  • 21. Pipeline Registers between successive pipeline stages
  • 22. Pipeline Hazards Hazards are situations, that prevent the next instruction in the instruction stream from executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are 3 classes of hazards. Structural hazards - arise from resource conflicts when the hardware can not support all possible combinations of instructions simultaneously in overlapped execution. Data hazards - arise when an instruction depends on the results of a previous instruction. Control hazards - arise from the pipelining of branches and other instructions that change the PC.
  • 23. Pipeline Hazards Hazards in pipeline can make it necessary to stall the pipeline. Avoiding a hazard often requires the some instructions in the pipeline be allowed to proceed while others are delayed. When an instruction is stalled, all instructions issued later than the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction must continue, otherwise the hazard will never clear. As a result no new instructions are fetched during the stall.
  • 24. Performance of Pipeline with stalls If we ignore the cycle time overhead of pipelining and assume the stages are perfectly balanced, then the cycle time of the two processors can be equal.
  • 25. Performance of Pipeline with stalls When all instructions take the same number of cycles, which must also equal the number of pipeline stages (also called the depth of the pipeline ) If there are no pipeline stalls, pipelining can improve performance by the depth of the pipeline. (No. of pipeline stages)
  • 26. Structural Hazards Requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If some combination of instructions cannot be accommodated because of resource conflicts, the processor is said to have a structural hazard. Structural hazards arise when some functional unit is not fully pipelined. Some resource has not been duplicated enough to allow all combinations of instructions. Why would a designer allow structural hazards? The primary reason is to reduce cost of the unit.
  • 29. Data Hazards Occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on an unpipelined processor. Consider the pipelined execution of the following instructions. DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11
  • 31. Data Hazards All the instructions after DADD use the result of DADD instruction. DADD writes the value of R1 in the WB pipe stage, but the DSUB reads the value during its ID stage. This problem is a data hazard. Unless precautions are taken to prevent it , the DSUB instruction will read the wrong value and try to use it. AND reads R1 during CC4 will receive wrong value because R1 will be updated at CC5 by DADD. XOR operates properly because its register read occurs in CC6, after register write. OR also operates without a hazard because we perform the register file reads in the second half of the cycle and the writes in the first half.
  • 32. Data Hazards Minimizing Data Hazard Stalls by Forwarding The previous problem can be solved with a simple hardware technique called forwarding (also called bypassing and sometimes short-circuiting ). The key insight in forwarding is that the result is not really needed by the DSUB until after the DADD actually produces it. If the result can be moved from the pipeline reg where the DADD stores it to where the DSUB needs it, then the need for a stall can be avoided.
  • 33. Data Hazards Forwarding works as follows 1) The ALU result from both the EX/MEM and MEM/WB pipeline registers is always fed back to the ALU inputs. 2) If the forwarding h/w detects that the previous ALU operations has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.
  • 36. Data Hazards Data Hazards Requiring Stalls not all potential data hazards can be handled by bypassing. LD R1,0(R2) DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9
  • 40. Branch Hazards Control hazards can cause a greater performance loss, than do data hazards. When a branch is executed, it may or may not change the PC to something other than its current value plus 4. If a branch changes the PC to its target address, it is a taken branch. If it fall through it is not taken or untaken. If instruction-i is a taken branch then the PC is normally not changed until the end of ID, after the completion of address calculation and comparison.
  • 41. Branch Hazards Follwoing figure shows a branch causes a 1-cycle stall in the five-stage pipeline.
  • 42. Branch Hazards Reducing Pipeline Branch Penalties software can try to minimize the branch penalty using knowledge of the hardware scheme and of branch behavior. Four schemes 1) freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known. 2) predicted-not-taken or predicted untaken scheme - implemented by continuing to fetch instructions as if the branch were a normal instruction. If the branch is taken, however, we need to turn the fetched instruction into a no- op and restart the fetch at the target address.
  • 43. Branch Hazards 3) predicted taken scheme - no advantage in this approach for the 5 stage pipe line. 4) delayed branch branch instruction sequential successor1 branch target if taken The sequential successor is in the branch delay slot. This instruction is executed whether or not the branch is taken.
  • 44. Branch Hazards The predicted-not-taken scheme and the pipeline sequence when the branch is untaken (top) and taken (bottom).
  • 45. Branch Hazards The pipeline behavior of the five-stage pipeline with a branch delay is shown in figure.
  • 46. Branch Hazards Performance of Branch Schemes Pipeline stall cycles from branches = Branch frequency × Branch penalty The branch frequency and branch penalty can have a component from both unconditional and conditional branches. However, the latter dominate since they are more frequent
  • 47. Instruction Level Parallelism  Pipelining overlaps the execution of instructions to improve performance.  Pipelining does not reduce the execution time of an instruction.  But it reduces the total execution time of the program.  This potential overlap among instructions is called “Instruction Level Parallelism”(ILP), since the instructions can be evaluated in parallel.
  • 48. Instruction Level Parallelism  There are two main approaches to exploit ILP:  An approach that relies on Hardware to help discover and exploit parallelism dynamically.  Used in Intel Core series dominate in the desktop and server market.  An approach that relies on software technology to find parallelism, statically at Compiler time.  Most processors for the PMD(Personal Mobile Device) market use static approaches.  However, future processors are using dynamic approaches
  • 49. Instruction Level Parallelism  The value of CPI for a pipeline processor is the sum of the base CPI and all contributions from stalls.  Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls.  Ideal pipeline CPI is a measure of the maximum performance attainable by the implementation.  By reducing each of the terms of the right hand side, we minimize the overall pipeline CPI or alternatively, increase the IPC ( Instructions Per Clock)
  • 50. Instruction Level Parallelism  The amount of parallelism available within a basic block is quite small.  Since these instructions are likely to depend upon one another,  the amount of overlap we can exploit within a basic block is likely to be less than the average basic block size.  To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks.
  • 51. Instruction Level Parallelism  The simplest and most common way to increase the ILP is to exploit parallelism among iterations of a loop.  This type of parallelism is often called loop-level parallelism.  Consider a simple example of a loop that adds two 1000- element arrays and is completely parallel:  for (i=0; i<=999; i=i+1) x[i] = x[i] + y[i];  Every iteration of the loop can overlap with any other iteration  Within each loop iteration there is little or no opportunity for overlap.
  • 52. Instruction Level Parallelism  There are number of techniques for converting such loop- level parallelism into instruction-level parallelism.  Basically, such techniques work by unrolling the loop  either statically by the compiler or  dynamically by the hardware
  • 53. Data Dependence  Determining how one instruction depends on another is critical to determine  How much parallelism exists in a program  How that parallelism can be exploited.  To exploit ILP we must determine which instructions can be executed in parallel.  If two instructions are parallel, they can execute simultaneously.  If two instructions are dependent, they are not parallel and must be executed in order, although they may often be partially overlapped
  • 54. Bernstein’s Conditins for detection of Parallelism  Bernstein conditions are based on the following two sets of variables: i. The Read set or input set Ri that consists of variables read by the statement of instruction Ii. ii.The Write set or output set Wi that consists of variables written into by instruction Ii .  Two instructions I1 and I2 can be executed parallelly if they satisfies the following conditions:  R1 ∩ W2 = φ  R2 ∩ W1 = φ  W1 ∩ W2 = φ
  • 55. Data Dependence  Three different types of dependences:  data dependences (also called true data dependences),  name dependences and  control dependences.  Data Dependences  True data dependence ( or flow dependence)  Anti dependence  Output dependence  An instruction j is data dependent on instruction i if either of the following holds:  Instruction i produces a result that may be used by instruction j, or  Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i
  • 56. Data Dependence  Dependences are a property of programs  Pipeline organization determines  if dependence is detected and  if it causes a stall  Data dependence conveys 3 things:  Possibility of a hazard  Order in which results must be calculated  An upper bound on howmuch parallelism can possibly be exploited.  A dependence can be overcome in two different ways:  Maintaining the dependence but avoiding a hazard  Eliminating the dependence by transforming the code.
  • 57. Name Dependence  Two instructions use the same name but no flow of information associated with that name.  Two types of Name Dependences between an instruction i that precedes instruction j in program order  1) Antidependence: instruction j writes a register or memory location that instruction i reads.  The original ordering must be preserved to ensure that instruction i reads the correct value.  2) Output dependence: instruction i and instruction j write the same register or memory location  Ordering must be preserved to ensure that the value finally written corresponds to instruction j.  To resolve name dependences, we use renaming techniques (register renaming)
  • 61. Data Hazards  A hazard is created whenever there is a dependence between instructions, and  they are close enough that the overlap during execution would change the order of access to the operand involved in the dependence.  Because of the dependence, we have to preserve the program order.  Three types of Data Hazards  Read after write (RAW)  Write after write (WAW)  Write after read (WAR)
  • 62. Data Hazards  Read after write (RAW)  Instruction j tries to read a source before i writes it, so j incorrectly gets the old value.  This hazard is the most common type  It corresponds to a true data dependentce.  Pgm order must be preserved to ensure that j receives the value from i .  Write After Write (WAW)  Instruction j tries to write an operand before it is written by i.  The writes end up being performed in the wrong order.  This corresponds to an output dependence
  • 63. Data Hazards  Write After Read (WAR)  Instruction j tries to write a destination before it is read by i, so i incorrectly gets the new value.  This hazard arises from an antidependence.  Read After Read (RAR) case is not a hazard.
  • 64. Control Dependences  A control dependence determines the ordering of an instruction, i, w.r.t. a branch instruction so that the instruction i executed in correct program order and only when it should be.  These control dependences must be preserved to preserve program order.  One of the simplest examples of a control dependence is the dependence of the statements in the “then” part of an if statement on the branch.
  • 65. Control Dependences  For example, in the code segment : if p1 { s1 } if p2 { s2 }  S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.
  • 66. Control Dependences  In general, two constraints are imposed by control dependences:  An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.  For example, we cannot take an instruction from the then portion of an if statement and move it before the if statement.  An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch.  For example, we cannot take a statement before the if statement and move it into the then portion.
  • 67. Basic Compiler Techniques for Exposing ILP ➢ These techniques are crucial for processors that use static scheduling. ➢ The basic compiler techniques includes: ➢ Scheduling the code ➢ Loop unrolling ➢ Reducing branch costs with advanced brach prediction
  • 68. Basic Pipeline Scheduling ➢ To keep a pipeline full, ➢ parallelism among instructions must be exploited by ➢ finding sequences of unrelated instructions that can be overlapped in the pipeline. ➢ To avoid a pipeline stall, ➢ the execution of a dependent instruction must be ➢ separated from the source instruction by a distance in clock cycles equal to ➢ The pipeline latency of that source instruction.
  • 69. Basic Pipeline Scheduling ➢ A compiler’s ability to perform this scheduling depends both on ➢ the amount of ILP available in the program and ➢ On the latencies of the functional units in the pipeline. Instruction producing result Instruction using result Latency in clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 ➢ Latencies of FP operations used is given above. ➢ The last column is the number of intervening clock cycles needed to avoid a stall.
  • 70. Basic Pipeline Scheduling ➢ We assume ➢ the standard five-stage integer pipeline, so that branches have a delay of one clock cycle. ➢ the functional units are fully pipelined or replicated (as many times as the pipeline depth), ➢ so that an operation of any type can be issued on every clock cycle and ➢ there are no structural hazards. ➢ The integer ALU operation latency of 0
  • 71. Basic Pipeline Scheduling ➢ Consider the following code segment which adds a scalar to a vector: for (i=999; i>=0; i--) x[i] = x[i] + s ; ➢ This loop is parallel by noticing that the body of each iteration is independent. ➢ The first step is to translate the above segment to MIPS assembly language. ➢ In the following code segment, ➢ R1 is initially the address of the element in the array with the highest address, and ➢ F2 contains the scalar value s. ➢ Register R2 is precomputed, so that 8(R2) is the address of the last element to operate on.
  • 72. Basic Pipeline Scheduling ➢ The straightforward MIPS code, not scheduled for the pipeline, looks like : ➢ Loop: L.D F0,0(R1) ;F0=array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store result DADDUI R1,R1,#-8 ;decrement pointer ;8 bytes (per DW) BNE R1,R2,Loop ;branch R1!=R2
  • 73. Basic Pipeline Scheduling ➢ Without any scheduling, the loop will execute as follows: Clock cycle issued Loop: L.D F0,0(R1) 1 stall 2 ADD.D F4,F0,F2 3 stall 4 stall 5 S.D F4,0(R1) 6 DADDUI R1,R1,#-8 7 stall 8 BNE R1,R2,Loop 9
  • 74. Basic Pipeline Scheduling ➢ We can schedule the loop to obtain only two stalls and reduce the time to seven cycles: Clock cycle issued Loop: L.D F0,0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 stall 4 stall 5 S.D F4, 8(R1) 6 BNE R1,R2,Loop 7 ➢ Two stalls after ADD.D are for use by the S.D
  • 75. Basic Pipeline Scheduling ➢ In the previous example, we complete one loop iteration and store back one array element every seven clock cycles. ➢ The actual work of operating on the array element takes just three (the load, add, and store) of those seven clock cycles. ➢ The remaining four clock cycles consist of ➢ loop overhead—the DADDUI and BNE—and ➢ two stalls. ➢ To eliminate these four clock cycles ➢ we need to get more operations relative to the number of overhead instructions.
  • 76. Loop Unrolling ➢ A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. ➢ Unrolling simply replicates the loop body multiple times, adjusting the loop termination code. ➢ Loop unrolling can also be used to improve scheduling. ➢ Because it eliminates the branch, ➢ it allows instructions from different iterations to be scheduled together
  • 77. Loop Unrolling ➢ If we simply replicated the instructions when we unrolled the loop, ➢ the resulting use of the same registers could prevent us from effectively scheduling the loop. ➢ Thus, we will want to use different registers for each iteration, ➢ increasing the required number of registers.
  • 78. Loop Unrolling without scheduling ➢ Here we assumes that the number of element is a multiple of 4. ➢ Note that R2 must now be set so that 32(R2) is the starting address of the last four elements
  • 79. Loop Unrolling without scheduling ➢ We have eliminated 3 branches and 3 decrements of R1 . ➢ Without scheduling, every operation in the unrolled loop is followed by a dependent operation and thus will cause a stall. ➢ This loop will run in 27 clock cycles: ➢ each L.D has 1 stall, (1x4 =4) ➢ each ADDD has 2 stalls, (2x4 =8) ➢ the DADDUI has 1 stall, (1x1 =1) ➢ plus 14 instruction issue cycles ➢ Or (27/4)=6.75 clock cycles for each elements. ➢ This can be scheduled to improve performance significally.
  • 80. Loop Unrolling with scheduling
  • 81. Loop Unrolling with scheduling ➢ The execution time of the unrolled loop has dropped to a total of 14 clock cycles. ➢ or 3.5 clock cycles per element, ➢ compared with ➢ 9 cycles per element before any unrolling or scheduling ➢ 7 cycles when scheduled but not unrolled. ➢ 6.75 cycles with unrolling but no scheduling
  • 82. Strip mining ➢ In real programs we do not usually know the upper bound on the loop. ➢ Suppose it is n ➢ we would like to unroll the loop to make k copies of the body. ➢ Instead of a single unrolled loop, we generate a pair of consecutive loops. ➢ The first executes (n mod k) times and has a body that is the original loop. ➢ The second is the unrolled body surrounded by an outer loop that iterates (n/k) times ➢ For large values of n, most of the execution time will be spent in the unrolled loop body.
  • 83. Loop Unrolling ➢ Loop unrolling is a simple but useful method for ➢ increasing the size of straight-line code fragments that can be scheduled effectively. ➢ Three different effects limit the gains from loop unrolling: (1) a decrease in the amount of overhead amortized with each unroll ➢ If the loop is unrolled double the times(2n), the overhead is reduced to 1/2 the overhead of unrolling of n times. (2) code size limitations ➢ growth in code size may increases instuction cache miss rate (3) compiler limitations – shortfall in registers. - Register pressure
  • 84. Branch Prediction Loop unrolling is one way to reduce the number of branch hazards. We can also reduce the performance losses of branches by predicting how they will behave. Branch prediction schemes are of two types: static branch prediction (or compile-time branch prediction) dynamic branch prediction
  • 85. Static Branch Prediction It is the simplest one, because it does not rely on information about the dynamic history of code executing. It rely on information available at compile time It predicts the outcome of a branch based solely on the branch instruction. i.e., uses information that was gathered before the execution of the program. use profile information collected from earlier runs.
  • 86. Dynamic Branch Prediction Predict branches dynamically based on program behavior. It uses information about taken or not taken branches gathered at run-time to predict the outcome of a branch. The simplest dynamic branch-prediction scheme is a branch- prediction buffer or branch history table. A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory location contains a bit that says whether the branch was recently taken or not.
  • 87. Dynamic Branch Prediction Different branch instructions may have the same low-order bits. With such a buffer we don’t know the prediction is correct The prediction is a hint that is assumed to be correct, and fetching begins in the predicted direction. If the hint turns out to be wrong, the prediction bit is inverted and stored back. This simple 1-bit prediction scheme has a performance shortcoming: Even if a branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken since the misprediction causes the prediction bit to be flipped.
  • 88. Dynamic Branch Prediction 2-bit Prediction Scheme :- To overcome the weakness of 1-bit prediction scheme, 2-bit prediction schemes are often used. In a 2-bit scheme, a prediction must miss twice before it is changed. Fig shows the finite-state diagram for a 2-bit prediction scheme.
  • 89. Dynamic Branch Prediction Correlating Branch Predictors :- The 2-bit predictor schemes use only the recent behavior of a single branch to predict the future behavior of that branch. It may be possible to improve the prediction accuracy if we also look at the recent behavior of other branches rather than just the branch we are trying to predict. Branch predictors that use the behavior of other branches to make a prediction are called correlating predictors or two- level predictors.
  • 90. Dynamic Branch Prediction Correlating Branch Predictors :- Consider the following code : if (aa == 2) // branch b1 aa=0; if (bb==2) // branch b2 bb=0; if (aa!=bb) { // branch b3 ........ } The behavior of branch b3 is correlated with the behavior of branches b1 and b2. If branches b1 and b2 are both not taken then branch b3 will be taken.
  • 91. Dynamic Branch Prediction Correlating Branch Predictors :- A predictor that uses only the behavior of a single branch to predict the outcome of that branch can never capture this behavior. Existing correlating predictors add information about the behavior of the most recent branches to decide how to predict a given branch. For example, a (1,2) predictor uses the behavior of the last branch to choose from among a pair of 2-bit branch predictors in predicting a particular branch.
  • 92. Dynamic Branch Prediction Correlating Branch Predictors :- In general case an (m, n) predictor uses the behavior of the last m branches to choose from 2 m branch predictors, each of which is an n-bit predictor for a single branch. The attraction of this type of correlating branch predictor is that it can yield higher prediction rates than the 2-bit scheme and requires only a trivial amount of additional hardware.
  • 93. Dynamic Branch Prediction Correlating Branch Predictors :- The global history of the most recent m branches can be recorded in an m-bit shift register, where each bit records whether the branch was taken or not taken. The branch prediction buffer can then be indexed using a concatenation of the low-order bits from the branch address with the m-bit global history.
  • 94. Dynamic Branch Prediction Correlating Branch Predictors :- For example, in a (2, 2) buffer with 64 total entries, the 4 low-order address bits of the branch (word address) and the 2 global bits representing the behavior of the two most recently executed branches form a 6-bit index that can be used to index the 64 counters. The number of bits in an (m, n) predictor is 2 m × n × Number of prediction entries selected by the branch address A 2-bit predictor with no global history is simply a (0,2) predictor.
  • 96. Dynamic Branch Prediction Tournament Predictors :- Tournament predictors uses multiple predictors, usually one based on global information and one based on local information, and combining them with a selector. Tournament predictors can achieve both better accuracy at medium sizes (8K–32K bits) and also make use of very large numbers of prediction bits effectively.
  • 97. Dynamic Branch Prediction Tournament Predictors :- Existing tournament predictors use a 2-bit saturating counter per branch to choose among two different predictors based on which predictor (local, global, or even some mix) was most effective in recent predictions. As in a simple 2-bit predictor, the saturating counter requires two mispredictions before changing the identity of the preferred predictor.
  • 98. Dynamic Branch Prediction Tournament Predictors :- The advantage of a tournament predictor is its ability to select the right predictor for a particular branch.
  • 99. Dynamic Branch Prediction Fig: The misprediction rate for three different predictors on SPEC89(benchmark) as the total number of bits is increased.
  • 100. Speculation overcome control dependency by Predicting branch outcome and Speculatively executing instructions as if predictions were correct. Hardware Based Speculation
  • 101. Hardware-based speculation combines three key ideas: 1) dynamic branch prediction to choose which instructions to execute 2) speculation to allow the execution of instructions before the control dependences are resolved (with the ability to undo the effects of an incorrectly speculated sequence) 3) dynamic scheduling to deal with the scheduling of different combinations of basic blocks. Hardware Based Speculation
  • 102. Hardware-based speculation follows the predicted flow of data values to choose when to execute instructions. This method of executing programs is essentially a data flow execution: Operations execute as soon as their operands are available. Hardware Based Speculation
  • 103. The key idea behind implementing speculation is to allow instructions to execute out of order but to force them to commit in order and to prevent any irrevocable action (such as updating state or taking an exception) until an instruction commits. Hence, when we add speculation, we need to separate the process of completing execution from instruction commit, since instructions may finish execution considerably before they are ready to commit. Hardware Based Speculation
  • 104. Adding the commit phase to the instruction execution sequence requires an additional set of hardware buffers that hold the results of instructions that have finished execution but have not committed. This hardware buffer, reorder buffer, is also used to pass results among instructions that may be speculated. Hardware Based Speculation
  • 105. • The reorder buffer (ROB) provides additional registers. • The ROB holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits. • Hence, the ROB is a source of operands for instructions. Reorder Buffer (ROB)
  • 106. • With speculation, the register file is not updated until the instruction commits ; • thus, the ROB supplies operands in the interval between completion of instruction execution and instruction commit. Reorder Buffer (ROB)
  • 107. Each entry in the ROB contains four fields: the instruction type, the destination field, the value field, and the ready field. The instruction type field indicates whether the instruction is a branch (and has no destination result), a store (which has a memory address destination), or a register operation (ALU operation or load, which has register destinations). Reorder Buffer (ROB)
  • 108. The destination field supplies the register number (for loads and ALU operations) or the memory address (for stores) where the instruction result should be written. The value field is used to hold the value of the instruction result until the instruction commits. The ready field Indicates that the instruction has completed execution, and the value is ready. Reorder Buffer (ROB)
  • 109. Basic Structure with H/W Based Speculation
  • 110. There are the four steps involved in instruction execution: Issue Execute Write result Commit Steps in Execution
  • 111. Issue Get an instruction from the instruction queue. Issue the instruction if there is an empty reservation station and an empty slot in the ROB; send the operands to the reservation station if they are available in either the registers or the ROB. Update the control entries to indicate the buffers are in use. The number of the ROB entry allocated for the result is also sent to the reservation station, so that the number can be used to tag the result when it is placed on the CDB (Common Data Bus). Steps in Execution
  • 112. Issue If either all reservations are full or the ROB is full, then instruction issue is stalled until both have available entries. Write Result : When the result is available, write it on the CDB (with the ROB tag sent when the instruction issued) and from the CDB into the ROB, as well as to any reservation stations waiting for this result. Mark the reservation station as available. Steps in Execution
  • 113. Write Result : Special actions are required for store instructions. If the value to be stored is available, it is written into the Value field of the ROB entry for the store. If the value to be stored is not available yet, the CDB must be monitored until that value is broadcast, at which time the Value field of the ROB entry of the store is updated. Steps in Execution
  • 114. Commit : This is the final stage of completing an instruction, after which only its result remains. There are three different sequences of actions at commit depending on whether the committing instruction is a branch with an incorrect prediction, a store, or any other instruction (normal commit) Steps in Execution
  • 115. Commit : The normal commit case occurs when an instruction reaches the head of the ROB and its result is present in the buffer; at this point, the processor updates the register with the result and removes the instruction from the ROB. Committing a store is similar except that memory is updated rather than a result register. Steps in Execution
  • 116. Commit : When a branch with incorrect prediction reaches the head of the ROB, it indicates that the speculation was wrong. The ROB is flushed and execution is restarted at the correct successor of the branch. If the branch was correctly predicted, the branch is finished. Steps in Execution
  • 117. Once an instruction commits, its entry in the ROB is reclaimed and the register or memory destination is updated, eliminating the need for the ROB entry. If the ROB is filled, we simply stop issuing instructions until an entry is made free. Steps in Execution
  • 118. Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput allows multiple threads to share the functional units of a single processor in an overlapping fashion. In contrast, a more general method to exploit thread- level parallelism (TLP) is with a multiprocessor that has multiple independent threads operating at once and in parallel. Multithreading, however, does not duplicate the entire processor as a multiprocessor does. Instead, multithreading shares most of the processor core among a set of threads, duplicating only .
  • 119. contd.. • Duplicating the per-thread state of a processor core means creating a separate register file, a separate PC, and a separate page table for each thread. • There are three main hardware approaches to multithreading. 1. Fine-grained multithreading switches between threads on each clock, causing the execution of instructions from multiple threads to be interleaved. 2. Coarse-grained multithreading switches threads only on costly stalls, such as level two or three cache misses. 3. Simultaneous multithreading is a variation on fine grained multithreading that arises naturally when fine- grained multithreading is implemented on top of a multiple-issue, dynamically scheduled processor.
  • 120. . The horizontal dimension represents the instruction execution capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding execution slot is unused in that clock cycle. The shades of gray and black correspond to four different threads in the multithreading processors.