SlideShare a Scribd company logo
Unit-3: Pipeline Processing
Pipeline Processing
Basic concepts of Pipeline Processing
Instruction pipeline
Arithmetic pipeline
Handling Data, Control and Structural hazards
Compiler techniques for improving performance
Let's say that there are four loads of dirty laundry that need to be washed, dried, and folded.
We could put the the first load in the washer for 30 minutes, dry it for 40 minutes, and then
take 20 minutes to fold the clothes. Then pick up the second load and wash, dry, and fold,
and repeat for the third and fourth loads. Supposing we started at 6 PM and worked as
efficiently as possible, we would still be doing laundry until midnight.
However, a smarter approach to the problem would be to put the second load of dirty
laundry into the washer after the first was already clean and whirling happily in the dryer.
Then, while the first load was being folded, the second load would dry, and a third load
could be added to the pipeline of laundry. Using this method, the laundry would be finished
by 9:30.
Instruction Execution
An instruction in a process is divided into 5 subtasks:
1. Instruction fetch (IF).
2. Instruction decode (ID).
3. Operand fetch (OF).
4. Instruction Execution (IE).
5. Output store (OS).
COA_Unit-3_slides_Pipeline Processing .pdf
• Instructions are executed one by one or in Non-Parallel fashion.
• Single H/W Components which can take only 1 task at a time from it’s
input and produce the result at the O/P.
Drawback
• Only one input can be processed at a time.
• Partial output or Intermediate O/P not possible
Pipelining
• Pipelining is a technique where multiple instructions are overlapped
during execution.
• Pipeline is divided into stages and these stages are connected with one
another to form a pipe like structure.
• Improves the throughput of the system and thus increases the overall
instruction throughput.
Pipelining
Execution in Pipelined Architecture
• Parallel Execution of instruction takes place.
• At a particular time slot, all the instructions are in different phases.
• Instead of single H/W components we can split H/W design into small
components.
• The segments are connected with each other through an interface
register and they can execute multiple task in in-depend or parallel.
4-Stage instruction pipeline
•The processing of each instruction is divided into 4-segments.
•FI🡪It is the segment that fetches an instruction.
•DA🡪It is the segment that decode instruction and calculate effective
address.
•FO🡪 It is the segment that fetches the operand
•EX🡪 It is the segment that executes the instruction
Instruction Cycle:
• Fetch Instruction
• Decode Instruction(Identify opcode and operand)
• Execute Instruction
• Write Back Result(In Register/Memory)
Registers Involved In Each Instruction Cycle:
• Memory address registers(MAR) : It is connected to the address lines of the
system
bus. It specifies the address in memory for a read or write operation.
• Memory Buffer Register(MBR) : It is connected to the data lines of the system
bus. It
contains the value to be stored in memory or the last value read from the
memory.
• Program Counter(PC) : Holds the address of the next instruction to be fetched.
• Instruction Register(IR) : Holds the last instruction fetched.
Stages of
Pipelining
• Instructions of the program execute parallely. When one instruction goes from nth
stage to (n+1)th
stage, another instruction goes from (n-1)th
stage to nth
stage.
Pipelining
• Advantages
• Pipelining improves the throughput of the system.
• In every clock cycle, a new instruction finishes its
execution.
• Allow multiple instructions to be executed concurrently.
• Disadvantages
• The design ofpipelined processor is complex and costly
to manufacture.
• The instruction latency is more.
Types of Pipelining
•Instruction Pipelining
•Arithmetic Pipelining
Instruction Pipelining
• Instruction pipelining is a technique in computer architecture where multiple
instruction stages (fetch, decode, execute, etc.) are overlapped to improve
CPU performance.
• It allows the next instruction to begin before the previous one completes,
increasing throughput and reducing execution time.
1. Fetch instruction from memory.
2. Decode the instruction.
3. Calculate effective address.
4. Fetch operand from memory.
5. Execute instruction.
6. Store the result in memory.
COA_Unit-3_slides_Pipeline Processing .pdf
Problems with instruction pipeline (Pipeline
Hazards).
Pipeline hazards are issues that disrupt the smooth execution of instructions in a
pipeline. The main types are:
1.Resource Hazard (Resource Conflict): Occurs when two instructions need the
same resource (like memory or registers) simultaneously, causing a conflict.
2.Data Hazard (Data Dependency): Arises when an instruction depends on the result
of a previous instruction that hasn’t completed, leading to delays.
3.Branch Hazard (Branch Difficulties): Happens when the pipeline cannot predict the
outcome of a branch (like a conditional jump) early enough, leading to incorrect
instruction execution.
Arithmetic Pipeline
• Arithmetic pipelining is a technique used in processors to break down
complex arithmetic operations (like multiplication or division) into smaller
stages, allowing multiple operations to be processed concurrently.
• This improves performance by executing parts of several instructions in
parallel.
Arithmetic Pipeline
• The combined operation of floating-point addition and subtraction
is divided into four segments. Each segment contains the
corresponding suboperation.
• The suboperations in the four segments are:
• Compare the exponents by subtraction.
• Align the mantissas.
• Add or subtract the mantissas.
• Normalize the results
Floating-point binary addition:
X = A * 2a
= 0.9504 * 103
Y = B * 2b
= 0.8200 * 102
1. Compare exponents by subtraction:
• The exponents are compared by
subtracting them to determine their
difference. The larger exponent is
chosen as the exponent of the result.
• The difference of the exponents,
i.e., 3 - 2 = 1 determines how many
times the mantissa associated with the
smaller exponent must be shifted to the
right.
2. Align the mantissas:
• The mantissa associated with the
smaller exponent is shifted according to
the difference of exponents determined
in segment one.
X = 0.9504 * 103
Y = 0.08200 * 103
3. Add mantissas:
The two mantissas are added in segment
three.
Z = X + Y
= 1.0324 * 103
4. Normalize the result:
After normalization, the result is written
as:
Z = 0.10324 * 104
Parameters that Determines the Performance of pipeline process
• Speed Up Ratio (SK
)
• Latency (LK
)
• Efficiency (EK
)
• Throughput(HK
)
Parameters that Determines the Performance of pipeline process
COA_Unit-3_slides_Pipeline Processing .pdf
Parameters that Determines the Performance of pipeline process-
Speed up(s)
COA_Unit-3_slides_Pipeline Processing .pdf
COA_Unit-3_slides_Pipeline Processing .pdf
COA_Unit-3_slides_Pipeline Processing .pdf
Latency measures the time to complete a single instruction, and while it doesn’t
decrease with pipelining, pipelining allows more instructions to be completed
concurrently.
Throughput is significantly improved with pipelining because multiple
instructions are processed in parallel, leading to the completion of one
instruction per clock cycle after the pipeline is filled.
Speedup quantifies the performance improvement achieved by pipelining, often
close to the number of pipeline stages, provided hazards are minimized.
Efficiency is a measure of how effectively the pipeline stages are utilized. It is
affected by hazards, stalls, and the balance of work between stages.
Example: Space Time Diagram, No of
Segment = 4 No of Task =6 Speed Up Ratio =
?
•Number of Segment = 4
•Number of tasks= 6
COA_Unit-3_slides_Pipeline Processing .pdf
Branch instruction hazard
•In a pipeline, a branch instruction hazard occurs when the
pipeline cannot determine the next instruction to fetch because
it depends on the outcome of a branch instruction. This creates
a delay (or "stall") in the pipeline until the branch's outcome is
known. Let’s break down this scenario with the given
information:
•Number of time slots: 12
•Total instructions: 7
•Branch instruction: 3rd instruction
•Number of stalls due to the branch: 2
•Space-Time Diagram with Stalls for the Branch Instruction
•Here’s a pipeline diagram showing how these stalls impact the
execution. Each cell represents the stage that an instruction is in
during a specific time slot.
• Explanation of Diagram
• IF (Instruction Fetch), ID (Instruction Decode), EX (Execute), MEM (Memory
Access), and WB (Writeback) represent the stages each instruction goes
through.
• The 3rd instruction is a branch, which, after the EX stage, introduces 2 stalls in
time slots 6 and 7.
• These stalls occur because the pipeline must wait for the branch outcome.
• Once the branch direction is resolved, instruction 4 resumes at time slot 8.
• Impact of Stalls: The branch hazard delays the pipeline by 2 time slots, causing
subsequent instructions to start later than they would in an uninterrupted
pipeline.
• Calculating the Total Time with Stalls
• In this case:
• Without stalls, the pipeline could ideally complete all 7 instructions in 10 time
slots.
• With the 2 stalls caused by the branch, the total execution time increases to 12
time slots.
Pipeline
Hazard
• Any condition that causes ‘stall’ in the pipeline operations can be called a
Hazard
• Pipeline hazards are situation that prevent the next instruction in the
instruction stream from executing during its designated clock cycles.
• Hazard Occurs:
A <- 3+A
B <- 4*A
• No Hazard
A <- 5*C
B <- 20+C
Pipeline: It is a technique of decomposing a sequential
process into a no of sub processes with each sub process
being executed in a special dedicated segment that
operates concurrently with all other segments.
Pipeline Hazards
a) Data Hazard
b) Control or Instruction Hazard
c) Structure Hazard
Pipeline Hazard
a) Data Hazards
•An instruction cannot continue because it needs a value that has not yet
been generated by an earlier instruction.
•In other words an instruction attempt to use a resource before it is ready.
There are three type of Data Hazard possible.
1) RAW (Read after Write) [Flow/True data dependency]
2) WAR (Write after Read) [Anti-Data dependency]
3) WAW (Write after Write) [Output data dependency]
Pipeline Hazard
• Let there be two instructions I and J, such that J follow I. Then,
• RAW hazard occurs when instruction J tries to read data before instruction I writes
it.
• Eg:
• I: R2 <- R1 + R3
• J: R4 <- R2 + R3
• WAR hazard occurs when instruction J tries to write data before instruction I reads
it.
• Eg:
• I: R2 <- R1 + R3
• J: R3 <- R4 + R5
Pipeline Hazard
•WAW hazard occurs when instruction J tries to write output before
instruction I writes it.
Eg:
I: R2 <- R1 + R3
J: R2 <- R4 + R5
•WAR and WAW hazards occur during the out-of-order execution of
the instructions.
#Observations
• All the instructions after the ADD use the result of the ADD instruction (in R1). The ADD instruction writes the
value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub
). This
problem is called a data hazard. Unless precautions are taken to prevent it, the SUB instruction will read the
wrong value and try to use it.
• The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle
5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand
) will receive the wrong
result.
• The OR instruction can be made to operate without incurring a hazard by a simple implementation
technique. The technique is to perform register file reads in the second half of the cycle, and writes in the
first half. Because both WB for ADD and IDor
for OR are performed in one cycle 5, the write to register file
by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the
second half of the cycle.
• The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by
ADD.
Control Hazard
•A control hazard (or branch hazard) occurs in pipelined processors
when the pipeline cannot determine which instruction to fetch next due
to a conditional branch instruction. This uncertainty causes a delay
until the branch outcome (whether to take the branch or not) is known.
Control hazard
Memory
Location
Instructions
12: If R1 = R3 Jump to label 36
16: AND R2, R3, R5
20: MUL R6, R1, R7
24: ADD R8, R1, R9
Label 36 DIV R10,R1, R11
Control
Hazard
In the instruction 12 ,
• Whether it will jump to address location 36 or not, it will be known at ‘MEM’ phase i.e.
at CC4.
• It means until the instruction 12 will be in the 4th
CC the next three instructions at
the memory locations 16, 20 and 24 have already entered in the pipe and
performing the operations in their respective phases.
• In normal pipeline concept these instruction(located at addresses 16,20 and 24) have
entered in the pipe.
• Now, suppose if condition(R1=R3) becomes true then everything happened with the
subsequent instructions becomes wrong, because at this point it is clear that instruction
at location 36 should be the next instruction to be executed, instead of 16,20 and 24.
• So, there is a requirement of flush out the wrongly entered/processed instructions.
• Therefore, STALLS for 3CC will occur and instruction at the location 16, 20 and 24 should
not suppose to executes and they should flush out from the pipe.
Control hazard
•Suppose, branch instruction decides to go to the location 36 for the
next instruction to be executed in the MEM stage i.e. in the CC4.
•Three subsequent instructions that follow the branch instruction will
be fetched and begin their executions as like normal scenario, before
If loop branches to location 36.
•Generally, pipelining will not be stopped, so all the instruction after
the branch also gets executed, but “if branch condition is true”, then
“just flush out the previous unwanted instruction from the pipe”
•In this case Branch Penalty is 3 Cycles
Control
Hazard
So,
• The instruction fetch unit of CPU is responsible for providing a stream of instructions to
the execution unit.
• The instruction fetched by the fetch unit are in consecutive memory location until some
special conditions or branch occurs.
• Problem arises when one of the instruction is branch instruction and need to go to the
some different memory location.
• In this case all unwanted instructions fetched in the pipeline from consecutive memory locations
are invalid now and need to remove i.e. Flushing out from the pipe.
Memory
Location
Instructions
12 If R1 = R3 Jump to label 36
16 AND R2, R3, R5
20 MUL R6, R1, R7
24 ADD R8, R1, R9
36 DIV R10,R1, R11
Flush OUT
Control Hazard
• This causes STALL in the pipeline till new corrected instruction are
fetched from the memory.
• Thus the time lost as a result of this called as Branch Penalty.
• For reducing the resulting delay, dedicated hardware is incorporated in
the fetch/decode unit to identify branch instruction possibility of
occurrence in advance.
• It can increase the cost.
Structural Hazard
• When the multiple instructions need the same resource.
• It means in a computer organization part, common resources are used by
the multiple instructions for their execution.
• These resources are like: Memory, RAM, Different kind of registers, ALU,
common bus etc.
• We have limited no of resources and large no of instructions
• So, obviously many conflict may occur due to this situation.
• Due to this, normal pipeline concept is getting disturbed, called as
Structural Hazard.
Structural
Hazard
• For I1: Focus on CC4,
In this above example , we are accessing ‘MEM’ in CC4 for loading/storing the data in
the memory.
• For I4: Focus on CC4,
In the same CC4, I4 is also fetching instruction from memory, so two instructions I1 and
I4
using the same resources at a time.
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Clock Cycle
I1
I2
I3
I4
Structural
Hazard
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
STALL IF ID EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Clock Cycle
I1
I2
I3
I4
• For I4: If we make a stall at CC4 and start the I4 in the CC5 then again the similar
kind of problem still exists, because at CC5, i2 is using the memory along with
i4.
• Same kind of problem may occur continuously in CC6 and CC7 also.
Then What will be the solution for this???
Structural Hazard solved by Stalling Solution
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
STALL STALL IF ID EX MEM
Clock Cycle
I1
I2
I3
I4
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
I1
I2
I3
I4
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
STALL STALL STALL IF ID EX MEM
Clock
Cycle
WB
CC11
Structural Hazard solution by hardware technique
• One of the simple solution is to give separate memory for keeping the instructions
and the data.
• In Von Neumann Architecture, same memory is used for storing the data and the
instructions, so it is a big drawback of Von Neumann Architecture.
• If we use harvard Architecture, we can store instructions and operand data in
different slots of memory locations.
VON-NEUMANN
I1
I2
DATA1
I3
DATA2
.
.
I10
Memory for
Instruction
Memory for operand
Data
I1 D1
I2 D2
I3 D3
I4 D4
.
.
D5
.
.
I10 D10
Harvard
Architecture
Structural
Hazard
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
STALL IF ID EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Clock Cycle
I1
I2
I3
I4
• Again at CC6, I2 is writing result in the register, whereas I4, is in decode stage.
Here, there is chance that both (I2) and (I4) is using the same register at their
stages. So, there is a requirement of better register architecture.
• It means we need to keep multiple register files as per the specific purpose.
• Similarly, common bus is also accessing by multiple instructions at their
different stages. So, there is a great chance of confliction.
• So, there is a requirement of better bus organization.
Methods of Optimizing Against Hazards – Compiler/Software Level
• While stalling is a universal remedy in that it can be used to resolve any pipelining
hazard, the costs are high, and stalls impact a chip’s ability to perform efficiently.
However, there are other methods available to resolve hazards that help retain
efficiency.
• The first we will examine are all performed in the compiler; no additional hardware
need be added to implement them because the improvements are made to the
code itself, not to the machine running it.
• Implemented correctly, compiler-level optimizations, as opposed to hardware-level
optimizations, can provide a solution to hazards that does not require extra power
and can be performed on any hardware.
Resolving Structural Hazards using Compiler/Software Level
•One approach to structural hazards is to reorder operations such that
two instructions are never so close to one another that this sort of
hazard occurs.
•Reordering operations to prevent this kind of hazard is often viable
during compiler time.
COA_Unit-3_slides_Pipeline Processing .pdf
COA_Unit-3_slides_Pipeline Processing .pdf
Resolving Data Hazards using
Compiler/Software Level
•Data hazards occur in a pipeline when an instruction depends on
the result of a previous instruction that hasn't completed yet.
•Resolving data hazards is crucial for maintaining pipeline efficiency,
and common techniques include stalling, data forwarding
(bypassing), and reordering instructions.
COA_Unit-3_slides_Pipeline Processing .pdf
COA_Unit-3_slides_Pipeline Processing .pdf
CYCLE ADD SUB
1 Fetch
2 Decode Fetch
3 Execute Decode
4 Memory Execute(receive the R1 from add
execute operation)
5 Write back Memory
6 Write back
COA_Unit-3_slides_Pipeline Processing .pdf
COA_Unit-3_slides_Pipeline Processing .pdf
COA_Unit-3_slides_Pipeline Processing .pdf
Resolving Control Hazards using
Compiler/Software Level
•There are several ways to handle control hazards, including
stalling (waiting for branch resolution), branch prediction,
and branch delay slots.
COA_Unit-3_slides_Pipeline Processing .pdf
COA_Unit-3_slides_Pipeline Processing .pdf
COA_Unit-3_slides_Pipeline Processing .pdf
Resolving Control Hazards using software/compiler level solution
Managing control hazards at the compiler level is related to distancing the logical operation on which the branch is
based from the branch itself and on limiting the number of branches. Both of these are accomplished by loop
unrolling. This is also used to increase the system performance by reducing the number of iterations.
Loop unrolling essentially expands the body of a loop so that fewer branches are necessary.
Example:
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
Unrolled loop (4 times):
for (i = 1000; i > 0; i = i - 4) {
x[i] = x[i] + s;
x[i - 1] = x[i - 1] + s;
x[i - 2] = x[i - 2] + s;
x[i - 3] = x[i - 3] + s;
}
Challenges in Loop Unrolling
• Increased Code Size and Cache Misses
• Trade-offs of loop unrolling: larger code footprint may increase cache misses
• Optimal Unrolling Factor
• Explanation of factors influencing optimal unrolling: cache size, register
Limitations of Compiler-Level Optimization
• Limitations at Compile Time
• Many runtime hazards, like data dependencies, can’t be fully predicted
• Examples of Runtime-Only Hazards
• Control flow changes, unpredictable loops
• Conclusion: Need for hardware-level solutions for complex hazards
Methods of Optimizing Against Hazards – Hardware-Level
Resolving Structural Hazards
• Resource Duplication
• Resource Pipelining
• Dynamic Scheduling
Resource Duplication
•Resource duplication offers a more direct approach to eliminating
structural hazards. By duplicating critical resources, such as providing
multiple ALUs or additional memory ports, the pipeline can handle
concurrent requests without conflicts. This eliminates the need for
stalls and ensures smoother instruction flow. However, resource
duplication increases the hardware complexity and cost of the
processor. The trade-off between performance gain and increased
cost must be carefully considered.
Resource Duplication
• One of the simple solution is to give separate memory for keeping the instructions
and the data.
• In Von Neumann Architecture, same memory is used for storing the data and the
instructions, so it is a big drawback of Von Neumann Architecture.
• If we use harvard Architecture, we can store instructions and operand data in
different slots of memory locations.
VON-NEUMANN
I1
I2
DATA1
I3
DATA2
.
.
I10
Memory for
Instruction
Memory for operand
Data
I1 D1
I2 D2
I3 D3
I4 D4
.
.
D5
.
.
I10 D10
Harvard
Architecture
Pipelining the Resource
•Pipelining the resource itself is another effective strategy to mitigate
structural hazards. Instead of duplicating the entire resource, it is
divided into smaller pipelined stages. This allows the resource to
handle multiple instructions concurrently, as different stages can
process different parts of the instructions. For example, a pipelined
ALU can have separate stages for operand fetching, arithmetic
operation, and result writing. This approach improves throughput
without requiring full resource duplication, but it may increase the
latency of the resource itself.
Dynamic Scheduling
•Dynamic scheduling employs sophisticated hardware mechanisms to
dynamically analyze and reorder instructions at runtime, aiming to
avoid structural hazards. Techniques like scoreboarding and
Tomasulo's algorithm track resource availability and instruction
dependencies, allowing the processor to schedule instructions out of
order to maximize resource utilization and minimize stalls. While
highly efficient, dynamic scheduling significantly increases the
complexity of the processor's control logic and can lead to higher
power consumption.
Resolving Data Hazards
• Register Renaming
• Open forwarding
Register Renaming: Unlocking Parallelism
• Register renaming eliminates false dependencies by mapping logical to physical registers.
• This technique allows multiple instructions to execute concurrently without interference.
• It frees up resources and enhances the performance of the pipeline significantly.
• Register renaming is essential for modern processors aiming for higher instruction throughput.
• It’s a cornerstone strategy in overcoming data hazards.
• · Example:
• · Instruction 1: R1 = R2 + R3 (Stores the result in R1)
• · Instruction 2: R1 = R4 + R5 (Uses R1 but overwrites the previous value)
• Without register renaming, the second instruction would overwrite R1 too soon, before the first instruction can use it,
leading
• to incorrect results. With register renaming:
• · Instruction 1: R6 = R2 + R3 (Renames R1 to R6)
• · Instruction 2: R7 = R4 + R5 (Renames R1 to R7)
• This ensures that both instructions use different physical registers (R6 and R7), allowing parallel execution without any
Open forwarding
• Open forwarding allows for the dynamic transfer of data between pipeline stages.
This technique minimizes delays by forwarding results directly to dependent instructions.
It enhances throughput and reduces latency in executing instructions smoothly.
By doing so, it optimizes resource utilization within the pipeline.
Open forwarding is a key strategy to combat data hazards.
Resolving Control Hazards
• Branch Prediction
Branch Prediction
• Guessing whether a branch (like an if-statement) will go one way or another to keep the pipeline moving
without delays.
• There are two types 1) Static and 2) Dynamic
Static Prediction:
Uses simple rules, like always assuming branches are "taken" or "not taken.“
• Pros: Reduces pauses and keeps the system working smoothly.
• Cons: Wrong guesses lead to wasted work and lost time.
Dynamic Branch prediction
• A smarter type of branch prediction that looks at what happened in the past to make better guesses about
branches.
• Common techniques include 1-bit and 2-bit prediction tables, and more complex methods like Pattern
History Tables (PHT) or Branch History Tables (BHT).
• Pros: More accurate than basic guessing, which means fewer stalls.
• Cons: Needs extra hardware, and wrong guesses can still waste time.
COA_Unit-3_slides_Pipeline Processing .pdf
Consider the following 4-Stage instruction pipeline where different instructions are
taking different amount of time at different stages. How many CC will be required to
complete these four instructions in the given pipeline?
IF ID EX WB
I1 2 1 2 2
I2 1 3 3 1
I3 2 2 2 2
I4 1 2 1 2
IF ID EX WB
I1 2 1 2 2
I2 1 3 3 1
I3 2 2 2 2
I4 1 2 1 2
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
I1
I2
I3
I4
Fetch(F), Decode(D), Execute(E) Write(W).
ADD R0, R1,R2
MUL R3, R4,
R6 SUB
R7,R8,R9
DIV R10, R11, R12
STORE X, R13
Fetch, Decode, Write Back takes 1 CC while Execution takes 3 Cycle for remaining instructions. What is
the
speed up?
Q. Consider the following program segment which is executed in the 4- stage pipeline.
Fetch(F), Decode(D), Execute(E) Write(W).
ADD R0, R1,R2
MUL R3, R4,
R6 SUB
R7,R8,R9
DIV R10, R11, R12
STORE X, R13
Fetch, Decode, Write Back takes 1 CC while Execution takes 3 Cycle for remaining instructions. What is the speed
up?
Q. Consider the following program segment which is executed in the 4- stage pipeline.
Fetch(F) Decode(D) Execute(E) Write Back(W)
I1 1 1 1 1
I2 1 1 3 1
I3 1 1 1 1
I4 1 1 3 1
I5 1 1 1 1
Fetch(F) Decode(D) Execute(E) Write Back(W)
I1 1 1 1 1
I2 1 1 3 1
I3 1 1 1 1
I4 1 1 3 1
I5 1 1 1 1
I1
I2
I3
I4
I5
Speed Up=
Q) A CPU has a 5-stage pipeline and operates at a frequency of 1 GHz. The instruction fetch happens in the first
stage. A conditional branch instruction computes the target address and evaluates the condition in the 3rd
stage. The CPU stalls and does not fetch new instructions following a conditional branch instruction until the
branch outcome is known. Given that a program consists of 1 billion instructions, where 20% of these
instructions are conditional branch instructions, and each instruction takes 1 clock cycles on average, calculate
the total time required for the completion of the program.
• Consider two pipeline implementations that have the same instruction structures and support
overlapping of all instructions, except for memory-related operations. In this case, if memory
operations cannot be executed simultaneously, it results in one stall cycle. In the program, 20% of
the instructions involve memory-related operations. Pipeline 1 utilizes 1 port-memory, while Pipeline
2 uses 2 port –memory. If the speedup factors for the respective pipelines are S1,S2, What is the
value of S2/S1.

More Related Content

PPT
Pipeline hazard
PDF
Computer SAarchitecture Lecture 6_Pip.pdf
PPT
Computer architecture pipelining
PPTX
pipeline in computer architecture design
PPT
Instruction pipelining
PPT
pipelining
PPT
Pipelining in computer architecture
PPTX
pipelining.pptx
Pipeline hazard
Computer SAarchitecture Lecture 6_Pip.pdf
Computer architecture pipelining
pipeline in computer architecture design
Instruction pipelining
pipelining
Pipelining in computer architecture
pipelining.pptx

Similar to COA_Unit-3_slides_Pipeline Processing .pdf (20)

PPSX
Concept of Pipelining
PPT
PPT
Pipelining slides
PPT
Pipelining
PPT
Pipelining
PPTX
Core pipelining
PPTX
3 Pipelining
PDF
Pipelining
PDF
Pipeline Organization Overview and Performance.pdf
PPTX
Pipelining
PPTX
Instruction pipeline: Computer Architecture
PPTX
Assembly p1
PPTX
Pipeline processing - Computer Architecture
PPT
Chapter6 pipelining
PDF
Module 2 of apj Abdul kablam university hpc.pdf
PPTX
COA Unit-5.pptx
PDF
instruction pipeline in computer architecture and organization.pdf
PPTX
Pipelining , structural hazards
PPTX
Computer organisation and architecture .
PPTX
Pipeline & Nonpipeline Processor
Concept of Pipelining
Pipelining slides
Pipelining
Pipelining
Core pipelining
3 Pipelining
Pipelining
Pipeline Organization Overview and Performance.pdf
Pipelining
Instruction pipeline: Computer Architecture
Assembly p1
Pipeline processing - Computer Architecture
Chapter6 pipelining
Module 2 of apj Abdul kablam university hpc.pdf
COA Unit-5.pptx
instruction pipeline in computer architecture and organization.pdf
Pipelining , structural hazards
Computer organisation and architecture .
Pipeline & Nonpipeline Processor
Ad

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
DOCX
573137875-Attendance-Management-System-original
PDF
PPT on Performance Review to get promotions
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Artificial Intelligence
PDF
Digital Logic Computer Design lecture notes
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Project quality management in manufacturing
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Current and future trends in Computer Vision.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Sustainable Sites - Green Building Construction
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Construction Project Organization Group 2.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Model Code of Practice - Construction Work - 21102022 .pdf
573137875-Attendance-Management-System-original
PPT on Performance Review to get promotions
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Artificial Intelligence
Digital Logic Computer Design lecture notes
Embodied AI: Ushering in the Next Era of Intelligent Systems
UNIT-1 - COAL BASED THERMAL POWER PLANTS
UNIT 4 Total Quality Management .pptx
bas. eng. economics group 4 presentation 1.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Project quality management in manufacturing
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Current and future trends in Computer Vision.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Sustainable Sites - Green Building Construction
Internet of Things (IOT) - A guide to understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Ad

COA_Unit-3_slides_Pipeline Processing .pdf

  • 2. Pipeline Processing Basic concepts of Pipeline Processing Instruction pipeline Arithmetic pipeline Handling Data, Control and Structural hazards Compiler techniques for improving performance
  • 3. Let's say that there are four loads of dirty laundry that need to be washed, dried, and folded. We could put the the first load in the washer for 30 minutes, dry it for 40 minutes, and then take 20 minutes to fold the clothes. Then pick up the second load and wash, dry, and fold, and repeat for the third and fourth loads. Supposing we started at 6 PM and worked as efficiently as possible, we would still be doing laundry until midnight.
  • 4. However, a smarter approach to the problem would be to put the second load of dirty laundry into the washer after the first was already clean and whirling happily in the dryer. Then, while the first load was being folded, the second load would dry, and a third load could be added to the pipeline of laundry. Using this method, the laundry would be finished by 9:30.
  • 5. Instruction Execution An instruction in a process is divided into 5 subtasks: 1. Instruction fetch (IF). 2. Instruction decode (ID). 3. Operand fetch (OF). 4. Instruction Execution (IE). 5. Output store (OS).
  • 7. • Instructions are executed one by one or in Non-Parallel fashion. • Single H/W Components which can take only 1 task at a time from it’s input and produce the result at the O/P. Drawback • Only one input can be processed at a time. • Partial output or Intermediate O/P not possible
  • 8. Pipelining • Pipelining is a technique where multiple instructions are overlapped during execution. • Pipeline is divided into stages and these stages are connected with one another to form a pipe like structure. • Improves the throughput of the system and thus increases the overall instruction throughput.
  • 9. Pipelining Execution in Pipelined Architecture • Parallel Execution of instruction takes place. • At a particular time slot, all the instructions are in different phases. • Instead of single H/W components we can split H/W design into small components. • The segments are connected with each other through an interface register and they can execute multiple task in in-depend or parallel.
  • 10. 4-Stage instruction pipeline •The processing of each instruction is divided into 4-segments. •FI🡪It is the segment that fetches an instruction. •DA🡪It is the segment that decode instruction and calculate effective address. •FO🡪 It is the segment that fetches the operand •EX🡪 It is the segment that executes the instruction
  • 11. Instruction Cycle: • Fetch Instruction • Decode Instruction(Identify opcode and operand) • Execute Instruction • Write Back Result(In Register/Memory) Registers Involved In Each Instruction Cycle: • Memory address registers(MAR) : It is connected to the address lines of the system bus. It specifies the address in memory for a read or write operation. • Memory Buffer Register(MBR) : It is connected to the data lines of the system bus. It contains the value to be stored in memory or the last value read from the memory. • Program Counter(PC) : Holds the address of the next instruction to be fetched. • Instruction Register(IR) : Holds the last instruction fetched.
  • 12. Stages of Pipelining • Instructions of the program execute parallely. When one instruction goes from nth stage to (n+1)th stage, another instruction goes from (n-1)th stage to nth stage.
  • 13. Pipelining • Advantages • Pipelining improves the throughput of the system. • In every clock cycle, a new instruction finishes its execution. • Allow multiple instructions to be executed concurrently. • Disadvantages • The design ofpipelined processor is complex and costly to manufacture. • The instruction latency is more.
  • 14. Types of Pipelining •Instruction Pipelining •Arithmetic Pipelining
  • 15. Instruction Pipelining • Instruction pipelining is a technique in computer architecture where multiple instruction stages (fetch, decode, execute, etc.) are overlapped to improve CPU performance. • It allows the next instruction to begin before the previous one completes, increasing throughput and reducing execution time. 1. Fetch instruction from memory. 2. Decode the instruction. 3. Calculate effective address. 4. Fetch operand from memory. 5. Execute instruction. 6. Store the result in memory.
  • 17. Problems with instruction pipeline (Pipeline Hazards). Pipeline hazards are issues that disrupt the smooth execution of instructions in a pipeline. The main types are: 1.Resource Hazard (Resource Conflict): Occurs when two instructions need the same resource (like memory or registers) simultaneously, causing a conflict. 2.Data Hazard (Data Dependency): Arises when an instruction depends on the result of a previous instruction that hasn’t completed, leading to delays. 3.Branch Hazard (Branch Difficulties): Happens when the pipeline cannot predict the outcome of a branch (like a conditional jump) early enough, leading to incorrect instruction execution.
  • 18. Arithmetic Pipeline • Arithmetic pipelining is a technique used in processors to break down complex arithmetic operations (like multiplication or division) into smaller stages, allowing multiple operations to be processed concurrently. • This improves performance by executing parts of several instructions in parallel.
  • 19. Arithmetic Pipeline • The combined operation of floating-point addition and subtraction is divided into four segments. Each segment contains the corresponding suboperation. • The suboperations in the four segments are: • Compare the exponents by subtraction. • Align the mantissas. • Add or subtract the mantissas. • Normalize the results
  • 20. Floating-point binary addition: X = A * 2a = 0.9504 * 103 Y = B * 2b = 0.8200 * 102 1. Compare exponents by subtraction: • The exponents are compared by subtracting them to determine their difference. The larger exponent is chosen as the exponent of the result. • The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the mantissa associated with the smaller exponent must be shifted to the right. 2. Align the mantissas: • The mantissa associated with the smaller exponent is shifted according to the difference of exponents determined in segment one. X = 0.9504 * 103 Y = 0.08200 * 103
  • 21. 3. Add mantissas: The two mantissas are added in segment three. Z = X + Y = 1.0324 * 103 4. Normalize the result: After normalization, the result is written as: Z = 0.10324 * 104
  • 22. Parameters that Determines the Performance of pipeline process • Speed Up Ratio (SK ) • Latency (LK ) • Efficiency (EK ) • Throughput(HK )
  • 23. Parameters that Determines the Performance of pipeline process
  • 25. Parameters that Determines the Performance of pipeline process- Speed up(s)
  • 29. Latency measures the time to complete a single instruction, and while it doesn’t decrease with pipelining, pipelining allows more instructions to be completed concurrently. Throughput is significantly improved with pipelining because multiple instructions are processed in parallel, leading to the completion of one instruction per clock cycle after the pipeline is filled. Speedup quantifies the performance improvement achieved by pipelining, often close to the number of pipeline stages, provided hazards are minimized. Efficiency is a measure of how effectively the pipeline stages are utilized. It is affected by hazards, stalls, and the balance of work between stages.
  • 30. Example: Space Time Diagram, No of Segment = 4 No of Task =6 Speed Up Ratio = ? •Number of Segment = 4 •Number of tasks= 6
  • 32. Branch instruction hazard •In a pipeline, a branch instruction hazard occurs when the pipeline cannot determine the next instruction to fetch because it depends on the outcome of a branch instruction. This creates a delay (or "stall") in the pipeline until the branch's outcome is known. Let’s break down this scenario with the given information: •Number of time slots: 12 •Total instructions: 7 •Branch instruction: 3rd instruction •Number of stalls due to the branch: 2
  • 33. •Space-Time Diagram with Stalls for the Branch Instruction •Here’s a pipeline diagram showing how these stalls impact the execution. Each cell represents the stage that an instruction is in during a specific time slot.
  • 34. • Explanation of Diagram • IF (Instruction Fetch), ID (Instruction Decode), EX (Execute), MEM (Memory Access), and WB (Writeback) represent the stages each instruction goes through. • The 3rd instruction is a branch, which, after the EX stage, introduces 2 stalls in time slots 6 and 7. • These stalls occur because the pipeline must wait for the branch outcome. • Once the branch direction is resolved, instruction 4 resumes at time slot 8. • Impact of Stalls: The branch hazard delays the pipeline by 2 time slots, causing subsequent instructions to start later than they would in an uninterrupted pipeline. • Calculating the Total Time with Stalls • In this case: • Without stalls, the pipeline could ideally complete all 7 instructions in 10 time slots. • With the 2 stalls caused by the branch, the total execution time increases to 12 time slots.
  • 35. Pipeline Hazard • Any condition that causes ‘stall’ in the pipeline operations can be called a Hazard • Pipeline hazards are situation that prevent the next instruction in the instruction stream from executing during its designated clock cycles. • Hazard Occurs: A <- 3+A B <- 4*A • No Hazard A <- 5*C B <- 20+C Pipeline: It is a technique of decomposing a sequential process into a no of sub processes with each sub process being executed in a special dedicated segment that operates concurrently with all other segments.
  • 36. Pipeline Hazards a) Data Hazard b) Control or Instruction Hazard c) Structure Hazard
  • 37. Pipeline Hazard a) Data Hazards •An instruction cannot continue because it needs a value that has not yet been generated by an earlier instruction. •In other words an instruction attempt to use a resource before it is ready. There are three type of Data Hazard possible. 1) RAW (Read after Write) [Flow/True data dependency] 2) WAR (Write after Read) [Anti-Data dependency] 3) WAW (Write after Write) [Output data dependency]
  • 38. Pipeline Hazard • Let there be two instructions I and J, such that J follow I. Then, • RAW hazard occurs when instruction J tries to read data before instruction I writes it. • Eg: • I: R2 <- R1 + R3 • J: R4 <- R2 + R3 • WAR hazard occurs when instruction J tries to write data before instruction I reads it. • Eg: • I: R2 <- R1 + R3 • J: R3 <- R4 + R5
  • 39. Pipeline Hazard •WAW hazard occurs when instruction J tries to write output before instruction I writes it. Eg: I: R2 <- R1 + R3 J: R2 <- R4 + R5 •WAR and WAW hazards occur during the out-of-order execution of the instructions.
  • 40. #Observations • All the instructions after the ADD use the result of the ADD instruction (in R1). The ADD instruction writes the value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub ). This problem is called a data hazard. Unless precautions are taken to prevent it, the SUB instruction will read the wrong value and try to use it. • The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand ) will receive the wrong result. • The OR instruction can be made to operate without incurring a hazard by a simple implementation technique. The technique is to perform register file reads in the second half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR are performed in one cycle 5, the write to register file by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the second half of the cycle. • The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by ADD.
  • 41. Control Hazard •A control hazard (or branch hazard) occurs in pipelined processors when the pipeline cannot determine which instruction to fetch next due to a conditional branch instruction. This uncertainty causes a delay until the branch outcome (whether to take the branch or not) is known.
  • 42. Control hazard Memory Location Instructions 12: If R1 = R3 Jump to label 36 16: AND R2, R3, R5 20: MUL R6, R1, R7 24: ADD R8, R1, R9 Label 36 DIV R10,R1, R11
  • 43. Control Hazard In the instruction 12 , • Whether it will jump to address location 36 or not, it will be known at ‘MEM’ phase i.e. at CC4. • It means until the instruction 12 will be in the 4th CC the next three instructions at the memory locations 16, 20 and 24 have already entered in the pipe and performing the operations in their respective phases. • In normal pipeline concept these instruction(located at addresses 16,20 and 24) have entered in the pipe. • Now, suppose if condition(R1=R3) becomes true then everything happened with the subsequent instructions becomes wrong, because at this point it is clear that instruction at location 36 should be the next instruction to be executed, instead of 16,20 and 24. • So, there is a requirement of flush out the wrongly entered/processed instructions. • Therefore, STALLS for 3CC will occur and instruction at the location 16, 20 and 24 should not suppose to executes and they should flush out from the pipe.
  • 44. Control hazard •Suppose, branch instruction decides to go to the location 36 for the next instruction to be executed in the MEM stage i.e. in the CC4. •Three subsequent instructions that follow the branch instruction will be fetched and begin their executions as like normal scenario, before If loop branches to location 36. •Generally, pipelining will not be stopped, so all the instruction after the branch also gets executed, but “if branch condition is true”, then “just flush out the previous unwanted instruction from the pipe” •In this case Branch Penalty is 3 Cycles
  • 45. Control Hazard So, • The instruction fetch unit of CPU is responsible for providing a stream of instructions to the execution unit. • The instruction fetched by the fetch unit are in consecutive memory location until some special conditions or branch occurs. • Problem arises when one of the instruction is branch instruction and need to go to the some different memory location. • In this case all unwanted instructions fetched in the pipeline from consecutive memory locations are invalid now and need to remove i.e. Flushing out from the pipe. Memory Location Instructions 12 If R1 = R3 Jump to label 36 16 AND R2, R3, R5 20 MUL R6, R1, R7 24 ADD R8, R1, R9 36 DIV R10,R1, R11 Flush OUT
  • 46. Control Hazard • This causes STALL in the pipeline till new corrected instruction are fetched from the memory. • Thus the time lost as a result of this called as Branch Penalty. • For reducing the resulting delay, dedicated hardware is incorporated in the fetch/decode unit to identify branch instruction possibility of occurrence in advance. • It can increase the cost.
  • 47. Structural Hazard • When the multiple instructions need the same resource. • It means in a computer organization part, common resources are used by the multiple instructions for their execution. • These resources are like: Memory, RAM, Different kind of registers, ALU, common bus etc. • We have limited no of resources and large no of instructions • So, obviously many conflict may occur due to this situation. • Due to this, normal pipeline concept is getting disturbed, called as Structural Hazard.
  • 48. Structural Hazard • For I1: Focus on CC4, In this above example , we are accessing ‘MEM’ in CC4 for loading/storing the data in the memory. • For I4: Focus on CC4, In the same CC4, I4 is also fetching instruction from memory, so two instructions I1 and I4 using the same resources at a time. IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 Clock Cycle I1 I2 I3 I4
  • 49. Structural Hazard IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB STALL IF ID EX MEM WB CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Clock Cycle I1 I2 I3 I4 • For I4: If we make a stall at CC4 and start the I4 in the CC5 then again the similar kind of problem still exists, because at CC5, i2 is using the memory along with i4. • Same kind of problem may occur continuously in CC6 and CC7 also. Then What will be the solution for this???
  • 50. Structural Hazard solved by Stalling Solution IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB STALL STALL IF ID EX MEM Clock Cycle I1 I2 I3 I4 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 I1 I2 I3 I4 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB STALL STALL STALL IF ID EX MEM Clock Cycle WB CC11
  • 51. Structural Hazard solution by hardware technique • One of the simple solution is to give separate memory for keeping the instructions and the data. • In Von Neumann Architecture, same memory is used for storing the data and the instructions, so it is a big drawback of Von Neumann Architecture. • If we use harvard Architecture, we can store instructions and operand data in different slots of memory locations. VON-NEUMANN I1 I2 DATA1 I3 DATA2 . . I10 Memory for Instruction Memory for operand Data I1 D1 I2 D2 I3 D3 I4 D4 . . D5 . . I10 D10 Harvard Architecture
  • 52. Structural Hazard IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB STALL IF ID EX MEM WB CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Clock Cycle I1 I2 I3 I4 • Again at CC6, I2 is writing result in the register, whereas I4, is in decode stage. Here, there is chance that both (I2) and (I4) is using the same register at their stages. So, there is a requirement of better register architecture. • It means we need to keep multiple register files as per the specific purpose. • Similarly, common bus is also accessing by multiple instructions at their different stages. So, there is a great chance of confliction. • So, there is a requirement of better bus organization.
  • 53. Methods of Optimizing Against Hazards – Compiler/Software Level • While stalling is a universal remedy in that it can be used to resolve any pipelining hazard, the costs are high, and stalls impact a chip’s ability to perform efficiently. However, there are other methods available to resolve hazards that help retain efficiency. • The first we will examine are all performed in the compiler; no additional hardware need be added to implement them because the improvements are made to the code itself, not to the machine running it. • Implemented correctly, compiler-level optimizations, as opposed to hardware-level optimizations, can provide a solution to hazards that does not require extra power and can be performed on any hardware.
  • 54. Resolving Structural Hazards using Compiler/Software Level •One approach to structural hazards is to reorder operations such that two instructions are never so close to one another that this sort of hazard occurs. •Reordering operations to prevent this kind of hazard is often viable during compiler time.
  • 57. Resolving Data Hazards using Compiler/Software Level •Data hazards occur in a pipeline when an instruction depends on the result of a previous instruction that hasn't completed yet. •Resolving data hazards is crucial for maintaining pipeline efficiency, and common techniques include stalling, data forwarding (bypassing), and reordering instructions.
  • 60. CYCLE ADD SUB 1 Fetch 2 Decode Fetch 3 Execute Decode 4 Memory Execute(receive the R1 from add execute operation) 5 Write back Memory 6 Write back
  • 64. Resolving Control Hazards using Compiler/Software Level •There are several ways to handle control hazards, including stalling (waiting for branch resolution), branch prediction, and branch delay slots.
  • 68. Resolving Control Hazards using software/compiler level solution Managing control hazards at the compiler level is related to distancing the logical operation on which the branch is based from the branch itself and on limiting the number of branches. Both of these are accomplished by loop unrolling. This is also used to increase the system performance by reducing the number of iterations. Loop unrolling essentially expands the body of a loop so that fewer branches are necessary. Example: for (i=1000; i>0; i=i-1) x[i] = x[i] + s; Unrolled loop (4 times): for (i = 1000; i > 0; i = i - 4) { x[i] = x[i] + s; x[i - 1] = x[i - 1] + s; x[i - 2] = x[i - 2] + s; x[i - 3] = x[i - 3] + s; }
  • 69. Challenges in Loop Unrolling • Increased Code Size and Cache Misses • Trade-offs of loop unrolling: larger code footprint may increase cache misses • Optimal Unrolling Factor • Explanation of factors influencing optimal unrolling: cache size, register
  • 70. Limitations of Compiler-Level Optimization • Limitations at Compile Time • Many runtime hazards, like data dependencies, can’t be fully predicted • Examples of Runtime-Only Hazards • Control flow changes, unpredictable loops • Conclusion: Need for hardware-level solutions for complex hazards
  • 71. Methods of Optimizing Against Hazards – Hardware-Level Resolving Structural Hazards • Resource Duplication • Resource Pipelining • Dynamic Scheduling
  • 72. Resource Duplication •Resource duplication offers a more direct approach to eliminating structural hazards. By duplicating critical resources, such as providing multiple ALUs or additional memory ports, the pipeline can handle concurrent requests without conflicts. This eliminates the need for stalls and ensures smoother instruction flow. However, resource duplication increases the hardware complexity and cost of the processor. The trade-off between performance gain and increased cost must be carefully considered.
  • 73. Resource Duplication • One of the simple solution is to give separate memory for keeping the instructions and the data. • In Von Neumann Architecture, same memory is used for storing the data and the instructions, so it is a big drawback of Von Neumann Architecture. • If we use harvard Architecture, we can store instructions and operand data in different slots of memory locations. VON-NEUMANN I1 I2 DATA1 I3 DATA2 . . I10 Memory for Instruction Memory for operand Data I1 D1 I2 D2 I3 D3 I4 D4 . . D5 . . I10 D10 Harvard Architecture
  • 74. Pipelining the Resource •Pipelining the resource itself is another effective strategy to mitigate structural hazards. Instead of duplicating the entire resource, it is divided into smaller pipelined stages. This allows the resource to handle multiple instructions concurrently, as different stages can process different parts of the instructions. For example, a pipelined ALU can have separate stages for operand fetching, arithmetic operation, and result writing. This approach improves throughput without requiring full resource duplication, but it may increase the latency of the resource itself.
  • 75. Dynamic Scheduling •Dynamic scheduling employs sophisticated hardware mechanisms to dynamically analyze and reorder instructions at runtime, aiming to avoid structural hazards. Techniques like scoreboarding and Tomasulo's algorithm track resource availability and instruction dependencies, allowing the processor to schedule instructions out of order to maximize resource utilization and minimize stalls. While highly efficient, dynamic scheduling significantly increases the complexity of the processor's control logic and can lead to higher power consumption.
  • 76. Resolving Data Hazards • Register Renaming • Open forwarding
  • 77. Register Renaming: Unlocking Parallelism • Register renaming eliminates false dependencies by mapping logical to physical registers. • This technique allows multiple instructions to execute concurrently without interference. • It frees up resources and enhances the performance of the pipeline significantly. • Register renaming is essential for modern processors aiming for higher instruction throughput. • It’s a cornerstone strategy in overcoming data hazards. • · Example: • · Instruction 1: R1 = R2 + R3 (Stores the result in R1) • · Instruction 2: R1 = R4 + R5 (Uses R1 but overwrites the previous value) • Without register renaming, the second instruction would overwrite R1 too soon, before the first instruction can use it, leading • to incorrect results. With register renaming: • · Instruction 1: R6 = R2 + R3 (Renames R1 to R6) • · Instruction 2: R7 = R4 + R5 (Renames R1 to R7) • This ensures that both instructions use different physical registers (R6 and R7), allowing parallel execution without any
  • 78. Open forwarding • Open forwarding allows for the dynamic transfer of data between pipeline stages. This technique minimizes delays by forwarding results directly to dependent instructions. It enhances throughput and reduces latency in executing instructions smoothly. By doing so, it optimizes resource utilization within the pipeline. Open forwarding is a key strategy to combat data hazards.
  • 79. Resolving Control Hazards • Branch Prediction
  • 80. Branch Prediction • Guessing whether a branch (like an if-statement) will go one way or another to keep the pipeline moving without delays. • There are two types 1) Static and 2) Dynamic Static Prediction: Uses simple rules, like always assuming branches are "taken" or "not taken.“ • Pros: Reduces pauses and keeps the system working smoothly. • Cons: Wrong guesses lead to wasted work and lost time.
  • 81. Dynamic Branch prediction • A smarter type of branch prediction that looks at what happened in the past to make better guesses about branches. • Common techniques include 1-bit and 2-bit prediction tables, and more complex methods like Pattern History Tables (PHT) or Branch History Tables (BHT). • Pros: More accurate than basic guessing, which means fewer stalls. • Cons: Needs extra hardware, and wrong guesses can still waste time.
  • 83. Consider the following 4-Stage instruction pipeline where different instructions are taking different amount of time at different stages. How many CC will be required to complete these four instructions in the given pipeline? IF ID EX WB I1 2 1 2 2 I2 1 3 3 1 I3 2 2 2 2 I4 1 2 1 2
  • 84. IF ID EX WB I1 2 1 2 2 I2 1 3 3 1 I3 2 2 2 2 I4 1 2 1 2 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 I1 I2 I3 I4
  • 85. Fetch(F), Decode(D), Execute(E) Write(W). ADD R0, R1,R2 MUL R3, R4, R6 SUB R7,R8,R9 DIV R10, R11, R12 STORE X, R13 Fetch, Decode, Write Back takes 1 CC while Execution takes 3 Cycle for remaining instructions. What is the speed up? Q. Consider the following program segment which is executed in the 4- stage pipeline.
  • 86. Fetch(F), Decode(D), Execute(E) Write(W). ADD R0, R1,R2 MUL R3, R4, R6 SUB R7,R8,R9 DIV R10, R11, R12 STORE X, R13 Fetch, Decode, Write Back takes 1 CC while Execution takes 3 Cycle for remaining instructions. What is the speed up? Q. Consider the following program segment which is executed in the 4- stage pipeline. Fetch(F) Decode(D) Execute(E) Write Back(W) I1 1 1 1 1 I2 1 1 3 1 I3 1 1 1 1 I4 1 1 3 1 I5 1 1 1 1
  • 87. Fetch(F) Decode(D) Execute(E) Write Back(W) I1 1 1 1 1 I2 1 1 3 1 I3 1 1 1 1 I4 1 1 3 1 I5 1 1 1 1 I1 I2 I3 I4 I5 Speed Up=
  • 88. Q) A CPU has a 5-stage pipeline and operates at a frequency of 1 GHz. The instruction fetch happens in the first stage. A conditional branch instruction computes the target address and evaluates the condition in the 3rd stage. The CPU stalls and does not fetch new instructions following a conditional branch instruction until the branch outcome is known. Given that a program consists of 1 billion instructions, where 20% of these instructions are conditional branch instructions, and each instruction takes 1 clock cycles on average, calculate the total time required for the completion of the program.
  • 89. • Consider two pipeline implementations that have the same instruction structures and support overlapping of all instructions, except for memory-related operations. In this case, if memory operations cannot be executed simultaneously, it results in one stall cycle. In the program, 20% of the instructions involve memory-related operations. Pipeline 1 utilizes 1 port-memory, while Pipeline 2 uses 2 port –memory. If the speedup factors for the respective pipelines are S1,S2, What is the value of S2/S1.