SlideShare a Scribd company logo
Pipelining understanding:
Pipelining is running multiple stages of the same process in parallel in a way that efficiently uses
all the available hardware while respecting the dependencies of each stage upon the previous
stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash
stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is
minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must
remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by
the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into
smaller sub-stages. For those less concerned with laundry-based examples, consider a video
game. The CPU computes the keyboard/mouse input each frame and moves the camera
accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the
CPU has already begun calculating what's going to happen in the next frame.
How Pipelining will done:
In class, we mentioned that interpreting each computer instruction is a four step process: fetching
the instruction, decoding it and reading the register, executing it, and recording the results. Each
instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle,
then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish
this, we can split up an instruction's work into the 4 different steps so that other pieces of
hardware work to decode, execute, and record results while the CPU performs the fetch. The
latency to process each instruction is fixed at 4 cycles, so by processing a new instruction every
cycle, after four cycles, one instruction has been completed and three are "in progress" (they're
in the pipeline). After many cycles the steady state throughput approaches one completed
instruction every cycle.
An assembly line in a auto manufacturing plant is another good example of a pipelined process.
There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline.
Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a
lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a
steady state. The larger the depth, the more costly it is to turn the system around: A branch
misprediction in an instruction pipeline would be like getting one of the steps wrong in the
assembly line: all the cars affected would have to go back to the beginning of the assembly line
and be processed again.
OnLive Example[Realtime]:
OnLive is a company that allows gamers to play video games in the cloud. The games are run on
one of the company's server farms, and video of the game is sent back to your computer. The
idea is that even the lamest of computers can run the most highly intensive games because all the
computer does is send your joystick input over the internet and display the frames it gets back.
Of course, no one wants to play a game with a noticeably low framerate. We're going to
demonstrate how OnLive could deliver a reasonable experience. For our purposes, we'll assume
that OnLive uses a four step process: the user's computer sends over the input to the server
(10ms), the server tells the game about the user's input and then compresses the resulting game
frame (15ms), the compressed video is sent back to the user (60ms) where it is then
decompressed and displayed (15ms). Note that OnLive doesn't share its data, so these numbers
are contrived.
The latency of this process is 100ms (10+15+60+15). This means that there will always be a
tenth of a second lag from when you perform an action to when you see it affect things on the
screen.
Communication between different parts of a machine is not particularly easy to manage since
often it only occurs in burst situations - that is a huge demand on the communication framework
followed by a period of very little activity. Communication can be sped up by pipelining
however. We do not necessarily have to wait for a message to be delivered before we send
another piece of information. Therefore we can set up a level of pipelining. Often, however, the
rate at which we can send messages is much faster than the time it takes data to go through the
slowest part of our system. Therefore, pipelining only helps to an extent because in the long run
our communication is limited by the slowest part of our system.
Data hazards:
Data hazards occur when instructions that exhibit data dependence modify data in different
stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed
race hazards). There are three situations in which a data hazard can occur:
read after write (RAW), a true dependency
write after read (WAR), an anti-dependency
write after write (WAW), an output dependency
Consider two instructions i1 and i2, with i1 occurring before i2 in program order.
Read after write (RAW):
(i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a
situation where an instruction refers to a result that has not yet been calculated or retrieved. This
can occur because even though an instruction is executed after a prior instruction, the prior
instruction has been processed only partly through the pipeline.
For example:
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
The first instruction is calculating a value to be saved in register R2, and the second is going to
use this value to compute a result for register R4. However, in a pipeline, when operands are
fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a
data dependency occurs.
A data dependency occurs with instruction i2, as it is dependent on the completion of instruction
i1.
Write after write (WAW):
(i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard
may occur in a concurrent execution environment.
Example:
For example:
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
The write back (WB) of i2 must be delayed until i1 finishes executing.
Structural hazards:
A structural hazard occurs when a part of the processor's hardware is needed by two or more
instructions at the same time. A canonical example is a single memory unit that is accessed both
in the fetch stage where an instruction is retrieved from memory, and the memory stage where
data is written and/or read from memory.[3] They can often be resolved by separating the
component into orthogonal units (such as separate caches) or bubbling the pipeline.
Control hazards (branch hazards):
Further information: Branch (computer science)
Branching hazards (also termed control hazards) occur with branches. On many instruction
pipeline microarchitectures, the processor will not know the outcome of the branch when it needs
to insert a new instruction into the pipeline.
Forwarding:
The problem with data hazards, introduced by this sequence of instructions can be solved with a
simple hardware technique called forwarding.
1 2 3 4 5 6 7
ADD R1,R2,R3 IF ID EX MEM WB
SUB R4,R5,R1
IF ID SUB EX MEM WB
AND R6,R1,R7
IF ID AND EX MEM WB
The key insight in forwarding is that the result is not really needed by SUB until after the ADD
actually produces it. The only problem is to make it available for SUB when it needs it.
If the result can be moved from where the ADD produces it (EX/MEM register), to where the
SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation , forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input latches.
If the forwarding hardware detects that the previous ALU operation has written the register
corresponding to the source for the current ALU operation, control logic selects the forwarded
result as the ALU input rather than the value read from the register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each ALU
multiplexer and the addtion of three paths to the new inputs.
The paths correspond to a forwarding of:
(a) the ALU output at the end of EX,
(b) the ALU output at the end of MEM, and
(c) the memory output at the end of MEM.
Solution
Pipelining understanding:
Pipelining is running multiple stages of the same process in parallel in a way that efficiently uses
all the available hardware while respecting the dependencies of each stage upon the previous
stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash
stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is
minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must
remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by
the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into
smaller sub-stages. For those less concerned with laundry-based examples, consider a video
game. The CPU computes the keyboard/mouse input each frame and moves the camera
accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the
CPU has already begun calculating what's going to happen in the next frame.
How Pipelining will done:
In class, we mentioned that interpreting each computer instruction is a four step process: fetching
the instruction, decoding it and reading the register, executing it, and recording the results. Each
instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle,
then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish
this, we can split up an instruction's work into the 4 different steps so that other pieces of
hardware work to decode, execute, and record results while the CPU performs the fetch. The
latency to process each instruction is fixed at 4 cycles, so by processing a new instruction every
cycle, after four cycles, one instruction has been completed and three are "in progress" (they're
in the pipeline). After many cycles the steady state throughput approaches one completed
instruction every cycle.
An assembly line in a auto manufacturing plant is another good example of a pipelined process.
There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline.
Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a
lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a
steady state. The larger the depth, the more costly it is to turn the system around: A branch
misprediction in an instruction pipeline would be like getting one of the steps wrong in the
assembly line: all the cars affected would have to go back to the beginning of the assembly line
and be processed again.
OnLive Example[Realtime]:
OnLive is a company that allows gamers to play video games in the cloud. The games are run on
one of the company's server farms, and video of the game is sent back to your computer. The
idea is that even the lamest of computers can run the most highly intensive games because all the
computer does is send your joystick input over the internet and display the frames it gets back.
Of course, no one wants to play a game with a noticeably low framerate. We're going to
demonstrate how OnLive could deliver a reasonable experience. For our purposes, we'll assume
that OnLive uses a four step process: the user's computer sends over the input to the server
(10ms), the server tells the game about the user's input and then compresses the resulting game
frame (15ms), the compressed video is sent back to the user (60ms) where it is then
decompressed and displayed (15ms). Note that OnLive doesn't share its data, so these numbers
are contrived.
The latency of this process is 100ms (10+15+60+15). This means that there will always be a
tenth of a second lag from when you perform an action to when you see it affect things on the
screen.
Communication between different parts of a machine is not particularly easy to manage since
often it only occurs in burst situations - that is a huge demand on the communication framework
followed by a period of very little activity. Communication can be sped up by pipelining
however. We do not necessarily have to wait for a message to be delivered before we send
another piece of information. Therefore we can set up a level of pipelining. Often, however, the
rate at which we can send messages is much faster than the time it takes data to go through the
slowest part of our system. Therefore, pipelining only helps to an extent because in the long run
our communication is limited by the slowest part of our system.
Data hazards:
Data hazards occur when instructions that exhibit data dependence modify data in different
stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed
race hazards). There are three situations in which a data hazard can occur:
read after write (RAW), a true dependency
write after read (WAR), an anti-dependency
write after write (WAW), an output dependency
Consider two instructions i1 and i2, with i1 occurring before i2 in program order.
Read after write (RAW):
(i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a
situation where an instruction refers to a result that has not yet been calculated or retrieved. This
can occur because even though an instruction is executed after a prior instruction, the prior
instruction has been processed only partly through the pipeline.
For example:
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
The first instruction is calculating a value to be saved in register R2, and the second is going to
use this value to compute a result for register R4. However, in a pipeline, when operands are
fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a
data dependency occurs.
A data dependency occurs with instruction i2, as it is dependent on the completion of instruction
i1.
Write after write (WAW):
(i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard
may occur in a concurrent execution environment.
Example:
For example:
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
The write back (WB) of i2 must be delayed until i1 finishes executing.
Structural hazards:
A structural hazard occurs when a part of the processor's hardware is needed by two or more
instructions at the same time. A canonical example is a single memory unit that is accessed both
in the fetch stage where an instruction is retrieved from memory, and the memory stage where
data is written and/or read from memory.[3] They can often be resolved by separating the
component into orthogonal units (such as separate caches) or bubbling the pipeline.
Control hazards (branch hazards):
Further information: Branch (computer science)
Branching hazards (also termed control hazards) occur with branches. On many instruction
pipeline microarchitectures, the processor will not know the outcome of the branch when it needs
to insert a new instruction into the pipeline.
Forwarding:
The problem with data hazards, introduced by this sequence of instructions can be solved with a
simple hardware technique called forwarding.
1 2 3 4 5 6 7
ADD R1,R2,R3 IF ID EX MEM WB
SUB R4,R5,R1
IF ID SUB EX MEM WB
AND R6,R1,R7
IF ID AND EX MEM WB
The key insight in forwarding is that the result is not really needed by SUB until after the ADD
actually produces it. The only problem is to make it available for SUB when it needs it.
If the result can be moved from where the ADD produces it (EX/MEM register), to where the
SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation , forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input latches.
If the forwarding hardware detects that the previous ALU operation has written the register
corresponding to the source for the current ALU operation, control logic selects the forwarded
result as the ALU input rather than the value read from the register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each ALU
multiplexer and the addtion of three paths to the new inputs.
The paths correspond to a forwarding of:
(a) the ALU output at the end of EX,
(b) the ALU output at the end of MEM, and
(c) the memory output at the end of MEM.

More Related Content

PPTX
Assembly p1
PPT
Pipeline hazard
PDF
Pipeline and data hazard
PPTX
Pipeline & Nonpipeline Processor
PDF
Module 2 of apj Abdul kablam university hpc.pdf
PPT
Pipelining In computer
PDF
FPGA based 10G Performance Tester for HW OpenFlow Switch
DOC
Pipeline Mechanism
Assembly p1
Pipeline hazard
Pipeline and data hazard
Pipeline & Nonpipeline Processor
Module 2 of apj Abdul kablam university hpc.pdf
Pipelining In computer
FPGA based 10G Performance Tester for HW OpenFlow Switch
Pipeline Mechanism

Similar to Pipelining understandingPipelining is running multiple stages of .pdf (20)

PDF
Pipeline Computing by S. M. Risalat Hasan Chowdhury
PDF
Pipelining 16 computers Artitacher pdf
PPTX
pipelining
PPT
Pipelining _
PDF
Efficient Resource Allocation to Virtual Machine in Cloud Computing Using an ...
PPTX
Design pipeline architecture for various stage pipelines
PDF
Hardware Assisted Latency Investigations
PDF
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
PDF
Module 3-cpu-scheduling
PPT
Pipelining & All Hazards Solution
PPTX
Pipelining in Computer System Achitecture
PDF
Pipeline Organization Overview and Performance.pdf
PDF
Oversimplified CA
PPTX
Computer architecture
PPTX
concept of computer organisation and architechture
PPTX
Pipelining of Processors Computer Architecture
PPTX
Chapter 3 part 2 Interconnections Computer organization
PPTX
Plc by Mohamed Al-Emam, Session3
PDF
Reconfigurable computing
DOCX
Bc0040
Pipeline Computing by S. M. Risalat Hasan Chowdhury
Pipelining 16 computers Artitacher pdf
pipelining
Pipelining _
Efficient Resource Allocation to Virtual Machine in Cloud Computing Using an ...
Design pipeline architecture for various stage pipelines
Hardware Assisted Latency Investigations
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
Module 3-cpu-scheduling
Pipelining & All Hazards Solution
Pipelining in Computer System Achitecture
Pipeline Organization Overview and Performance.pdf
Oversimplified CA
Computer architecture
concept of computer organisation and architechture
Pipelining of Processors Computer Architecture
Chapter 3 part 2 Interconnections Computer organization
Plc by Mohamed Al-Emam, Session3
Reconfigurable computing
Bc0040
Ad

More from arasanlethers (20)

PDF
#include SDLSDL.hSDL_Surface Background = NULL; SDL_Surface.pdf
PDF
C, D, and E are wrong and involve random constants. Thisnarrows it d.pdf
PDF
AnswerOogenesis is the process by which ovum mother cells or oogo.pdf
PDF
AnswerB) S. typhimuium gains access to the host by crossing the.pdf
PDF
Answer question1,2,4,5Ion–dipole interactionsAffinity of oxygen .pdf
PDF
A Letter to myself!Hi to myself!Now that I am an Engineer with a.pdf
PDF
Particulars Amount ($) Millons a) Purchase consideratio.pdf
PDF
2 = 14.191,     df = 9,2df = 1.58 ,         P(2 14.191) = .pdf
PDF
while determining the pH the pH of the water is n.pdf
PDF
Quantum Numbers and Atomic Orbitals By solving t.pdf
PDF
They are molecules that are mirror images of each.pdf
PDF
Well u put so many type of compounds here.Generally speakingi) t.pdf
PDF
Ventilation is the process of air going in and out of lungs. Increas.pdf
PDF
A1) A living being or an individual is known as an organism and it i.pdf
PDF
there are laws and regulations that would pertain to an online breac.pdf
PDF
The false statement among the given list is “Territoriality means ho.pdf
PDF
The current article is discussing about the role of SOX4 geneprotei.pdf
PDF
The major similarities between rocks and minerals are as follows1.pdf
PDF
main.cpp #include iostream #include iomanip #include S.pdf
PDF
12). Choose the letter designation that represent homozygous recessi.pdf
#include SDLSDL.hSDL_Surface Background = NULL; SDL_Surface.pdf
C, D, and E are wrong and involve random constants. Thisnarrows it d.pdf
AnswerOogenesis is the process by which ovum mother cells or oogo.pdf
AnswerB) S. typhimuium gains access to the host by crossing the.pdf
Answer question1,2,4,5Ion–dipole interactionsAffinity of oxygen .pdf
A Letter to myself!Hi to myself!Now that I am an Engineer with a.pdf
Particulars Amount ($) Millons a) Purchase consideratio.pdf
2 = 14.191,     df = 9,2df = 1.58 ,         P(2 14.191) = .pdf
while determining the pH the pH of the water is n.pdf
Quantum Numbers and Atomic Orbitals By solving t.pdf
They are molecules that are mirror images of each.pdf
Well u put so many type of compounds here.Generally speakingi) t.pdf
Ventilation is the process of air going in and out of lungs. Increas.pdf
A1) A living being or an individual is known as an organism and it i.pdf
there are laws and regulations that would pertain to an online breac.pdf
The false statement among the given list is “Territoriality means ho.pdf
The current article is discussing about the role of SOX4 geneprotei.pdf
The major similarities between rocks and minerals are as follows1.pdf
main.cpp #include iostream #include iomanip #include S.pdf
12). Choose the letter designation that represent homozygous recessi.pdf
Ad

Recently uploaded (20)

PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Institutional Correction lecture only . . .
PDF
RMMM.pdf make it easy to upload and study
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Pre independence Education in Inndia.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Complications of Minimal Access Surgery at WLH
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
master seminar digital applications in india
PDF
Microbial disease of the cardiovascular and lymphatic systems
VCE English Exam - Section C Student Revision Booklet
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
01-Introduction-to-Information-Management.pdf
Institutional Correction lecture only . . .
RMMM.pdf make it easy to upload and study
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Supply Chain Operations Speaking Notes -ICLT Program
Pre independence Education in Inndia.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPH.pptx obstetrics and gynecology in nursing
human mycosis Human fungal infections are called human mycosis..pptx
Complications of Minimal Access Surgery at WLH
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
master seminar digital applications in india
Microbial disease of the cardiovascular and lymphatic systems

Pipelining understandingPipelining is running multiple stages of .pdf

  • 1. Pipelining understanding: Pipelining is running multiple stages of the same process in parallel in a way that efficiently uses all the available hardware while respecting the dependencies of each stage upon the previous stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into smaller sub-stages. For those less concerned with laundry-based examples, consider a video game. The CPU computes the keyboard/mouse input each frame and moves the camera accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the CPU has already begun calculating what's going to happen in the next frame. How Pipelining will done: In class, we mentioned that interpreting each computer instruction is a four step process: fetching the instruction, decoding it and reading the register, executing it, and recording the results. Each instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle, then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish this, we can split up an instruction's work into the 4 different steps so that other pieces of hardware work to decode, execute, and record results while the CPU performs the fetch. The latency to process each instruction is fixed at 4 cycles, so by processing a new instruction every cycle, after four cycles, one instruction has been completed and three are "in progress" (they're in the pipeline). After many cycles the steady state throughput approaches one completed instruction every cycle. An assembly line in a auto manufacturing plant is another good example of a pipelined process. There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline. Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a steady state. The larger the depth, the more costly it is to turn the system around: A branch misprediction in an instruction pipeline would be like getting one of the steps wrong in the assembly line: all the cars affected would have to go back to the beginning of the assembly line and be processed again. OnLive Example[Realtime]: OnLive is a company that allows gamers to play video games in the cloud. The games are run on one of the company's server farms, and video of the game is sent back to your computer. The idea is that even the lamest of computers can run the most highly intensive games because all the
  • 2. computer does is send your joystick input over the internet and display the frames it gets back. Of course, no one wants to play a game with a noticeably low framerate. We're going to demonstrate how OnLive could deliver a reasonable experience. For our purposes, we'll assume that OnLive uses a four step process: the user's computer sends over the input to the server (10ms), the server tells the game about the user's input and then compresses the resulting game frame (15ms), the compressed video is sent back to the user (60ms) where it is then decompressed and displayed (15ms). Note that OnLive doesn't share its data, so these numbers are contrived. The latency of this process is 100ms (10+15+60+15). This means that there will always be a tenth of a second lag from when you perform an action to when you see it affect things on the screen. Communication between different parts of a machine is not particularly easy to manage since often it only occurs in burst situations - that is a huge demand on the communication framework followed by a period of very little activity. Communication can be sped up by pipelining however. We do not necessarily have to wait for a message to be delivered before we send another piece of information. Therefore we can set up a level of pipelining. Often, however, the rate at which we can send messages is much faster than the time it takes data to go through the slowest part of our system. Therefore, pipelining only helps to an extent because in the long run our communication is limited by the slowest part of our system. Data hazards: Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur: read after write (RAW), a true dependency write after read (WAR), an anti-dependency write after write (WAW), an output dependency Consider two instructions i1 and i2, with i1 occurring before i2 in program order. Read after write (RAW): (i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved. This can occur because even though an instruction is executed after a prior instruction, the prior instruction has been processed only partly through the pipeline. For example: i1. R2 <- R1 + R3 i2. R4 <- R2 + R3
  • 3. The first instruction is calculating a value to be saved in register R2, and the second is going to use this value to compute a result for register R4. However, in a pipeline, when operands are fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a data dependency occurs. A data dependency occurs with instruction i2, as it is dependent on the completion of instruction i1. Write after write (WAW): (i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard may occur in a concurrent execution environment. Example: For example: i1. R2 <- R4 + R7 i2. R2 <- R1 + R3 The write back (WB) of i2 must be delayed until i1 finishes executing. Structural hazards: A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A canonical example is a single memory unit that is accessed both in the fetch stage where an instruction is retrieved from memory, and the memory stage where data is written and/or read from memory.[3] They can often be resolved by separating the component into orthogonal units (such as separate caches) or bubbling the pipeline. Control hazards (branch hazards): Further information: Branch (computer science) Branching hazards (also termed control hazards) occur with branches. On many instruction pipeline microarchitectures, the processor will not know the outcome of the branch when it needs to insert a new instruction into the pipeline. Forwarding: The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding. 1 2 3 4 5 6 7 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID SUB EX MEM WB AND R6,R1,R7 IF ID AND EX MEM WB The key insight in forwarding is that the result is not really needed by SUB until after the ADD
  • 4. actually produces it. The only problem is to make it available for SUB when it needs it. If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided. Using this observation , forwarding works as follows: The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addtion of three paths to the new inputs. The paths correspond to a forwarding of: (a) the ALU output at the end of EX, (b) the ALU output at the end of MEM, and (c) the memory output at the end of MEM. Solution Pipelining understanding: Pipelining is running multiple stages of the same process in parallel in a way that efficiently uses all the available hardware while respecting the dependencies of each stage upon the previous stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into smaller sub-stages. For those less concerned with laundry-based examples, consider a video game. The CPU computes the keyboard/mouse input each frame and moves the camera accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the CPU has already begun calculating what's going to happen in the next frame. How Pipelining will done: In class, we mentioned that interpreting each computer instruction is a four step process: fetching the instruction, decoding it and reading the register, executing it, and recording the results. Each instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle, then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish this, we can split up an instruction's work into the 4 different steps so that other pieces of hardware work to decode, execute, and record results while the CPU performs the fetch. The
  • 5. latency to process each instruction is fixed at 4 cycles, so by processing a new instruction every cycle, after four cycles, one instruction has been completed and three are "in progress" (they're in the pipeline). After many cycles the steady state throughput approaches one completed instruction every cycle. An assembly line in a auto manufacturing plant is another good example of a pipelined process. There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline. Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a steady state. The larger the depth, the more costly it is to turn the system around: A branch misprediction in an instruction pipeline would be like getting one of the steps wrong in the assembly line: all the cars affected would have to go back to the beginning of the assembly line and be processed again. OnLive Example[Realtime]: OnLive is a company that allows gamers to play video games in the cloud. The games are run on one of the company's server farms, and video of the game is sent back to your computer. The idea is that even the lamest of computers can run the most highly intensive games because all the computer does is send your joystick input over the internet and display the frames it gets back. Of course, no one wants to play a game with a noticeably low framerate. We're going to demonstrate how OnLive could deliver a reasonable experience. For our purposes, we'll assume that OnLive uses a four step process: the user's computer sends over the input to the server (10ms), the server tells the game about the user's input and then compresses the resulting game frame (15ms), the compressed video is sent back to the user (60ms) where it is then decompressed and displayed (15ms). Note that OnLive doesn't share its data, so these numbers are contrived. The latency of this process is 100ms (10+15+60+15). This means that there will always be a tenth of a second lag from when you perform an action to when you see it affect things on the screen. Communication between different parts of a machine is not particularly easy to manage since often it only occurs in burst situations - that is a huge demand on the communication framework followed by a period of very little activity. Communication can be sped up by pipelining however. We do not necessarily have to wait for a message to be delivered before we send another piece of information. Therefore we can set up a level of pipelining. Often, however, the rate at which we can send messages is much faster than the time it takes data to go through the slowest part of our system. Therefore, pipelining only helps to an extent because in the long run our communication is limited by the slowest part of our system.
  • 6. Data hazards: Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur: read after write (RAW), a true dependency write after read (WAR), an anti-dependency write after write (WAW), an output dependency Consider two instructions i1 and i2, with i1 occurring before i2 in program order. Read after write (RAW): (i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved. This can occur because even though an instruction is executed after a prior instruction, the prior instruction has been processed only partly through the pipeline. For example: i1. R2 <- R1 + R3 i2. R4 <- R2 + R3 The first instruction is calculating a value to be saved in register R2, and the second is going to use this value to compute a result for register R4. However, in a pipeline, when operands are fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a data dependency occurs. A data dependency occurs with instruction i2, as it is dependent on the completion of instruction i1. Write after write (WAW): (i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard may occur in a concurrent execution environment. Example: For example: i1. R2 <- R4 + R7 i2. R2 <- R1 + R3 The write back (WB) of i2 must be delayed until i1 finishes executing. Structural hazards: A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A canonical example is a single memory unit that is accessed both in the fetch stage where an instruction is retrieved from memory, and the memory stage where data is written and/or read from memory.[3] They can often be resolved by separating the component into orthogonal units (such as separate caches) or bubbling the pipeline.
  • 7. Control hazards (branch hazards): Further information: Branch (computer science) Branching hazards (also termed control hazards) occur with branches. On many instruction pipeline microarchitectures, the processor will not know the outcome of the branch when it needs to insert a new instruction into the pipeline. Forwarding: The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding. 1 2 3 4 5 6 7 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID SUB EX MEM WB AND R6,R1,R7 IF ID AND EX MEM WB The key insight in forwarding is that the result is not really needed by SUB until after the ADD actually produces it. The only problem is to make it available for SUB when it needs it. If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided. Using this observation , forwarding works as follows: The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addtion of three paths to the new inputs. The paths correspond to a forwarding of: (a) the ALU output at the end of EX, (b) the ALU output at the end of MEM, and (c) the memory output at the end of MEM.