SlideShare a Scribd company logo
Processor: Superscalars Pipeline Organization 
Z. Jerry Shi 
Computer Science and Engineering 
University of Connecticut 
* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*
Targeting better performance 
•Factors that decide the execution time 
Execution Time = Path Length × CPI × Cycle Time 
•Exploit parallelism
Abstract view of instruction execution unit for MIPS
Key components on datapath
Pipelining 
•An implementation technique whereby multiple instructions are overlapped in execution 
–The parallelism among instructions in a sequential stream 
–The parallelism among actions needed to execute an instruction 
•Divide the execution into multiple steps and do one step each time 
–Each step is called a pipe stage or a pipe segment 
•Pipeline throughput: how often an instruction leaves the pipeline 
•Need to balance the length of each pipeline stage 
–Processor cycle time is determined by the slowest stage 
•Ideally, the speedup is the number of pipe stages. However,… 
–Time per instruction on unpipelined machine / Number of pipe stages
Pipelined MIPS datapath
Pipelining in MIPS instruction execution
Two abstract representation of a 5-stage pipeline
Performance of Pipelines 
pipelined 
unpipelined 
pipelined 
unpipelined 
pipelined pipelined 
unpipelined unpipelined 
pipelined 
unpipelined 
pipeline 
Cycle Time 
Cycle Time 
CPI 
CPI 
CPI Cycle Time 
CPI Cycle Time 
AVG Instr Time 
AVG Instr Time 
Speedup 
_ 
_ 
_ 
_ 
_ _ 
_ _ 
  
 
 
 
 
Assume the cycle time is the same: 
1 
Pipeline _ Depth 
CPI 
CPI 
Speedup 
pipelined 
unpipelined 
pipeline  
Things preventing you from getting the ideal speedup 
•Hazard 
•Cost of pipelining 
–Delay on pipeline registers 
–Unbalanced pipeline stages
A basic MIPS datapath
Registers added between stages
Towards Ideal Pipeline CPI 
Pipeline CPI = 
Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls 
–Ideal pipeline CPI: measure of the maximum performance attainable by the implementation 
–Structural hazards: HW cannot support this combination of instructions 
–Data hazards: Instruction depends on the result of prior instructions 
–Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps) 
•Stall the pipeline when there is a hazard 
–Any instructions issued earlier than the stalled instruction continue 
–Any instructions after the stalled instruction are also stalled 
•No new instrutions are fetched
Structural hazards in a simple RISC pipeline 
Accessing memory in the same cycle
Performance impact of structural hazards 
Ideal CPI = 1, no structural hazard, clock rate = 1 
40% of the instructions resulting structural hazards, clock rate =1.05 
Which one is faster? 
Instruction count is the same. Need to consider time per instr. only 
The average time per instruction for the processor with the structural hazard is 
idealidealTimeCycleTimeCycleTimeCycleCPITimeInstrAVG_3.105.1_ )14.01( _ __   
Data hazards
Bypassing can handle some data hazards 
Any other bypassing paths?
Forwarding required by stores
Some problems cannot be solved by bypassing
Data forwarding requires more inputs on multiplexers 
Any other paths ? 
1 
2 
3
Data forwarding to the MEM stage
Examples of data forwarding 
1 
2 
3 
4 
5 
6 
7 
8 
9 
LD R2, 0(R11) 
IF 
ID 
EX 
ME 
WB 
ADD R1, R2, R3 
IF 
ID 
- 
EX 
ME 
WB 
ADD R4, R1, R4 
IF 
- 
ID 
EX 
ME 
WB 
ADD R5, R1, R5 
IF 
ID 
EX 
ME 
WB 
1 
2 
3 
4 
5 
6 
7 
8 
9 
LD R2, 0(R11) 
IF 
ID 
EX 
ME 
WB 
ST R2, 0(R12) 
IF 
ID 
EX 
ME 
WB 
ADD R1, R3, R4 
IF 
ID 
EX 
ME 
WB 
ST R1, 0(R13) 
IF 
ID 
EX 
ME 
WB 
0
Try producing fast code for 
a = b + c; 
d = e – f; 
assuming a, b, c, d ,e, and f in memory. 
Slow code: 
LW Rb,b 
LW Rc,c 
ADD Ra,Rb,Rc 
SW a,Ra 
LW Re,e 
LW Rf,f 
SUB Rd,Re,Rf 
SW d,Rd 
Software scheduling to avoid load hazards 
Fast code: 
LW Rb,b 
LW Rc,c 
LW Re,e 
ADD Ra,Rb,Rc 
LW Rf,f 
SW a,Ra 
SUB Rd,Re,Rf 
SW d,Rd 
Compiler optimizes for performance. Hardware checks for safety.
Reducing branch hazards 
Forwarding from EX/MEM and MEM/WB
Handling control hazards 
Branch instruction 
IF 
ID 
EX 
MEM 
WB 
Brach successor 
IF 
IF 
ID 
EX 
MEM 
WB 
Brach successor + 1 
IF 
ID 
EX 
MEM 
WB 
Brach successor + 2 
IF 
ID 
EX 
MEM 
WB 
• Freeze/flush the pipeline. Wait until the branch destination is known 
–Penalty is fixed 
• Treat every branch as not taken 
• Treat every branch as taken 
–Any advantages in our 5-stage pipeline? 
• Delayed branch 
Branch instruction 
Sequential successor 1 
Branch target if taken 
What if the condition is not resolved until the EX stage?
Predicted Not Taken 
Untaken Branch instr. 
IF 
ID 
EX 
MEM 
WB 
Brach successor 
IF 
ID 
EX 
MEM 
WB 
Brach successor + 1 
IF 
ID 
EX 
MEM 
WB 
Brach successor + 2 
IF 
ID 
EX 
MEM 
WB 
Taken Branch instruction 
IF 
ID 
EX 
MEM 
WB 
Brach successor 
IF 
IF 
ID 
EX 
MEM 
WB 
Brach target 
IF 
ID 
EX 
MEM 
WB 
Brach successor + 1 
IF 
ID 
EX 
MEM 
WB 
Brach successor + 2 
IF 
ID 
EX 
MEM 
WB
Scheduling the branch delay slot 
•a) is the best choice, fills delay slot & reduces instruction count (IC) 
•In b), the sub instruction may need to be copied, increasing IC 
•In b) and c), it must be okay to execute sub when branch fails
Delayed Branch 
•Compiler effectiveness for single branch delay slot: 
–Fills about 60% of branch delay slots 
–About 80% of instructions executed in branch delay slots useful in computation 
–About 50% (60% x 80%) of slots usefully filled 
•Delayed branch downside: 
As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot 
–Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches 
–Growth in available transistors has made dynamic approaches relatively cheaper
Performance of Branch Schemes 
Example: 
Assume a deeper pipeline. 
4% unconditional branch, 
6% conditional branch- untaken, 
10% conditional branch-taken. 
Pipeline speedup = Pipeline depth 
1 +Branch frequencyBranch penalty 
Branch scheme Penalty 
unconditional 
Penalty 
untaken 
Penalty 
taken 
Flush 2 3 3 
Predicted taken 2 3 2 
Predicted untaken 2 0 3
Evaluating Branch Alternatives 
Branch scheme 
Speedup vs Flush 
Delayed branch 
Flush 
1 
Predicted taken 
1.06 
1.14 
Predicted untaken 
1.12 
1.19 
For delayed branch, 50% of the slots can be filled with useful instructions.
MIPS pipeline with three unpipelined FP functional units
•Multiple FP instructions can be executed simultaneously 
Pipelined functional units
Latency and initiation interval 
•Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the results 
–Typically 1 cycle less than the depth of the execution pipeline 
•Consider LD has a two-stage execution, 1-cycle latency if the following instruction is not ST 
•Initiation interval: the number of cycles that must elapse between issuing two operations to the same functional unit 
For example, a multiplier with a latency of 7 cycles 
Unpipelined: initiation interval is 7 cycles. 1, 8, 15, … 
Pipelined: initiation interval is 1 cycle. 1, 2, 3, …
Latencies and initiation intervals for functional units 
Functional unit 
# of execution stage 
Latency 
Initiation interval 
Integer ALU 
1 
0 
1 
Data memory 
2 
1 
1 
FP add 
4 
3 
1 
FP multiply 
7 
6 
1 
FP divide 
25 
24 
25
Pipeline timing of a set of independent FP oprations 
•Instructions are fecthed and sent to functional units in order 
•The completion of instructions are not in order because of different execution lenghes 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
MUL.D 
IF 
ID 
M1 
M2 
M3 
M4 
M5 
M6 
M7 
ME 
WB 
ADD.D 
IF 
ID 
A1 
A2 
A3 
A4 
ME 
WB 
L.D 
IF 
ID 
EX 
ME 
WB 
S.D 
IF 
ID 
EX 
ME 
WB
FP code sequence showing the stalls (from RAW) 
1 
2 
3 
4 
5 
6 
7 
8 
9 
L.D F4, 0(R2) 
IF 
ID 
EX 
ME 
WB 
MUL F0,F4,F6 
IF 
ID 
- 
M1 
M2 
M3 
M4 
M5 
ADD F2,F0,F8 
IF 
- 
ID 
- 
- 
- 
- 
S.D F2, 0(R2) 
IF 
- 
- 
- 
- 
10 
11 
12 
13 
14 
15 
16 
17 
18 
L.D F4, 0(R2) 
MUL F0,F4,F6 
M6 
M7 
ME 
WB 
ADD F2,F0,F8 
A1 
A2 
A3 
A4 
ME 
WB 
S.D F2, 0(R2) 
ID 
EX 
- 
- 
- 
ME 
WB
Handling multiple writes to register file 
•Track the use of the write port in the ID stage and install an instruction before it issues 
–Stalls the instruction if it writes in the same cycle as instructions already issued 
–Use shift registers to track which instruction need register in which cycle 
•Stall a conflicting instruction when it tries to enter either MEM or WB stage 
–May choose either instruction 
•May give priority to instructions with long latencies 
–Does not detect conflict until the entrance of the MEM or WB stage, where it is easy to see 
–Complicates pipeline control as stalls may arise from two places
Problems with Pipelining 
•Exception: An unusual event happens to an instruction during its execution 
–Examples: divide by zero, undefined opcode 
•Interrupt: Hardware signal to switch the processor to a new instruction stream 
–Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) 
•Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) 
–The effect of all instructions up to and including Ii is totalling complete 
– No effect of any instruction after Ii can take place 
•The interrupt (exception) handler either aborts program or restarts at instruction Ii+1
Dealing with exceptions 
•Exceptions are harder to handle in a pipelined processor 
–An instruction is executed in several steps, making it more difficult to determine whether an instruction can safely change the state of the processor 
•Other instructions in pipeline may cause exceptions 
•Example of exceptions 
–Invoking an operating system service 
–Breakpoint (programmer-requested interrupt) 
–Integer/FP arithmetic overflow or anomaly 
–Memory access (Page fault, protection, misalignment) 
–Unknown instructions 
–Hardware malfunctions 
–I/O request 
–Power failure
Classification of exceptions 
•Synchronous versus asynchronous 
–Occur at the same place every time the program is executed? 
•User requested versus coerced 
–User asks for it? 
•User maskable versus nonmaskable 
–Can be masked (disabled) by user? 
•Within versus between instructions 
–Occur in the middle of execution and prevent instruction completion? 
•Resume versus terminate 
–Can program’s execution be resumed?
Stopping and restarting exceptions 
•Most difficult exceptions 
–Occur within instructions (e.g. in the EX and MEM stage) 
–Must be restartable 
•Possible solutions 
–Force a trap instruction into the pipeline on the next IF 
–Until the trap is taken, turn off all writes for the faulting and all following instructions 
–In the exception handlers, save the PC of the faulting instructions 
Precise exceptions: if the pipeline can be stopped so 
the instructions before the faulting instruction can complete 
the instructions after the faulting instruction can be restarted
Precise Exceptions in Static Pipelines 
Key observation: architected state only change in memory and register write stages.
A more complicated pipeline 
•Fetch 
•Decode 
•Dispatch 
•Issue 
•Execute 
•Finish 
•Complete 
•Retire 
Branch Prediction 
Dynamic Scheduling 
Reorder buffer
Superscalar Pipeline 
Executing multiple instructions in parallel
Instruction Fetch 
•Limit on maximum throughput of pipeline 
•Fetch s instructions per cycle from I cache 
•Problems with attaining throughput: 
–Control flow  Branch Prediction 
–Alignment of cache line and PC
Interactions between Instruction Fetch and Instruction Cache Structure 
•In b), if a fetch group ( s instructions) straddles two cache lines, need to access I cache twice 
–If any of the cache line is a miss, the pipeline stalls
Instruction Decode 
•Extract from assembly instruction 
–Instruction Type (Decoder) 
–Dependencies (Comparators) 
–Operands (Register Files & Buses) 
•CISC  RISC: 
–Converted to ROP (RISC OP)
Instruction Decode - Predecoding
AMD’s K6 can decode two instructions per cycle
Instruction Dispatch 
•Dataflow: 
–Send an instruction to a functional unit as soon as its operands are available, regardless of original program order. 
–Tomasulo’s
Instruction Dispatch 
•Centralized reservation station 
•Distributed reservation station
Instruction Execution 
•How many functional units? Why different types? 
–Constraints of area, power, interconnection, etc. 
•You cannot put as many as you want 
–Mix of functional units may not be ideal for some applications 
•Bypassing 
–Bypassing needed between functional units to minimize stalls
Instruction Completion & Retiring 
•Completion  Registers 
•Reorder/Store buffer in between 
–Registers in the buffer (not register file) hold the new values 
•Retiring Memory
Limiting factors: Pipelining hazards 
•Structural hazards 
–Resource conflicts when hardware cannot support all possible combinations of instructions simultaneously 
•Data hazards 
–An instruction depends on the results of a previous instruction 
•Control hazards 
–Branch instructions that change the instruction flow

More Related Content

PPT
Pipeline hazard
PPT
Pipeline hazards in computer Architecture ppt
PPTX
Conditional branches
PPT
Pipelining & All Hazards Solution
PPTX
3 Pipelining
PPTX
pipelining and hazards occure in assembly language.
PPT
Ct213 processor design_pipelinehazard
PDF
Pipelining
Pipeline hazard
Pipeline hazards in computer Architecture ppt
Conditional branches
Pipelining & All Hazards Solution
3 Pipelining
pipelining and hazards occure in assembly language.
Ct213 processor design_pipelinehazard
Pipelining

What's hot (19)

PPTX
Pipeline hazard
PPTX
Pipelining powerpoint presentation
PPSX
Concept of Pipelining
PPTX
Pipelining of Processors
PPTX
Instruction pipelining
PPT
pipeline and pipeline hazards
PPTX
Pipelining , structural hazards
PPTX
Instruction Pipelining
PPTX
Chapter 04 the processor
PDF
Instruction pipeline
PPTX
pipelining
PPTX
Dealing with exceptions Computer Architecture part 2
PPT
Pipelining in computer architecture
PPTX
Instruction pipeline: Computer Architecture
PPT
Unit 3
PPT
Chapter6 pipelining
PPTX
Dealing with Exceptions Computer Architecture part 1
PPTX
Pipelining, processors, risc and cisc
Pipeline hazard
Pipelining powerpoint presentation
Concept of Pipelining
Pipelining of Processors
Instruction pipelining
pipeline and pipeline hazards
Pipelining , structural hazards
Instruction Pipelining
Chapter 04 the processor
Instruction pipeline
pipelining
Dealing with exceptions Computer Architecture part 2
Pipelining in computer architecture
Instruction pipeline: Computer Architecture
Unit 3
Chapter6 pipelining
Dealing with Exceptions Computer Architecture part 1
Pipelining, processors, risc and cisc
Ad

Similar to Topic2a ss pipelines (20)

PPTX
CPU Pipelining and Hazards - An Introduction
PPT
Performance Enhancement with Pipelining
PPT
Introduction_pipeline24.ppt which include
PPTX
CA UNIT III.pptx
PPT
PPT
Pipelining slides
PPTX
Assembly p1
PPT
Computer architecture pipelining
PPT
12 processor structure and function
PPT
12 processor structure and function
PPT
Instruction pipelining
PPT
chapter6- Pipelining.ppt chaptPipelining
PPT
Chapter 4
PPTX
Design pipeline architecture for various stage pipelines
PPTX
2021Arch_6_Ch3_0_Extend2SupportingMCoperation.pptx
PPTX
Slides.pptx
PPT
Chapt12Processor Structure and Function.ppt
PPT
Computer Organozation
PPT
Pipelining
PPTX
Pipeline and Vector Processing Computer Org. Architecture.pptx
CPU Pipelining and Hazards - An Introduction
Performance Enhancement with Pipelining
Introduction_pipeline24.ppt which include
CA UNIT III.pptx
Pipelining slides
Assembly p1
Computer architecture pipelining
12 processor structure and function
12 processor structure and function
Instruction pipelining
chapter6- Pipelining.ppt chaptPipelining
Chapter 4
Design pipeline architecture for various stage pipelines
2021Arch_6_Ch3_0_Extend2SupportingMCoperation.pptx
Slides.pptx
Chapt12Processor Structure and Function.ppt
Computer Organozation
Pipelining
Pipeline and Vector Processing Computer Org. Architecture.pptx
Ad

Recently uploaded (20)

PPTX
master seminar digital applications in india
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Pre independence Education in Inndia.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Classroom Observation Tools for Teachers
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Lesson notes of climatology university.
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
master seminar digital applications in india
Complications of Minimal Access Surgery at WLH
Microbial diseases, their pathogenesis and prophylaxis
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
TR - Agricultural Crops Production NC III.pdf
Computing-Curriculum for Schools in Ghana
VCE English Exam - Section C Student Revision Booklet
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Pre independence Education in Inndia.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Sports Quiz easy sports quiz sports quiz
Final Presentation General Medicine 03-08-2024.pptx
Classroom Observation Tools for Teachers
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Lesson notes of climatology university.
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra

Topic2a ss pipelines

  • 1. Processor: Superscalars Pipeline Organization Z. Jerry Shi Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*
  • 2. Targeting better performance •Factors that decide the execution time Execution Time = Path Length × CPI × Cycle Time •Exploit parallelism
  • 3. Abstract view of instruction execution unit for MIPS
  • 4. Key components on datapath
  • 5. Pipelining •An implementation technique whereby multiple instructions are overlapped in execution –The parallelism among instructions in a sequential stream –The parallelism among actions needed to execute an instruction •Divide the execution into multiple steps and do one step each time –Each step is called a pipe stage or a pipe segment •Pipeline throughput: how often an instruction leaves the pipeline •Need to balance the length of each pipeline stage –Processor cycle time is determined by the slowest stage •Ideally, the speedup is the number of pipe stages. However,… –Time per instruction on unpipelined machine / Number of pipe stages
  • 7. Pipelining in MIPS instruction execution
  • 8. Two abstract representation of a 5-stage pipeline
  • 9. Performance of Pipelines pipelined unpipelined pipelined unpipelined pipelined pipelined unpipelined unpipelined pipelined unpipelined pipeline Cycle Time Cycle Time CPI CPI CPI Cycle Time CPI Cycle Time AVG Instr Time AVG Instr Time Speedup _ _ _ _ _ _ _ _       Assume the cycle time is the same: 1 Pipeline _ Depth CPI CPI Speedup pipelined unpipelined pipeline  
  • 10. Things preventing you from getting the ideal speedup •Hazard •Cost of pipelining –Delay on pipeline registers –Unbalanced pipeline stages
  • 11. A basic MIPS datapath
  • 13. Towards Ideal Pipeline CPI Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls –Ideal pipeline CPI: measure of the maximum performance attainable by the implementation –Structural hazards: HW cannot support this combination of instructions –Data hazards: Instruction depends on the result of prior instructions –Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps) •Stall the pipeline when there is a hazard –Any instructions issued earlier than the stalled instruction continue –Any instructions after the stalled instruction are also stalled •No new instrutions are fetched
  • 14. Structural hazards in a simple RISC pipeline Accessing memory in the same cycle
  • 15. Performance impact of structural hazards Ideal CPI = 1, no structural hazard, clock rate = 1 40% of the instructions resulting structural hazards, clock rate =1.05 Which one is faster? Instruction count is the same. Need to consider time per instr. only The average time per instruction for the processor with the structural hazard is idealidealTimeCycleTimeCycleTimeCycleCPITimeInstrAVG_3.105.1_ )14.01( _ __   
  • 17. Bypassing can handle some data hazards Any other bypassing paths?
  • 19. Some problems cannot be solved by bypassing
  • 20. Data forwarding requires more inputs on multiplexers Any other paths ? 1 2 3
  • 21. Data forwarding to the MEM stage
  • 22. Examples of data forwarding 1 2 3 4 5 6 7 8 9 LD R2, 0(R11) IF ID EX ME WB ADD R1, R2, R3 IF ID - EX ME WB ADD R4, R1, R4 IF - ID EX ME WB ADD R5, R1, R5 IF ID EX ME WB 1 2 3 4 5 6 7 8 9 LD R2, 0(R11) IF ID EX ME WB ST R2, 0(R12) IF ID EX ME WB ADD R1, R3, R4 IF ID EX ME WB ST R1, 0(R13) IF ID EX ME WB 0
  • 23. Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Software scheduling to avoid load hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd Compiler optimizes for performance. Hardware checks for safety.
  • 24. Reducing branch hazards Forwarding from EX/MEM and MEM/WB
  • 25. Handling control hazards Branch instruction IF ID EX MEM WB Brach successor IF IF ID EX MEM WB Brach successor + 1 IF ID EX MEM WB Brach successor + 2 IF ID EX MEM WB • Freeze/flush the pipeline. Wait until the branch destination is known –Penalty is fixed • Treat every branch as not taken • Treat every branch as taken –Any advantages in our 5-stage pipeline? • Delayed branch Branch instruction Sequential successor 1 Branch target if taken What if the condition is not resolved until the EX stage?
  • 26. Predicted Not Taken Untaken Branch instr. IF ID EX MEM WB Brach successor IF ID EX MEM WB Brach successor + 1 IF ID EX MEM WB Brach successor + 2 IF ID EX MEM WB Taken Branch instruction IF ID EX MEM WB Brach successor IF IF ID EX MEM WB Brach target IF ID EX MEM WB Brach successor + 1 IF ID EX MEM WB Brach successor + 2 IF ID EX MEM WB
  • 27. Scheduling the branch delay slot •a) is the best choice, fills delay slot & reduces instruction count (IC) •In b), the sub instruction may need to be copied, increasing IC •In b) and c), it must be okay to execute sub when branch fails
  • 28. Delayed Branch •Compiler effectiveness for single branch delay slot: –Fills about 60% of branch delay slots –About 80% of instructions executed in branch delay slots useful in computation –About 50% (60% x 80%) of slots usefully filled •Delayed branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot –Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches –Growth in available transistors has made dynamic approaches relatively cheaper
  • 29. Performance of Branch Schemes Example: Assume a deeper pipeline. 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken. Pipeline speedup = Pipeline depth 1 +Branch frequencyBranch penalty Branch scheme Penalty unconditional Penalty untaken Penalty taken Flush 2 3 3 Predicted taken 2 3 2 Predicted untaken 2 0 3
  • 30. Evaluating Branch Alternatives Branch scheme Speedup vs Flush Delayed branch Flush 1 Predicted taken 1.06 1.14 Predicted untaken 1.12 1.19 For delayed branch, 50% of the slots can be filled with useful instructions.
  • 31. MIPS pipeline with three unpipelined FP functional units
  • 32. •Multiple FP instructions can be executed simultaneously Pipelined functional units
  • 33. Latency and initiation interval •Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the results –Typically 1 cycle less than the depth of the execution pipeline •Consider LD has a two-stage execution, 1-cycle latency if the following instruction is not ST •Initiation interval: the number of cycles that must elapse between issuing two operations to the same functional unit For example, a multiplier with a latency of 7 cycles Unpipelined: initiation interval is 7 cycles. 1, 8, 15, … Pipelined: initiation interval is 1 cycle. 1, 2, 3, …
  • 34. Latencies and initiation intervals for functional units Functional unit # of execution stage Latency Initiation interval Integer ALU 1 0 1 Data memory 2 1 1 FP add 4 3 1 FP multiply 7 6 1 FP divide 25 24 25
  • 35. Pipeline timing of a set of independent FP oprations •Instructions are fecthed and sent to functional units in order •The completion of instructions are not in order because of different execution lenghes 1 2 3 4 5 6 7 8 9 10 11 MUL.D IF ID M1 M2 M3 M4 M5 M6 M7 ME WB ADD.D IF ID A1 A2 A3 A4 ME WB L.D IF ID EX ME WB S.D IF ID EX ME WB
  • 36. FP code sequence showing the stalls (from RAW) 1 2 3 4 5 6 7 8 9 L.D F4, 0(R2) IF ID EX ME WB MUL F0,F4,F6 IF ID - M1 M2 M3 M4 M5 ADD F2,F0,F8 IF - ID - - - - S.D F2, 0(R2) IF - - - - 10 11 12 13 14 15 16 17 18 L.D F4, 0(R2) MUL F0,F4,F6 M6 M7 ME WB ADD F2,F0,F8 A1 A2 A3 A4 ME WB S.D F2, 0(R2) ID EX - - - ME WB
  • 37. Handling multiple writes to register file •Track the use of the write port in the ID stage and install an instruction before it issues –Stalls the instruction if it writes in the same cycle as instructions already issued –Use shift registers to track which instruction need register in which cycle •Stall a conflicting instruction when it tries to enter either MEM or WB stage –May choose either instruction •May give priority to instructions with long latencies –Does not detect conflict until the entrance of the MEM or WB stage, where it is easy to see –Complicates pipeline control as stalls may arise from two places
  • 38. Problems with Pipelining •Exception: An unusual event happens to an instruction during its execution –Examples: divide by zero, undefined opcode •Interrupt: Hardware signal to switch the processor to a new instruction stream –Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) •Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) –The effect of all instructions up to and including Ii is totalling complete – No effect of any instruction after Ii can take place •The interrupt (exception) handler either aborts program or restarts at instruction Ii+1
  • 39. Dealing with exceptions •Exceptions are harder to handle in a pipelined processor –An instruction is executed in several steps, making it more difficult to determine whether an instruction can safely change the state of the processor •Other instructions in pipeline may cause exceptions •Example of exceptions –Invoking an operating system service –Breakpoint (programmer-requested interrupt) –Integer/FP arithmetic overflow or anomaly –Memory access (Page fault, protection, misalignment) –Unknown instructions –Hardware malfunctions –I/O request –Power failure
  • 40. Classification of exceptions •Synchronous versus asynchronous –Occur at the same place every time the program is executed? •User requested versus coerced –User asks for it? •User maskable versus nonmaskable –Can be masked (disabled) by user? •Within versus between instructions –Occur in the middle of execution and prevent instruction completion? •Resume versus terminate –Can program’s execution be resumed?
  • 41. Stopping and restarting exceptions •Most difficult exceptions –Occur within instructions (e.g. in the EX and MEM stage) –Must be restartable •Possible solutions –Force a trap instruction into the pipeline on the next IF –Until the trap is taken, turn off all writes for the faulting and all following instructions –In the exception handlers, save the PC of the faulting instructions Precise exceptions: if the pipeline can be stopped so the instructions before the faulting instruction can complete the instructions after the faulting instruction can be restarted
  • 42. Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and register write stages.
  • 43. A more complicated pipeline •Fetch •Decode •Dispatch •Issue •Execute •Finish •Complete •Retire Branch Prediction Dynamic Scheduling Reorder buffer
  • 44. Superscalar Pipeline Executing multiple instructions in parallel
  • 45. Instruction Fetch •Limit on maximum throughput of pipeline •Fetch s instructions per cycle from I cache •Problems with attaining throughput: –Control flow  Branch Prediction –Alignment of cache line and PC
  • 46. Interactions between Instruction Fetch and Instruction Cache Structure •In b), if a fetch group ( s instructions) straddles two cache lines, need to access I cache twice –If any of the cache line is a miss, the pipeline stalls
  • 47. Instruction Decode •Extract from assembly instruction –Instruction Type (Decoder) –Dependencies (Comparators) –Operands (Register Files & Buses) •CISC  RISC: –Converted to ROP (RISC OP)
  • 48. Instruction Decode - Predecoding
  • 49. AMD’s K6 can decode two instructions per cycle
  • 50. Instruction Dispatch •Dataflow: –Send an instruction to a functional unit as soon as its operands are available, regardless of original program order. –Tomasulo’s
  • 51. Instruction Dispatch •Centralized reservation station •Distributed reservation station
  • 52. Instruction Execution •How many functional units? Why different types? –Constraints of area, power, interconnection, etc. •You cannot put as many as you want –Mix of functional units may not be ideal for some applications •Bypassing –Bypassing needed between functional units to minimize stalls
  • 53. Instruction Completion & Retiring •Completion  Registers •Reorder/Store buffer in between –Registers in the buffer (not register file) hold the new values •Retiring Memory
  • 54. Limiting factors: Pipelining hazards •Structural hazards –Resource conflicts when hardware cannot support all possible combinations of instructions simultaneously •Data hazards –An instruction depends on the results of a previous instruction •Control hazards –Branch instructions that change the instruction flow