SlideShare a Scribd company logo
ECE 4100/6100
Advanced Computer Architecture
Lecture 8 Dynamic Scheduling (II)
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
Modern Processors
• Branch Prediction results in speculative
execution
• Speculative instructions (if wrongly
speculated) must not alter the architecture
states
– Architecture Registers
– Memory
• Requirement of precise exception/interrupts
Modern Out-of-Order Core
ALLOC
RAT
RS
ARFROB
Register Alias Table
renames architecture
registers
Allocate
instructions
Reorder Buffer maintains state
information (physical registers)
for precise interrupts and
speculative execution
Reservation Station
issues instructions to
functional units
Architectural
register file
LSQ
Load Store Queue
maintains memory
access ordering
Register Renaming
R0
Architected
Registers
R1
R2
R3
R4
R5
R6
R7
T0
T2
T4
T6
T8
T10
T12
T14
T16
T18
T20
T22
Tn-2
T1
T3
T5
T7
T9
T11
T13
T15
T17
T19
T21
T23
Tn-1
Physical
Registers
R2 = R1+R3
R4 = R2 - R6
…
R2 = R7 / R5
BEQ R2, #1
…
R2 = R4 * R1
R6 = Load [R2]
Original
Code
Renamed
Code
T1 = R1+R3
R4 = T1 - R6
…
T20 = R7 / R5
BEQ T20, #1
…
T7 = R4 * R1
R6 = Load [T7]
WAW
WAR
No False
Dependencies!
Adapted from Prof. G. Loh’s Slides
Sandy Bridge:
160 PRs for INT
144 PRs for FP
Register Renaming
Dest = Src1 op Src2
Mapping
Mechanism
TagS1 op TagS2
Src1  TagS1
Src2  TagS2
Unmapped
Physical
Registers
TagD
TagD =
Dest  TagD
Repeat for each instruction
Adapted from Prof. G. Loh’s Slides
Register Alias Table (RAT)
• Use a lookup table for
renaming
• One entry per
architectural register
• Each entry maps to the
most recent version of the
architectural register,
could be in
– Physical register file
– Architectural register file
ROB (40 entries)ROB (40 entries)
RRFRRF
DataData StatusStatus
EBXEBX
ECXECX
EDXEDX
ESIESI
EDIEDI
EAXEAX
ESPESP
EBPEBP
RATRAT
P6 Style Register RenamingP6 Style Register Renaming
(So does HP-PA8000, PPC604)(So does HP-PA8000, PPC604)
RAT Example
R1 = R2 + R3
R0
-
R1
-
R2
-
R3
-
R4
-
R5
-
R6
-
R7
- T13, T14, T15, T16
Free PRegs
T13 = R2 + R3
- 13 - - - - - - T14, T15, T16R5 = R4 – R1
T14 = R4 – T13
- 13 - - - 14 - -R1 = R1 * R5 T15, T16
T15 = T13 * T14
- 15 - - - 14 - -R2 = R5 / R1 T16
T16 = T14 / T15
- 15 16 - - 14 - -
Adapted from Prof. G. Loh’s Slides
Superscalar Rename
R1 = R2 + R3
R4 = R5 – R7
R3 = R0 / R2
R5 = Ld 12[R6]
RAT
T16 T23
T39 T7
T14 T16
T5 X
Don’t rename
immediates
T10
T31
T19
T6
Fromfree
registerpool
For N-wide
superscalar:
2N RAT read-ports
N RAT write-ports
Intra-Group Dependencies
R2 = R2 + R3
R4 = R5 – R7
R3 = R0 / R2
R5 = Ld 12[R6]
RAT
T16 T23
T39 T7
T14 T16
T5 X
T10
T31
T19
T6
Fromfree
registerpool
This is the wrong
version of R2
Should be using
this version of R2
Intra-Group Dependencies
R1 = R2 + R1
R2 = R1 – R2
R1 = R2 / R1
R1 = R2 >> R1
RAT
T16 T34
T34 T16
T16 T34
T16 T34
T16 T34
T10 T16
T31 T10
T31 T19
Result of
sequential
renaming
T10
T31
T19
T6
Fromfree
registerpool
Correct final renamed registers
Resolving Intra-Group Dependencies
RAT
From free
register pool
Intra-Group
Dependency
Checker
Inst 0
Inst 1
Inst 2
Inst 3
Src L
Src R
Dest
T0L
T1L
T2L
T3L
T0R
T1R
T2R
T3R
Pdst0
Pdst1
Pdst2
Adapted from Prof. G. Loh’s Slides
Intra-Group Dependency Checking
Pdst0
Pdst1
Pdst2
dst0
src1L
=R1L
T1L
0 1
src1R
R1R =
T1R
R2L
src2L
=
T2L
=
dst1
src2R
=
T2R
R2R
=
dst2
src3L
=
T3L
=
R3L
=
=
T3R
=
=
R3R
src3R
Pdst3
src0L src0R
dst3
Adapted from Prof. G. Loh’s Slides
Mapping Selection
R1 = R2 + R1
R2 = R1 – R2
R1 = R2 / R1
R1 = R2 >> R1
Only this mapping
for R1 should be
written into the RAT
dst0 dst1 dst2 dst3
!=
!=
use pdst1
!=
!=
!=
use pdst0
!= use pdst2
use pdst31
Condition: use mapping
if instruction is last
writer to the register
Priority
encoder
Adapted from Prof. G. Loh’s Slides
Issue with Imprecise Interrupt
• add instructions take one cycle
• E.g.,
– Load (left side) induces a “data page fault”;
– Add (right side) induces an “instruction page fault”
• If out-of-order completion is allowed
– r10, r12, (or r2, r4) … will be modified
– Wrong values will be used by the re-issued load
• Interrupt classes
– Program interrupts (exceptions or traps)
– External interrupts (asynchronous)
lw r5, 8(r10r10)
add r10r10, r9, r8
add r12, r10, r7
L1:
add r3, r1, r2r2
add r4, r1, r4
add r2, r4, r4
End of
Non-Resident
Page X
Start of
Resident
Page X+1
Instruction
Page Fault
Precise Interrupts
• To reflect a sequential architecture model ⇒
Serially correct (think about a single issue, non-
pipelined processor)
• Keep “Precise State” of an execution
– All instructions before the interrupted instruction must be completed
– The state should appear as if no instruction issued after the
interrupted instruction
– The interrupted PC should be presented to the interrupt handler
(restartable)
• Similar to branch misprediction handling
• Out-of-order execution makes the ordering hard
– Undo what comes after an interrupt
Why Supporting Precise Interrupts
• Need to maintain a precise state (for recovery)
• Software debugging
• I/O or timer interrupts
• Virtual memory (page fault)
• Instruction emulation
• Virtual machines
Support Precise Interrupt
• Buffer results
• Can reconstruct the scenario (state) as
sequential execution
• Restart from saved PC with saved PC state
Reorder Buffer (ROB) [SmithPlezkun’85 ‘88]
• Architecture Register File keeps “In-order state”
• Reorder Buffer (ROB)
– A circular buffer
– Contains all in-flight instructions
– buffers the “Lookahead state”
– In-order allocation/deallocation with head/tail pointers
• When an exception occurs
– Halting instruction issues
– Revert to in-order state using RF and discard ROB results
• Also used for branch misprediction recovery
• Pentium Pro/II/III integrates physical register file within ROB
• Pentium 4 decouples ROB and physical register file
Reorder Buffer (with physical registers)
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
Head
(oldest
instruction)
Tail
(next inst
to be
allocated)
Sandy Bridge : 168-entry ROB
Handling Precise Interrupts
Head
Tail
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
1 0 0 xA000 0000 R1
1 0 0 xA004 0000 R2
R1=R1+10
R2=R2*2
1 0 0 xA008 0000 FR1 FR1=FR2/0.0
10 11
1R1 11
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
Handling Precise Interrupts
Head
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
0
1 0 0 xA004 0000 R2 R2=R2*2
1 0 0 xA008 0000 FR1 FR1=FR2/0.0
Tail
1 0 0 xA00C 0000 R3 R3=R3+1
1R1 11
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
Handling Precise Interrupts
Head
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
0
1 0 0 xA004 0000 R2 R2=R2*2
1 0 0 xA008 0000 FR1 FR1=FR2/0.0
Tail
1 0 1 xA00C 0000 R3 R3=R3+1
1 0 0 xA010 0000 R4
4
R4=R4*2
1R1 11
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
Handling Precise Interrupts
Head
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
0
1 0 0 xA004 0000 R2 R2=R2*2
1 0 0 xA008 0010 FR1 FR1=FR2/0.0
Tail
1 0 1 xA00C 0000 R3 R3=R3+1
1 0 1 xA010 0000 R4
4
R4=R4*28
1 0 0 xA014 0000 FR4 FR4=FR4*2.0
1 4
1R1 11
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
4
Handling Precise Interrupts
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
0
1 0 0 xA008 0010 FR1 FR1=FR2/0.0
Tail
1 0 1 xA00C 0000 R3 R3=R3+1
1 0 1 xA010 0000 R4
4
R4=R4*28
1 0 0 xA014 0000 FR4 FR4=FR4*2.0
1 0 1 xA004 0000 R2 R2=R2*240Head
1R1 11
1R2
1
ARF
R31
1
1
R3
R4
4
3
4
Handling Precise Interrupts
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
0
1 0 0 xA008 0010 FR1 FR1=FR2/0.0
Tail
1 0 1 xA00C 0000 R3 R3=R3+1
1 0 1 xA010 0000 R4
4
R4=R4*28
1 0 0 xA014 0000 FR4 FR4=FR4*2.0
Head
0
Exception detected.
Back up “PC”
and current RF
These values
were not
committed into
RF
Depending on the Exception, process will either abort or instruction will be resumed from this
excepting instruction
1R1 11
1R2
1
ARF
R31
1
1
R3
R4
4
3
4
Handling Speculative Execution
Head
Tail
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
1 0 0 xB000 0000 R1
1 0 0 xB004 0000
R1=R1+10
BEQ R1, R0, L1
1R1
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
Handling Speculative Execution
Head
Tail
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
1 0 0 xB000 0000 R1
1 0 0 xB004 0000
R1=R1+10
BEQ R1, R0, L1
1 1 1 xC100 0000 R2=R3 << 2
1 1 0 xC104 0000 R1=R2*R3
1 1 0 xD2AC 0000 BEQ R3, R0, L1
1 1 1 xD2B0 0000 R1=R7+1
R1
R2
R1 28
32
1R1
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
BEQ R1, R0, L1 is predicted TAKENBEQ R1, R0, L1 is predicted TAKEN
Handling Speculative Execution
Head
Tail
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
1 0 0 xB004 0000 BEQ R1, R0, L1
1 1 1 xC100 0000 R2=R3 << 2
1 1 0 xC104 0000 R1=R2*R3
1 1 0 xD2AC 0000 BEQ R3, R0, L1
1 1 1 xD2B0 0000 R1=R7+1
R1
R2
R1 28
32
11R1
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
BEQ R1, R0, L1 is resolved, actually NOT TAKEN !!BEQ R1, R0, L1 is resolved, actually NOT TAKEN !!
BEQ
Misprediction
Handling Speculative Execution
Tail
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
1 0 0 xB004 0000 BEQ R1, R0, L1
11R1
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
Retire branch, Clear all entries after the mis-speculated branchRetire branch, Clear all entries after the mis-speculated branch
Head
Handling Speculative Execution
Head
Tail
V Data (physical register)
Exp
event RegDst
Done?
Spec?
PC
.
.
.
.
.
.
11R1
1R2
1
ARF
R31
1
1
R3
R4
2
3
4
Continue execution from the correct path (Fall through in this case)Continue execution from the correct path (Fall through in this case)
1 0 0 xB008 0000 R2=R5 << 4R2
RAT Recovery
br
ARF
RAT
ARF state corresponds to state prior
to oldest non-committed instruction
As instructions are processed, the RAT
corresponds to the register mapping after
the most recently renamed instruction
On a branch misprediction, wrong-path
instructions are flushed from the machine
?!?
The RAT is left with an invalid set of
mappings corresponding to the wrong-
path instruction state
Adapted from Prof. G. Loh’s Slide
Solution: Stall and Drain
br
ARF
RAT
?!?
Correct path instructions from fetch;
can’t rename because RAT is wrong
foo
X
ARF now corresponds to the state
right before the next instruction to
be renamed (foo)
Allow all instructions to execute and
commit; ARF corresponds to last
committed instruction
Reset RAT so that all mappings
refer to the ARF
Resume renaming the new correct-
path instructions from fetch
Pros: Very simple
to implement
Cons: Performance loss
due to stalls
Another Solution: Checkpointing
br
br
br
br
ARF
RAT
At each branch, make a copy of the RAT
(register mapping at the time of the branch)
RAT
RAT
RAT
RAT
On a misprediction:
Checkpoint
Free Pool
1. flush wrong-path instructions
2. deallocate RAT checkpoints
3. recover RAT from checkpoint
foo
4. resume renaming
Modern Instruction Scheduler
• At dispatch, instruction read all available
operands from the register files and store a
copy in the scheduler (Tomasulo’s algorithm)
• Unavailable operands will be “captured” from
the functional unit outputs (CDB broadcast)
• When ready, instructions can issue directly
from the scheduler without reading additional
operands from any other register files
(Wakeup and select)
Fetch &
Dispatch
ARF PRF/ROB
Instruction
Scheduler
Functional
Units
Physicalregisterupdate
BypassFetch &
Dispatch
ARF PRF/ROB
Fetch &
Dispatch
ARF
Adapted from Prof. G. Loh’s Slide
Instruction Scheduling: Wakeup and Select
• Wakeup Logic
– To notify the resolution of data dependency of
input operands
– Wake up instructions with zero input dependency
• Select Logic
– Choose and fire ready instructions
– Deal with structure hazard
• Wakeup-select is likely on the critical path
– Associative match
Scalar Scheduler (Issue Width = 1)
T14
T16
T39
T6
T17
T39
T15
T39
=
=
=
=
=
=
=
=
T39
T8
T17
T42
SelectLogic
ToExecuteLogic
TagBroadcastBus
From Prof. G. Loh’s Slide
Superscalar Scheduler (Issue Width = 4)
T39
T8
T17
T42
SelectLogic
ToExecuteLogic
Tag Broadcast Bus [3..0]
Adapted from Prof. G. Loh’s Slide
T14 ====
T16 ====
T39 ====
T6 ====
T17 ====
T39 ====
T15 ====
T39 ====
Snapshot of RS (only 4 entries shown)
Selection Logic
• Select ready instructions to be issued
• Goal: to reduce the height of DFG
• Methods
– Location-based (e.g., leftmost ready first)
•Allow simple, faster hardware
– Oldest ready first
•Can use location-based (in-order issue) with
“compaction”
•Can be slow and complex
Simple Select Logic Implementation
Reservation Station
[Palarchala ISCA’97]
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Tree-like
Arbitrated
Selection
Logic
1
Simple Select Logic Implementation
Reservation Station
[Palarchala ISCA’97]
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Priority
Decoder
EnableAnyQueue
Req0
Req1
Req2
Req3
Grt0
Grt1
Grt2
Grt3
1
Simple Select Logic Implementation
Reservation Station
[Palarchala ISCA’97]
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
1
Simple Select Logic Implementation
Reservation Station
[Palarchala ISCA’97]
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Grant3
EnableAnyQueue
1
Issues to Distinctive Functional Units
Reservation Station Reservation Station
Distributed Instruction Windows (e.g., MIPS R1000 or Alpha 21264)
Faster to have separate instruction schedulers
for different instruction types
Dual Issues to Multiple Units (e.g., 2 Adders)
Grant0
[Palarchala Dissertation]
Req0
Grant1Req1
Grant2Req2
Grant3Req3
Req0Grant0
Req1Grant1
Req2Grant2
Req3Grant3
Memory Disambiguation
• Can we “undo” stores?
• Stores cannot be committed to memory until
they are marked ready to retire
• Completed stores are queued and waiting in
a store queue or store buffer
• Disambiguate (and resolve) memory
dependency dynamically
Memory Ordering
• Load X bypassing Load X violates certain memory
consistency model (e.g., sequential consistency)
• Load-load order trap replays
Source: Alpha 21264 HRM
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 2
Load Store Queue (LSQ)
• Memory instructions are allocated into LSQ in program order
• LSQ manages memory reference ordering
• Unified LSQ vs. Split LSQ
• Sandy Bridge: 64 Load buffers, 36 Store buffers
Store Queue Load Queue
Age-ordered
ALLOC
RS
ROB
Split LSQ
Issuing a Load for Execution
1 A1
2 D0
Issued?
age address
Load Queue
2 C0
Issued to
Memory
for execution
Issued?
age address
1 A1
1 B1
1 C0
2 ???0
Store Queue
00000001
12340000
FFFF1111
data
FFFFFF00
• Each load checks against older stores
– Associative search
– A performance issue of scalability
Issuing a Load for ExecutionIssued?
age address
1 A1
1 B1
1 A1
1 C0
2 ???0
2 D1
Issued?
age address
Store Queue Load Queue
2 C0
Store-to-load
forwarding
00000001
12340000
FFFF1111
data
FFFFFF00
• Implementation dependent: comprehensive size matching can be prohibitively
expensive
• Simple method: forward when a larger store (word) precedes a smaller load (half)
Issuing a Load for ExecutionIssued?
age address
1 A1
1 B1
1 A1
1 C0
2 ???0
2 D1
Issued?
age address
Store Queue Load Queue
2 C1
00000001
12340000
FFFF1111
data
3 K0FFFFFF00 Speculative
ly issue for
execution
• Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott))
– Naively
– Use Memory Dependency Predictor
• Store, when address ready, checks newer loads in the Load Queue
• “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay)
Store Checks Pre-Mature LoadsIssued?
age address
1 A1
1 B1
1 A1
1 C1
2 K0
2 D1
Issued?
age address
Store Queue Load Queue
2 C1
00000001
12340000
FFFF1111
data
3 K1FFFFFF00
• Store, when address ready, checks newer loads in the Load Queue
– Associative Search
• “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-
load replay)
3 M1
4 P1 Conflict
detected!
Replay the load
Issuing a Store for ExecutionIssued?
age address
4 A1
6 A0
4 A1
6 C0
5 D0
Issued?
age address
Store Queue Load Queue
5 C0
11000000
0F0F0F0F
00000002
data
6 K0
Issued to
memory
• Shown above the basic concept
• Implementation dependent
– Not allow store bypassing load, since it has little impact on performance
– Perform associative search
Issuing a Store for ExecutionIssued?
age address
4 A1
6 A0
4 A1
6 C0
5 D0
Issued?
age address
Store Queue Load Queue
5 C0
11000000
0F0F0F0F
00000002
data
6 K0cannot issue
for execution
Load-Load Ordering
• Needed for
– Multiprocessor support
– Maintaining memory
consistency model
• Load-load trap invoked
– Trap on the later, conflicted
instructions
– Replay
4 A0
5 D1
Issued?
age address
Load Queue
5 C1
6 A1
6 M1
6 N1
7 K0
Load-load trap

More Related Content

PPT
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
PPT
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
PPT
Lec1 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Pipelining
PPT
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
PPT
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
PPT
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
PPT
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
PPT
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec1 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Pipelining
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...

What's hot (20)

PPT
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
PPT
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
PPT
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
PDF
Solution manual 8051 microcontroller by mazidi
PPT
The 8051 assembly language
PPTX
Winter training,Readymade Projects,Buy Projects,Corporate Training
PDF
Understanding Tomasulo Algorithm
PDF
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
PDF
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
PDF
8086 labmanual
PPT
Data hazards ppt
PDF
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
PPT
Introduction to Assembly Language
PDF
Code GPU with CUDA - Device code optimization principle
PPT
1347 Assembly Language Programming Of 8051
PPT
Stack and subroutine
DOCX
32 bit ALU Chip Design using IBM 130nm process technology
PPTX
PDF
Code GPU with CUDA - SIMT
PPT
Chp2 introduction to the 68000 microprocessor copy
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Solution manual 8051 microcontroller by mazidi
The 8051 assembly language
Winter training,Readymade Projects,Buy Projects,Corporate Training
Understanding Tomasulo Algorithm
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
8086 labmanual
Data hazards ppt
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Introduction to Assembly Language
Code GPU with CUDA - Device code optimization principle
1347 Assembly Language Programming Of 8051
Stack and subroutine
32 bit ALU Chip Design using IBM 130nm process technology
Code GPU with CUDA - SIMT
Chp2 introduction to the 68000 microprocessor copy
Ad

Viewers also liked (20)

PPT
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
PPT
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
PPT
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
PPT
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
PPT
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
PPT
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
PPT
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
PPT
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
PPT
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
PPT
Semiconductor
PPTX
B sc cs i bo-de u-iii counters & registers
PPT
Shift Register
PPT
Digital 9 16
PPTX
digital Counter
PPT
14827 shift registers
PPTX
2.3 sequantial logic circuit
PPTX
Overview of Shift register and applications
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Semiconductor
B sc cs i bo-de u-iii counters & registers
Shift Register
Digital 9 16
digital Counter
14827 shift registers
2.3 sequantial logic circuit
Overview of Shift register and applications
Ad

Similar to Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 2 (20)

PPT
Arm teaching material
PPT
Arm teaching material
PPT
Lecture10.ppt
PPT
arm_exp.ppt
PDF
POWER processor and features presentation
PPT
ARM_2.ppt
PPTX
Topic 2 ARM Architecture and Programmer's Model.pptx
PPT
W10: Interrupts
PPT
COMPILER_DESIGN_CLASS 2.ppt
PPTX
COMPILER_DESIGN_CLASS 1.pptx
PPTX
PPT
AdvancedRiscMachineryss-INTRODUCTION.ppt
PPTX
Arm architecture
PPT
LPC 2148 Instructions Set.ppt
PPTX
2024_lecture12_come321.pptx..................
PDF
Lecture6.pdf computer architecture for computer science
PDF
lecture07_RISCV_Impl.pdflecture07_RISCV_Impl.pdf
PPTX
8086 architecture
PDF
ARM_InstructionSet.pdf; For VTU 22 regulation course code BCS402
PDF
ARM Holings presentation for the worldd.pdf
Arm teaching material
Arm teaching material
Lecture10.ppt
arm_exp.ppt
POWER processor and features presentation
ARM_2.ppt
Topic 2 ARM Architecture and Programmer's Model.pptx
W10: Interrupts
COMPILER_DESIGN_CLASS 2.ppt
COMPILER_DESIGN_CLASS 1.pptx
AdvancedRiscMachineryss-INTRODUCTION.ppt
Arm architecture
LPC 2148 Instructions Set.ppt
2024_lecture12_come321.pptx..................
Lecture6.pdf computer architecture for computer science
lecture07_RISCV_Impl.pdflecture07_RISCV_Impl.pdf
8086 architecture
ARM_InstructionSet.pdf; For VTU 22 regulation course code BCS402
ARM Holings presentation for the worldd.pdf

More from Hsien-Hsin Sean Lee, Ph.D. (12)

PPT
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
PPT
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
PPT
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
PPT
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
PPT
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
PPT
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
PPT
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
PPT
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
PPT
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
PPT
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
PPT
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
PPT
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2

Recently uploaded (20)

PPTX
Embeded System for Artificial intelligence 2.pptx
PPTX
title _yeOPC_Poisoning_Presentation.pptx
PPTX
kvjhvhjvhjhjhjghjghjgjhgjhgjhgjhgjhgjhgjhgjh
PDF
Prescription1 which to be used for periodo
PPTX
udi-benefits-ggggggggfor-healthcare.pptx
PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PDF
YKS Chrome Plated Brass Safety Valve Product Catalogue
PDF
PPT Determiners.pdf.......................
PPTX
Sem-8 project ppt fortvfvmat uyyjhuj.pptx
PPTX
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
PPTX
ERP good ERP good ERP good ERP good good ERP good ERP good
PDF
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
PPTX
Lecture-3-Computer-programming for BS InfoTech
PPTX
ATL_Arduino_Complete_Presentation_AI_Visuals.pptx
PPTX
5. MEASURE OF INTERIOR AND EXTERIOR- MATATAG CURRICULUM.pptx
PDF
Colorful Illustrative Digital Education For Children Presentation.pdf
PPTX
code of ethics.pptxdvhwbssssSAssscasascc
PPTX
material for studying about lift elevators escalation
PDF
Core Components of IoT, The elements need for IOT
PPTX
making presentation that do no stick.pptx
Embeded System for Artificial intelligence 2.pptx
title _yeOPC_Poisoning_Presentation.pptx
kvjhvhjvhjhjhjghjghjgjhgjhgjhgjhgjhgjhgjhgjh
Prescription1 which to be used for periodo
udi-benefits-ggggggggfor-healthcare.pptx
Smarter Security: How Door Access Control Works with Alarms & CCTV
YKS Chrome Plated Brass Safety Valve Product Catalogue
PPT Determiners.pdf.......................
Sem-8 project ppt fortvfvmat uyyjhuj.pptx
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
ERP good ERP good ERP good ERP good good ERP good ERP good
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
Lecture-3-Computer-programming for BS InfoTech
ATL_Arduino_Complete_Presentation_AI_Visuals.pptx
5. MEASURE OF INTERIOR AND EXTERIOR- MATATAG CURRICULUM.pptx
Colorful Illustrative Digital Education For Children Presentation.pdf
code of ethics.pptxdvhwbssssSAssscasascc
material for studying about lift elevators escalation
Core Components of IoT, The elements need for IOT
making presentation that do no stick.pptx

Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 2

  • 1. ECE 4100/6100 Advanced Computer Architecture Lecture 8 Dynamic Scheduling (II) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology
  • 2. Modern Processors • Branch Prediction results in speculative execution • Speculative instructions (if wrongly speculated) must not alter the architecture states – Architecture Registers – Memory • Requirement of precise exception/interrupts
  • 3. Modern Out-of-Order Core ALLOC RAT RS ARFROB Register Alias Table renames architecture registers Allocate instructions Reorder Buffer maintains state information (physical registers) for precise interrupts and speculative execution Reservation Station issues instructions to functional units Architectural register file LSQ Load Store Queue maintains memory access ordering
  • 4. Register Renaming R0 Architected Registers R1 R2 R3 R4 R5 R6 R7 T0 T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 T22 Tn-2 T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21 T23 Tn-1 Physical Registers R2 = R1+R3 R4 = R2 - R6 … R2 = R7 / R5 BEQ R2, #1 … R2 = R4 * R1 R6 = Load [R2] Original Code Renamed Code T1 = R1+R3 R4 = T1 - R6 … T20 = R7 / R5 BEQ T20, #1 … T7 = R4 * R1 R6 = Load [T7] WAW WAR No False Dependencies! Adapted from Prof. G. Loh’s Slides Sandy Bridge: 160 PRs for INT 144 PRs for FP
  • 5. Register Renaming Dest = Src1 op Src2 Mapping Mechanism TagS1 op TagS2 Src1  TagS1 Src2  TagS2 Unmapped Physical Registers TagD TagD = Dest  TagD Repeat for each instruction Adapted from Prof. G. Loh’s Slides
  • 6. Register Alias Table (RAT) • Use a lookup table for renaming • One entry per architectural register • Each entry maps to the most recent version of the architectural register, could be in – Physical register file – Architectural register file ROB (40 entries)ROB (40 entries) RRFRRF DataData StatusStatus EBXEBX ECXECX EDXEDX ESIESI EDIEDI EAXEAX ESPESP EBPEBP RATRAT P6 Style Register RenamingP6 Style Register Renaming (So does HP-PA8000, PPC604)(So does HP-PA8000, PPC604)
  • 7. RAT Example R1 = R2 + R3 R0 - R1 - R2 - R3 - R4 - R5 - R6 - R7 - T13, T14, T15, T16 Free PRegs T13 = R2 + R3 - 13 - - - - - - T14, T15, T16R5 = R4 – R1 T14 = R4 – T13 - 13 - - - 14 - -R1 = R1 * R5 T15, T16 T15 = T13 * T14 - 15 - - - 14 - -R2 = R5 / R1 T16 T16 = T14 / T15 - 15 16 - - 14 - - Adapted from Prof. G. Loh’s Slides
  • 8. Superscalar Rename R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] RAT T16 T23 T39 T7 T14 T16 T5 X Don’t rename immediates T10 T31 T19 T6 Fromfree registerpool For N-wide superscalar: 2N RAT read-ports N RAT write-ports
  • 9. Intra-Group Dependencies R2 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] RAT T16 T23 T39 T7 T14 T16 T5 X T10 T31 T19 T6 Fromfree registerpool This is the wrong version of R2 Should be using this version of R2
  • 10. Intra-Group Dependencies R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 RAT T16 T34 T34 T16 T16 T34 T16 T34 T16 T34 T10 T16 T31 T10 T31 T19 Result of sequential renaming T10 T31 T19 T6 Fromfree registerpool Correct final renamed registers
  • 11. Resolving Intra-Group Dependencies RAT From free register pool Intra-Group Dependency Checker Inst 0 Inst 1 Inst 2 Inst 3 Src L Src R Dest T0L T1L T2L T3L T0R T1R T2R T3R Pdst0 Pdst1 Pdst2 Adapted from Prof. G. Loh’s Slides
  • 12. Intra-Group Dependency Checking Pdst0 Pdst1 Pdst2 dst0 src1L =R1L T1L 0 1 src1R R1R = T1R R2L src2L = T2L = dst1 src2R = T2R R2R = dst2 src3L = T3L = R3L = = T3R = = R3R src3R Pdst3 src0L src0R dst3 Adapted from Prof. G. Loh’s Slides
  • 13. Mapping Selection R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 Only this mapping for R1 should be written into the RAT dst0 dst1 dst2 dst3 != != use pdst1 != != != use pdst0 != use pdst2 use pdst31 Condition: use mapping if instruction is last writer to the register Priority encoder Adapted from Prof. G. Loh’s Slides
  • 14. Issue with Imprecise Interrupt • add instructions take one cycle • E.g., – Load (left side) induces a “data page fault”; – Add (right side) induces an “instruction page fault” • If out-of-order completion is allowed – r10, r12, (or r2, r4) … will be modified – Wrong values will be used by the re-issued load • Interrupt classes – Program interrupts (exceptions or traps) – External interrupts (asynchronous) lw r5, 8(r10r10) add r10r10, r9, r8 add r12, r10, r7 L1: add r3, r1, r2r2 add r4, r1, r4 add r2, r4, r4 End of Non-Resident Page X Start of Resident Page X+1 Instruction Page Fault
  • 15. Precise Interrupts • To reflect a sequential architecture model ⇒ Serially correct (think about a single issue, non- pipelined processor) • Keep “Precise State” of an execution – All instructions before the interrupted instruction must be completed – The state should appear as if no instruction issued after the interrupted instruction – The interrupted PC should be presented to the interrupt handler (restartable) • Similar to branch misprediction handling • Out-of-order execution makes the ordering hard – Undo what comes after an interrupt
  • 16. Why Supporting Precise Interrupts • Need to maintain a precise state (for recovery) • Software debugging • I/O or timer interrupts • Virtual memory (page fault) • Instruction emulation • Virtual machines
  • 17. Support Precise Interrupt • Buffer results • Can reconstruct the scenario (state) as sequential execution • Restart from saved PC with saved PC state
  • 18. Reorder Buffer (ROB) [SmithPlezkun’85 ‘88] • Architecture Register File keeps “In-order state” • Reorder Buffer (ROB) – A circular buffer – Contains all in-flight instructions – buffers the “Lookahead state” – In-order allocation/deallocation with head/tail pointers • When an exception occurs – Halting instruction issues – Revert to in-order state using RF and discard ROB results • Also used for branch misprediction recovery • Pentium Pro/II/III integrates physical register file within ROB • Pentium 4 decouples ROB and physical register file
  • 19. Reorder Buffer (with physical registers) V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . Head (oldest instruction) Tail (next inst to be allocated) Sandy Bridge : 168-entry ROB
  • 20. Handling Precise Interrupts Head Tail V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 1 0 0 xA000 0000 R1 1 0 0 xA004 0000 R2 R1=R1+10 R2=R2*2 1 0 0 xA008 0000 FR1 FR1=FR2/0.0 10 11 1R1 11 1R2 1 ARF R31 1 1 R3 R4 2 3 4
  • 21. Handling Precise Interrupts Head V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 0 1 0 0 xA004 0000 R2 R2=R2*2 1 0 0 xA008 0000 FR1 FR1=FR2/0.0 Tail 1 0 0 xA00C 0000 R3 R3=R3+1 1R1 11 1R2 1 ARF R31 1 1 R3 R4 2 3 4
  • 22. Handling Precise Interrupts Head V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 0 1 0 0 xA004 0000 R2 R2=R2*2 1 0 0 xA008 0000 FR1 FR1=FR2/0.0 Tail 1 0 1 xA00C 0000 R3 R3=R3+1 1 0 0 xA010 0000 R4 4 R4=R4*2 1R1 11 1R2 1 ARF R31 1 1 R3 R4 2 3 4
  • 23. Handling Precise Interrupts Head V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 0 1 0 0 xA004 0000 R2 R2=R2*2 1 0 0 xA008 0010 FR1 FR1=FR2/0.0 Tail 1 0 1 xA00C 0000 R3 R3=R3+1 1 0 1 xA010 0000 R4 4 R4=R4*28 1 0 0 xA014 0000 FR4 FR4=FR4*2.0 1 4 1R1 11 1R2 1 ARF R31 1 1 R3 R4 2 3 4 4
  • 24. Handling Precise Interrupts V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 0 1 0 0 xA008 0010 FR1 FR1=FR2/0.0 Tail 1 0 1 xA00C 0000 R3 R3=R3+1 1 0 1 xA010 0000 R4 4 R4=R4*28 1 0 0 xA014 0000 FR4 FR4=FR4*2.0 1 0 1 xA004 0000 R2 R2=R2*240Head 1R1 11 1R2 1 ARF R31 1 1 R3 R4 4 3 4
  • 25. Handling Precise Interrupts V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 0 1 0 0 xA008 0010 FR1 FR1=FR2/0.0 Tail 1 0 1 xA00C 0000 R3 R3=R3+1 1 0 1 xA010 0000 R4 4 R4=R4*28 1 0 0 xA014 0000 FR4 FR4=FR4*2.0 Head 0 Exception detected. Back up “PC” and current RF These values were not committed into RF Depending on the Exception, process will either abort or instruction will be resumed from this excepting instruction 1R1 11 1R2 1 ARF R31 1 1 R3 R4 4 3 4
  • 26. Handling Speculative Execution Head Tail V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 1 0 0 xB000 0000 R1 1 0 0 xB004 0000 R1=R1+10 BEQ R1, R0, L1 1R1 1R2 1 ARF R31 1 1 R3 R4 2 3 4
  • 27. Handling Speculative Execution Head Tail V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 1 0 0 xB000 0000 R1 1 0 0 xB004 0000 R1=R1+10 BEQ R1, R0, L1 1 1 1 xC100 0000 R2=R3 << 2 1 1 0 xC104 0000 R1=R2*R3 1 1 0 xD2AC 0000 BEQ R3, R0, L1 1 1 1 xD2B0 0000 R1=R7+1 R1 R2 R1 28 32 1R1 1R2 1 ARF R31 1 1 R3 R4 2 3 4 BEQ R1, R0, L1 is predicted TAKENBEQ R1, R0, L1 is predicted TAKEN
  • 28. Handling Speculative Execution Head Tail V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 1 0 0 xB004 0000 BEQ R1, R0, L1 1 1 1 xC100 0000 R2=R3 << 2 1 1 0 xC104 0000 R1=R2*R3 1 1 0 xD2AC 0000 BEQ R3, R0, L1 1 1 1 xD2B0 0000 R1=R7+1 R1 R2 R1 28 32 11R1 1R2 1 ARF R31 1 1 R3 R4 2 3 4 BEQ R1, R0, L1 is resolved, actually NOT TAKEN !!BEQ R1, R0, L1 is resolved, actually NOT TAKEN !! BEQ Misprediction
  • 29. Handling Speculative Execution Tail V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 1 0 0 xB004 0000 BEQ R1, R0, L1 11R1 1R2 1 ARF R31 1 1 R3 R4 2 3 4 Retire branch, Clear all entries after the mis-speculated branchRetire branch, Clear all entries after the mis-speculated branch Head
  • 30. Handling Speculative Execution Head Tail V Data (physical register) Exp event RegDst Done? Spec? PC . . . . . . 11R1 1R2 1 ARF R31 1 1 R3 R4 2 3 4 Continue execution from the correct path (Fall through in this case)Continue execution from the correct path (Fall through in this case) 1 0 0 xB008 0000 R2=R5 << 4R2
  • 31. RAT Recovery br ARF RAT ARF state corresponds to state prior to oldest non-committed instruction As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction On a branch misprediction, wrong-path instructions are flushed from the machine ?!? The RAT is left with an invalid set of mappings corresponding to the wrong- path instruction state Adapted from Prof. G. Loh’s Slide
  • 32. Solution: Stall and Drain br ARF RAT ?!? Correct path instructions from fetch; can’t rename because RAT is wrong foo X ARF now corresponds to the state right before the next instruction to be renamed (foo) Allow all instructions to execute and commit; ARF corresponds to last committed instruction Reset RAT so that all mappings refer to the ARF Resume renaming the new correct- path instructions from fetch Pros: Very simple to implement Cons: Performance loss due to stalls
  • 33. Another Solution: Checkpointing br br br br ARF RAT At each branch, make a copy of the RAT (register mapping at the time of the branch) RAT RAT RAT RAT On a misprediction: Checkpoint Free Pool 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint foo 4. resume renaming
  • 34. Modern Instruction Scheduler • At dispatch, instruction read all available operands from the register files and store a copy in the scheduler (Tomasulo’s algorithm) • Unavailable operands will be “captured” from the functional unit outputs (CDB broadcast) • When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files (Wakeup and select) Fetch & Dispatch ARF PRF/ROB Instruction Scheduler Functional Units Physicalregisterupdate BypassFetch & Dispatch ARF PRF/ROB Fetch & Dispatch ARF Adapted from Prof. G. Loh’s Slide
  • 35. Instruction Scheduling: Wakeup and Select • Wakeup Logic – To notify the resolution of data dependency of input operands – Wake up instructions with zero input dependency • Select Logic – Choose and fire ready instructions – Deal with structure hazard • Wakeup-select is likely on the critical path – Associative match
  • 36. Scalar Scheduler (Issue Width = 1) T14 T16 T39 T6 T17 T39 T15 T39 = = = = = = = = T39 T8 T17 T42 SelectLogic ToExecuteLogic TagBroadcastBus From Prof. G. Loh’s Slide
  • 37. Superscalar Scheduler (Issue Width = 4) T39 T8 T17 T42 SelectLogic ToExecuteLogic Tag Broadcast Bus [3..0] Adapted from Prof. G. Loh’s Slide T14 ==== T16 ==== T39 ==== T6 ==== T17 ==== T39 ==== T15 ==== T39 ==== Snapshot of RS (only 4 entries shown)
  • 38. Selection Logic • Select ready instructions to be issued • Goal: to reduce the height of DFG • Methods – Location-based (e.g., leftmost ready first) •Allow simple, faster hardware – Oldest ready first •Can use location-based (in-order issue) with “compaction” •Can be slow and complex
  • 39. Simple Select Logic Implementation Reservation Station [Palarchala ISCA’97] Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Tree-like Arbitrated Selection Logic 1
  • 40. Simple Select Logic Implementation Reservation Station [Palarchala ISCA’97] Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Priority Decoder EnableAnyQueue Req0 Req1 Req2 Req3 Grt0 Grt1 Grt2 Grt3 1
  • 41. Simple Select Logic Implementation Reservation Station [Palarchala ISCA’97] Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue 1
  • 42. Simple Select Logic Implementation Reservation Station [Palarchala ISCA’97] Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Grant3 EnableAnyQueue 1
  • 43. Issues to Distinctive Functional Units Reservation Station Reservation Station Distributed Instruction Windows (e.g., MIPS R1000 or Alpha 21264) Faster to have separate instruction schedulers for different instruction types
  • 44. Dual Issues to Multiple Units (e.g., 2 Adders) Grant0 [Palarchala Dissertation] Req0 Grant1Req1 Grant2Req2 Grant3Req3 Req0Grant0 Req1Grant1 Req2Grant2 Req3Grant3
  • 45. Memory Disambiguation • Can we “undo” stores? • Stores cannot be committed to memory until they are marked ready to retire • Completed stores are queued and waiting in a store queue or store buffer • Disambiguate (and resolve) memory dependency dynamically
  • 46. Memory Ordering • Load X bypassing Load X violates certain memory consistency model (e.g., sequential consistency) • Load-load order trap replays Source: Alpha 21264 HRM
  • 48. Load Store Queue (LSQ) • Memory instructions are allocated into LSQ in program order • LSQ manages memory reference ordering • Unified LSQ vs. Split LSQ • Sandy Bridge: 64 Load buffers, 36 Store buffers Store Queue Load Queue Age-ordered ALLOC RS ROB Split LSQ
  • 49. Issuing a Load for Execution 1 A1 2 D0 Issued? age address Load Queue 2 C0 Issued to Memory for execution Issued? age address 1 A1 1 B1 1 C0 2 ???0 Store Queue 00000001 12340000 FFFF1111 data FFFFFF00 • Each load checks against older stores – Associative search – A performance issue of scalability
  • 50. Issuing a Load for ExecutionIssued? age address 1 A1 1 B1 1 A1 1 C0 2 ???0 2 D1 Issued? age address Store Queue Load Queue 2 C0 Store-to-load forwarding 00000001 12340000 FFFF1111 data FFFFFF00 • Implementation dependent: comprehensive size matching can be prohibitively expensive • Simple method: forward when a larger store (word) precedes a smaller load (half)
  • 51. Issuing a Load for ExecutionIssued? age address 1 A1 1 B1 1 A1 1 C0 2 ???0 2 D1 Issued? age address Store Queue Load Queue 2 C1 00000001 12340000 FFFF1111 data 3 K0FFFFFF00 Speculative ly issue for execution • Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott)) – Naively – Use Memory Dependency Predictor • Store, when address ready, checks newer loads in the Load Queue • “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay)
  • 52. Store Checks Pre-Mature LoadsIssued? age address 1 A1 1 B1 1 A1 1 C1 2 K0 2 D1 Issued? age address Store Queue Load Queue 2 C1 00000001 12340000 FFFF1111 data 3 K1FFFFFF00 • Store, when address ready, checks newer loads in the Load Queue – Associative Search • “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store- load replay) 3 M1 4 P1 Conflict detected! Replay the load
  • 53. Issuing a Store for ExecutionIssued? age address 4 A1 6 A0 4 A1 6 C0 5 D0 Issued? age address Store Queue Load Queue 5 C0 11000000 0F0F0F0F 00000002 data 6 K0 Issued to memory • Shown above the basic concept • Implementation dependent – Not allow store bypassing load, since it has little impact on performance – Perform associative search
  • 54. Issuing a Store for ExecutionIssued? age address 4 A1 6 A0 4 A1 6 C0 5 D0 Issued? age address Store Queue Load Queue 5 C0 11000000 0F0F0F0F 00000002 data 6 K0cannot issue for execution
  • 55. Load-Load Ordering • Needed for – Multiprocessor support – Maintaining memory consistency model • Load-load trap invoked – Trap on the later, conflicted instructions – Replay 4 A0 5 D1 Issued? age address Load Queue 5 C1 6 A1 6 M1 6 N1 7 K0 Load-load trap

Editor's Notes

  • #47: Quick example for load-load violation X= 5 P0P1 R1 = XX = 0 R2 = X Under SC, it is not possible to have R1=0 and R2=X, only if load can bypass load.