SlideShare a Scribd company logo
ECE 4100/6100
Advanced Computer Architecture
Lecture 5 Branch Prediction
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
2
Predict What?
• Direction (1-bit)
– Single direction for unconditional jumps and calls/returns
– Binary for conditional branches
• Target (32-bit or 64-bit addresses)
– Some are easy
• One: Uni-directional jumps
• Two: Fall through (Not Taken) vs. Taken
– Many: Function Pointer or Indirect Jump (e.g. jr r31)
3
Categorizing Branches
8%
10%
82%
19%
6%
75%
0% 20% 40% 60% 80% 100%
Call/Return
Jump
Conditional
Branch
Frequency of branch instructions
SPEC2000INT
SPEC2000FP
Source: H&P using Alpha
4
Branch Misprediction
PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Single Issue
5
Branch Misprediction
PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Single Issue
Mispredict
6
Branch Misprediction
PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Single Issue (flush entailed instructions and refetch)
Mispredict
7
Branch Misprediction
PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Single Issue
Fetch the correct path
8
Branch Misprediction
PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Single Issue
Mispredict
8-issue Superscalar Processor (Worst case)
9
Why Branch is Predictable?
for (i=0; i<100; i++) {
….
}
addi r10, r0, 100
add r1, r0, r0
L1:
… …
… …
addi r1, r1, 1
bne r1, r10, L1
… …
if (aa==2)
aa = 0;
if (bb==2)
bb = 0;
if (aa!=bb)
….
addi r2, r0, 2
bne r10, r2, L_bb
xor r10, r10, r10
L_bb:
bne r11, r2, L_xx
xor r11, r11, r11
L_xx:
beq r10, r11, L_exit
…
Lexit:
10
Control Speculation
• Execute instruction beyond a branch before
the branch is resolved  Performance
• Speculative execution
• What if mis-speculated? need
– Recovery mechanism
– Squash instructions on the incorrect path
• Branch prediction: Dynamic vs. Static
• What to predict?
11
Static Branch Prediction
• Uni-directional, always predict taken (or not
taken)
• Backward taken, Forward not taken
– Need offset information (when?)
• Compiler hints with branch annotation
– When the info will be available? Post-decode?
12
Simplest Dynamic Branch Predictor
• Prediction based on latest outcome
• Index by some bits in the branch PC
– Aliasing
T
NT
T
T
NT
NT
.
.
.
for (i=0; i<100; i++) {
….
}
addi r10, r0, 100
addi r1, r1, r0
L1:
… …
… …
addi r1, r1, 1
bne r1, r10, L1
… …
0x40010100
0x40010104
0x40010108
…
0x40010A04
0x40010A08
How accurate?
NT
T
1-bit
Branch
History
Table
13
Typical Table Organization
Hash
PC (32 bits)
.
.
.
.
.
2N
entries
Prediction
N bits
FSM
Update
Logic
table update
Actual outcome
14
Simplest Dynamic Branch Predictor
T
NT
T
T
NT
NT
.
.
.
addi r10, r0, 100
addi r1, r1, r0
L1:
add r21, r20, r1
lw r2, (r21)
beq r2, r0, L2
… …
j L3
L2:
… … …
L3:
addi r1, r1, 1
bne r1, r10, L1
0x40010100
0x40010104
0x40010108
0x4001010c
0x40010110
0x40010210
0x40010B0c
0x40010B10
for (i=0; i<100; i++) {
if (a[i] == 0) {
…
}
…
}
NT
T
1-bit
Branch
History
Table
15
FSM of the Simplest Predictor
• A 2-state machine
• Change mind fast
00 11
If branch not taken
If branch taken
00
11
Predict not taken
Predict taken
16
Example using 1-bit branch history table
for (i=0; i<44; i++) {
….
}
00Pred
Actual T T
√
11 11
√
T T
√
11 11
addi r10, r0, 4
addi r1, r1, r0
L1:
… …
addi r1, r1, 1
bne r1, r10, L1
NT
00
 
T
11

T
√
11
√
T T
√
11 11
NT
00

T
11

60% accuracy
17
2-bit Saturating Up/Down Counter Predictor
Not Taken
Taken
Predict Not taken
Predict taken
ST: Strongly Taken
WT: Weakly Taken
WN: Weakly Not Taken
SN: Strongly Not Taken
01/
WN
01/
WN
00/
SN
00/
SN
10/
WT
10/
WT
11/
ST
11/
ST
MSB: Direction bit
LSB: Hysteresis bit
18
2-bit Counter Predictor (Another Scheme)
Not Taken
Taken
Predict Not taken
Predict taken
ST: Strongly Taken
WT: Weakly Taken
WN: Weakly Not Taken
SN: Strongly Not Taken
01/
WN
01/
WN
00/
SN
00/
SN
11/
ST
11/
ST
10/
WT
10/
WT
19
Example using 2-bit up/down counter
for (i=0; i<44; i++) {
….
}
0101Pred
Actual T T
√
1010 1111
√
T T
√
1111 1111
addi r10, r0, 4
addi r1, r1, r0
L1:
… …
addi r1, r1, 1
bne r1, r10, L1
NT
1010
 
T
1111
√
T
√
1111
√
T T
√
1111 1111
NT
1010

T
1111
√
80% accuracy
20
Branch Correlation
• Branch direction
– Not independent
– Correlated to the path taken
• Example: Path 1-1 of b3 can be surely known beforehand
• Track path using a 2-bit register
if (aa==2) // b1
aa = 0;
if (bb==2) // b2
bb = 0;
if (aa!=bb) { // b3
…….
}
b1
b2 b2
b3 b3 b3
1 (T)
1 1
0 (NT)
0
b3
0
Path: A:1-1 B:1-0 C:0-1 D:0-0
aa=0
bb=0
aa=0
bb≠2
aa≠2
bb=0
aa≠2
bb≠2
Code Snippet
21
Correlated Branch Predictor [PanSoRahmeh’92]
• (M,N) correlation scheme
– M: shift register size (# bits)
– N: N-bit counter
2-bit
counter
hash .
.
.
.
X X
Branch PC
hash
2-bit
counter
.
.
.
.
2-bit
counter
.
.
.
.
X X
2-bit
counter
.
.
.
.
2-bit
counter
.
.
.
.
Prediction Prediction
2-bit shift register
(global branch history)
select
Subsequent
branch
direction
(2,2) Correlation Scheme
2-bit Sat. Counter Scheme
2w
w
Branch PC
22
Two-Level Branch Predictor [YehPatt91,92,93]
• Generalized correlated branch predictor
• 1st
level keeps branch history in Branch History Register (BHR)
• 2nd
level segregates pattern history in Pattern History Table (PHT)
1 1 . . . . . 1 0
00…..00
00…..01
00…..10
11…..11
11…..10
Branch History Pattern
Pattern History Table (PHT)
Prediction
Rc-k Rc-1
Rc: Actual Branch Outcome
FSM
Update
Logic
Branch History Register (BHR)
(Shift left when update)
N
2N
entries
Current StatePHT update
23
Branch History Register
• An N-bit Shift Register = 2N
patterns in PHT
• Shift-in branch outcomes
– 1 ⇒ taken
– 0 ⇒ not taken
• First-in First-Out
• BHR can be
– Global
– Per-set
– Local (Per-address)
24
Pattern History Table
• 2N
entries addressed by N-bit BHR
• Each entry keeps a countercounter (2-bit or more) for
prediction
– Counter update: the same as 2-bit counter
– Can be initialized in alternate patterns (01, 10,
01, 10, ..)
• Alias (or interference) problem
25
Global History Schemes
Global
BHR
Global
PHT
GAg
Global
BHR
..
SetP(B) Per-set
PHTs
(SPHTs)
GAs
Global
BHR
..
Addr(B) Per-addr
PHTs
(PPHTs)
GAp
** [PanSoRahmeh’92] similar to GAp
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Set can be determined by branch
opcode, compiler classification,
or branch PC address.
26
GAs Two-Level Branch Prediction
01100110
BHR
PC = 0x4001000C
.
.
.
PHT
00110110
.
.
00110110
00110111
11111101
11111110
00000000
00000001
00000010
11111111
10
MSB = 1
Predict Taken
The 2 LSBs are insignificant for
32-bit instruction
27
Predictor Update (Actually, Not Taken)
01100110
BHR
PC = 0x4001000C
.
.
.
PHT
00110110
.
.
00110110
00110111
11111101
11111110
00000000
00000001
00000010
11111111
1001 decremented
11001100
00111100
00111100
Wrong
Predictio
n
• Update Predictor after branch is resolved
28
Per-Address History Schemes
Global
PHT
PAg
SetP(B) Per-set
PHTs
(SPHTs)
PAs
Addr(B) Per-addr
PHTs
(PPHTs)
PAp
.
.
.
Addr(B)
Per-addr
BHT (PBHT)
.
.
.
Addr(B)
Per-addr
BHT (PBHT)
.
.
.
Addr(B)
Per-set
BHT (PBHT)
•Ex: P6, Itanium
•Ex: Alpha 21264’s
local predictor
.
.
.
..
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
29
PAs Two-Level Branch Predictor
PC = 1110 0000 1001 1001 0010 1100 1110 1000
000
001
010
011
100
101
110
111
BHT
11010110
.
.
.
PHT
.
.
11010101
11010110
11111101
11111110
00000000
00000001
00000010
11111111
MSB = 1
Predict
Taken
11
110
30
Per-Set History Schemes
Global
PHT
SAg
SetP(B) Per-set
PHTs
(SPHTs)
SAs
Addr(B) Per-addr
PHTs
(PPHTs)
SAp
.
.
.
Per-set
BHT (SBHT)
.
.
.
SetH(B)
Per-set
BHT (SBHT)
.
.
.
SetH(B)
Per-set
BHT (SBHT)
..
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
SetH(B)
31
PHT Indexing
Branch addr
Global
history
Gselect
4/4
00000000 00000001 00000001
00000000 00000000 00000000
11111111 00000000 11110000
11111111 10000000 11110000
Insufficient
History
• Tradeoff between more history bits and address bits
• Too many bits needed in Gselect ⇒ sparse table entries
32
Gshare Branch Predictor [McFarling93]
• Tradeoff between more history bits and address bits
• Too many bits needed in Gselect ⇒ sparse table entries
• Gshare ⇒ Not to lose global history bits
• Ex: AMD Athlon, MIPS R12000, Sun MAJC, Broadcom SiByte’s SB-1
Branch addr
Global
history
Gselect
4/4
Gshare
8/8
00000000 00000001 00000001 00000001
00000000 00000000 00000000 00000000
11111111 00000000 11110000 11111111
11111111 10000000 11110000 01111111
GselectGselect 4/4: Index PHT by concatenateconcatenate low order 4 bits
GshareGshare 8/8: Index PHT by {Branch address ⊕ Global history}
33
Gshare Branch Predictor
.
.
.
PHT
.
.
00
MSB = 0
Predict Not Taken
1 1 . . . . . 1 0
0 1 . . . . . 0 1 0 01. . . . .1 1
⊕
PC Address
Global BHR
34
Aliasing Example
PHT
BHR 1101
PC 0110
----
XOR 1011
BHR 1001
PC 1010
----
XOR 0011
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
PHT (indexed by 10)
BHR 1101
PC 0110
----
|| 1001
BHR 1001
PC 1010
----
|| 1001
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
GApGAp GshareGshare
35
Hybrid Branch Predictor [McFarling93]
• Some branches correlated to global history, some correlated to local
history
• Only update the meta-predictor when 2 predictors disagree
P0P0 P1P1
.
.
.
Choice (or Meta)
Predictor
Branch PC
Final Prediction
36
Alpha 21264 (EV6) Hybrid Predictor
Local
History
Table
1024 x
10 bits
Single
Local
Predictor
1024 x
3 bits
Global
Predictor
4096 x
2 bits
Choice
Predictor
4096 x
2 bits
Global history
12
Local
prediction
Global
prediction Meta
prediction
Next
Line/set
Prediction
L1 I-cache
(64KB 2w)
&
TLB
4 instr./cycle4 instr./cycle
Virtual address
Final Branch Prediction
PCPC
10
• A “tournament branch
predictor”
• Multi-predictor scheme w/
– Local predictorLocal predictor (~PAg)
• Self-correlation
– Global predictorGlobal predictor
• Inter-correlation
– Choice predictorChoice predictor as the
decision maker: a 2-bit
sat. counter to credit either
local or global predictors.
• Die size impact
– History info tables ~2%
– BTB ~ 2.7% (associated
with I-$ on a per-line basis)
• 2 cycle latency, we will discuss
more later
For Single-cycle
Prediction
37
Alpha EV8 Branch Predictor
Branch PC Global history
F1 F2 F3
majority vote
prediction
G0 G1 Meta
F4
Bimodal
e-gskew
predictor
• Real silicon never sees the daylight
• Use a 2Bc-gskew predictor (one form of enhanced gskew)
– Bimodal predictor used as (1) static biased predictor and (2) part of e-gskew predictor
– Global predictors G0 and G1 are part of e-gskew predictor
– Table sizes: 352Kbits in total (208Kbits for prediction table; 144Kbits for hysteresis table.)
38
Branch Target Prediction
• Try the easy ones first
– Direct jumps
– Call/Return
– Conditional branch (bi-directional)
• Branch Target Buffer (BTB)
• Return Address Stack (RAS)
39
Branch Target Buffer (BTB)
TargetTag TargetTag TargetTag…
BTBBranch PC
== == ==…
++
4
Branch
Target
Predicted
Branch
Direction
0
1
40
Return Address Stack (RAS)
• Different call sites make return address hard
to predict
– Printf() being called by many callers
– The target of “return” instruction in printf() is a
moving target
• A hardware stack (LIFO)
– Call will push return address on the stack
– Return uses the prediction off of TOS
41
Return Address Stack
• Does it always work?
– Call depth
– Setjmp/Longjmp
– Speculative call?
++
4
Call PC
Push
Return
Address
BTBBTB
Return PC
BTBBTB
Return?
• May not know it is a return instruction
prior to decoding
– Rely on BTB for speculation
– Fix once recognize Return
42
Indirect Jump
• Need Target Prediction
– Many (potentially 230
for 32-bit machine)
– In reality, not so many
– Similar to predicting values
• Tagless Target Prediction
• Tagged Target Prediction
43
Tagless Target Prediction [ChangHaoPatt’97]
1 1 . . . . . 1 0
Branch History RegisterBranch History Register
(BHR)(BHR)
00…..00
00…..01
00…..10
11…..11
11…..10
PC ⊗ BHR Pattern
Target Cache (2N
entries)
Predicted Target
Address
Branch PCBranch PC
HashHash
• Modify the PHT to be a “Target Cache”
– (indirect jump) ? (from target cache) : (from BTB)
• Alias?
44
Tagged Target Prediction [ChangHaoPatt’97]
• To reduce aliasing with set-associative target cache
• Use branch PC and/or history for tags
1 1 . . . . . 1 0
BHR
00…..00
00…..01
00…..10
11…..11
11…..10
Target Cache (2n
entries per way)
Predicted Target
Address
Branch PC
Hash
n
=?
Tag Array
45
Multiple Branch Prediction
• For a really wide machine
– Across several basic blocks
– Need to predict multiple branches per cycle
• How to fetch non-contiguous instructions in one
cycle?
• Prediction accuracy extremely critical (will be
reduced geometrically)

More Related Content

PDF
Vivado hls勉強会1(基礎編)
PPTX
Introduction to DPDK
PDF
Intel dpdk Tutorial
PDF
Making Linux do Hard Real-time
PDF
LR parsing
PDF
Smalltalkだめ自慢
PDF
Intel DPDK Step by Step instructions
PDF
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
Vivado hls勉強会1(基礎編)
Introduction to DPDK
Intel dpdk Tutorial
Making Linux do Hard Real-time
LR parsing
Smalltalkだめ自慢
Intel DPDK Step by Step instructions
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説

What's hot (20)

PDF
CMake - Introduction and best practices
PDF
深層学習向け計算機クラスター MN-3
PDF
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
PPT
Javaバイトコード入門
PDF
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
PPTX
MySQLの運用でありがちなこと
PPTX
Understanding DPDK
PDF
DPDK In Depth
PDF
ELFの動的リンク
PDF
Deflate
PPTX
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
PDF
OSC2011 Tokyo/Fall 濃いバナ(virtio)
PDF
C++の話(本当にあった怖い話)
PDF
スペクトラル・クラスタリング
PDF
「FPGA 開発入門:FPGA を用いたエッジ AI の高速化手法を学ぶ」
PDF
論文紹介:The wavelet matrix
PDF
条件分岐とcmovとmaxps
PPTX
Gohan
PPTX
LLVM Instruction Selection
PDF
MCC CTF講習会 pwn編
CMake - Introduction and best practices
深層学習向け計算機クラスター MN-3
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
Javaバイトコード入門
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
MySQLの運用でありがちなこと
Understanding DPDK
DPDK In Depth
ELFの動的リンク
Deflate
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
OSC2011 Tokyo/Fall 濃いバナ(virtio)
C++の話(本当にあった怖い話)
スペクトラル・クラスタリング
「FPGA 開発入門:FPGA を用いたエッジ AI の高速化手法を学ぶ」
論文紹介:The wavelet matrix
条件分岐とcmovとmaxps
Gohan
LLVM Instruction Selection
MCC CTF講習会 pwn編
Ad

Viewers also liked (20)

PPT
Lec1 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Pipelining
PPT
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
PPT
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
PPT
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
PPT
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
PPT
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
PPT
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
PPT
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
PPT
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
PPT
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
PPT
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
PPT
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
PPT
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
PPT
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
PPT
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
PPT
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
PPT
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
PPT
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
PPT
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
PPT
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec1 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Pipelining
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Ad

Similar to Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Predictor (20)

PPT
ENG241-Week1-NumberSystemsaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
PPT
Computer organiztion2
PDF
Datarepresentation2
PPT
Data representation moris mano ch 03
PPT
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
PPT
LCDF3_Chap_10_P2.pptytttttyyyyyyyyyyyyyy
PPT
DATA REPRESENTATION
PPT
Register transfer and microoperations
PPT
dataformats and data conversionscpu1.ppt
PPT
An introduction to data_representation.ppt
PPT
DATA REPRESENTATIONS and Data codes and formats.ppt
PPTX
PROCESSOR AND CONTROL UNIT
PPTX
Counter Register power point to learn good
PPT
microprocessors
PPTX
Chapter 1 digital design.pptx
PPT
Micro operations
PPT
Digital Logic & Design
PPTX
PROCESSOR AND CONTROL UNIT - unit 3 Architecture
PPT
Kaizen cso002 l1
ENG241-Week1-NumberSystemsaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Computer organiztion2
Datarepresentation2
Data representation moris mano ch 03
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
LCDF3_Chap_10_P2.pptytttttyyyyyyyyyyyyyy
DATA REPRESENTATION
Register transfer and microoperations
dataformats and data conversionscpu1.ppt
An introduction to data_representation.ppt
DATA REPRESENTATIONS and Data codes and formats.ppt
PROCESSOR AND CONTROL UNIT
Counter Register power point to learn good
microprocessors
Chapter 1 digital design.pptx
Micro operations
Digital Logic & Design
PROCESSOR AND CONTROL UNIT - unit 3 Architecture
Kaizen cso002 l1

More from Hsien-Hsin Sean Lee, Ph.D. (13)

PPT
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
PPT
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
PPT
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
PPT
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
PPT
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
PPT
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
PPT
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
PPT
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
PPT
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
PPT
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
PPT
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1

Recently uploaded (20)

PDF
How NGOs Save Costs with Affordable IT Rentals
PPTX
KVL KCL ppt electrical electronics eee tiet
PPTX
code of ethics.pptxdvhwbssssSAssscasascc
PPTX
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
PDF
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PPTX
Lecture-3-Computer-programming for BS InfoTech
PDF
YKS Chrome Plated Brass Safety Valve Product Catalogue
PPTX
Fundamentals of Computer.pptx Computer BSC
PDF
Colorful Illustrative Digital Education For Children Presentation.pdf
PPTX
Embeded System for Artificial intelligence 2.pptx
PPTX
ATL_Arduino_Complete_Presentation_AI_Visuals.pptx
PPTX
executive branch_no record.pptxsvvsgsggs
PPTX
了解新西兰毕业证(Wintec毕业证书)怀卡托理工学院毕业证存档可查的
PDF
Cableado de Controladores Logicos Programables
PPT
FABRICATION OF MOS FET BJT DEVICES IN NANOMETER
PPTX
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
PPTX
Nanokeyer nano keyekr kano ketkker nano keyer
PPTX
quadraticequations-111211090004-phpapp02.pptx
PPTX
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
How NGOs Save Costs with Affordable IT Rentals
KVL KCL ppt electrical electronics eee tiet
code of ethics.pptxdvhwbssssSAssscasascc
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
Smarter Security: How Door Access Control Works with Alarms & CCTV
Lecture-3-Computer-programming for BS InfoTech
YKS Chrome Plated Brass Safety Valve Product Catalogue
Fundamentals of Computer.pptx Computer BSC
Colorful Illustrative Digital Education For Children Presentation.pdf
Embeded System for Artificial intelligence 2.pptx
ATL_Arduino_Complete_Presentation_AI_Visuals.pptx
executive branch_no record.pptxsvvsgsggs
了解新西兰毕业证(Wintec毕业证书)怀卡托理工学院毕业证存档可查的
Cableado de Controladores Logicos Programables
FABRICATION OF MOS FET BJT DEVICES IN NANOMETER
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
Nanokeyer nano keyekr kano ketkker nano keyer
quadraticequations-111211090004-phpapp02.pptx
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe

Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Predictor

  • 1. ECE 4100/6100 Advanced Computer Architecture Lecture 5 Branch Prediction Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology
  • 2. 2 Predict What? • Direction (1-bit) – Single direction for unconditional jumps and calls/returns – Binary for conditional branches • Target (32-bit or 64-bit addresses) – Some are easy • One: Uni-directional jumps • Two: Fall through (Not Taken) vs. Taken – Many: Function Pointer or Indirect Jump (e.g. jr r31)
  • 3. 3 Categorizing Branches 8% 10% 82% 19% 6% 75% 0% 20% 40% 60% 80% 100% Call/Return Jump Conditional Branch Frequency of branch instructions SPEC2000INT SPEC2000FP Source: H&P using Alpha
  • 4. 4 Branch Misprediction PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Single Issue
  • 5. 5 Branch Misprediction PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Single Issue Mispredict
  • 6. 6 Branch Misprediction PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Single Issue (flush entailed instructions and refetch) Mispredict
  • 7. 7 Branch Misprediction PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Single Issue Fetch the correct path
  • 8. 8 Branch Misprediction PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Single Issue Mispredict 8-issue Superscalar Processor (Worst case)
  • 9. 9 Why Branch is Predictable? for (i=0; i<100; i++) { …. } addi r10, r0, 100 add r1, r0, r0 L1: … … … … addi r1, r1, 1 bne r1, r10, L1 … … if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa!=bb) …. addi r2, r0, 2 bne r10, r2, L_bb xor r10, r10, r10 L_bb: bne r11, r2, L_xx xor r11, r11, r11 L_xx: beq r10, r11, L_exit … Lexit:
  • 10. 10 Control Speculation • Execute instruction beyond a branch before the branch is resolved  Performance • Speculative execution • What if mis-speculated? need – Recovery mechanism – Squash instructions on the incorrect path • Branch prediction: Dynamic vs. Static • What to predict?
  • 11. 11 Static Branch Prediction • Uni-directional, always predict taken (or not taken) • Backward taken, Forward not taken – Need offset information (when?) • Compiler hints with branch annotation – When the info will be available? Post-decode?
  • 12. 12 Simplest Dynamic Branch Predictor • Prediction based on latest outcome • Index by some bits in the branch PC – Aliasing T NT T T NT NT . . . for (i=0; i<100; i++) { …. } addi r10, r0, 100 addi r1, r1, r0 L1: … … … … addi r1, r1, 1 bne r1, r10, L1 … … 0x40010100 0x40010104 0x40010108 … 0x40010A04 0x40010A08 How accurate? NT T 1-bit Branch History Table
  • 13. 13 Typical Table Organization Hash PC (32 bits) . . . . . 2N entries Prediction N bits FSM Update Logic table update Actual outcome
  • 14. 14 Simplest Dynamic Branch Predictor T NT T T NT NT . . . addi r10, r0, 100 addi r1, r1, r0 L1: add r21, r20, r1 lw r2, (r21) beq r2, r0, L2 … … j L3 L2: … … … L3: addi r1, r1, 1 bne r1, r10, L1 0x40010100 0x40010104 0x40010108 0x4001010c 0x40010110 0x40010210 0x40010B0c 0x40010B10 for (i=0; i<100; i++) { if (a[i] == 0) { … } … } NT T 1-bit Branch History Table
  • 15. 15 FSM of the Simplest Predictor • A 2-state machine • Change mind fast 00 11 If branch not taken If branch taken 00 11 Predict not taken Predict taken
  • 16. 16 Example using 1-bit branch history table for (i=0; i<44; i++) { …. } 00Pred Actual T T √ 11 11 √ T T √ 11 11 addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 NT 00   T 11  T √ 11 √ T T √ 11 11 NT 00  T 11  60% accuracy
  • 17. 17 2-bit Saturating Up/Down Counter Predictor Not Taken Taken Predict Not taken Predict taken ST: Strongly Taken WT: Weakly Taken WN: Weakly Not Taken SN: Strongly Not Taken 01/ WN 01/ WN 00/ SN 00/ SN 10/ WT 10/ WT 11/ ST 11/ ST MSB: Direction bit LSB: Hysteresis bit
  • 18. 18 2-bit Counter Predictor (Another Scheme) Not Taken Taken Predict Not taken Predict taken ST: Strongly Taken WT: Weakly Taken WN: Weakly Not Taken SN: Strongly Not Taken 01/ WN 01/ WN 00/ SN 00/ SN 11/ ST 11/ ST 10/ WT 10/ WT
  • 19. 19 Example using 2-bit up/down counter for (i=0; i<44; i++) { …. } 0101Pred Actual T T √ 1010 1111 √ T T √ 1111 1111 addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 NT 1010   T 1111 √ T √ 1111 √ T T √ 1111 1111 NT 1010  T 1111 √ 80% accuracy
  • 20. 20 Branch Correlation • Branch direction – Not independent – Correlated to the path taken • Example: Path 1-1 of b3 can be surely known beforehand • Track path using a 2-bit register if (aa==2) // b1 aa = 0; if (bb==2) // b2 bb = 0; if (aa!=bb) { // b3 ……. } b1 b2 b2 b3 b3 b3 1 (T) 1 1 0 (NT) 0 b3 0 Path: A:1-1 B:1-0 C:0-1 D:0-0 aa=0 bb=0 aa=0 bb≠2 aa≠2 bb=0 aa≠2 bb≠2 Code Snippet
  • 21. 21 Correlated Branch Predictor [PanSoRahmeh’92] • (M,N) correlation scheme – M: shift register size (# bits) – N: N-bit counter 2-bit counter hash . . . . X X Branch PC hash 2-bit counter . . . . 2-bit counter . . . . X X 2-bit counter . . . . 2-bit counter . . . . Prediction Prediction 2-bit shift register (global branch history) select Subsequent branch direction (2,2) Correlation Scheme 2-bit Sat. Counter Scheme 2w w Branch PC
  • 22. 22 Two-Level Branch Predictor [YehPatt91,92,93] • Generalized correlated branch predictor • 1st level keeps branch history in Branch History Register (BHR) • 2nd level segregates pattern history in Pattern History Table (PHT) 1 1 . . . . . 1 0 00…..00 00…..01 00…..10 11…..11 11…..10 Branch History Pattern Pattern History Table (PHT) Prediction Rc-k Rc-1 Rc: Actual Branch Outcome FSM Update Logic Branch History Register (BHR) (Shift left when update) N 2N entries Current StatePHT update
  • 23. 23 Branch History Register • An N-bit Shift Register = 2N patterns in PHT • Shift-in branch outcomes – 1 ⇒ taken – 0 ⇒ not taken • First-in First-Out • BHR can be – Global – Per-set – Local (Per-address)
  • 24. 24 Pattern History Table • 2N entries addressed by N-bit BHR • Each entry keeps a countercounter (2-bit or more) for prediction – Counter update: the same as 2-bit counter – Can be initialized in alternate patterns (01, 10, 01, 10, ..) • Alias (or interference) problem
  • 25. 25 Global History Schemes Global BHR Global PHT GAg Global BHR .. SetP(B) Per-set PHTs (SPHTs) GAs Global BHR .. Addr(B) Per-addr PHTs (PPHTs) GAp ** [PanSoRahmeh’92] similar to GAp . . . . . . . . . . . . . . . . . . . . . Set can be determined by branch opcode, compiler classification, or branch PC address.
  • 26. 26 GAs Two-Level Branch Prediction 01100110 BHR PC = 0x4001000C . . . PHT 00110110 . . 00110110 00110111 11111101 11111110 00000000 00000001 00000010 11111111 10 MSB = 1 Predict Taken The 2 LSBs are insignificant for 32-bit instruction
  • 27. 27 Predictor Update (Actually, Not Taken) 01100110 BHR PC = 0x4001000C . . . PHT 00110110 . . 00110110 00110111 11111101 11111110 00000000 00000001 00000010 11111111 1001 decremented 11001100 00111100 00111100 Wrong Predictio n • Update Predictor after branch is resolved
  • 28. 28 Per-Address History Schemes Global PHT PAg SetP(B) Per-set PHTs (SPHTs) PAs Addr(B) Per-addr PHTs (PPHTs) PAp . . . Addr(B) Per-addr BHT (PBHT) . . . Addr(B) Per-addr BHT (PBHT) . . . Addr(B) Per-set BHT (PBHT) •Ex: P6, Itanium •Ex: Alpha 21264’s local predictor . . . .. . . . . . . . . . .. . . . . . . . . .
  • 29. 29 PAs Two-Level Branch Predictor PC = 1110 0000 1001 1001 0010 1100 1110 1000 000 001 010 011 100 101 110 111 BHT 11010110 . . . PHT . . 11010101 11010110 11111101 11111110 00000000 00000001 00000010 11111111 MSB = 1 Predict Taken 11 110
  • 30. 30 Per-Set History Schemes Global PHT SAg SetP(B) Per-set PHTs (SPHTs) SAs Addr(B) Per-addr PHTs (PPHTs) SAp . . . Per-set BHT (SBHT) . . . SetH(B) Per-set BHT (SBHT) . . . SetH(B) Per-set BHT (SBHT) .. . . . . . . . . . . . . .. . . . . . . . . . SetH(B)
  • 31. 31 PHT Indexing Branch addr Global history Gselect 4/4 00000000 00000001 00000001 00000000 00000000 00000000 11111111 00000000 11110000 11111111 10000000 11110000 Insufficient History • Tradeoff between more history bits and address bits • Too many bits needed in Gselect ⇒ sparse table entries
  • 32. 32 Gshare Branch Predictor [McFarling93] • Tradeoff between more history bits and address bits • Too many bits needed in Gselect ⇒ sparse table entries • Gshare ⇒ Not to lose global history bits • Ex: AMD Athlon, MIPS R12000, Sun MAJC, Broadcom SiByte’s SB-1 Branch addr Global history Gselect 4/4 Gshare 8/8 00000000 00000001 00000001 00000001 00000000 00000000 00000000 00000000 11111111 00000000 11110000 11111111 11111111 10000000 11110000 01111111 GselectGselect 4/4: Index PHT by concatenateconcatenate low order 4 bits GshareGshare 8/8: Index PHT by {Branch address ⊕ Global history}
  • 33. 33 Gshare Branch Predictor . . . PHT . . 00 MSB = 0 Predict Not Taken 1 1 . . . . . 1 0 0 1 . . . . . 0 1 0 01. . . . .1 1 ⊕ PC Address Global BHR
  • 34. 34 Aliasing Example PHT BHR 1101 PC 0110 ---- XOR 1011 BHR 1001 PC 1010 ---- XOR 0011 1111 1110 1101 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 0001 0000 PHT (indexed by 10) BHR 1101 PC 0110 ---- || 1001 BHR 1001 PC 1010 ---- || 1001 1111 1110 1101 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 0001 0000 GApGAp GshareGshare
  • 35. 35 Hybrid Branch Predictor [McFarling93] • Some branches correlated to global history, some correlated to local history • Only update the meta-predictor when 2 predictors disagree P0P0 P1P1 . . . Choice (or Meta) Predictor Branch PC Final Prediction
  • 36. 36 Alpha 21264 (EV6) Hybrid Predictor Local History Table 1024 x 10 bits Single Local Predictor 1024 x 3 bits Global Predictor 4096 x 2 bits Choice Predictor 4096 x 2 bits Global history 12 Local prediction Global prediction Meta prediction Next Line/set Prediction L1 I-cache (64KB 2w) & TLB 4 instr./cycle4 instr./cycle Virtual address Final Branch Prediction PCPC 10 • A “tournament branch predictor” • Multi-predictor scheme w/ – Local predictorLocal predictor (~PAg) • Self-correlation – Global predictorGlobal predictor • Inter-correlation – Choice predictorChoice predictor as the decision maker: a 2-bit sat. counter to credit either local or global predictors. • Die size impact – History info tables ~2% – BTB ~ 2.7% (associated with I-$ on a per-line basis) • 2 cycle latency, we will discuss more later For Single-cycle Prediction
  • 37. 37 Alpha EV8 Branch Predictor Branch PC Global history F1 F2 F3 majority vote prediction G0 G1 Meta F4 Bimodal e-gskew predictor • Real silicon never sees the daylight • Use a 2Bc-gskew predictor (one form of enhanced gskew) – Bimodal predictor used as (1) static biased predictor and (2) part of e-gskew predictor – Global predictors G0 and G1 are part of e-gskew predictor – Table sizes: 352Kbits in total (208Kbits for prediction table; 144Kbits for hysteresis table.)
  • 38. 38 Branch Target Prediction • Try the easy ones first – Direct jumps – Call/Return – Conditional branch (bi-directional) • Branch Target Buffer (BTB) • Return Address Stack (RAS)
  • 39. 39 Branch Target Buffer (BTB) TargetTag TargetTag TargetTag… BTBBranch PC == == ==… ++ 4 Branch Target Predicted Branch Direction 0 1
  • 40. 40 Return Address Stack (RAS) • Different call sites make return address hard to predict – Printf() being called by many callers – The target of “return” instruction in printf() is a moving target • A hardware stack (LIFO) – Call will push return address on the stack – Return uses the prediction off of TOS
  • 41. 41 Return Address Stack • Does it always work? – Call depth – Setjmp/Longjmp – Speculative call? ++ 4 Call PC Push Return Address BTBBTB Return PC BTBBTB Return? • May not know it is a return instruction prior to decoding – Rely on BTB for speculation – Fix once recognize Return
  • 42. 42 Indirect Jump • Need Target Prediction – Many (potentially 230 for 32-bit machine) – In reality, not so many – Similar to predicting values • Tagless Target Prediction • Tagged Target Prediction
  • 43. 43 Tagless Target Prediction [ChangHaoPatt’97] 1 1 . . . . . 1 0 Branch History RegisterBranch History Register (BHR)(BHR) 00…..00 00…..01 00…..10 11…..11 11…..10 PC ⊗ BHR Pattern Target Cache (2N entries) Predicted Target Address Branch PCBranch PC HashHash • Modify the PHT to be a “Target Cache” – (indirect jump) ? (from target cache) : (from BTB) • Alias?
  • 44. 44 Tagged Target Prediction [ChangHaoPatt’97] • To reduce aliasing with set-associative target cache • Use branch PC and/or history for tags 1 1 . . . . . 1 0 BHR 00…..00 00…..01 00…..10 11…..11 11…..10 Target Cache (2n entries per way) Predicted Target Address Branch PC Hash n =? Tag Array
  • 45. 45 Multiple Branch Prediction • For a really wide machine – Across several basic blocks – Need to predict multiple branches per cycle • How to fetch non-contiguous instructions in one cycle? • Prediction accuracy extremely critical (will be reduced geometrically)