ch14_1 RISC Processors .pdf

RISC Processors
Chapter 14
S. Dandamudi

2003
To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer, 2003.
 S. Dandamudi Chapter 14: Page 2
Outline
• Introduction
• Evolution of CISC
processors
• RISC design principles
• PowerPC processor
∗ Architecture
∗ Addressing modes
∗ Instruction set
• Itanium processor
∗ Architecture
∗ Addressing modes
∗ Instruction set
∗ Instruction-level parallelism
∗ Branch handling
∗ Speculative execution

2003
Introduction
• CISC
∗ Complex instruction set
» Pentium is the most popular example
• RISC
∗ Simple instructions
» Reduced complexity
∗ Modern processors use this design philosophy
» PowerPC, MIPS, SPARC, Intel Itanium
– Borrow some features from CISC
∗ No precise definition
» We can identify some common characteristics

2003
Evolution of CISC Designs
• Motivation to efficiently use expensive resources
∗ Processor
∗ Memory
• High density code
∗ Complex instructions
» Hardware complexity is handled by microprogramming
» Microprogramming is also helpful to
– Reduce the impact of memory access latency
– Offers flexibility
4Low-cost members of the same family
∗ Tailored to high-level language constructs

2003
Evolution of CISC Designs (cont’d)
CISC RISC
VAX
11/780
Intel 486 MIPS
R4000
# instructions 303 235 94
Addr. modes 22 11 1
Inst. size (bytes) 2-57 1-12 4
GP registers 16 8 32

2003
Evolution of CISC Designs (cont’d)
Example
∗ Autoincrement addressing mode of VAX
» Performs the following actions:
(R2) = (R2) + R3; R2 = R2 + 1
∗ RISC equivalent
R4 = (R2)
R4 = R4 + R3
(R2) = R4
R2 = R2 + 1

2003
Why RISC?
• Simple instructions are preferred
∗ Complex instructions are mostly ignored by compilers
» Due to semantic gap
• Simple data structures
∗ Complex data structures are used relatively infrequently
∗ Better to support a few simple data types efficiently
» Synthesize complex ones
• Simple addressing modes
∗ Complex addressing modes lead to variable length
instructions
» Lead to inefficient instruction decoding and scheduling

2003
Why RISC? (cont’d)
• Large register set
∗ Efficient support for procedure calls and returns
» Patterson and Sequin’s study
– Procedure call/return: 12−15% of HLL statements
4Constitute 31−33% of machine language instructions
4Generate nearly half (45%) of memory references
∗ Small activation record
» Tanenbaum’s study
– Only 1.25% of the calls have more than 6 arguments
– More than 93% have less than 6 local scalar variables
– Large register set can avoid memory references

2003
RISC Design Principles
• Simple operations
∗ Simple instructions that can execute in one cycle
• Register-to-register operations
∗ Only load and store operations access memory
∗ Rest of the operations on a register-to-register basis
• Simple addressing modes
∗ A few addressing modes (1 or 2)
• Large number of registers
∗ Needed to support register-to-register operations
∗ Minimize the procedure call and return overhead

2003
RISC Design Principles (cont’d)
Register windows storing
activation records

2003
RISC Design Principles (cont’d)
• Fixed-length instructions
∗ Facilitates efficient instruction execution
• Simple instruction format
∗ Fixed boundaries for various fields
» opcode, source operands,…
• Other features
∗ Tend to use Harvard architecture
∗ Pipelining is visible at the architecture level

2003
PowerPC
• Registers
∗ 32 general-purpose registers (GPR0 – GPR31)
∗ 32 floating-point registers (FPR0 – FPR31)
∗ Condition register (CR)
» Similar to Pentium’s flags register
» Divided into 8 CR fields (4 bits each)
– “less than” (LT), “greater than” (GT), “equal to” (EQ),
Overflow (SO)
– CR1 is for floating-point exceptions
– Other CR fields can be used for integer or FP exceptions
– Branch instructions can test a specific CR field bit

2003
PowerPC (cont’d)

2003
PowerPC (cont’d)
∗ XER register serves two distinct purposes
» Bits 0, 1, and 2 are used to capture
– Summary overflow (SO), overflow (OV), carry (CA)
– OV and CA are similar to Pentium’s overflow and carry
– SO, once set, only a special instruction can clear it
» Bits 25 to 31 (7 bits)
– Specifies the number of bytes to be transferred between
memory and registers
– Two instructions
4Load string word indexed (lswx)
4Store string word indexed (stswx)
4Can load/store all 32 registers (GPR0-GPR31)

2003
PowerPC (cont’d)
∗ Link register (LR)
» Used to store the procedure return address
– Stores the effective address of the instruction following
the procedure call instruction
– Procedure calls use the branch instructions
4Example: b = branch, bl = procedure call
∗ Count register (CTR)
» Maintains loop count value
– Similar to Pentium's ECX register
– Branch instructions can test the value
• 32-bit PowerPC implementations use
segmentation like the Pentium

2003
PowerPC (cont’d)
• Addressing modes
∗ Load/store instructions support three addressing modes
» Can use GPRs
∗ Register Indirect
» Effective address = contents of rA or 0
» Specifying 0 generates address 0
∗ Register Indirect with Immediate Index
» Effective address = Contents of rA or 0 + imm16
∗ Register Indirect with Index
» Effective address = Contents of rA or 0 + contents of rB

2003
PowerPC (cont’d)
Instruction
format

2003
PowerPC (cont’d)
• Bits 0-5
∗ Specify primary opcode
∗ Other fields specify suboperations
» Depends on instruction type
• AA bit
∗ 1 (use absolute address)
∗ 0 (use relative address)
• LK bit
∗ 0 (no link --- branch)
∗ 1 (link --- turns branch into a procedure call)

2003
PowerPC Instruction Set
• Data Transfer instructions
• Byte loads
lbz rD,disp(rA) ;Load byte and zero
lbzu rD,disp(rA) ;Load byte and zero
;with update
» Effective address = contents of rA + disp
lbzx rD,rA,rB ;Load byte and zero indexed
lbzux rD,rA,rB ;Load byte and zero
;with update indexed
» Effective address = contents of rA + contents of rB
» Upper three bytes of rD are zeroed
» Update versions: rA ← effective address

2003
PowerPC Instruction Set (cont’d)
• Similar instructions for halfword and word loads
lhz, lhzu, lhzx, lhzxu
lwz, lwzu, lwzx, lwzxu
• For halfword loads, sign extension is possible
lha, lhau, lhax, lhaxu
• Multiword load
lmw rD,disp(rA)
» Loads n consecutive words at EA to registers rD, …, r31

2003
• Similar instructions for store
stbz, stbzu, stbzx, stbzxu
sthz, sthzu, sthzx, sthzxu
stwz, stwzu, stwzx, stwzxu
• Multiword store
stmw rD,disp(rA)
» Stores n consecutive words at EA to registers rD, …, r31

2003
Arithmetic Instructions
• Add instructions
add rD,rA,rB ; rD ← rA + rB
» Status and overflow bits of CR0 and XER are not altered
add. rD,rA,rB ; alters LT,GT,EQ,SO of CR0
addo rD,rA,rB ; alters SO,OV of XER
addo. rD,rA,rB ; alters LT,GT,EQ,SO of CR0
; and SO,OV of XER
» These four instructions do not alter the CA bit of XER

2003
∗ To alter CA bit, use
adde rD,rA,rB
∗ To alter the other bits, use
adde., addeo, addeo.
∗ Immediate operand version
addi rD,rA,Simm16
∗ We can use addi to implement other instructions
li rD,value as addi rD,0,value
la rD,disp(rA) as addi rD,rA,disp
subi rD,rA,value as addi rD,rA,-value

2003
• Subtract instructions
subf rD,rA,rB ; rD ← rB − rA
–subf = subtract from
∗ Like add, other forms are available
subf., subfo, subfo.
∗ Negate instruction
neg rD,rA ; rD ← 0 − rA

2003
• Multiply instructions
∗ Two instructions to get upper and lower 32 bits of the
64-bit result
mullw rD,rA,rB ; signed/unsigned multiply
» Stores the lower-order 32 bits of the result
» Use the following to get the upper 32 bits
mulhw rD,rA,rB ; signed
mulhwu rD,rA,rB ; unsigned
∗ Immediate form
mulli rD,rA,Simm16
» Stores only lower 32 bits of the 48-bit result

2003
∗ Divide instructions
» Two divide instructions
– Signed (divw)
divw rD,rA,rB ; rD = rA/rB
– Unsigned (divwu)
» Both give only quotient
» For quotient and remainder, use
divw rD,rA,rB ; quotient in rD
mullw rX,rD,rB
subf rC,rX,rA ; remainder in rC

2003
∗ Logical instructions
and rD,rS,rB and. rD,rS,rB
andi. rD,rS,Uimm16 andis. rD,rS,Uimm16
andc rD,rS,rB andc. rD,rS,rB
» andis = left shift uimm16 by four positions before ANDing
» andc = complement rB before ANDing
» Dot versions update the LT, GT, EQ, SO bits of CR0
» Logical OR also has these six versions
» Move register instruction is implemented using OR
mr rA,RS is equivalent to or rA,rS,rS
» NOP is implemented as
ori 0,0,0

2003
∗ Other logical operations
» NAND
– nand
– nand.
» NOR
– nor
– nor.
» XOR
– xor, xor.
– xori, xoris
» Equivalence (exclusive-NOR)
– eqv
– eqv.

2003
∗ Shift and Rotate instructions
» Shift left
slw rA,rS,rB ; shift left word
» Shift left the word in rS by rB positions and store result in rA
– Shifted out bits get zeroes
» Also have the dot version slw.
» Shift right
srw srw. (logical)
sraw sraw. (arithmetic)
» Rotate left instructions
rlwnm rA,rS,rB,MB,ME
rotlw rA,rS,rB ≡
≡
≡
≡ rlwnm rA,rS,rB,0,31

2003
∗ Compare instructions
» Two versions:
– For signed and unsigned
» Two formats
– Register and immediate
» Register compare
cmp crfD,rA,rB
» Updates LT (rA < rB), GT (rA > rB), EQ, SO bits in the crfD
» If crfD is not specified, CR0 is used
» Immediate version
cmp crfD,rA,Simm16

2003
∗ Branch Instructions
» Used for both branch (LK = 0) and procedure calls (LK = 1)
» Can use absolute (AA = 1) or relative address (AA = 0)
b target (AA=0, LK=0) Branch
ba target (AA=1, LK=0) Branch Absolute
bl target (AA=0, LK=1) Branch then link
bla target (AA=1, LK=1) Branch Absolute then link
» The last two are procedure calls
» Three types of conditional branches
– Direct address
– Register indirect
4CTR or LR

2003
∗ Conditional branch instructions (direct address)
bc BO,BI,target (AA=0, LK=0)
Branch Conditional
bca BO,BI,target (AA=1, LK=0)
Branch Conditional Absolute
bcl BO,BI,target (AA=0, LK=1)
Branch Conditional then link
bcla BO,BI,target (AA=1, LK=1)
Branch Conditional Absolute then link
» BO = branch options (5 bits) ⇒ specifies branch condition
» BI = branch input (5 bits) ⇒ specifies a bit in CR field

2003
∗ Nine different branch conditions can be specified
» Decrement CTR; branch if CTR ≠ 0 AND cond = false
» Decrement CTR; branch if CTR = 0 AND cond = false
» Decrement CTR; branch if CTR ≠ 0 AND cond = true
» Decrement CTR; branch if CTR = 0 AND cond = true
» Branch if cond = false
» Branch if cond = true
» Decrement CTR; branch if CTR ≠ 0
» Decrement CTR; branch if CTR = 0
» Branch always

2003
∗ LR-based branch instructions
bclr BO,BI (LK=0)
Branch Conditional to Link Register
bclrl BO,BI (LK=1)
Branch Conditional to Link Register then Link
» Target address is taken from LR
» Used to return from procedure calls
∗ CTR-based branch instructions
bcctr BO,BI (LK=0)
bcctrl BO,BI (LK=1)
» CTR instead of LR is used to get target

2003
Itanium
• Intel’s 64-bit processor
∗ RISC based
∗ Based on EPIC design philosophy
» Explicit Parallel Instruction Computing
» Support for ILP
– 3-instruction wide word
» Speculative computation
– Hides memory latency
» Predication
– Improves branch handling
» Large number of registers
– 128 integer and 128 FP
– Aids in efficient procedure calls

2003
Itanium (cont’d)

2003
Itanium (cont’d)
• Registers
∗ 128 general purpose register (gr0 – gr127)
» 64-bit wide
» NaT (Not-a-Thing) bit
– Used in speculative loading
» Divided into static and stacked
– Static
4First 32 registers (gr0 – gr31)
4gr0 is read-only (always provides zero)
– Stacked
4Available for programs
4Used as register stack frame

2003
Itanium (cont’d)
• Registers
∗ Branch registers
» 8 in total (br0 – br7)
» 64-bit wide
» Specify target address for
– Conditional branches
– Procedure calls
– Return
∗ User mask register
» Alignment, byte ordering, …
∗ Other registers
» Predicate register, Application registers, Current frame marker

2003
Itanium (cont’d)
• Addressing modes
∗ Load/store instructions can access memory
» Specify three registers: r1, r2, r3
– r32 and r3 are used to compute effective address
– r1 receives/supplies data
∗ Register indirect addressing
» Effective address = contents of r3
∗ Register indirect with immediate addressing
» Effective address = contents of r3 + imm9
» r3 = Effective address
∗ Register indirect with index addressing
» Effective address = contents of r3 + contents of r2
» r3 = Effective address

2003
Itanium (cont’d)
• Instruction Format
[(qp)] mnemonic[.comp] dests = srcs
∗ qp = qualifying predicate
» Specifies a predicate register
– 64 1-bit registers
– Executed if the specified PR is 1
– Otherwise, instruction is treated as NOP
» mnemonic
– Identifies an instruction (e.g., compare)
» comp
– Gives more information to completely specify instruction
– E.g., Type of comparison is equality

2003
Itanium (cont’d)

2003
Itanium (cont’d)
• Examples
add r1 = r2,r3
Predicate instruction
(p4) add r1 = r2,r3
add r1 = r2,r3,1
Compare instructions
cmp.eq p3 = r2,r4
cmp.gt p2,p3 = r3,r4
Branch instruction
br.cloop.sptk loop_back

2003
Instruction-level Parallelism
• Itanium provides
∗ Runtime support for explicit parallelism
– Compiler/assembler can indicate parallelism
» Instruction groups
∗ Large number of registers
• Instruction groups
∗ Set of instructions that do not have conflicting
dependencies
» Can be executed in parallel
∗ Compiler/assembler can indicate this by ;; notation

2003
Instruction-level Parallelism
• Example: Logical expression with four terms
if (r10 || r11 || r12 || r13) {
/* if-block code */
}
can be done using or-tree evaluation
or r1 = r10,r11 /* Group 1 */
or r2 = r12,r13 ;;
or r3 = r1,r2 /* Group 2 */
Other instructions /* Group 3 */
∗ Processor can execute as many instructions from group
as it can
» Depends on the available resources

2003
Itanium Instruction Bundle
• Each instruction is encoded using 41 bits
• Three instructions are bundled together
∗ 128-bit Instruction bundle
∗ No conflicting dependencies among the three instructions
» Aids in instruction–level parallelism
∗ 5-bit template
» Specifies mapping of instruction slots to execution instruction types
– Six instruction types
4Integer ALU, non-ALU integer, memory, branch, FP, extended

2003
Itanium Instructions
• Data transfer instructions
» Load and store instructions are more complicated than a typical
RISC processor
∗ Load instructions
(qp) ldSZ.ldtype.ldhint r1=[r3]
(qp) ldSZ.ldtype.ldhint r1=[r3],r2
(qp) ldSZ.ldtype.ldhint r1=[r3],imm9
» Loads SZ bytes from memory
– SZ can be 1, 2, 4, or 8 to load 1, 2, 4, or 8 bytes
– Example:
ld8 r5 = [r6]
Locality of
memory access
Special load operations:
advanced, speculative

2003
Itanium Instructions (cont’d)
• ldtype
∗ This completer can be used to specify special load
operations
» Advanced
ld8.a r5 = [r6]
» Speculative
ld8.s r5 = [r6]
• ldhint
∗ Locality of memory access
None – Temporal locality, level 1
nt 1 – No temporal locality, level 1
nt a – No temporal locality, all levels

2003
• Store instructions
∗ Simpler than load instructions
(qp) stSZ.sttype.sthint r1=[r3]
(qp) stSZ.sttype.sthint r1=[r3],imm9
• Move instructions
(qp) mov r1 = r3
(qp) mov r1 = imm2
(qp) mov r1 = imm64
» First two are pseudo-instructions
– Implemented using other processor instructions

2003
• Arithmetic instructions
∗ Simpler than load instructions
(qp) add r1 = r2,r3
(qp) add r1 = r2,r3,1
(qp) add r1 = imm,r4
∗ Move instruction
(qp) mov r1 = r3
implemented as
(qp) add r1 = 0,r3
∗ Move instruction
(qp) mov r1 = imm22
implemented as
(qp) add r1 = imm22,r0
can be imm14
or imm22

2003
• Similar instructions for subtraction
• Shift-add
(qp) shladd r1 = r2,count,r3
» Before adding, r2 is left-shifted by count bit positions
• Integer multiply is realized using the xma
instruction and floating-point registers
• No divide instruction
∗ Done in software

2003
• Logical instructions
∗ AND
∗ OR
∗ XOR
∗ No NOT operation
» Can use and-complement (andcm)
– Complements one of the operands before ANDing
• Format
(qp) and r1 = r2,r3
(qp) and r1 = imm8,r3

2003
• Shift instructions
∗ Left-shift
∗ Right-shift
• Format
(qp) shl r1 = r2,r3
(qp) and r1 = imm8,r3
• Right-shift
(qp) shr r1 = r2,r3 (signed version)
(qp) shr.u r1 = r2,r3 (Unsigned version)

2003
• Compare instructions
∗ Format
(qp) cmp.crel.ctype p1,p2 = r2,r3
(qp) cmp.crel.ctype p1,p2 = imm8,r3
∗ crel: Type of comparison
Cmp type signed unsigned
< lt ult
≤
≤
≤
≤ le ule
> gt ugt
≥
≥
≥
≥ ge uge
= eq eq

2003
∗ ctype: Specifies how the two predicate registers are to
be updated
» Default:
– Comparison result in p1 and its complement in p2
» or type
– p1 and p2 are set to 1 only if the comparison result is 1
– Otherwise, p1 and p2 are not altered
– Useful in OR-type simultaneous execution
» andtype
– p1 and p2 are set to 0 only if the comparison result is 0
– Otherwise, p1 and p2 are not altered
– Useful in AND-type simultaneous execution

2003
• Branch instructions
∗ Used for jump as well as procedure calls
∗ Supports both direct and indirect branching
» All direct branched are IP-relative
∗ IP relative form
(qp) br.btype.bwh.ph.dh target25
(basic form)
(qp) br.btype.bwh.ph.dh b1=target25
(call form)
br.btype.bwh.ph.dh target25
(counted loop form)

2003
∗ Indirect form
(qp) br.btype.bwh.ph.dh b2 (basic form)
(qp) br.btype.bwh.ph.dh b1=b2 (call form)
∗ btype: Type of branch
» cond or none (for basic form)
– Branch taken if qp is 1; otherwise not
» To invoke a procedure
– Use the call form with btype = call
– Turns branch into a conditional procedure call
– Procedure invoked only if qp is 1; otherwise not
– Return address is saved in b1 branch register

2003
» Uncounted counted loop version
– Set btype = cloop
– Loop count is in application register ar65
– If ar65 not zero, decrements and takes branch
» RET version
– Use btype = ret
– Should use the indirect form and specify the branch
register that has the return address
• Example 1: Conditional skip
(p3) br skip or
(p3) br.cond skip

2003
• Example 2: Loop iterates 100 times
mov lc = 100
Loop_back:
. . .
br.cloop loop_back
• Example 3: Procedure call to sum
(p0) br.call br2 = sum
• Example 4: Return from a procedure
(p0) br.ret br2

2003
Handling Branches
• Three techniques:
∗ Branch elimination
» Eliminate branches
– Best way to handle branches is not to have branches
4Possible to eliminate some types of branches
∗ Branch speedup
» Reduce the delay associated with branches
– Reorder instructions
– Speculative execution
∗ Branch prediction
» Discussed before (see Chapter 8)

2003
Handling Branches (cont’d)
• Branch elimination in Itanium
∗ Can be done using predication
if (R1 == R2)
R3 = R3 + R1;
else
R3 = R3 – R1;
cmp r1,r2
je equal
sub r3,r1
jmp next
equal:
add r3,r1
next:
cmp.eq p1,p2 = r1,r2
(p1) add r3 = r3,r1
(P2) sub r3 = r3,r1

2003
Handling Branches (cont’d)
switch (r6){
case 1:
r2 = r3 + r4;
break;
case 2:
r2 = r3 - r4;
break;
case 3:
r2 = r3 + r5;
break;
case 4:
r2 = r3 – r5;
break;
}
cmp.eq p1,p0 = r6,1
cmp.eq p2,p0 = r6,2
cmp.eq p3,p0 = r6,3
cmp.eq p4,p0 = r6,4;;
(p1) add r2 = r3,r4
(p2) sub r2 = r3,r4
(p3) add r2 = r3,r5
(p4) sub r2 = r3,r5

2003
Speculative Execution
• Instructions are executed in expectation that they
will be needed
∗ Keeps pipeline full
∗ Masks memory latency
• Itanium supports two types
∗ Handles data dependencies
» Data dependencies are discussed in Chapter 8
∗ Handles control dependencies
∗ Both are compiler optimizations
» Reorders instructions

2003
Speculative Execution (cont’d)
Data speculation sub r6 = r7,r8 ;; //cycle 1
sub r9 = r10,r6 //cycle 2
ld8 r4 = [r5] ;;
add r11 = r12,r4 ;; //cycle 4
ld8 r4 = [r5] //cycle 1
sub r6 = r7,r8 ;;
sub r9 = r10,r6 ;; //cycle 2
add r11 = r12,r4 //cycle 3

2003
• Ambiguous dependency between first st8 and
ld8
sub r6 = r7,r8 ;; //cycle 1
st8 [r9] = r6 //cycle 2
ld8 r4 = [r5] ;;
add r11 = r12,r4 ;; //cycle 4
st8 [r10] = r11 //cycle 5

2003
• We can move such load instructions using
advance load (ld.a) and check load (ld.c)
ld8.a r4 = [r5] //cycle 0 or earlier
. . .
sub r6 = r7,r8 ;; //cycle 1
st8 [r9] = r6 //cycle 2
ld8.c r4 = [r5]
add r11 = r12,r4 ;;
st8 [r10] = r11 //cycle 3

2003
• Further improvement with advance check (chk.a)
ld8.a r4 = [r5] //cycle -1 or earlier
. . .
sub r6 = r7,r8 ;;
st8 [r9] = r6 //cycle 2
chk.a r4,recover
back:
st8 [r10] = r11
recover:
ld8 r4 = [r5] // reload
add r11 = r12,r4 // reexecute add
br back // jump back

2003
• Control speculation
∗ To reduce long latency instructions such as loads,
advance them earlier into the code
cmp.eq p1,p0 = r10,10 //cycle 0
(p1) br.cond skip ;; //cycle 0
ld8 r1 = [r2] ;; //cycle 1
skip:
// other instructions
Cannot advance
because of branch

2003
ld8.s r1 = [r2] ;; cycle –2 or earlier
//other instructions
cmp.eq p1,p0 = r10,10 //cycle 0
(p1) br.cond skip //cycle 0
chk.s r1,recovery //cycle 0
skip:
//other instructions
recovery:
ld8 r1 = [r2]
br skip
Speculative
check chk.s
allows us to
advance ld8

2003
Branch Prediction
• Branch hints
∗ bwh completer (branch whether hint)
spnt static branch not taken
sptk static branch taken
dpnt dynamic branch not taken
dptk static branch not taken
• Prefetch hint (ph)
∗ Hint about sequential prefetch
» few or many
• Deallocation hint (dh)
∗ Specifies whether branch cache should be cleared
» clr indicates deallocation
Last slide

ch14_1 RISC Processors .pdf

More Related Content

Similar to ch14_1 RISC Processors .pdf (20)

More from Yatru Harsha Hiski (12)

Recently uploaded (20)

ch14_1 RISC Processors .pdf