SlideShare a Scribd company logo
Embedded Systems in Silicon
TD5102
Other Architectures
Henk Corporaal
http://www.ics.ele.tue.nl/~heco/courses/EmbSystems
Technical University Eindhoven
DTI / NUS Singapore
2005/2006
ACA 2003 2
 Design alternatives:
 provide more powerful operations
 goal is to reduce number of instructions executed
 danger is a slower cycle time and/or a higher CPI
 provide even simpler operations
 to reduce code size / complexity interpreter
 Sometimes referred to as “RISC vs. CISC”
 virtually all new instruction sets since 1982 have been RISC
 VAX: minimize code size, make assembly language easy
instructions from 1 to 54 bytes long!
 We’ll look at IA-32 and Java Virtual Machine
Introduction
ACA 2003 3
Topics
 Recap of MIPS architecture
 Why RISC?
 Other architecture styles
 Accumulator architecture
 Stack architecture
 Memory-Memory architecture
 Register architectures
 Examples
 80x86
 Pentium Pro, II, III, 4
 JVM
ACA 2003 4
Recap of MIPS
 RISC architecture
 Register space
 Addressing
 Instruction format
 Pipelining
ACA 2003 5
Why RISC? Keep it simple
RISC characteristics:
 Reduced number of instructions
 Limited addressing modes
 load-store architecture
 enables pipelining
 Large register set
 uniform (no distinction between e.g. address and data registers)
 Limited number of instruction sizes (preferably one)
 know directly where the following instruction starts
 Limited number of instruction formats
 Memory alignment restrictions
 ......
 Based on quantitative analysis
 " the famous MIPS one percent rule": don't even think about it
when its not used more than one percent
ACA 2003 6
Register space
Name Register number Usage
$zero 0 the constant value 0
$v0-$v1 2-3 values for results and expression evaluation
$a0-$a3 4-7 arguments
$t0-$t7 8-15 temporaries
$s0-$s7 16-23 saved (by callee)
$t8-$t9 24-25 more temporaries
$gp 28 global pointer
$sp 29 stack pointer
$fp 30 frame pointer
$ra 31 return address
32 integer (and 32 floating point) registers of 32-bit
ACA 2003 7
Addressing
Byte Halfword Word
Registers
Memory
Memory
Word
Memory
Word
Register
Register
1. Immediate addressing
2. Register addressing
3. Base addressing
4. PC-relative addressing
5. Pseudodirect addressing
op rs rt
op rs rt
op rs rt
op
op
rs rt
Address
Address
Address
rd . . . funct
Immediate
PC
PC
+
+
ACA 2003 8
Instruction format
Example instructions
Instruction Meaning
add $s1,$s2,$s3 $s1 = $s2 + $s3
addi $s2,$s3,4 $s2 = $s3 + 4
lw $s1,100($s2) $s1 = Memory[$s2+100]
bne $s4,$s5,L if $s4<>$s5 goto L
j Label goto Label
op rs rt rd shamt funct
op rs rt 16 bit address
op 26 bit address
R
I
J
ACA 2003 9
Pipelining
time
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
All integer instructions fit into the following pipeline
ACA 2003 10
Other architecture styles
 Accumulator architecture
 Stack
 Register (load store)
 Register-Memory
 Memory-Memory
ACA 2003 11
Accumulator architecture
Accumulator
ALU Memory
registers
address
latch
latch
Example code: a = b+c;
load b; // accumulator is implicit operand
add c;
store a;
ACA 2003 12
Stack architecture
Example code: a = b+c;
push b;
push c;
add;
pop a;
b
b
c b+c
push b push c add pop a
stack:
ALU Memory
stack
stack pt
latch
latch
latch
ACA 2003 13
Other architecture styles
Stack
Architecture
Accumulator
Architecture
Register-
Memory
Memory-
Memory
Register
(load-store)
Push A Load A Load r1,A Add C,B,A Load r1,A
Push B Add B Add r1,B Load r2,B
Add Store C Store C,r1 Add r3,r1,r2
Pop C Store C,r3
Let's look at the code for C = A + B
Q: What are the advantages / disadvantages of load-store (RISC) architecture?
ACA 2003 14
Other architecture styles
 Accumulator architecture
 one operand (in register or memory), accumulator almost always
implicitly used
 Stack
 zero operand: all operands implicit (on TOS)
 Register (load store)
 three operands, all in registers
 loads and stores are the only instructions accessing memory (i.e.
with a memory (indirect) addressing mode
 Register-Memory
 two operands, one in memory
 Memory-Memory
 three operands, may be all in memory
(there are more varieties / combinations)
ACA 2003 15
Examples
 80x86
 extended accumulator
 Pentium x
 extended accumulator
 JVM
 stack
IA-32
ACA 2003 16
A dominant architecture: x86/IA-32
A bit of history:
 1978: The Intel 8086 is announced (16 bit architecture)
 1980: The 8087 floating point coprocessor is added
 1981: IBM PC was launched, equipped with the Intel 8088
 1982: The 80286 increases address space to 24 bits + new
instructions
 1985: The 80386 extends to 32 bits, new addressing modes
 1989-1995: The 80486, Pentium, Pentium Pro add a few
instructions (mostly designed for higher performance)
 1997: MMX is added
 2000: Pentium 4; very deep pipelined; extends SIMD instructions
 2002: Hypertreading
“This history illustrates the impact of the “golden handcuffs” of compatibility
“adding new features as someone might add clothing to a packed bag”
“an architecture that is difficult to explain and impossible to love”
ACA 2003 17
IA-32 Overview
 Complexity:
 Instructions from 1 to 17 bytes long
 two-address instructions: one operand must act as both a
source and destination
 ADD EAX,EBX ; EAX = EAX+EBX
 one operand can come from memory
 complex addressing modes
e.g., “base or scaled index with 8 or 32 bit displacement”
 Saving grace:
 the most frequently used instructions are not too difficult to build
 compilers avoid the portions of the architecture that are slow
“what the 80x86 lacks in style is made up in quantity,
making it beautiful from the right perspective”
ACA 2003 18
80x86 (IA-32) registers
AH AL
BH
CH
DH
BL
CL
DL
AX
BX
CX
DX
8
8
16
EAX
EBX
ECX
EDX
ESI
EDI
EBP
ESP
CS
SS
DS
ES
FS
GS
EIP
general
purpose
registers
index
registers
pointer
registers
segment
registers
PC
condition codes (a.o.)
ACA 2003 19
IA-32 Addressing Modes
Addressing modes: where are the operands?
 Immediate
MOV EAX,10 ; EAX = 10
 Direct
MOV EAX,I ; EAX = Mem[&i]
I DW 3
 Register
MOV EAX,EBX ; EAX = EBX
 Register indirect
MOV EAX,[EBX] ; EAX = Memory[EBX]
 Based with 8- or 32-bit displacement
MOV EAX,[EBX+8] ; EAX = Mem[EBX+8]
 Based with scaled index (scale = 0 .. 3)
MOV EAX,ECX[EBX] ; EAX = Mem[EBX + 2scale * ECX]
 Based plus scaled index with 8- or 32-bit displacement
MOV EAX,ECX[EBX+8]
ACA 2003 20
IA-32 Addressing Modes
 Not all modes apply to all instructions
 one of the operands must be a register
 Not all registers can be used in all modes
 Why? Simply not enough bits in the instruction
ACA 2003 21
Control: condition codes
 Many instructions set condition codes in EFLAGS register
 Some condition codes:
 sign: set if the result of an operation was negative
 zero: set if the result was zero
 carry: set if the operation had a carry out
 overflow: set if the operation caused an overflow
 parity: set when result had even parity
 Subsequent conditional branch instructions test condition
codes to determine if they should jump or not
ACA 2003 22
Control
 Special instruction: compare
CMP SRC1,SRC2 ; set cc’s based on SRC1-SRC2
 Example
for (i=0; i<10; i++)
a[i]++;
MOV EAX,0 ; EAX = i = 0
_L: CMP EAX,10 ; if (i<10)
JNL _EXIT ; jump to _EXIT if i>=10
INC [EBX] ; Mem[EBX](=a[i])++
ADD EBX,4 ; EBX = &a[i+1]
INC EAX ; EAX++
JMP _L ; goto _L
_EXIT: ...
ACA 2003 23
Control
 Peculiar control instruction
LOOP _LABEL ; decrease ECX, if (ECX!=0) goto
_LABEL
 Previous example rewritten:
MOV ECX,10
_L: INC [EBX]
ADD EBX,4
LOOP _L
 Fewer instructions, but LOOP is slow
ACA 2003 24
Procedures/functions
 Instructions
 CALL AProcedure ; push return address on stack
; and goto AProcedure
 RET ; pop return address from stack
; and jump to it
 EBP is used as a frame pointer which points to a fixed
location within stack frame (to access locals)
 ESP is used as stack pointer
 Special instructions:
 PUSH EAX ; ESP -= 4, Mem[ESP] = EAX
 POP EAX ; EAX = Mem[ESP], ESP += 4
ACA 2003 25
IA-32 Machine Language
 IA-32 instruction formats:
prefix opcode mode sib displ imm
0-5 1-2 0-1 0-1 0-4 0-4
6 1 1
Bytes
Bits
2 3 3
Bits
mod reg r/m
Source operand
Byte/word
2 3 3
Bits
scale index base
00 memory
01 memory+d8
10 memory+d16/d32
11 register
ACA 2003 26
Pentium, Pentium Pro, II, III, 4
 Issue rate:
 Pentium : 2 way issue, in-order
 Pentium Pro .. 4 : 3 way issue, out-of-order
 IA-32 operations are translated into ops (by hardware)
 Pipeline
 Pentium: 5 stage pipeline
 Pentium Pro, II, III: 10 stage pipeline
 Pentium 4: 20 stage pipeline
 Extra SIMD instructions
 MMX (multi-media extensions), SSE/SSE-2 (streaming simd
extensions)
+
ACA 2003 27
Die example: Pentium 4
ACA 2003 28
Pentium 4 chip area breakdown
ACA 2003 29
Pentium 4
 Trace cache
 Hyper threading
 Add with ½ cycle throughput (1 ½ cycle latency)
cycle cycle cycle
add least signif. 16 bits
add most signif. 16 bits
calculate flags
forwarding carry
Pentium® 4 Processor
Block Diagram
FP
RF
FMul
FAdd
MMX
SSE
FP move
FP store
3.2
GB/s
System
Interface
L2 Cache and Control
L1
D-Cache
and
D-TLB
Store
AGU
Load
AGU
Schedulers
Integer
RF
ALU
ALU
ALU
ALU
Trace
Cache
Rename/Alloc
uop
Queues
BTB
uCode
ROM
3 3
Decoder
BTB
&
I-TLB
L2 Cache and Control
P4 slides from
Doug Carmean, Intel
ACA 2003 31
P4 vs P II, PIII
1 2 3 4 5 6 7 8 9 10
Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec
Basic P6 Pipeline
Basic Pentium® 4 Processor Pipeline
1 2 3 4 5 6 7 8 9 10 11 12
TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch
13 14
Disp Disp
15 16 17 18 19 20
RF Ex Flgs Br Ck Drive
RF
Intro at
 1.4GHz
.18µ
Intro at
733MHz
.18µ
ACA 2003 32
Example with Higher IPC and Faster Clock!
Code
Sequence
Ld
Add
Add
Ld
Add
Add
10 clocks
10ns
IPC = 0.6
6 clocks
4.3ns
IPC = 1.0
P6
@1GHz
Pentium® 4
Processor
@1.4GHz
ACA 2003 33
The Execution Trace Cache
L2 Cache and Control
L1
D-Cache
and
D-TLB
Trace
Cache
3 3
FP
RF
FMul
FAdd
MMX
SSE
FP move
FP store
3.2
GB/s
System
Interface
Store
AGU
Load
AGU
Schedulers
Integer
RF
ALU
ALU
ALU
ALU
Rename/Alloc
uop
Queues
BTB
uCode
ROM
Decoder
BTB
&
I-TLB
Trace
Cache
BTB
ACA 2003 34
Execution Trace Cache
 Advanced L1 instruction cache
 Caches “decoded” IA-32 instructions (uops)
 Removes decoder pipeline latency
 Capacity is ~12K uOps
 Integrates branches into single line
 Follows predicted path of program execution
Execution Trace Cache feeds fast engine
ACA 2003 35
1 cmp
2 br -> T1
..
... (unused code)
T1: 3 sub
4 br -> T2
..
... (unused code)
T2: 5 mov
6 sub
7 br -> T3
..
... (unused code)
T3: 8 add
9 sub
10 mul
11 cmp
12 br -> T4
Execution Trace Cache
Trace Cache Delivery
10 mul 11 cmp 12 br T4
7 br T3 8 T3:add 9 sub
4 br T2 5 mov 6 sub
1 cmp 2 br T1 3 T1: sub
ACA 2003 36
Multi/Hyper-threading in Uniprocessor Architectures
Superscalar
Simultaneous
Multithreading
(Hyperthreading)
Concurrent
Multithreading
Issue slots
Clock
cycles
Empty Slot
Thread 1
Thread 2
Thread 3
Thread 4
ACA 2003 37
JVM: Java Virtual Machine
 Make JAVA code run everywhere
 Use virtual architecture
 Platform (processor) independent
Java
program
Java
bytecode
Java
compiler
JVM
(interpreter)
 JVM = stack architecture
ACA 2003 38
Stack Architecture
 JVM follows stack model of execution
 operands are pushed onto stack from memory and popped off
stack to memory
 operations take operands from stack and place result on stack
 Example (not real Java bytecode):
b
b
c b+c
a = b+c;
push b push c add pop a
ACA 2003 39
JVM Architecture
 For each method invocation, the JVM creates a stack
frame consisting of
 Local variable frame: parameters and local variables, numbered
0, 1, 2, …
 Operand stack: stack used for evaluating expressions
static void add3(int x, int y, int z){
int r = x+y+z;
System.out.println(r);
}
local
var 0
local
var 1
local
var 2
local
var 3
ACA 2003 40
Some JVM instructions
 iload_n: push local variable n onto the stack
 iconst_n: push constant n onto the stack (n=-1,0,...,5)
 bipush imm8: push byte onto stack
 sipush imm16: push short onto stack
 istore_n: pop word from stack into local variable n
 iadd, isub, ineg, imul, idiv, irem: usual
arithmetic operations
 if_icmpXX offset16 (XX can be eq, ne, lt, gt, le, ge):
 pop TOS into a
 pop TOS stack into b
 if (b XX a) PC = PC + offset16
 goto offset16 : PC = PC + offset16
ACA 2003 41
Example 1
 Translate following expression to Java bytecode:
v = 3*(x/y - 2/(u+y))
assume x is local var 0, y local var 1, u local var 3, v local var 4
Stack
iconst_3 ; 3
iload_0 ; x | 3
iload_1 ; y | x | 3
idiv ; x/y | 3
iconst_2 ; 2 | x/y | 3
iload_3 ; u | 2 | x/y | 3
iload_1 ; y | u | 2 | x/y | 3
iadd ; u+y | 2 | x/y | 3
idiv ; 2/(u+y) | x/y | 3
isub ; x/y - 2/(u+y) | 3
imul ; 3*(x/y - 2/(u+y))
istore_4 ; v = 3*(x/y - 2/(u+y))
ACA 2003 42
Example 2
Translate following Java code to Java bytecode:
if (x < 2) x = 0;
assume x is local var 0
Stack
iload_0 ; x
iconst_2 ; 2 | x
if_icmpge endif ; if (x>=2) goto endif
iconst_0 ; 0
istore_0 ;
endif:
...

More Related Content

PDF
Cao 2012
PPT
System Software introduction and SIC machine Architecture
PPTX
11-risc-cisc-and-isa-w.pptx
PPTX
PPTX
2024_lecture12_come321.pptx..................
PPTX
05 instruction set design and architecture
PPTX
Introduction to Processor Design and ARM Processor
PPT
LECTURE2 td 2 sue les theories de graphes
Cao 2012
System Software introduction and SIC machine Architecture
11-risc-cisc-and-isa-w.pptx
2024_lecture12_come321.pptx..................
05 instruction set design and architecture
Introduction to Processor Design and ARM Processor
LECTURE2 td 2 sue les theories de graphes

Similar to other-architectures.ppt (20)

PPTX
3_Arch_and_Kernels_for_computer_systems.pptx
PPTX
FALLSEM2024-25_BCSE205L_TH_VL2024250108124_2024-07-15_Reference-Material-I (1...
PPT
unit-3-L1.ppt
PDF
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
PPT
IS 139 Lecture 6
PPTX
COA Lecture 01(Introduction).pptx
PPT
Chapt 02 ia-32 processer architecture
PPT
CO_Chapter2.ppt
PPTX
Instruction Set Architecture
PPTX
Instruction set.pptx
PPT
Instruction Set Architecture
PPTX
05-machine-basics.pptx
PPT
hvuygyugihLec02-Review Instruction Set.ppt
PPTX
CSe_Cumilla Bangladesh_Country CSE CSE213_5.ppt
PPT
Bca 2nd sem-u-4 central processing unit and pipeline
PPT
B.sc cs-ii-u-4 central processing unit and pipeline
PPTX
PPTX
COA Lecture 01(Introduction to COAL).pptx
PDF
computer organization and architecturebec306c
PPT
Mca i-u-4 central processing unit and pipeline
3_Arch_and_Kernels_for_computer_systems.pptx
FALLSEM2024-25_BCSE205L_TH_VL2024250108124_2024-07-15_Reference-Material-I (1...
unit-3-L1.ppt
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
IS 139 Lecture 6
COA Lecture 01(Introduction).pptx
Chapt 02 ia-32 processer architecture
CO_Chapter2.ppt
Instruction Set Architecture
Instruction set.pptx
Instruction Set Architecture
05-machine-basics.pptx
hvuygyugihLec02-Review Instruction Set.ppt
CSe_Cumilla Bangladesh_Country CSE CSE213_5.ppt
Bca 2nd sem-u-4 central processing unit and pipeline
B.sc cs-ii-u-4 central processing unit and pipeline
COA Lecture 01(Introduction to COAL).pptx
computer organization and architecturebec306c
Mca i-u-4 central processing unit and pipeline
Ad

Recently uploaded (20)

PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Construction Project Organization Group 2.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPT
Drone Technology Electronics components_1
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
573137875-Attendance-Management-System-original
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
ETO & MEO Certificate of Competency Questions and Answers
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Lecture Notes Electrical Wiring System Components
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
composite construction of structures.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Construction Project Organization Group 2.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Drone Technology Electronics components_1
Foundation to blockchain - A guide to Blockchain Tech
573137875-Attendance-Management-System-original
OOP with Java - Java Introduction (Basics)
Embodied AI: Ushering in the Next Era of Intelligent Systems
ETO & MEO Certificate of Competency Questions and Answers
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Digital Logic Computer Design lecture notes
bas. eng. economics group 4 presentation 1.pptx
Lecture Notes Electrical Wiring System Components
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
composite construction of structures.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Ad

other-architectures.ppt

  • 1. Embedded Systems in Silicon TD5102 Other Architectures Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006
  • 2. ACA 2003 2  Design alternatives:  provide more powerful operations  goal is to reduce number of instructions executed  danger is a slower cycle time and/or a higher CPI  provide even simpler operations  to reduce code size / complexity interpreter  Sometimes referred to as “RISC vs. CISC”  virtually all new instruction sets since 1982 have been RISC  VAX: minimize code size, make assembly language easy instructions from 1 to 54 bytes long!  We’ll look at IA-32 and Java Virtual Machine Introduction
  • 3. ACA 2003 3 Topics  Recap of MIPS architecture  Why RISC?  Other architecture styles  Accumulator architecture  Stack architecture  Memory-Memory architecture  Register architectures  Examples  80x86  Pentium Pro, II, III, 4  JVM
  • 4. ACA 2003 4 Recap of MIPS  RISC architecture  Register space  Addressing  Instruction format  Pipelining
  • 5. ACA 2003 5 Why RISC? Keep it simple RISC characteristics:  Reduced number of instructions  Limited addressing modes  load-store architecture  enables pipelining  Large register set  uniform (no distinction between e.g. address and data registers)  Limited number of instruction sizes (preferably one)  know directly where the following instruction starts  Limited number of instruction formats  Memory alignment restrictions  ......  Based on quantitative analysis  " the famous MIPS one percent rule": don't even think about it when its not used more than one percent
  • 6. ACA 2003 6 Register space Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address 32 integer (and 32 floating point) registers of 32-bit
  • 7. ACA 2003 7 Addressing Byte Halfword Word Registers Memory Memory Word Memory Word Register Register 1. Immediate addressing 2. Register addressing 3. Base addressing 4. PC-relative addressing 5. Pseudodirect addressing op rs rt op rs rt op rs rt op op rs rt Address Address Address rd . . . funct Immediate PC PC + +
  • 8. ACA 2003 8 Instruction format Example instructions Instruction Meaning add $s1,$s2,$s3 $s1 = $s2 + $s3 addi $s2,$s3,4 $s2 = $s3 + 4 lw $s1,100($s2) $s1 = Memory[$s2+100] bne $s4,$s5,L if $s4<>$s5 goto L j Label goto Label op rs rt rd shamt funct op rs rt 16 bit address op 26 bit address R I J
  • 9. ACA 2003 9 Pipelining time IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB All integer instructions fit into the following pipeline
  • 10. ACA 2003 10 Other architecture styles  Accumulator architecture  Stack  Register (load store)  Register-Memory  Memory-Memory
  • 11. ACA 2003 11 Accumulator architecture Accumulator ALU Memory registers address latch latch Example code: a = b+c; load b; // accumulator is implicit operand add c; store a;
  • 12. ACA 2003 12 Stack architecture Example code: a = b+c; push b; push c; add; pop a; b b c b+c push b push c add pop a stack: ALU Memory stack stack pt latch latch latch
  • 13. ACA 2003 13 Other architecture styles Stack Architecture Accumulator Architecture Register- Memory Memory- Memory Register (load-store) Push A Load A Load r1,A Add C,B,A Load r1,A Push B Add B Add r1,B Load r2,B Add Store C Store C,r1 Add r3,r1,r2 Pop C Store C,r3 Let's look at the code for C = A + B Q: What are the advantages / disadvantages of load-store (RISC) architecture?
  • 14. ACA 2003 14 Other architecture styles  Accumulator architecture  one operand (in register or memory), accumulator almost always implicitly used  Stack  zero operand: all operands implicit (on TOS)  Register (load store)  three operands, all in registers  loads and stores are the only instructions accessing memory (i.e. with a memory (indirect) addressing mode  Register-Memory  two operands, one in memory  Memory-Memory  three operands, may be all in memory (there are more varieties / combinations)
  • 15. ACA 2003 15 Examples  80x86  extended accumulator  Pentium x  extended accumulator  JVM  stack IA-32
  • 16. ACA 2003 16 A dominant architecture: x86/IA-32 A bit of history:  1978: The Intel 8086 is announced (16 bit architecture)  1980: The 8087 floating point coprocessor is added  1981: IBM PC was launched, equipped with the Intel 8088  1982: The 80286 increases address space to 24 bits + new instructions  1985: The 80386 extends to 32 bits, new addressing modes  1989-1995: The 80486, Pentium, Pentium Pro add a few instructions (mostly designed for higher performance)  1997: MMX is added  2000: Pentium 4; very deep pipelined; extends SIMD instructions  2002: Hypertreading “This history illustrates the impact of the “golden handcuffs” of compatibility “adding new features as someone might add clothing to a packed bag” “an architecture that is difficult to explain and impossible to love”
  • 17. ACA 2003 17 IA-32 Overview  Complexity:  Instructions from 1 to 17 bytes long  two-address instructions: one operand must act as both a source and destination  ADD EAX,EBX ; EAX = EAX+EBX  one operand can come from memory  complex addressing modes e.g., “base or scaled index with 8 or 32 bit displacement”  Saving grace:  the most frequently used instructions are not too difficult to build  compilers avoid the portions of the architecture that are slow “what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective”
  • 18. ACA 2003 18 80x86 (IA-32) registers AH AL BH CH DH BL CL DL AX BX CX DX 8 8 16 EAX EBX ECX EDX ESI EDI EBP ESP CS SS DS ES FS GS EIP general purpose registers index registers pointer registers segment registers PC condition codes (a.o.)
  • 19. ACA 2003 19 IA-32 Addressing Modes Addressing modes: where are the operands?  Immediate MOV EAX,10 ; EAX = 10  Direct MOV EAX,I ; EAX = Mem[&i] I DW 3  Register MOV EAX,EBX ; EAX = EBX  Register indirect MOV EAX,[EBX] ; EAX = Memory[EBX]  Based with 8- or 32-bit displacement MOV EAX,[EBX+8] ; EAX = Mem[EBX+8]  Based with scaled index (scale = 0 .. 3) MOV EAX,ECX[EBX] ; EAX = Mem[EBX + 2scale * ECX]  Based plus scaled index with 8- or 32-bit displacement MOV EAX,ECX[EBX+8]
  • 20. ACA 2003 20 IA-32 Addressing Modes  Not all modes apply to all instructions  one of the operands must be a register  Not all registers can be used in all modes  Why? Simply not enough bits in the instruction
  • 21. ACA 2003 21 Control: condition codes  Many instructions set condition codes in EFLAGS register  Some condition codes:  sign: set if the result of an operation was negative  zero: set if the result was zero  carry: set if the operation had a carry out  overflow: set if the operation caused an overflow  parity: set when result had even parity  Subsequent conditional branch instructions test condition codes to determine if they should jump or not
  • 22. ACA 2003 22 Control  Special instruction: compare CMP SRC1,SRC2 ; set cc’s based on SRC1-SRC2  Example for (i=0; i<10; i++) a[i]++; MOV EAX,0 ; EAX = i = 0 _L: CMP EAX,10 ; if (i<10) JNL _EXIT ; jump to _EXIT if i>=10 INC [EBX] ; Mem[EBX](=a[i])++ ADD EBX,4 ; EBX = &a[i+1] INC EAX ; EAX++ JMP _L ; goto _L _EXIT: ...
  • 23. ACA 2003 23 Control  Peculiar control instruction LOOP _LABEL ; decrease ECX, if (ECX!=0) goto _LABEL  Previous example rewritten: MOV ECX,10 _L: INC [EBX] ADD EBX,4 LOOP _L  Fewer instructions, but LOOP is slow
  • 24. ACA 2003 24 Procedures/functions  Instructions  CALL AProcedure ; push return address on stack ; and goto AProcedure  RET ; pop return address from stack ; and jump to it  EBP is used as a frame pointer which points to a fixed location within stack frame (to access locals)  ESP is used as stack pointer  Special instructions:  PUSH EAX ; ESP -= 4, Mem[ESP] = EAX  POP EAX ; EAX = Mem[ESP], ESP += 4
  • 25. ACA 2003 25 IA-32 Machine Language  IA-32 instruction formats: prefix opcode mode sib displ imm 0-5 1-2 0-1 0-1 0-4 0-4 6 1 1 Bytes Bits 2 3 3 Bits mod reg r/m Source operand Byte/word 2 3 3 Bits scale index base 00 memory 01 memory+d8 10 memory+d16/d32 11 register
  • 26. ACA 2003 26 Pentium, Pentium Pro, II, III, 4  Issue rate:  Pentium : 2 way issue, in-order  Pentium Pro .. 4 : 3 way issue, out-of-order  IA-32 operations are translated into ops (by hardware)  Pipeline  Pentium: 5 stage pipeline  Pentium Pro, II, III: 10 stage pipeline  Pentium 4: 20 stage pipeline  Extra SIMD instructions  MMX (multi-media extensions), SSE/SSE-2 (streaming simd extensions) +
  • 27. ACA 2003 27 Die example: Pentium 4
  • 28. ACA 2003 28 Pentium 4 chip area breakdown
  • 29. ACA 2003 29 Pentium 4  Trace cache  Hyper threading  Add with ½ cycle throughput (1 ½ cycle latency) cycle cycle cycle add least signif. 16 bits add most signif. 16 bits calculate flags forwarding carry
  • 30. Pentium® 4 Processor Block Diagram FP RF FMul FAdd MMX SSE FP move FP store 3.2 GB/s System Interface L2 Cache and Control L1 D-Cache and D-TLB Store AGU Load AGU Schedulers Integer RF ALU ALU ALU ALU Trace Cache Rename/Alloc uop Queues BTB uCode ROM 3 3 Decoder BTB & I-TLB L2 Cache and Control P4 slides from Doug Carmean, Intel
  • 31. ACA 2003 31 P4 vs P II, PIII 1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec Basic P6 Pipeline Basic Pentium® 4 Processor Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch 13 14 Disp Disp 15 16 17 18 19 20 RF Ex Flgs Br Ck Drive RF Intro at  1.4GHz .18µ Intro at 733MHz .18µ
  • 32. ACA 2003 32 Example with Higher IPC and Faster Clock! Code Sequence Ld Add Add Ld Add Add 10 clocks 10ns IPC = 0.6 6 clocks 4.3ns IPC = 1.0 P6 @1GHz Pentium® 4 Processor @1.4GHz
  • 33. ACA 2003 33 The Execution Trace Cache L2 Cache and Control L1 D-Cache and D-TLB Trace Cache 3 3 FP RF FMul FAdd MMX SSE FP move FP store 3.2 GB/s System Interface Store AGU Load AGU Schedulers Integer RF ALU ALU ALU ALU Rename/Alloc uop Queues BTB uCode ROM Decoder BTB & I-TLB Trace Cache BTB
  • 34. ACA 2003 34 Execution Trace Cache  Advanced L1 instruction cache  Caches “decoded” IA-32 instructions (uops)  Removes decoder pipeline latency  Capacity is ~12K uOps  Integrates branches into single line  Follows predicted path of program execution Execution Trace Cache feeds fast engine
  • 35. ACA 2003 35 1 cmp 2 br -> T1 .. ... (unused code) T1: 3 sub 4 br -> T2 .. ... (unused code) T2: 5 mov 6 sub 7 br -> T3 .. ... (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> T4 Execution Trace Cache Trace Cache Delivery 10 mul 11 cmp 12 br T4 7 br T3 8 T3:add 9 sub 4 br T2 5 mov 6 sub 1 cmp 2 br T1 3 T1: sub
  • 36. ACA 2003 36 Multi/Hyper-threading in Uniprocessor Architectures Superscalar Simultaneous Multithreading (Hyperthreading) Concurrent Multithreading Issue slots Clock cycles Empty Slot Thread 1 Thread 2 Thread 3 Thread 4
  • 37. ACA 2003 37 JVM: Java Virtual Machine  Make JAVA code run everywhere  Use virtual architecture  Platform (processor) independent Java program Java bytecode Java compiler JVM (interpreter)  JVM = stack architecture
  • 38. ACA 2003 38 Stack Architecture  JVM follows stack model of execution  operands are pushed onto stack from memory and popped off stack to memory  operations take operands from stack and place result on stack  Example (not real Java bytecode): b b c b+c a = b+c; push b push c add pop a
  • 39. ACA 2003 39 JVM Architecture  For each method invocation, the JVM creates a stack frame consisting of  Local variable frame: parameters and local variables, numbered 0, 1, 2, …  Operand stack: stack used for evaluating expressions static void add3(int x, int y, int z){ int r = x+y+z; System.out.println(r); } local var 0 local var 1 local var 2 local var 3
  • 40. ACA 2003 40 Some JVM instructions  iload_n: push local variable n onto the stack  iconst_n: push constant n onto the stack (n=-1,0,...,5)  bipush imm8: push byte onto stack  sipush imm16: push short onto stack  istore_n: pop word from stack into local variable n  iadd, isub, ineg, imul, idiv, irem: usual arithmetic operations  if_icmpXX offset16 (XX can be eq, ne, lt, gt, le, ge):  pop TOS into a  pop TOS stack into b  if (b XX a) PC = PC + offset16  goto offset16 : PC = PC + offset16
  • 41. ACA 2003 41 Example 1  Translate following expression to Java bytecode: v = 3*(x/y - 2/(u+y)) assume x is local var 0, y local var 1, u local var 3, v local var 4 Stack iconst_3 ; 3 iload_0 ; x | 3 iload_1 ; y | x | 3 idiv ; x/y | 3 iconst_2 ; 2 | x/y | 3 iload_3 ; u | 2 | x/y | 3 iload_1 ; y | u | 2 | x/y | 3 iadd ; u+y | 2 | x/y | 3 idiv ; 2/(u+y) | x/y | 3 isub ; x/y - 2/(u+y) | 3 imul ; 3*(x/y - 2/(u+y)) istore_4 ; v = 3*(x/y - 2/(u+y))
  • 42. ACA 2003 42 Example 2 Translate following Java code to Java bytecode: if (x < 2) x = 0; assume x is local var 0 Stack iload_0 ; x iconst_2 ; 2 | x if_icmpge endif ; if (x>=2) goto endif iconst_0 ; 0 istore_0 ; endif: ...

Editor's Notes