2. The Role of Performance
• Text 3rd
Edition, Chapter 4
• Main focus topics
– Compare the performance of different
architectures or architectural variations in
executing a given application
– Determine the CPI for an executable
application on a given architecture
– HW1 solutions, 2.11, 2.12, 2.13
3. • Q2.13 [10] <§§2.2-2.3> Consider two different implementations, M1 and M2,
of the same instruction set. There are three classes of instructions (A, B, and C)
in the instruction set. M1 has a clock rate of 400 MHz, and M2 has a clock rate
of 200 MHz. The average number of cycles per instruction (CPI) for each class
of instruction on M1 and M2 is given in the following table:
Class CPI on M1 CPI on M2 Instruction mix for C1
Instruction mix for C2 Instruction mix for C3
A 4 2 30% 30% 50%
B 6 4 50% 20% 30%
C 8 3 20% 50% 20%
• Using C1 on both M1 and M2, how much faster can the makers of M1 claim
that M1 is compared with M2?
• ii. Using C2 on both M1 and M2, how much faster can the makers of M2 claim
that M2 is compared with M1?
• iii. If you purchase M1 which of the three compilers would you choose?
• iv. If you purchase M2 which of the three compilers would you choose?
4. Sol.
Using C1 compiler:
M1: CPU Clock Cycles = 0.3*4+0.5*6+0.2*8 = 5.8
CPU time = CPU CC/Clock Rate = 5.8 / 400*10^6 = 0.0145*10^-6
M2: CPU CC = 3.2
CPU time = 3.2 / 200*10^6 = 0.016*10^-6
Thus, M1 is 0.016 / 0.0145 = 1.10 times as fast as M2.
Using C2 compiler:
Using the above method,
M1: CPU time = 0.016*10^-6
M2: CPU time = 0.0145*10^-6
Thus, M2 is 0.016 / 0.0145 = 1.10 times as fast as M1.
Using 3rd party:
M1: CPU time = 0.0135*10^-6
M2: CPU time = 0.014*10^-6
Thus, M1 is 0.014 / 0.0135 = 1.04 times as fast as M2.
The third-party compiler is the superior product regardless of machine purchase.
M1 is the machine to purchase using the third-party compiler
5. The Instruction Set Architecure
• Text, Ch. 2
• Compare instruction set architectures based on their
complexity (instruction format, number of
operands, addressing modes, operations supported)
• Instruction set architecture types
– Register-to-register
– Register –to-memory
– Memory –to-memory
• HW2 solutions,
6. 2.51 Suppose we have made the following measurements of average CPI for
instructions: INSTRUCTION AVERAGE CPI
Arithmetic 1.0 clock cycles
Data Transfer 1.4 clock cycles
Conditional Branch 1.7 clock cycles
Jump 1.2 clock cycles
Compute the effective CPI for MIPS. Average the instruction frequencies for
SPEC2000int and SPEC2000fp in figure 2.48 to obtain the instruction mix.
Class CPI Avg. Freq (int & fp) CxF
Arithmetic 1.0 .36 ..36
Data Transfer 1.4 .375 .525
Cond. Branch 1.7 .12 .204
Jump 1.2 .03 .036
1.125CPI
The effective CPI for MIPS is 1.125, this seems inaccurate because the table does
not include the CPI for logical operations.
7. The Processor: Data Path and
Control
• Text, ch. 5
• The data path organization: functional units
and their interconnections needed to support
the instruction set.
• The control unit design
– Hardwired vs microprogramming design
• HW3 and HW4,
10. The concept of the “critical path” , the longest possible path in the
machine, was introduced in 5.4 on page 315. Based on your
understanding of the single-cycle implementation, show which units
can tolerate more delays (i.e. are not on the critical path), and which
units can benefit from hardware optimization. Quantify your answers
taking the same numbers presented on page 315.
Longest path is load instruction (instruction memory, register file, ALU,
data memory, register file). It can benefit by optimizing the hardware.
Using the numbers from pg 315
Mem units: 200ps
ALU&Adders: 100ps
Register File: 50ps
Critical path = 200+50+100+200+50 = 600ps (for lw)
The path between the adders and the pc can tolerate more delays because
they do not lie within the critical path. Any unit within the critical
path (ALU, Register, Data memory) would benefit by optimizing the
hardware, this would make the critical path shorter
13. Pipelined Architecutres
• Text, Ch.6
• Stages of a pipelined data path
• Pipeline hazzards
• Pipelined performance, number of cycles to execute a code
segment (and the effective CPI), look for dependencies in
sequencesinvolving lw and branch instructions (delay cyles)
• HW5
6.22 lw $4, 100($2)
sub $6, $4, $3
add $2, $3, $5
number of cycles = 5+2+1= 8 eff. CPI = 8/3
= k+ (n-1)+delay cycles #cycles / #instructions
k=no of Stages, n=no of instructions
14. The Memory Hierarchy
• Text, Ch. 7
• The levels of memory hierarchy, and the principal of
locality
• Cache Design, direct-mapped, fully associative and
set associative
– Cache access, factors affecting the miss rate, and the miss
penalty
• Virtual memory, address map, page tables, and the
TLB
• HW6
15. 1 KB Direct Mapped Cache with 32 B Blocks
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0
4
31
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as part
of the cache “state”
Valid Bit
:
31
Byte 1
Byte 31
:
Byte 32
Byte 33
Byte 63
:
Byte 992
Byte 1023
:
Cache Tag
Byte Select
Ex: 0x00
9
16. And yet Another Extreme Example: Fully Associative
:
Cache Data
Byte 0
0
4
31
:
Cache Tag (27 bits long)
Valid Bit
:
Byte 1
Byte 31
:
Byte 32
Byte 33
Byte 63
:
Cache Tag
Byte Select
Ex: 0x01
X
X
X
X
X