SlideShare a Scribd company logo
CS455/CpE 442 Intro. To
Computer Architecure
Review for Term Exam
The Role of Performance
• Text 3rd
Edition, Chapter 4
• Main focus topics
– Compare the performance of different
architectures or architectural variations in
executing a given application
– Determine the CPI for an executable
application on a given architecture
– HW1 solutions, 2.11, 2.12, 2.13
• Q2.13 [10] <§§2.2-2.3> Consider two different implementations, M1 and M2,
of the same instruction set. There are three classes of instructions (A, B, and C)
in the instruction set. M1 has a clock rate of 400 MHz, and M2 has a clock rate
of 200 MHz. The average number of cycles per instruction (CPI) for each class
of instruction on M1 and M2 is given in the following table:
Class CPI on M1 CPI on M2 Instruction mix for C1
Instruction mix for C2 Instruction mix for C3
A 4 2 30% 30% 50%
B 6 4 50% 20% 30%
C 8 3 20% 50% 20%
• Using C1 on both M1 and M2, how much faster can the makers of M1 claim
that M1 is compared with M2?
• ii. Using C2 on both M1 and M2, how much faster can the makers of M2 claim
that M2 is compared with M1?
• iii. If you purchase M1 which of the three compilers would you choose?
• iv. If you purchase M2 which of the three compilers would you choose?
Sol.
Using C1 compiler:
M1: CPU Clock Cycles = 0.3*4+0.5*6+0.2*8 = 5.8
CPU time = CPU CC/Clock Rate = 5.8 / 400*10^6 = 0.0145*10^-6
M2: CPU CC = 3.2
CPU time = 3.2 / 200*10^6 = 0.016*10^-6
Thus, M1 is 0.016 / 0.0145 = 1.10 times as fast as M2.
Using C2 compiler:
Using the above method,
M1: CPU time = 0.016*10^-6
M2: CPU time = 0.0145*10^-6
Thus, M2 is 0.016 / 0.0145 = 1.10 times as fast as M1.
Using 3rd party:
M1: CPU time = 0.0135*10^-6
M2: CPU time = 0.014*10^-6
Thus, M1 is 0.014 / 0.0135 = 1.04 times as fast as M2.
The third-party compiler is the superior product regardless of machine purchase.
M1 is the machine to purchase using the third-party compiler
The Instruction Set Architecure
• Text, Ch. 2
• Compare instruction set architectures based on their
complexity (instruction format, number of
operands, addressing modes, operations supported)
• Instruction set architecture types
– Register-to-register
– Register –to-memory
– Memory –to-memory
• HW2 solutions,
2.51 Suppose we have made the following measurements of average CPI for
instructions: INSTRUCTION AVERAGE CPI
Arithmetic 1.0 clock cycles
Data Transfer 1.4 clock cycles
Conditional Branch 1.7 clock cycles
Jump 1.2 clock cycles
Compute the effective CPI for MIPS. Average the instruction frequencies for
SPEC2000int and SPEC2000fp in figure 2.48 to obtain the instruction mix.
Class CPI Avg. Freq (int & fp) CxF
Arithmetic 1.0 .36 ..36
Data Transfer 1.4 .375 .525
Cond. Branch 1.7 .12 .204
Jump 1.2 .03 .036
1.125CPI
The effective CPI for MIPS is 1.125, this seems inaccurate because the table does
not include the CPI for logical operations.
The Processor: Data Path and
Control
• Text, ch. 5
• The data path organization: functional units
and their interconnections needed to support
the instruction set.
• The control unit design
– Hardwired vs microprogramming design
• HW3 and HW4,
Ins
tr
Reg
Dst
AL
USr
c
Me
m
toR
eg
Reg
Wri
te
Me
m
Rea
d
Me
m
Wri
te
Bra
nch
AL
UO
p
1
AL
UO
p
2
JM
PRe
g
R-
typ
e
1 0 0 1 0 0 0 1 0 0
lw 0 1 1 1 1 0 0 0 0 0
sw x 1 x 0 0 1 0 0 0 0
beq x 0 x 0 0 0 1 0 1 0
jr x x x 0 x 0 0 x x 1
Instr Reg
Dst
ALU
Src
Mem
toRe
g
Reg
Writ
e
Mem
Rea
d
Mem
Writ
e
Bran
ch
ALU
Op
1
ALU
Op
2
LUIC
tr
R-
type
1 0 0 1 0 0 0 1 0 0
lw 0 1 1 1 1 0 0 0 0 0
sw x 1 x 0 0 1 0 0 0 0
beq x 0 x 0 0 0 1 0 1 0
lui 0 x x 1 x 0 0 x x 1
The concept of the “critical path” , the longest possible path in the
machine, was introduced in 5.4 on page 315. Based on your
understanding of the single-cycle implementation, show which units
can tolerate more delays (i.e. are not on the critical path), and which
units can benefit from hardware optimization. Quantify your answers
taking the same numbers presented on page 315.
Longest path is load instruction (instruction memory, register file, ALU,
data memory, register file). It can benefit by optimizing the hardware.
Using the numbers from pg 315
Mem units: 200ps
ALU&Adders: 100ps
Register File: 50ps
Critical path = 200+50+100+200+50 = 600ps (for lw)
The path between the adders and the pc can tolerate more delays because
they do not lie within the critical path. Any unit within the critical
path (ALU, Register, Data memory) would benefit by optimizing the
hardware, this would make the critical path shorter
IorD=0,
(pc=pc+4 cont.)
MemRead
LDI
Micro-program for LDI
Pipelined Architecutres
• Text, Ch.6
• Stages of a pipelined data path
• Pipeline hazzards
• Pipelined performance, number of cycles to execute a code
segment (and the effective CPI), look for dependencies in
sequencesinvolving lw and branch instructions (delay cyles)
• HW5
6.22 lw $4, 100($2)
sub $6, $4, $3
add $2, $3, $5
number of cycles = 5+2+1= 8 eff. CPI = 8/3
= k+ (n-1)+delay cycles #cycles / #instructions
k=no of Stages, n=no of instructions
The Memory Hierarchy
• Text, Ch. 7
• The levels of memory hierarchy, and the principal of
locality
• Cache Design, direct-mapped, fully associative and
set associative
– Cache access, factors affecting the miss rate, and the miss
penalty
• Virtual memory, address map, page tables, and the
TLB
• HW6
1 KB Direct Mapped Cache with 32 B Blocks
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0
4
31
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as part
of the cache “state”
Valid Bit
:
31
Byte 1
Byte 31
:
Byte 32
Byte 33
Byte 63
:
Byte 992
Byte 1023
:
Cache Tag
Byte Select
Ex: 0x00
9
And yet Another Extreme Example: Fully Associative
:
Cache Data
Byte 0
0
4
31
:
Cache Tag (27 bits long)
Valid Bit
:
Byte 1
Byte 31
:
Byte 32
Byte 33
Byte 63
:
Cache Tag
Byte Select
Ex: 0x01
X
X
X
X
X
Review:
4-way set
associative
HW6 Problem 1
• 32 bit address space, 32Kbytes cache
– Direct-mapped cache (32 byte blocks)
Byte select = 5 bits (lowest order bit 0-4)
Cache index = address modulo 1024 = log2(1024) = 10 bits (low order after byte select)
Tag = 32 – byte select – cache index = 17 bits (high order)
– 8 way set associative cache (16 byte blocks) – 8 blocks / set
Byte select for 16 byte blocks = 4 bits
set – 32768 bytes / 128 bytes per set = 256 sets
Cache index = address modulo 256 sets = log2(256) = 8 bits
Tag = 32 – 8 – 4 = 20 bits
– Fully associative cache (128 byte blocks)
Byte select = 7 bits,
Cache index does not exist because blocks in memory can be placed in any cache entry,
Tag = 25 bits
Problem 7.46
word ReadDirectMappedCache(address a)
static Entry cache[CACHE_SIZE_IN_WORDS];
Entry e = cache[a.index]
if (e.valid == FALSE || e.tag != a.tag)
{
e.valid = true;
e.tag = a.tag;
e.data = load_from_memory(a);
}
return e.data;
Modified to the following for multi-word blocks
word ReadDirectMappedCache(address a)
static Entry cache[CACHE_SIZE_IN_BLOCKS];
Entry e = cache[a.index]
if (e.valid == FALSE || e.tag != a.tag)
{
e.valid = true;
e.tag = a.tag;
e.data = load_from_memory(a);
}
return e.data[a.word_index];

More Related Content

PDF
What’s eating python performance
PPT
Microprocessor Systems and Interfacing Slides
PPTX
COA Unit-5.pptx
PPTX
CMPN301-Pipelining_V2.pptx
PPTX
Computer Architecturebhhgggfggtggeerr.pptx
PDF
It5304 syllabus
PDF
VJITSk 6713 user manual
PPTX
Basic Structure of a Computer System
What’s eating python performance
Microprocessor Systems and Interfacing Slides
COA Unit-5.pptx
CMPN301-Pipelining_V2.pptx
Computer Architecturebhhgggfggtggeerr.pptx
It5304 syllabus
VJITSk 6713 user manual
Basic Structure of a Computer System

Similar to Introduction to MIPS Computer Architecture (20)

PDF
CE412 -advanced computer Architecture lecture 1.pdf
PPTX
Energy efficient AI workload partitioning on multi-core systems
PPTX
Pipelining And Vector Processing
PPTX
Introduction to Computer Architecture and Organization
PPTX
embedded C.pptx
PDF
OPERATING_SYSTEMS_WILLIAM_STALLINGS TEXT
PPT
Short.course.introduction.to.vhdl
PDF
Pretzel: optimized Machine Learning framework for low-latency and high throug...
PPTX
MaPU-HPCA2016
PDF
SOC Application Studies: Image Compression
PPT
Short.course.introduction.to.vhdl for beginners
PPTX
Introduction to Computer Architecture: unit 1
PPTX
Unit 1 computer architecture_gghhhjjhbh.pptx
PDF
16-bit Microprocessor Design (2005)
PDF
4.1 Introduction 145• In this section, we first take a gander at a.pdf
PPTX
Pipelining and vector processing
PPTX
Netlist to GDSII flow new.pptx physical design full info
PDF
**Understanding_CTS_Log_Messages.pdf
DOCX
Bsc it winter 2013 2nd sem
PPTX
Unit-1_Digital Computers, number systemCOA[1].pptx
CE412 -advanced computer Architecture lecture 1.pdf
Energy efficient AI workload partitioning on multi-core systems
Pipelining And Vector Processing
Introduction to Computer Architecture and Organization
embedded C.pptx
OPERATING_SYSTEMS_WILLIAM_STALLINGS TEXT
Short.course.introduction.to.vhdl
Pretzel: optimized Machine Learning framework for low-latency and high throug...
MaPU-HPCA2016
SOC Application Studies: Image Compression
Short.course.introduction.to.vhdl for beginners
Introduction to Computer Architecture: unit 1
Unit 1 computer architecture_gghhhjjhbh.pptx
16-bit Microprocessor Design (2005)
4.1 Introduction 145• In this section, we first take a gander at a.pdf
Pipelining and vector processing
Netlist to GDSII flow new.pptx physical design full info
**Understanding_CTS_Log_Messages.pdf
Bsc it winter 2013 2nd sem
Unit-1_Digital Computers, number systemCOA[1].pptx
Ad

More from bansidhar11 (13)

PPT
Introduction to MIPS Computer Architecture
PPT
MIPS ALU and Datapath: Design and analysis
PPT
Memory organization including cache and RAM.ppt
PPT
Memory Organization and Cache mapping.ppt
PPTX
Module 1 Intro to Data Structures CSE 1101.pptx
PPTX
RowHammerUnderReducedWordlineVoltage_dsn22-lightning-talk.pptx
PPT
Side-Channel Attacks in Memory: A threat
PPTX
On_Chip_Networks_Digital_Forensics_Presentation.pptx
PPT
Introduction to Computer Organization and Architecture
PPT
Introduction to Computer Organization and Architecture
PPTX
8085 Assembly programming.pptx
PPTX
Carbon and its compounds.ppt
PPT
Introduction to MIPS Computer Architecture
MIPS ALU and Datapath: Design and analysis
Memory organization including cache and RAM.ppt
Memory Organization and Cache mapping.ppt
Module 1 Intro to Data Structures CSE 1101.pptx
RowHammerUnderReducedWordlineVoltage_dsn22-lightning-talk.pptx
Side-Channel Attacks in Memory: A threat
On_Chip_Networks_Digital_Forensics_Presentation.pptx
Introduction to Computer Organization and Architecture
Introduction to Computer Organization and Architecture
8085 Assembly programming.pptx
Carbon and its compounds.ppt
Ad

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
composite construction of structures.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Welding lecture in detail for understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Well-logging-methods_new................
573137875-Attendance-Management-System-original
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
composite construction of structures.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
Operating System & Kernel Study Guide-1 - converted.pdf
additive manufacturing of ss316l using mig welding
Lecture Notes Electrical Wiring System Components
Welding lecture in detail for understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Model Code of Practice - Construction Work - 21102022 .pdf
bas. eng. economics group 4 presentation 1.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Well-logging-methods_new................

Introduction to MIPS Computer Architecture

  • 1. CS455/CpE 442 Intro. To Computer Architecure Review for Term Exam
  • 2. The Role of Performance • Text 3rd Edition, Chapter 4 • Main focus topics – Compare the performance of different architectures or architectural variations in executing a given application – Determine the CPI for an executable application on a given architecture – HW1 solutions, 2.11, 2.12, 2.13
  • 3. • Q2.13 [10] <§§2.2-2.3> Consider two different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 400 MHz, and M2 has a clock rate of 200 MHz. The average number of cycles per instruction (CPI) for each class of instruction on M1 and M2 is given in the following table: Class CPI on M1 CPI on M2 Instruction mix for C1 Instruction mix for C2 Instruction mix for C3 A 4 2 30% 30% 50% B 6 4 50% 20% 30% C 8 3 20% 50% 20% • Using C1 on both M1 and M2, how much faster can the makers of M1 claim that M1 is compared with M2? • ii. Using C2 on both M1 and M2, how much faster can the makers of M2 claim that M2 is compared with M1? • iii. If you purchase M1 which of the three compilers would you choose? • iv. If you purchase M2 which of the three compilers would you choose?
  • 4. Sol. Using C1 compiler: M1: CPU Clock Cycles = 0.3*4+0.5*6+0.2*8 = 5.8 CPU time = CPU CC/Clock Rate = 5.8 / 400*10^6 = 0.0145*10^-6 M2: CPU CC = 3.2 CPU time = 3.2 / 200*10^6 = 0.016*10^-6 Thus, M1 is 0.016 / 0.0145 = 1.10 times as fast as M2. Using C2 compiler: Using the above method, M1: CPU time = 0.016*10^-6 M2: CPU time = 0.0145*10^-6 Thus, M2 is 0.016 / 0.0145 = 1.10 times as fast as M1. Using 3rd party: M1: CPU time = 0.0135*10^-6 M2: CPU time = 0.014*10^-6 Thus, M1 is 0.014 / 0.0135 = 1.04 times as fast as M2. The third-party compiler is the superior product regardless of machine purchase. M1 is the machine to purchase using the third-party compiler
  • 5. The Instruction Set Architecure • Text, Ch. 2 • Compare instruction set architectures based on their complexity (instruction format, number of operands, addressing modes, operations supported) • Instruction set architecture types – Register-to-register – Register –to-memory – Memory –to-memory • HW2 solutions,
  • 6. 2.51 Suppose we have made the following measurements of average CPI for instructions: INSTRUCTION AVERAGE CPI Arithmetic 1.0 clock cycles Data Transfer 1.4 clock cycles Conditional Branch 1.7 clock cycles Jump 1.2 clock cycles Compute the effective CPI for MIPS. Average the instruction frequencies for SPEC2000int and SPEC2000fp in figure 2.48 to obtain the instruction mix. Class CPI Avg. Freq (int & fp) CxF Arithmetic 1.0 .36 ..36 Data Transfer 1.4 .375 .525 Cond. Branch 1.7 .12 .204 Jump 1.2 .03 .036 1.125CPI The effective CPI for MIPS is 1.125, this seems inaccurate because the table does not include the CPI for logical operations.
  • 7. The Processor: Data Path and Control • Text, ch. 5 • The data path organization: functional units and their interconnections needed to support the instruction set. • The control unit design – Hardwired vs microprogramming design • HW3 and HW4,
  • 8. Ins tr Reg Dst AL USr c Me m toR eg Reg Wri te Me m Rea d Me m Wri te Bra nch AL UO p 1 AL UO p 2 JM PRe g R- typ e 1 0 0 1 0 0 0 1 0 0 lw 0 1 1 1 1 0 0 0 0 0 sw x 1 x 0 0 1 0 0 0 0 beq x 0 x 0 0 0 1 0 1 0 jr x x x 0 x 0 0 x x 1
  • 9. Instr Reg Dst ALU Src Mem toRe g Reg Writ e Mem Rea d Mem Writ e Bran ch ALU Op 1 ALU Op 2 LUIC tr R- type 1 0 0 1 0 0 0 1 0 0 lw 0 1 1 1 1 0 0 0 0 0 sw x 1 x 0 0 1 0 0 0 0 beq x 0 x 0 0 0 1 0 1 0 lui 0 x x 1 x 0 0 x x 1
  • 10. The concept of the “critical path” , the longest possible path in the machine, was introduced in 5.4 on page 315. Based on your understanding of the single-cycle implementation, show which units can tolerate more delays (i.e. are not on the critical path), and which units can benefit from hardware optimization. Quantify your answers taking the same numbers presented on page 315. Longest path is load instruction (instruction memory, register file, ALU, data memory, register file). It can benefit by optimizing the hardware. Using the numbers from pg 315 Mem units: 200ps ALU&Adders: 100ps Register File: 50ps Critical path = 200+50+100+200+50 = 600ps (for lw) The path between the adders and the pc can tolerate more delays because they do not lie within the critical path. Any unit within the critical path (ALU, Register, Data memory) would benefit by optimizing the hardware, this would make the critical path shorter
  • 13. Pipelined Architecutres • Text, Ch.6 • Stages of a pipelined data path • Pipeline hazzards • Pipelined performance, number of cycles to execute a code segment (and the effective CPI), look for dependencies in sequencesinvolving lw and branch instructions (delay cyles) • HW5 6.22 lw $4, 100($2) sub $6, $4, $3 add $2, $3, $5 number of cycles = 5+2+1= 8 eff. CPI = 8/3 = k+ (n-1)+delay cycles #cycles / #instructions k=no of Stages, n=no of instructions
  • 14. The Memory Hierarchy • Text, Ch. 7 • The levels of memory hierarchy, and the principal of locality • Cache Design, direct-mapped, fully associative and set associative – Cache access, factors affecting the miss rate, and the miss penalty • Virtual memory, address map, page tables, and the TLB • HW6
  • 15. 1 KB Direct Mapped Cache with 32 B Blocks Cache Index 0 1 2 3 : Cache Data Byte 0 0 4 31 : Cache Tag Example: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 31 Byte 1 Byte 31 : Byte 32 Byte 33 Byte 63 : Byte 992 Byte 1023 : Cache Tag Byte Select Ex: 0x00 9
  • 16. And yet Another Extreme Example: Fully Associative : Cache Data Byte 0 0 4 31 : Cache Tag (27 bits long) Valid Bit : Byte 1 Byte 31 : Byte 32 Byte 33 Byte 63 : Cache Tag Byte Select Ex: 0x01 X X X X X
  • 18. HW6 Problem 1 • 32 bit address space, 32Kbytes cache – Direct-mapped cache (32 byte blocks) Byte select = 5 bits (lowest order bit 0-4) Cache index = address modulo 1024 = log2(1024) = 10 bits (low order after byte select) Tag = 32 – byte select – cache index = 17 bits (high order) – 8 way set associative cache (16 byte blocks) – 8 blocks / set Byte select for 16 byte blocks = 4 bits set – 32768 bytes / 128 bytes per set = 256 sets Cache index = address modulo 256 sets = log2(256) = 8 bits Tag = 32 – 8 – 4 = 20 bits – Fully associative cache (128 byte blocks) Byte select = 7 bits, Cache index does not exist because blocks in memory can be placed in any cache entry, Tag = 25 bits
  • 19. Problem 7.46 word ReadDirectMappedCache(address a) static Entry cache[CACHE_SIZE_IN_WORDS]; Entry e = cache[a.index] if (e.valid == FALSE || e.tag != a.tag) { e.valid = true; e.tag = a.tag; e.data = load_from_memory(a); } return e.data; Modified to the following for multi-word blocks word ReadDirectMappedCache(address a) static Entry cache[CACHE_SIZE_IN_BLOCKS]; Entry e = cache[a.index] if (e.valid == FALSE || e.tag != a.tag) { e.valid = true; e.tag = a.tag; e.data = load_from_memory(a); } return e.data[a.word_index];