SlideShare a Scribd company logo
December 20, 2024
204521 Digital System Architecture
Computer Performance and
Cost
Pradondet Nilagupta
Spring 2001
(original notes from Randy Katz, & Prof. Jan
M. Rabaey , UC Berkeley)
December 20, 2024
204521 Digital System Architecture 2
Review: What is Computer
Architecture?
Technology
Applications
Computer
Architect
Interfaces
Machine Organization
Measurement &
Evaluation
ISA
API
Link
I/O
Chan
Regs
IR
December 20, 2024
204521 Digital System Architecture 3
Review: What is Computer
Architecture?
Technology
Applications
Computer
Architect
Interfaces
Machine Organization
Measurement &
Evaluation
ISA
API
Link
I/O
Chan
Regs
IR
December 20, 2024
204521 Digital System Architecture 4
The Architecture Process
New concepts
created
Estimate
Cost &
Performance
Sort
Good
ideas
Mediocre
ideas
Bad ideas
December 20, 2024
204521 Digital System Architecture 5
Performance Measurement and
Evaluation
Many dimensions to
computer performance
– CPU execution time
• by instruction or sequence
– floating point
– integer
– branch performance
– Cache bandwidth
– Main memory bandwidth
– I/O performance
• bandwidth
• seeks
• pixels or polygons per
second
Relative importance depends
on applications
P
$
M
December 20, 2024
204521 Digital System Architecture 6
Evaluation Tools
Benchmarks, traces, & mixes
– macrobenchmarks & suites
• application execution time
– microbenchmarks
• measure one aspect of
performance
– traces
• replay recorded accesses
– cache, branch, register
Simulation at many levels
– ISA, cycle accurate, RTL, gate,
circuit
• trade fidelity for simulation rate
Area and delay estimation
Analysis
– e.g., queuing theory
MOVE 39%
BR 20%
LOAD 20%
STORE 10%
ALU 11%
LD 5EA3
ST 31FF
….
LD 1EA2
….
December 20, 2024
204521 Digital System Architecture 7
Benchmarks
Microbenchmarks
– measure one
performance dimension
• cache bandwidth
• main memory bandwidth
• procedure call overhead
• FP performance
– weighted combination of
microbenchmark
performance is a good
predictor of application
performance
– gives insight into the
cause of performance
bottlenecks
Macrobenchmarks
– application execution
time
• measures overall
performance, but on just
one application
Perf. Dimensions
Applications
Micro
Macro
December 20, 2024
204521 Digital System Architecture 8
Some Warnings about
Benchmarks
Benchmarks measure the
whole system
– application
– compiler
– operating system
– architecture
– implementation
Popular benchmarks
typically reflect
yesterday’s programs
– computers need to be
designed for
tomorrow’s programs
Benchmark timings often
very sensitive to
– alignment in cache
– location of data on disk
– values of data
Benchmarks can lead to
inbreeding or positive
feedback
– if you make an operation
fast (slow) it will be used
more (less) often
• so you make it faster
(slower)
– and it gets used even
more (less)
» and so on…
December 20, 2024
204521 Digital System Architecture
Architectural Performance
Laws and Rules of Thumb
December 20, 2024
204521 Digital System Architecture 10
Measurement and Evaluation
Architecture is an iterative process:
– Searching the space of possible
designs
– Make selections
– Evaluate the selections made
Good measurement tools are
required to accurately evaluate the
selection.
December 20, 2024
204521 Digital System Architecture 11
Measurement Tools
Benchmarks, Traces, Mixes
Cost, delay, area, power estimation
Simulation (many levels)
– ISA, RTL, Gate, Circuit
Queuing Theory
Rules of Thumb
Fundamental Laws
December 20, 2024
204521 Digital System Architecture 12
Measuring and Reporting
Performance
What do we mean by one Computer is
faster than another?
– program runs less time
Response time or execution time
– time that users see the output
Throughput
– total amount of work done in a given time
December 20, 2024
204521 Digital System Architecture 13
Performance
“Increasing and decreasing” ?????
We use the term “improve performance” or “
improve execution time” When we mean in
crease performance and decrease executi
on time .
improve performance = increase performance
improve execution time = decrease execution time
December 20, 2024
204521 Digital System Architecture 14
Measuring Performance
Definition of time
Wall Clock time
Response time
Elapsed time
– A latency to complete a task including
disk accesses, memory accesses, I/O
activities, operating system overhead
December 20, 2024
204521 Digital System Architecture
Does Anybody Really Know
What Time it is?
User CPU Time (Time spent in program)
System CPU Time (Time spent in OS)
Elapsed Time (Response Time = 159 Sec.)
(90.7+12.9)/159 * 100 = 65%, % of lapsed time that is
CPU time. 45% of the time spent in I/O or running
other programs
UNIX Time Command: 90.7u 12.9s 2:39 65%
December 20, 2024
204521 Digital System Architecture 16
Example
UNIX time command
90.7u 12.95 2:39 65%
user CPU time is 90.7 sec
system CPU time is 12.9 sec
elapsed time is 2 min 39 sec. (159 sec)
% of elapsed time that is CPU time is 90.7 + 12.9 = 65%
159
December 20, 2024
204521 Digital System Architecture 17
Time
CPU time
– time the CPU is computing
– not including the time waiting for I/O or running
other program
User CPU time
– CPU time spent in the program
System CPU time
– CPU time spent in the operating system
performing task requested by the program
decrease execution time
CPU time = User CPU time + System CPU time
December 20, 2024
204521 Digital System Architecture 18
Performance
System Performance
– elapsed time on unloaded system
CPU performance
– user CPU time on an unloaded system
December 20, 2024
204521 Digital System Architecture 19
Programs to Evaluate
Processor Performance
Real Program
Kernel
– Time critical excerpts
Toy benchmark
– 10-100 lines of code, Sieve of Erastasthens, Puzzles,
Quick Sort
Synthetic benchmark
– Attempt to match average frequencies of real
workloads
– similar to kernel
– Whetstone, Dhrystone
– not even pieces of real program, but kernel might be.
December 20, 2024
204521 Digital System Architecture 20
Benchmark Suite
collecting of benchmark, try to
measure the performance of
processor with a variety of
application
– SPECint92, SPECfp92
– see fig. 1.9
December 20, 2024
204521 Digital System Architecture
Benchmarking Games
Differing configurations used to run the
same workload on two systems
Compiler wired to optimize the workload
Test specification written to be biased
towards one machine
Synchronized CPU/IO intensive job
sequence used
Workload arbitrarily picked
Very small benchmarks used
Benchmarks manually translated to optimize
performance
December 20, 2024
204521 Digital System Architecture
Common Benchmarking
Mistakes (1/2)
Only average behavior represented in test
workload
Skewness of device demands ignored
Loading level controlled inappropriately
Caching effects ignored
Buffer sizes not appropriate
Inaccuracies due to sampling ignored
December 20, 2024
204521 Digital System Architecture
Common Benchmarking
Mistakes (2/2)
Ignoring monitoring overhead
Not validating measurements
Not ensuring same initial conditions
Not measuring transient (cold start)
performance
Using device utilizations for
performance comparisons
Collecting too much data but doing
too little analysis
December 20, 2024
204521 Digital System Architecture 24
SPEC: System Performance
Evaluation Cooperative (1/2)
First Round 1989
– 10 programs yielding a single number
Second Round 1992
– SpecInt92 (6 integer programs) and SpecFP92 (
14 floating point programs)
– Compiler Flags unlimited. March 93 of DEC 400
0 Model 610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,
c)=memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
December 20, 2024
204521 Digital System Architecture 25
SPEC: System Performance
Evaluation Cooperative (2/2)
Third Round 1995
– new set of programs: SPECint95 (8
integer programs) and SPECfp95 (10
floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs:
SPECint_base95, SPECfp_base95
December 20, 2024
204521 Digital System Architecture 26
How to Summarize Performance
(1/2)
Arithmetic mean (weighted arithmetic
mean) tracks execution time:
Harmonic mean (weighted harmonic
mean) of rates (e.g., MFLOPS) tracks
execution time:


i
i
i
T
W
or
n
T
*


i
i
i R
W
n
or
R
n
1
December 20, 2024
204521 Digital System Architecture 27
How to Summarize Performance
(2/2)
Normalized execution time is handy for
scaling performance (e.g., X times faster
than SPARCstation 10)
– Arithmetic mean impacted by choice of
reference machine
Use the geometric mean for comparison:
– Independent of chosen machine
– but not good metric for total execution time
n
i
T
December 20, 2024
204521 Digital System Architecture 28
SPEC First Round
One program:
99% of time in
single line of
code
New front-end
compiler could
improve
dramatically
Benchmark
SPEC
Perf
0
100
200
300
400
500
600
700
800
gcc
epresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
December 20, 2024
204521 Digital System Architecture 29
Impact of Means
on SPECmark89 for IBM 550
(without and with special compiler option)
Ratio to VAX: Time: Weighted Time:
Program Before After Before After Before After
gcc 30 29 49 51 8.91 9.22
espresso 35 34 65 67 7.64 7.86
spice 47 47 510 510 5.69 5.69
doduc 46 49 41 38 5.81 5.45
nasa7 78 144 258 140 3.43 1.86
li 34 34 183 183 7.86 7.86
eqntott 40 40 28 28 6.68 6.68
matrix300 78 730 58 6 3.43 0.37
fpppp 90 87 34 35 2.97 3.07
tomcatv 33 138 20 19 2.01 1.94
Mean 54 72 124 108 54.42 49.99
Geometric Arithmetic Weighted Arith.
Ratio 1.33 Ratio 1.16 Ratio 1.09
December 20, 2024
204521 Digital System Architecture
The Bottom Line: Performance
(and Cost)
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)
– Throughput, bandwidth
Plane
Boeing 747
BAD/Sud
Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput
(pass/hr)
72
44
December 20, 2024
204521 Digital System Architecture
The Bottom Line: Performance
(and Cost)
"X is n times faster than Y" means
ExTime(Y) Performance(X)
--------- = --------------- = n
ExTime(X) Performance(Y)
• Performance is the reciprocal of execution time
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
December 20, 2024
204521 Digital System Architecture 32
Performance Terminology
“X is n% faster than Y” means:
ExTime(Y) Performance(X) n
--------- = -------------- = 1 + -----
ExTime(X) Performance(Y) 100
n = 100(Performance(X) - Performance(Y))
Performance(Y)
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
December 20, 2024
204521 Digital System Architecture
Example
Example: Y takes 15 seconds to complete a task,
X takes 10 seconds. What % faster is X?
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
n = 100(15 - 10)
10
n = 50%
December 20, 2024
204521 Digital System Architecture 34
Speedup
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fractionenhanced
of
the task by a factor Speedupenhanced , and the remainder of
the task is unaffected, then what is
ExTime(E) = ?
Speedup(E) = ?
December 20, 2024
204521 Digital System Architecture 35
Amdahl’s Law
States that the performance
improvement to be gained from
using some faster mode of execution
is limited by the fraction of the time
faster mode can be used
December 20, 2024
204521 Digital System Architecture
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =
ExTimeold
ExTimenew
Speedupenhanced
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
December 20, 2024
204521 Digital System Architecture
Example of Amdahl’s Law
Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
Speedupoverall =
ExTimenew =
December 20, 2024
204521 Digital System Architecture
Example of Amdahl’s Law
Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
Speedupoverall =
1
0.95
= 1.053
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
December 20, 2024
204521 Digital System Architecture 39
Corollary: Make The Common
Case Fast
All instructions require an instruction fetch, only a
fraction require a data fetch/store.
– Optimize instruction access over data access
Programs exhibit locality
– 90% of time in 10% of code
– Spatial Locality
– Temporal Locality
Access to small memories is faster
– Provide a storage hierarchy such that the most
frequent accesses are to the smallest (closest)
memories.
Reg's
Cache
Memory Disk / Tape
December 20, 2024
204521 Digital System Architecture 40
RISC Philosophy
The simple case is usually the most
frequent and the easiest to optimize!
Do simple, fast things in hardware
and be sure the rest can be handled
correctly in software
December 20, 2024
204521 Digital System Architecture
Metrics of Performance
Compiler
Programming
Language
Application
Datapath
Control
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second:
MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per month
Operations per second
December 20, 2024
204521 Digital System Architecture
% of Instructions Responsible for
80-90% of Instructions Executed
0
10
20
30
40
50
60
70
80
compress li ear
December 20, 2024
204521 Digital System Architecture 43
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x
Seconds
(User) Program Program Instruction Cycle
Inst. Cnt CPI Clock
Rate
Program
Compiler
Inst. Set
Organization
Technology
December 20, 2024
204521 Digital System Architecture 44
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x
Seconds
(User) Program Program Instruction Cycle
Inst. Cnt CPI Clock
Rate
Program X
Compiler X (X)
Inst. Set X X
Organization X X
Technology X
December 20, 2024
204521 Digital System Architecture 45
Marketing Metrics
MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6
• Machines with different instruction sets ?
• Programs with different instruction mixes ?
– Dynamic frequency of instructions
• Uncorrelated with performance
MFLOP/s = FP Operations / Time * 10^6
• Machine dependent
• Often not where time is spent
Normalized:
add,sub,compare,mult 1
divide, sqrt
4
exp, sin, . . .
8
December 20, 2024
204521 Digital System Architecture
Cycles Per Instruction
CPU time = CycleTime *  CPIi * Ii
i = 1
n
Avg. CPI = CPI * F where F = I
i = 1
n
i i i i
Instruction Count
“Instruction Frequency”
Invest Resources where time is Spent!
CPI = Clock Cycles for a Program/Instr. Count
“Average Cycles per Instruction”
December 20, 2024
204521 Digital System Architecture 47
Cycles Per Instruction
“Average Cycles per Instruction”
CPI = Clock Cycles for a Program/Instr. Count
“Instruction Frequency”
Where
Invest Resources where time is Spent!



n
1
j
j
i I
CPI
CycleTime
Time
CPU *
*
i
n
1
i
i F
CPI
CPI
Avg. *



Count
n
Instructio
I
F i
i 
December 20, 2024
204521 Digital System Architecture 48
Example: Calculating CPI
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Typical Mix
1.5 is the Average CPI for this instruction mix
December 20, 2024
204521 Digital System Architecture 49
Organizational Trade-offs
Instruction Mix
Cycle Time
CPI
Compiler
Programming
Language
Application
Datapath
Control
Transistors Wires Pins
ISA
Function Units
December 20, 2024
204521 Digital System Architecture 50
Trade-off Example
Add register / memory operations:
– One source operand in memory
– One source operand in register
– Cycle count of 2
Branch cycle count to increase to 3.
What fraction of the loads must be eliminated for this to pay off?
Base Machine (Reg / Reg)
Op Freq Cycles
ALU 50% 1
Load 20% 2
Store 10% 2
Branch20% 2
December 20, 2024
204521 Digital System Architecture
Example Solution
Exec Time = Instr Cnt x CPI x Clock
Op Freq Cycles
ALU .50 1 .5
Load .20 2 .4
Store .10 2 .2
Branch .20 2 .3
Reg/Mem
1.00 1.5
December 20, 2024
204521 Digital System Architecture
Example Solution
Exec Time = Instr Cnt x CPI x Clock
Op Freq Cycles Freq Cycles
ALU .50 1 .5 .5 – X 1 .5 – X
Load .20 2 .4 .2 – X 2 .4 – 2X
Store .10 2 .2 .1 2 .2
Branch .20 2 .3 .2 3 .6
Reg/Mem X 2 2X
1.00 1.5 1 – X (1.7 – X)/(1 – X)
CPINew must be normalized to new instruction frequency
CyclesNew
InstructionsNew
December 20, 2024
204521 Digital System Architecture
Example Solution
Exec Time = Instr Cnt x CPI x Clock
Op Freq Cycles Freq Cycles
ALU .50 1 .5 .5 – X 1 .5 – X
Load .20 2 .4 .2 – X 2 .4 – 2X
Store .10 2 .2 .1 2 .2
Branch .20 2 .3 .2 3 .6
Reg/Mem X 2 2X
1.00 1.5 1 – X (1.7 – X)/(1 – X)
Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPInew x ClockNew
1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)
December 20, 2024
204521 Digital System Architecture
Example Solution
Exec Time = Instr Cnt x CPI x Clock
Op Freq Cycles Freq Cycles
ALU .50 1 .5 .5 – X 1 .5 – X
Load .20 2 .4 .2 – X 2 .4 – 2X
Store .10 2 .2 .1 2 .2
Branch .20 2 .3 .2 3 .6
Reg/Mem X 2 2X
1.00 1.5 1 – X (1.7 – X)/(1 – X)
Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew
1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)
1.5 = 1.7 – X
0.2 = X
ALL loads must be eliminated for this to be a win!
December 20, 2024
204521 Digital System Architecture 55
Means


n
i
i
T
n 1
1
Arithmetic
mean


n
i i
R
n
1
1
Geometric
mean


























 

n
i ri
i
n
n
i ri
i
T
T
n
T
T
1
1
1
log
1
exp
Harmonic
mean
Consistent
independent
of reference
Can be
weighted.
Represents
total execution
time
December 20, 2024
204521 Digital System Architecture
Which Machine is “Better”?
Computer A Computer B Computer C
Program P1(sec) 1 10 20
Program P2 1000 100 20
Total Time 1001 110 40
December 20, 2024
204521 Digital System Architecture
Weighted Arithmetic Mean
Assume three weighting schemes
P1/P2 Comp A Comp B Comp C
.5/.5 500.5 55.0 20
.909/.091 91.82 18.8 20
.999/.001 2 10.09 20
December 20, 2024
204521 Digital System Architecture
Performance Evaluation
Given sales is a function of performance relative
to the competition, big investment in improving
product as reported by performance summary
Good products created then have:
– Good benchmarks
– Good ways to summarize performance
If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
Execution time is the measure of performance

More Related Content

PPT
Cost, Price, and Price for Performance.ppt
PPT
L-2 (Computer Performance).ppt
PPT
Fundamentals of Computer Architecture lecture notes
PPT
computer architecture.
PPTX
Computer Architechture and Organization
PPT
Technology trends-Computer food chain technologies
Cost, Price, and Price for Performance.ppt
L-2 (Computer Performance).ppt
Fundamentals of Computer Architecture lecture notes
computer architecture.
Computer Architechture and Organization
Technology trends-Computer food chain technologies

Similar to Computer performance and cost analysis in systems (20)

PDF
Computer architecture short note (version 8)
PPTX
Performance Computer Architecture Stuff.pptx
PPTX
FUNDAMENTALS OF COMPUTER DESIGN
PDF
Computer Architecture Performance and Energy
PPT
Evaluation of morden computer & system attributes in ACA
PPTX
Fundamentals.pptx
PPT
Computer Organization Design ch2Slides.ppt
PDF
analysis-of-computer-system-architecture-and-functionality.pdf
PPT
Chapter 1 computer abstractions and technology
PDF
The OpenCL-based FPGA accelerator data flow is as follows. First, the kernel ...
PPTX
Caqa5e ch1 with_review_and_examples
PPTX
Mirabilis_Presentation_DAC_June_2024.pptx
PPTX
Pertemuan 5.pptx
PPT
lect1.ppt of a lot of things like computer
PPTX
Advanced Computer Architecture – An Introduction
PPT
Chapter_01computer architecture chap 2 .ppt
PPT
Computer Abstractions and Technologies
PPTX
Unit 1c
PDF
Computer Organization and Architecture 10th Edition Stallings Test Bank
PPTX
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Computer architecture short note (version 8)
Performance Computer Architecture Stuff.pptx
FUNDAMENTALS OF COMPUTER DESIGN
Computer Architecture Performance and Energy
Evaluation of morden computer & system attributes in ACA
Fundamentals.pptx
Computer Organization Design ch2Slides.ppt
analysis-of-computer-system-architecture-and-functionality.pdf
Chapter 1 computer abstractions and technology
The OpenCL-based FPGA accelerator data flow is as follows. First, the kernel ...
Caqa5e ch1 with_review_and_examples
Mirabilis_Presentation_DAC_June_2024.pptx
Pertemuan 5.pptx
lect1.ppt of a lot of things like computer
Advanced Computer Architecture – An Introduction
Chapter_01computer architecture chap 2 .ppt
Computer Abstractions and Technologies
Unit 1c
Computer Organization and Architecture 10th Edition Stallings Test Bank
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Ad

More from VivekanandaGN1 (18)

PPTX
Study_Material_Presentations_Unit-2.pptx
PPT
Classical-Problem-of-Synchronization in OS
PPTX
Web Security and its Importance in the Present era
PPT
Digital computer architecture issues in IO
PPT
Storage devices metrics productivity- IO Introduction
PPTX
Web security Threats and approaches in Security.pptx
PPTX
Remote User Authentication ,Symmetric, Asymmetric and Kerberos.ppt
PPTX
Key management and Distribution in Network security.ppt
PPTX
Message Authentication Codes in Security.pptx
PPTX
Cryptographic Hash Functions in Security.pptx
PPTX
Asymmetric Ciphers in Networks and Security.pptx
PPTX
IdentityTheft by federal trade comission
PPTX
Cybercrime Mobile and Wireless Devices.pptx
PPTX
Cyber Secuirty Fully explained Lecture Notes
PPT
CYBER-CRIME PRESENTATION with real-time examples
PDF
GANS Project for Image idetification.pdf
PDF
Cheat sheet SQL commands with examples and easy understanding
PDF
Master the arrays and algorithms using Algotutor
Study_Material_Presentations_Unit-2.pptx
Classical-Problem-of-Synchronization in OS
Web Security and its Importance in the Present era
Digital computer architecture issues in IO
Storage devices metrics productivity- IO Introduction
Web security Threats and approaches in Security.pptx
Remote User Authentication ,Symmetric, Asymmetric and Kerberos.ppt
Key management and Distribution in Network security.ppt
Message Authentication Codes in Security.pptx
Cryptographic Hash Functions in Security.pptx
Asymmetric Ciphers in Networks and Security.pptx
IdentityTheft by federal trade comission
Cybercrime Mobile and Wireless Devices.pptx
Cyber Secuirty Fully explained Lecture Notes
CYBER-CRIME PRESENTATION with real-time examples
GANS Project for Image idetification.pdf
Cheat sheet SQL commands with examples and easy understanding
Master the arrays and algorithms using Algotutor
Ad

Recently uploaded (20)

PPT
Mechanical Engineering MATERIALS Selection
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Geodesy 1.pptx...............................................
PPTX
Sustainable Sites - Green Building Construction
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
DOCX
573137875-Attendance-Management-System-original
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
composite construction of structures.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
UNIT 4 Total Quality Management .pptx
Mechanical Engineering MATERIALS Selection
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Geodesy 1.pptx...............................................
Sustainable Sites - Green Building Construction
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
573137875-Attendance-Management-System-original
Operating System & Kernel Study Guide-1 - converted.pdf
composite construction of structures.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
Lecture Notes Electrical Wiring System Components
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
UNIT 4 Total Quality Management .pptx

Computer performance and cost analysis in systems

  • 1. December 20, 2024 204521 Digital System Architecture Computer Performance and Cost Pradondet Nilagupta Spring 2001 (original notes from Randy Katz, & Prof. Jan M. Rabaey , UC Berkeley)
  • 2. December 20, 2024 204521 Digital System Architecture 2 Review: What is Computer Architecture? Technology Applications Computer Architect Interfaces Machine Organization Measurement & Evaluation ISA API Link I/O Chan Regs IR
  • 3. December 20, 2024 204521 Digital System Architecture 3 Review: What is Computer Architecture? Technology Applications Computer Architect Interfaces Machine Organization Measurement & Evaluation ISA API Link I/O Chan Regs IR
  • 4. December 20, 2024 204521 Digital System Architecture 4 The Architecture Process New concepts created Estimate Cost & Performance Sort Good ideas Mediocre ideas Bad ideas
  • 5. December 20, 2024 204521 Digital System Architecture 5 Performance Measurement and Evaluation Many dimensions to computer performance – CPU execution time • by instruction or sequence – floating point – integer – branch performance – Cache bandwidth – Main memory bandwidth – I/O performance • bandwidth • seeks • pixels or polygons per second Relative importance depends on applications P $ M
  • 6. December 20, 2024 204521 Digital System Architecture 6 Evaluation Tools Benchmarks, traces, & mixes – macrobenchmarks & suites • application execution time – microbenchmarks • measure one aspect of performance – traces • replay recorded accesses – cache, branch, register Simulation at many levels – ISA, cycle accurate, RTL, gate, circuit • trade fidelity for simulation rate Area and delay estimation Analysis – e.g., queuing theory MOVE 39% BR 20% LOAD 20% STORE 10% ALU 11% LD 5EA3 ST 31FF …. LD 1EA2 ….
  • 7. December 20, 2024 204521 Digital System Architecture 7 Benchmarks Microbenchmarks – measure one performance dimension • cache bandwidth • main memory bandwidth • procedure call overhead • FP performance – weighted combination of microbenchmark performance is a good predictor of application performance – gives insight into the cause of performance bottlenecks Macrobenchmarks – application execution time • measures overall performance, but on just one application Perf. Dimensions Applications Micro Macro
  • 8. December 20, 2024 204521 Digital System Architecture 8 Some Warnings about Benchmarks Benchmarks measure the whole system – application – compiler – operating system – architecture – implementation Popular benchmarks typically reflect yesterday’s programs – computers need to be designed for tomorrow’s programs Benchmark timings often very sensitive to – alignment in cache – location of data on disk – values of data Benchmarks can lead to inbreeding or positive feedback – if you make an operation fast (slow) it will be used more (less) often • so you make it faster (slower) – and it gets used even more (less) » and so on…
  • 9. December 20, 2024 204521 Digital System Architecture Architectural Performance Laws and Rules of Thumb
  • 10. December 20, 2024 204521 Digital System Architecture 10 Measurement and Evaluation Architecture is an iterative process: – Searching the space of possible designs – Make selections – Evaluate the selections made Good measurement tools are required to accurately evaluate the selection.
  • 11. December 20, 2024 204521 Digital System Architecture 11 Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels) – ISA, RTL, Gate, Circuit Queuing Theory Rules of Thumb Fundamental Laws
  • 12. December 20, 2024 204521 Digital System Architecture 12 Measuring and Reporting Performance What do we mean by one Computer is faster than another? – program runs less time Response time or execution time – time that users see the output Throughput – total amount of work done in a given time
  • 13. December 20, 2024 204521 Digital System Architecture 13 Performance “Increasing and decreasing” ????? We use the term “improve performance” or “ improve execution time” When we mean in crease performance and decrease executi on time . improve performance = increase performance improve execution time = decrease execution time
  • 14. December 20, 2024 204521 Digital System Architecture 14 Measuring Performance Definition of time Wall Clock time Response time Elapsed time – A latency to complete a task including disk accesses, memory accesses, I/O activities, operating system overhead
  • 15. December 20, 2024 204521 Digital System Architecture Does Anybody Really Know What Time it is? User CPU Time (Time spent in program) System CPU Time (Time spent in OS) Elapsed Time (Response Time = 159 Sec.) (90.7+12.9)/159 * 100 = 65%, % of lapsed time that is CPU time. 45% of the time spent in I/O or running other programs UNIX Time Command: 90.7u 12.9s 2:39 65%
  • 16. December 20, 2024 204521 Digital System Architecture 16 Example UNIX time command 90.7u 12.95 2:39 65% user CPU time is 90.7 sec system CPU time is 12.9 sec elapsed time is 2 min 39 sec. (159 sec) % of elapsed time that is CPU time is 90.7 + 12.9 = 65% 159
  • 17. December 20, 2024 204521 Digital System Architecture 17 Time CPU time – time the CPU is computing – not including the time waiting for I/O or running other program User CPU time – CPU time spent in the program System CPU time – CPU time spent in the operating system performing task requested by the program decrease execution time CPU time = User CPU time + System CPU time
  • 18. December 20, 2024 204521 Digital System Architecture 18 Performance System Performance – elapsed time on unloaded system CPU performance – user CPU time on an unloaded system
  • 19. December 20, 2024 204521 Digital System Architecture 19 Programs to Evaluate Processor Performance Real Program Kernel – Time critical excerpts Toy benchmark – 10-100 lines of code, Sieve of Erastasthens, Puzzles, Quick Sort Synthetic benchmark – Attempt to match average frequencies of real workloads – similar to kernel – Whetstone, Dhrystone – not even pieces of real program, but kernel might be.
  • 20. December 20, 2024 204521 Digital System Architecture 20 Benchmark Suite collecting of benchmark, try to measure the performance of processor with a variety of application – SPECint92, SPECfp92 – see fig. 1.9
  • 21. December 20, 2024 204521 Digital System Architecture Benchmarking Games Differing configurations used to run the same workload on two systems Compiler wired to optimize the workload Test specification written to be biased towards one machine Synchronized CPU/IO intensive job sequence used Workload arbitrarily picked Very small benchmarks used Benchmarks manually translated to optimize performance
  • 22. December 20, 2024 204521 Digital System Architecture Common Benchmarking Mistakes (1/2) Only average behavior represented in test workload Skewness of device demands ignored Loading level controlled inappropriately Caching effects ignored Buffer sizes not appropriate Inaccuracies due to sampling ignored
  • 23. December 20, 2024 204521 Digital System Architecture Common Benchmarking Mistakes (2/2) Ignoring monitoring overhead Not validating measurements Not ensuring same initial conditions Not measuring transient (cold start) performance Using device utilizations for performance comparisons Collecting too much data but doing too little analysis
  • 24. December 20, 2024 204521 Digital System Architecture 24 SPEC: System Performance Evaluation Cooperative (1/2) First Round 1989 – 10 programs yielding a single number Second Round 1992 – SpecInt92 (6 integer programs) and SpecFP92 ( 14 floating point programs) – Compiler Flags unlimited. March 93 of DEC 400 0 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b, c)=memcpy(b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
  • 25. December 20, 2024 204521 Digital System Architecture 25 SPEC: System Performance Evaluation Cooperative (2/2) Third Round 1995 – new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) – “benchmarks useful for 3 years” – Single flag setting for all programs: SPECint_base95, SPECfp_base95
  • 26. December 20, 2024 204521 Digital System Architecture 26 How to Summarize Performance (1/2) Arithmetic mean (weighted arithmetic mean) tracks execution time: Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time:   i i i T W or n T *   i i i R W n or R n 1
  • 27. December 20, 2024 204521 Digital System Architecture 27 How to Summarize Performance (2/2) Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10) – Arithmetic mean impacted by choice of reference machine Use the geometric mean for comparison: – Independent of chosen machine – but not good metric for total execution time n i T
  • 28. December 20, 2024 204521 Digital System Architecture 28 SPEC First Round One program: 99% of time in single line of code New front-end compiler could improve dramatically Benchmark SPEC Perf 0 100 200 300 400 500 600 700 800 gcc epresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv
  • 29. December 20, 2024 204521 Digital System Architecture 29 Impact of Means on SPECmark89 for IBM 550 (without and with special compiler option) Ratio to VAX: Time: Weighted Time: Program Before After Before After Before After gcc 30 29 49 51 8.91 9.22 espresso 35 34 65 67 7.64 7.86 spice 47 47 510 510 5.69 5.69 doduc 46 49 41 38 5.81 5.45 nasa7 78 144 258 140 3.43 1.86 li 34 34 183 183 7.86 7.86 eqntott 40 40 28 28 6.68 6.68 matrix300 78 730 58 6 3.43 0.37 fpppp 90 87 34 35 2.97 3.07 tomcatv 33 138 20 19 2.01 1.94 Mean 54 72 124 108 54.42 49.99 Geometric Arithmetic Weighted Arith. Ratio 1.33 Ratio 1.16 Ratio 1.09
  • 30. December 20, 2024 204521 Digital System Architecture The Bottom Line: Performance (and Cost) • Time to run the task (ExTime) – Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) – Throughput, bandwidth Plane Boeing 747 BAD/Sud Concorde Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers 470 132 Throughput (pass/hr) 72 44
  • 31. December 20, 2024 204521 Digital System Architecture The Bottom Line: Performance (and Cost) "X is n times faster than Y" means ExTime(Y) Performance(X) --------- = --------------- = n ExTime(X) Performance(Y) • Performance is the reciprocal of execution time • Speed of Concorde vs. Boeing 747 • Throughput of Boeing 747 vs. Concorde
  • 32. December 20, 2024 204521 Digital System Architecture 32 Performance Terminology “X is n% faster than Y” means: ExTime(Y) Performance(X) n --------- = -------------- = 1 + ----- ExTime(X) Performance(Y) 100 n = 100(Performance(X) - Performance(Y)) Performance(Y) n = 100(ExTime(Y) - ExTime(X)) ExTime(X)
  • 33. December 20, 2024 204521 Digital System Architecture Example Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? n = 100(ExTime(Y) - ExTime(X)) ExTime(X) n = 100(15 - 10) 10 n = 50%
  • 34. December 20, 2024 204521 Digital System Architecture 34 Speedup Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fractionenhanced of the task by a factor Speedupenhanced , and the remainder of the task is unaffected, then what is ExTime(E) = ? Speedup(E) = ?
  • 35. December 20, 2024 204521 Digital System Architecture 35 Amdahl’s Law States that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time faster mode can be used
  • 36. December 20, 2024 204521 Digital System Architecture Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupoverall = ExTimeold ExTimenew Speedupenhanced = 1 (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced
  • 37. December 20, 2024 204521 Digital System Architecture Example of Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of actual instructions are FP Speedupoverall = ExTimenew =
  • 38. December 20, 2024 204521 Digital System Architecture Example of Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of actual instructions are FP Speedupoverall = 1 0.95 = 1.053 ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
  • 39. December 20, 2024 204521 Digital System Architecture 39 Corollary: Make The Common Case Fast All instructions require an instruction fetch, only a fraction require a data fetch/store. – Optimize instruction access over data access Programs exhibit locality – 90% of time in 10% of code – Spatial Locality – Temporal Locality Access to small memories is faster – Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Reg's Cache Memory Disk / Tape
  • 40. December 20, 2024 204521 Digital System Architecture 40 RISC Philosophy The simple case is usually the most frequent and the easiest to optimize! Do simple, fast things in hardware and be sure the rest can be handled correctly in software
  • 41. December 20, 2024 204521 Digital System Architecture Metrics of Performance Compiler Programming Language Application Datapath Control Transistors Wires Pins ISA Function Units (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Operations per second
  • 42. December 20, 2024 204521 Digital System Architecture % of Instructions Responsible for 80-90% of Instructions Executed 0 10 20 30 40 50 60 70 80 compress li ear
  • 43. December 20, 2024 204521 Digital System Architecture 43 Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds (User) Program Program Instruction Cycle Inst. Cnt CPI Clock Rate Program Compiler Inst. Set Organization Technology
  • 44. December 20, 2024 204521 Digital System Architecture 44 Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds (User) Program Program Instruction Cycle Inst. Cnt CPI Clock Rate Program X Compiler X (X) Inst. Set X X Organization X X Technology X
  • 45. December 20, 2024 204521 Digital System Architecture 45 Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 • Machines with different instruction sets ? • Programs with different instruction mixes ? – Dynamic frequency of instructions • Uncorrelated with performance MFLOP/s = FP Operations / Time * 10^6 • Machine dependent • Often not where time is spent Normalized: add,sub,compare,mult 1 divide, sqrt 4 exp, sin, . . . 8
  • 46. December 20, 2024 204521 Digital System Architecture Cycles Per Instruction CPU time = CycleTime *  CPIi * Ii i = 1 n Avg. CPI = CPI * F where F = I i = 1 n i i i i Instruction Count “Instruction Frequency” Invest Resources where time is Spent! CPI = Clock Cycles for a Program/Instr. Count “Average Cycles per Instruction”
  • 47. December 20, 2024 204521 Digital System Architecture 47 Cycles Per Instruction “Average Cycles per Instruction” CPI = Clock Cycles for a Program/Instr. Count “Instruction Frequency” Where Invest Resources where time is Spent!    n 1 j j i I CPI CycleTime Time CPU * * i n 1 i i F CPI CPI Avg. *    Count n Instructio I F i i 
  • 48. December 20, 2024 204521 Digital System Architecture 48 Example: Calculating CPI Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix 1.5 is the Average CPI for this instruction mix
  • 49. December 20, 2024 204521 Digital System Architecture 49 Organizational Trade-offs Instruction Mix Cycle Time CPI Compiler Programming Language Application Datapath Control Transistors Wires Pins ISA Function Units
  • 50. December 20, 2024 204521 Digital System Architecture 50 Trade-off Example Add register / memory operations: – One source operand in memory – One source operand in register – Cycle count of 2 Branch cycle count to increase to 3. What fraction of the loads must be eliminated for this to pay off? Base Machine (Reg / Reg) Op Freq Cycles ALU 50% 1 Load 20% 2 Store 10% 2 Branch20% 2
  • 51. December 20, 2024 204521 Digital System Architecture Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles ALU .50 1 .5 Load .20 2 .4 Store .10 2 .2 Branch .20 2 .3 Reg/Mem 1.00 1.5
  • 52. December 20, 2024 204521 Digital System Architecture Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) CPINew must be normalized to new instruction frequency CyclesNew InstructionsNew
  • 53. December 20, 2024 204521 Digital System Architecture Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPInew x ClockNew 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)
  • 54. December 20, 2024 204521 Digital System Architecture Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X) 1.5 = 1.7 – X 0.2 = X ALL loads must be eliminated for this to be a win!
  • 55. December 20, 2024 204521 Digital System Architecture 55 Means   n i i T n 1 1 Arithmetic mean   n i i R n 1 1 Geometric mean                              n i ri i n n i ri i T T n T T 1 1 1 log 1 exp Harmonic mean Consistent independent of reference Can be weighted. Represents total execution time
  • 56. December 20, 2024 204521 Digital System Architecture Which Machine is “Better”? Computer A Computer B Computer C Program P1(sec) 1 10 20 Program P2 1000 100 20 Total Time 1001 110 40
  • 57. December 20, 2024 204521 Digital System Architecture Weighted Arithmetic Mean Assume three weighting schemes P1/P2 Comp A Comp B Comp C .5/.5 500.5 55.0 20 .909/.091 91.82 18.8 20 .999/.001 2 10.09 20
  • 58. December 20, 2024 204521 Digital System Architecture Performance Evaluation Given sales is a function of performance relative to the competition, big investment in improving product as reported by performance summary Good products created then have: – Good benchmarks – Good ways to summarize performance If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales; Sales almost always wins! Execution time is the measure of performance