SlideShare a Scribd company logo
soc 4.1
Chapter 4
Memory Design: SOC and
Board-Based Systems
Computer System Design
System-on-Chip
by M. Flynn & W. Luk
Pub. Wiley 2011 (copyright 2011)
soc 4.2
Cache and Memory
• cache
– performance
– cache partitioning
– multi-level cache
• memory
– off-die memory designs
soc 4.3
Outline for memory design
soc 4.4
Area comparison of memory tech.
soc 4.5
System environments and memory
soc 4.6
Performance factors
Virtual
address
1. physical word size
• processor  cache
2. block / line size
• cache  memory
3. cache hit time
• cache size, organization
4. cache miss time
• memory and bus
5. virtual-to-real translation time
6. number of processor requests per cycle
Factors:
soc 4.7
Design target miss rates
beyond 1MB
double the size
half the miss rate
soc 4.8
System effects limit hit rate
• operating System affects the miss ratio
– about 20% increase
• so does multiprogramming (M)
– miss rates may not be affected by increased
cache size
– Q = no. instructions between task switches
soc 4.9
System Effects
• Cold-Start
– short transactions
are created
frequently and run
quickly to completion
• Warm-Start
– long processes are
executed in time
slices
COLD
soc 4.10
Some common cache types
soc 4.11
Multi-level caches: mostly on die
• useful for matching processor to memory
– generally at least 2-level
• For microprocessors L1 at frequency of pipeline
and L2 at slower latency
– often use 3-level
• Size limited by access time and improved cycle
times
soc 4.12
Cache partitioning:
scaling effect on cache access time
• access time to a cache is approximately
access time (ns) = (0.35 + 3.8f +(0.006
+0.025 f) C) x (1 + 0.3(1 - 1/A)) where
– f is the feature size in microns
– C is the cache capacity in K bytes
– A is the associativity, e.g. direct map A = 1
• for example, at f = 0.1u, A = 1 and C = 32
(KB) the access time is 1.00 ns
• problem with small feature size:
cache access time, not cache size
soc 4.13
Minimum cache access time
1 array, larger sizes use multiple arrays (interleaving)
L1 usually less
than 64kB
L3: multiple
256KB arrays
L2 usually less than
512KB (interleaved from
smaller arrays)
soc 4.14
Analysis: multi-level cache miss rate
• L2 cache analysis by statistical inclusion
• if L2 cache > 4 x size of the L1 cache then
– assume statistically: contents of L1 lies in L2
• relevant L2 miss rates
– local miss rate: No. L2 misses / No. L2 references
– global Miss Rate: No. misses / No. processor ref.
– solo Miss Rate: No. misses without L1/No. proc. ref.
– Inclusion => solo miss rate = global miss rate
• miss penalty calculation
– L1 miss rate x (miss in L1, hit in L2 penalty) plus
– L2 miss rate x ( miss in L1, miss in L2 penalty - L1
to L2 penalty)
soc 4.15
Multi-level cache example
L1 L2 Memory
Miss Rate 4% 1%
- delays:
Miss in L1, Hit in L2 2 cycles
Miss in L1, Miss in L2 15 cycles
- assume one reference/instruction
L1 delay is 1 ref/instr x .04 misses/ref x 2 cycles/miss = 0.08 cpi
L2 delay is 1 ref/instr x .01 misses/ref x (15-2) = 0.13 cpi
Total effect of 2 level system is 0.08 + 0.13 = 0.29 cpi
soc 4.16
Memory design
• logical inclusion
• embedded RAM
• off-die: DRAM
• basic memory model
• Strecker’s model
soc 4.17
Physical memory system
soc 4.18
Hierarchy of caches
Name ? Size Access Transfer
size
L0 Registers <256
words
<1 cycle word
L1 Core local <64K <4 cycle Line
L2 On Chip <64M <30 cycle Line
L3 DRAM on
Chip
<1G <60 cycle >= Line
M0 Off Chip
Cache
M1 Local
Main
Memory
<16G <150
cycle
>= Line
M2 Cluster
Memory
soc 4.19
Hierarchy of caches
• Working Set – how much memory an “iteration”
requires
• if it fits in a level then that will be the worst case
• if it does not, hit rate typically determines
performance
• double the cache level size half the miss rate –
good rule of thumb
• if 90% hit rate, 10x memory access time,
performance 50%
• and that’s for 1 core
soc 4.20
Logical inclusion
• multiprocessors with L1 and L2 caches
– Important: L1 cache does NOT contain a line
• sufficient to determine
– L2 cache does not have the line
• need to ensure
– all the contents of L1 are always in L2
• this property: Logical Inclusion
soc 4.21
Logical inclusion techniques
• passive
– control Cache size, organization, policies
– no. L2 sets no. L1 sets
– L2 set size L1 set size
– compatible replacement algorithms
– but: highly restrictive and difficult to guarantee
• active
– whenever a line is replaced or invalidated in the L2
– ensure it is not present in L1 or it is evicted from L1


soc 4.22
Memory system design outline
• memory chip technology
– on-die or off die
• static versus dynamic:
– SRAM versus DRAM
• access protocol: talking to memory
– synchronous vs asynchronous DRAMs
• simple memory performance model
– Strecker’s model for memory banks
soc 4.23
Why BIG memory?
soc 4.24
Memory
• many times, computation limited by memory
– not processor organization or cycle time
• memory: characterized by 3 parameters
– size
– access time: latency
– cycle time: bandwidth
soc 4.25
Embedded RAM
soc 4.26
Embedded RAM density (1)
soc 4.27
Embedded RAM density (2)
soc 4.28
Embedded RAM cycle time
soc 4.29
Embedded RAM error rates
soc 4.30
Off-die Memory Module
• module contains the DRAM chips that make up
the physical memory word
• if the DRAM is organized 2n words x b bits and the
memory has p bits/ physical word then the module
has p/b DRAM chips.
• total memory size is then 2n words x p bits
• Parity or Error-Correction Code (ECC) generally
required for error detection and availability
soc 4.31
Simple asychronous DRAM array
• DRAM cell
– Capacitor: store charge
for 0/1 state
– Transistor: switch
capacitor to bit line
– Charge decays =>
refresh required
• DRAM array
– Stores 2n bits in a square
array
– 2n/2 row lines connect to
data lines
– 2n/2 column bit lines
connect to sense
amplifiers
soc 4.32
DRAM basics
• Row read is destructive
• Sequence
– Read row into SRAM from dynamic
memory(>1000 bits)
– Select word (<64 bits)
– Write Word into row (writing)
– Repeat till done with row
– WRITE back row into dynamic memory
soc 4.33
DRAM timing
• row and column addresses muxed
• row and column Strobes for timing
soc 4.34
Increase DRAM bandwidth
• Burst Mode
– aka page mode, nibble mode, fast page mode
• Synchronous DRAM (SDRAM)
• DDR SDRAM
– DDR1
– DDR2
– DDR3
soc 4.35
DDR SDRAM
(Dual Data Rate Synchronous DRAM)
soc 4.36
Burst mode
• burst mode
– save most recently accessed row (“page”)
– only need column row + CAS to access within page
• most DDR SDRAMs: multiple rows can be open
– address counter in each row for sequential
accesses
– only need CAS (DRAM) or bus clock (SDRAM) for
sequential accesses
soc 4.37
Configuration parameters
Parameters for typical DRAM chips used in a 64-bit module
soc 4.38
DRAM timing
soc 4.39
Physical memory system
soc 4.40
Basic memory model
• assume that n processors
– each make 1 request per Tc to one of m memories
• B(n,m)
– number of successes
• Tc
– memory cycle time to the memory
• one processor making n requests per Tc
– behaves as n processors making 1 request per Tc
soc 4.41
Achieved vs. offered bandwidth
• offered request rate
– rate at which processor(s) would make requests if
memory had unlimited bandwidth and no contention
soc 4.42
Basic terms
• B = B(m,n) or B(m)
– number of requests that succeed each Tc
(= average number of busy modules)
– B: bandwidth normalized to Tc
• Ts: more generalized term for service time
– Tc = Ts
• BW: achieved bandwidth
– in requests serviced per second
– BW = B / Ts = B(m,n)/ Ts
soc 4.43
Modeling + evaluation methodology
• relevant physical parameters for memory
– word size
– module size
– number of modules
– cycle time Tc (=Ts)
• find the offered Bandwidth
– number of requests/Ts
• find the bottleneck
– performance limited by most restrictive service point
soc 4.44
Strecker’s model: compute B(m,n)
• model description
– each processor generates 1 reference per cycle
– requests randomly/uniformly distributed over modules
– any busy module serves 1 request
– all unserviced requests are dropped each cycle
– assume there are no queues
• B(m,n) = m[1 - (1 - 1/m)n]
• relative Performance Prel = B(m,n) / n
soc 4.45
Deriving Strecker’s model
• Prob[given processor not reference module]
= (1 – 1/m)
• Prob[no processor references module]
= P[idle]
= (1 – 1/m)n
• Prob[module busy]
= 1 - (1 – 1/m)n
• average number of busy modules is B(m,n)
• B(m,n) = m[1 - (1 - 1/m)n]
soc 4.46
Example 1
• 2 dual core processor dice share memory
– Ts = 24 ns
• each die has 2 processors
– sharing 4MB L2
– miss rate is 0.001 misses reference
– each processor makes 3 references/cycle @ 4 GHz
2 x 2 x 3 x 0.001 =0.012 refs/cyc
Ts = 4 x 24 cycles
n = 1.152 processor requests / Ts; if m= 4
success rate B(m,n) = B(4,1.152) = 0.81
Relative Performance = B/n = .81/1.152 =0.7
soc 4.47
Example 2
• 8-way interleaved associative data cache
• processor issues 2LD/ST per cycle
– each processor: data reference per cycle = 0.6
– n = 2 ; m = 8
– B(m,n) = B(8,1.2) = 1.18
• Relative Performance = B/n = 1.18/1.2 = 0.98
soc 4.48
Summary
• cache
– performance, cache partitioning, multi-level cache
• memory chip technology
– on-die or off die
• static versus dynamic:
– SRAM versus DRAM
• access protocol: talking to memory
– synchronous vs asynchronous DRAMs
• simple memory performance model
– Strecker’s model for memory banks

More Related Content

PPT
PPTX
Computer System Architecture Lecture Note 8.1 primary Memory
PPT
Memory Hierarchy PPT of Computer Organization
PPT
Akanskaha_ganesh_kullarni_memory_computer.ppt
PPT
Ct213 memory subsystem
PPT
cache memory.ppt
PPT
cache memory.ppt
Computer System Architecture Lecture Note 8.1 primary Memory
Memory Hierarchy PPT of Computer Organization
Akanskaha_ganesh_kullarni_memory_computer.ppt
Ct213 memory subsystem
cache memory.ppt
cache memory.ppt

Similar to SOC-CH4.pptSOC Processors Used in SOCSOC Processors Used in SOC (20)

PPTX
cache cache memory memory cache memory.pptx
PPTX
Computer Memory Hierarchy Computer Architecture
PPT
7_mem_cache.ppt
PDF
Memory (Computer Organization)
PPT
PDF
1083 wang
PPT
04 cache memory
PPT
04 cache memory
PPT
Ch_4.pptInnovation technology pptInnovation technology ppt
PDF
unit 4.faosdfjasl;dfkjas lskadfj asdlfk jasdf;laksjdf ;laskdjf a;slkdjf
PPT
cache memory
PPT
04 cache memory
PPT
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
PPT
cache memory introduction, level, function
PPT
04_Cache_Memory-cust memori memori memori.ppt
PPT
Memory Organization
PDF
Chache memory ( chapter number 4 ) by William stalling
PDF
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
PPT
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
PPTX
BCSE205L_Module 4 Computer Architecture Org.pptx
cache cache memory memory cache memory.pptx
Computer Memory Hierarchy Computer Architecture
7_mem_cache.ppt
Memory (Computer Organization)
1083 wang
04 cache memory
04 cache memory
Ch_4.pptInnovation technology pptInnovation technology ppt
unit 4.faosdfjasl;dfkjas lskadfj asdlfk jasdf;laksjdf ;laskdjf a;slkdjf
cache memory
04 cache memory
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
cache memory introduction, level, function
04_Cache_Memory-cust memori memori memori.ppt
Memory Organization
Chache memory ( chapter number 4 ) by William stalling
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
BCSE205L_Module 4 Computer Architecture Org.pptx
Ad

More from SnehaLatha68 (10)

PPT
ERM_Unit VERM_Unit VERM_Unit VERM_Unit V.ppt
PPT
ERM-4b-finalERM-4b-finaERM-4b-finaERM-4b-fina.ppt
PPT
ERM-4a-finalERM-4a-finalERM-4a-final.ppt
PPT
Lec 7 Unit IV Analysis ce cb cc amp send tdy.ppt
PPTX
safesYour score increases as you pick a category, fill out a long description...
PPTX
UNIT 5.pptxUNIT 4.pptxUNIT 4.pptxUNIT 4.pptxUNIT 4.pptxUNIT 4.pptx
PPTX
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
PPTX
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
PPT
SOC-CH5.pptSOC Processors Used in SOCSOC Processors Used in SOC
PPT
SOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOC
ERM_Unit VERM_Unit VERM_Unit VERM_Unit V.ppt
ERM-4b-finalERM-4b-finaERM-4b-finaERM-4b-fina.ppt
ERM-4a-finalERM-4a-finalERM-4a-final.ppt
Lec 7 Unit IV Analysis ce cb cc amp send tdy.ppt
safesYour score increases as you pick a category, fill out a long description...
UNIT 5.pptxUNIT 4.pptxUNIT 4.pptxUNIT 4.pptxUNIT 4.pptxUNIT 4.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
SOC-CH5.pptSOC Processors Used in SOCSOC Processors Used in SOC
SOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOC
Ad

Recently uploaded (20)

PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
communication and presentation skills 01
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPT
Total quality management ppt for engineering students
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
86236642-Electric-Loco-Shed.pdf jfkduklg
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
Abrasive, erosive and cavitation wear.pdf
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
PPT on Performance Review to get promotions
PPTX
introduction to high performance computing
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
communication and presentation skills 01
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Total quality management ppt for engineering students
Safety Seminar civil to be ensured for safe working.
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
86236642-Electric-Loco-Shed.pdf jfkduklg
Fundamentals of Mechanical Engineering.pptx
Abrasive, erosive and cavitation wear.pdf
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Nature of X-rays, X- Ray Equipment, Fluoroscopy
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPT on Performance Review to get promotions
introduction to high performance computing
Fundamentals of safety and accident prevention -final (1).pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx

SOC-CH4.pptSOC Processors Used in SOCSOC Processors Used in SOC

  • 1. soc 4.1 Chapter 4 Memory Design: SOC and Board-Based Systems Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)
  • 2. soc 4.2 Cache and Memory • cache – performance – cache partitioning – multi-level cache • memory – off-die memory designs
  • 3. soc 4.3 Outline for memory design
  • 4. soc 4.4 Area comparison of memory tech.
  • 6. soc 4.6 Performance factors Virtual address 1. physical word size • processor  cache 2. block / line size • cache  memory 3. cache hit time • cache size, organization 4. cache miss time • memory and bus 5. virtual-to-real translation time 6. number of processor requests per cycle Factors:
  • 7. soc 4.7 Design target miss rates beyond 1MB double the size half the miss rate
  • 8. soc 4.8 System effects limit hit rate • operating System affects the miss ratio – about 20% increase • so does multiprogramming (M) – miss rates may not be affected by increased cache size – Q = no. instructions between task switches
  • 9. soc 4.9 System Effects • Cold-Start – short transactions are created frequently and run quickly to completion • Warm-Start – long processes are executed in time slices COLD
  • 10. soc 4.10 Some common cache types
  • 11. soc 4.11 Multi-level caches: mostly on die • useful for matching processor to memory – generally at least 2-level • For microprocessors L1 at frequency of pipeline and L2 at slower latency – often use 3-level • Size limited by access time and improved cycle times
  • 12. soc 4.12 Cache partitioning: scaling effect on cache access time • access time to a cache is approximately access time (ns) = (0.35 + 3.8f +(0.006 +0.025 f) C) x (1 + 0.3(1 - 1/A)) where – f is the feature size in microns – C is the cache capacity in K bytes – A is the associativity, e.g. direct map A = 1 • for example, at f = 0.1u, A = 1 and C = 32 (KB) the access time is 1.00 ns • problem with small feature size: cache access time, not cache size
  • 13. soc 4.13 Minimum cache access time 1 array, larger sizes use multiple arrays (interleaving) L1 usually less than 64kB L3: multiple 256KB arrays L2 usually less than 512KB (interleaved from smaller arrays)
  • 14. soc 4.14 Analysis: multi-level cache miss rate • L2 cache analysis by statistical inclusion • if L2 cache > 4 x size of the L1 cache then – assume statistically: contents of L1 lies in L2 • relevant L2 miss rates – local miss rate: No. L2 misses / No. L2 references – global Miss Rate: No. misses / No. processor ref. – solo Miss Rate: No. misses without L1/No. proc. ref. – Inclusion => solo miss rate = global miss rate • miss penalty calculation – L1 miss rate x (miss in L1, hit in L2 penalty) plus – L2 miss rate x ( miss in L1, miss in L2 penalty - L1 to L2 penalty)
  • 15. soc 4.15 Multi-level cache example L1 L2 Memory Miss Rate 4% 1% - delays: Miss in L1, Hit in L2 2 cycles Miss in L1, Miss in L2 15 cycles - assume one reference/instruction L1 delay is 1 ref/instr x .04 misses/ref x 2 cycles/miss = 0.08 cpi L2 delay is 1 ref/instr x .01 misses/ref x (15-2) = 0.13 cpi Total effect of 2 level system is 0.08 + 0.13 = 0.29 cpi
  • 16. soc 4.16 Memory design • logical inclusion • embedded RAM • off-die: DRAM • basic memory model • Strecker’s model
  • 18. soc 4.18 Hierarchy of caches Name ? Size Access Transfer size L0 Registers <256 words <1 cycle word L1 Core local <64K <4 cycle Line L2 On Chip <64M <30 cycle Line L3 DRAM on Chip <1G <60 cycle >= Line M0 Off Chip Cache M1 Local Main Memory <16G <150 cycle >= Line M2 Cluster Memory
  • 19. soc 4.19 Hierarchy of caches • Working Set – how much memory an “iteration” requires • if it fits in a level then that will be the worst case • if it does not, hit rate typically determines performance • double the cache level size half the miss rate – good rule of thumb • if 90% hit rate, 10x memory access time, performance 50% • and that’s for 1 core
  • 20. soc 4.20 Logical inclusion • multiprocessors with L1 and L2 caches – Important: L1 cache does NOT contain a line • sufficient to determine – L2 cache does not have the line • need to ensure – all the contents of L1 are always in L2 • this property: Logical Inclusion
  • 21. soc 4.21 Logical inclusion techniques • passive – control Cache size, organization, policies – no. L2 sets no. L1 sets – L2 set size L1 set size – compatible replacement algorithms – but: highly restrictive and difficult to guarantee • active – whenever a line is replaced or invalidated in the L2 – ensure it is not present in L1 or it is evicted from L1  
  • 22. soc 4.22 Memory system design outline • memory chip technology – on-die or off die • static versus dynamic: – SRAM versus DRAM • access protocol: talking to memory – synchronous vs asynchronous DRAMs • simple memory performance model – Strecker’s model for memory banks
  • 23. soc 4.23 Why BIG memory?
  • 24. soc 4.24 Memory • many times, computation limited by memory – not processor organization or cycle time • memory: characterized by 3 parameters – size – access time: latency – cycle time: bandwidth
  • 26. soc 4.26 Embedded RAM density (1)
  • 27. soc 4.27 Embedded RAM density (2)
  • 28. soc 4.28 Embedded RAM cycle time
  • 29. soc 4.29 Embedded RAM error rates
  • 30. soc 4.30 Off-die Memory Module • module contains the DRAM chips that make up the physical memory word • if the DRAM is organized 2n words x b bits and the memory has p bits/ physical word then the module has p/b DRAM chips. • total memory size is then 2n words x p bits • Parity or Error-Correction Code (ECC) generally required for error detection and availability
  • 31. soc 4.31 Simple asychronous DRAM array • DRAM cell – Capacitor: store charge for 0/1 state – Transistor: switch capacitor to bit line – Charge decays => refresh required • DRAM array – Stores 2n bits in a square array – 2n/2 row lines connect to data lines – 2n/2 column bit lines connect to sense amplifiers
  • 32. soc 4.32 DRAM basics • Row read is destructive • Sequence – Read row into SRAM from dynamic memory(>1000 bits) – Select word (<64 bits) – Write Word into row (writing) – Repeat till done with row – WRITE back row into dynamic memory
  • 33. soc 4.33 DRAM timing • row and column addresses muxed • row and column Strobes for timing
  • 34. soc 4.34 Increase DRAM bandwidth • Burst Mode – aka page mode, nibble mode, fast page mode • Synchronous DRAM (SDRAM) • DDR SDRAM – DDR1 – DDR2 – DDR3
  • 35. soc 4.35 DDR SDRAM (Dual Data Rate Synchronous DRAM)
  • 36. soc 4.36 Burst mode • burst mode – save most recently accessed row (“page”) – only need column row + CAS to access within page • most DDR SDRAMs: multiple rows can be open – address counter in each row for sequential accesses – only need CAS (DRAM) or bus clock (SDRAM) for sequential accesses
  • 37. soc 4.37 Configuration parameters Parameters for typical DRAM chips used in a 64-bit module
  • 40. soc 4.40 Basic memory model • assume that n processors – each make 1 request per Tc to one of m memories • B(n,m) – number of successes • Tc – memory cycle time to the memory • one processor making n requests per Tc – behaves as n processors making 1 request per Tc
  • 41. soc 4.41 Achieved vs. offered bandwidth • offered request rate – rate at which processor(s) would make requests if memory had unlimited bandwidth and no contention
  • 42. soc 4.42 Basic terms • B = B(m,n) or B(m) – number of requests that succeed each Tc (= average number of busy modules) – B: bandwidth normalized to Tc • Ts: more generalized term for service time – Tc = Ts • BW: achieved bandwidth – in requests serviced per second – BW = B / Ts = B(m,n)/ Ts
  • 43. soc 4.43 Modeling + evaluation methodology • relevant physical parameters for memory – word size – module size – number of modules – cycle time Tc (=Ts) • find the offered Bandwidth – number of requests/Ts • find the bottleneck – performance limited by most restrictive service point
  • 44. soc 4.44 Strecker’s model: compute B(m,n) • model description – each processor generates 1 reference per cycle – requests randomly/uniformly distributed over modules – any busy module serves 1 request – all unserviced requests are dropped each cycle – assume there are no queues • B(m,n) = m[1 - (1 - 1/m)n] • relative Performance Prel = B(m,n) / n
  • 45. soc 4.45 Deriving Strecker’s model • Prob[given processor not reference module] = (1 – 1/m) • Prob[no processor references module] = P[idle] = (1 – 1/m)n • Prob[module busy] = 1 - (1 – 1/m)n • average number of busy modules is B(m,n) • B(m,n) = m[1 - (1 - 1/m)n]
  • 46. soc 4.46 Example 1 • 2 dual core processor dice share memory – Ts = 24 ns • each die has 2 processors – sharing 4MB L2 – miss rate is 0.001 misses reference – each processor makes 3 references/cycle @ 4 GHz 2 x 2 x 3 x 0.001 =0.012 refs/cyc Ts = 4 x 24 cycles n = 1.152 processor requests / Ts; if m= 4 success rate B(m,n) = B(4,1.152) = 0.81 Relative Performance = B/n = .81/1.152 =0.7
  • 47. soc 4.47 Example 2 • 8-way interleaved associative data cache • processor issues 2LD/ST per cycle – each processor: data reference per cycle = 0.6 – n = 2 ; m = 8 – B(m,n) = B(8,1.2) = 1.18 • Relative Performance = B/n = 1.18/1.2 = 0.98
  • 48. soc 4.48 Summary • cache – performance, cache partitioning, multi-level cache • memory chip technology – on-die or off die • static versus dynamic: – SRAM versus DRAM • access protocol: talking to memory – synchronous vs asynchronous DRAMs • simple memory performance model – Strecker’s model for memory banks