SlideShare a Scribd company logo
CACHE MEMORY
By
Anand Goyal
2010C6PS648
G
Memory Hierarchy
 Computer memory is organized in a
hierarchy. This done to cope up with
the speed of processor and hence
increase performance.
 Closest to the processor are the
Processing registers. Then comes the
Cache memory, followed by Main
memory.
SRAM and DRAM
 Both are random access memories and are
volatile, i.e. constant power supply is required
to avoid data loss.
 DRAM :- made up of a capacitor and a
transistor. Transistor acts as a switch and
data in the form of charge is present on the
capacitor. Requires periodic charge
refreshing to maintain data storage. Lesser
cost per bit, less expensive. Used for large
memory
 SRAM :- made up of 4 transistors, which are
cross-connected in an arrangement that
produces stable logic state. Greater costs per
bit, more expensive. Used for small memory.
Principles of Locality
 Since programs can access a small
portion of their address space at any
given instant, thus to increase
performance, two policies are followed
:-
 A) Temporal Locality :- locality in time,
i.e. if an item is referred, it will tend to
referred again soon.
 B) Spatial Locality :- locality in space,
i.e. if an item is referred, its neighboring
Mapping Functions
 There are three main types of memory
mapping functions :-
 1) Direct Mapped
 2) Fully Associative
 3) Set Associative
 For the coming explanations, let us
assume 1GB main memory, 128KB
Cache memory and Cache line size
32B.
Direct Mapping
TAG LINE or SLOT (r) OFFSET
•Each memory block is mapped to a
single cache line. For the purpose of
cache access, each main memory
address can be viewed as consisting of
three fields
•No two block in the same line have the
same Tag field
•Check contents of the cache by finding
s w
 For the given example, we have –
 1GB main memory = 220 bytes
 Cache size = 128KB = 217 bytes
 Block size = 32B = 25 bytes
 No. of cache lines = 217/25 = 212, thus
12 bits are required to locate 212 lines.
 Also, offset is 25bytes and thus 5 bits
are required to locate individual byte.
 Thus Tag bits = 32 – 12 - 5 = 14 bits
14 12 5
Summary
 Address length = (s + w) bits
 Number of addressable units = 2s+w
words or bytes
 Block size = line size = 2w words or bytes
 No. of blocks in main memory = 2s+ w/2w
= 2s
 Number of lines in cache = m = 2r
 Size of tag = (s – r) bits
 Mapping Function
 Jth Block of the main memory maps to ith
cache line
 I = J modulo M (M = no. of cache lines)
Pro’s and Con’s
 Simple
 Inexpensive
 Fixed location for given block
 If a program accesses 2 blocks that
map to the same line repeatedly,
cache misses (conflict misses) are
very high
Fully Associative Mapping
 A main memory block can load into any
line
 of cache
 Memory address is interpreted as tag
and
 word
 Tag uniquely identifies block of memory
 Every line’s tag is examined for a match
 Cache searching gets expensive and
more power consumption due to parallel
comparators
TAG OFFSET
s w
Fully Associative Cache
Organization
 For the given example, we have –
 1GB main memory = 220 bytes
 Cache size = 128KB = 217 bytes
 Block size = 32B = 25 bytes
Here, offset is 25bytes and thus 5 bits
are required to locate individual byte.
 Thus Tag bits = 32 – 5 = 27 bits
27 5
Fully Associative Mapping
Summary
 Address length = (s + w) bits
 Number of addressable units = 2s+w words
or bytes
 Block size = line size = 2w words or bytes
 No. of blocks in main memory = 2s+ w/2w =
2s
 Number of lines in cache = Total Number
of cache blocks
 Size of tag = s bits
Pro’s and Con’s
 There is flexibility as to which block to
replace when a new block is read into
the cache
 The complex circuitry required for
parallel Tag comparison is however a
major disadvantage.
Set Associative Mapping
 Cache is divided into a number of sets
 Each set contains a number of lines
 A given block maps to any line in a
given set. e.g. Block B can be in any
line of set i
 If 2 lines per set,
 2 way associative mapping
 A given block can be in one of 2 lines in
only one sets w
TAG SET (d) OFFSET
K-Way Set Associative
Organization
 For the given example, we have –
 1GB main memory = 220 bytes
 Cache size = 128KB = 217 bytes
 Block size = 32B = 25 bytes
 Let it be a 2-way set associative cache,
 No. of sets = 217/(2*25 )= 211, thus 11 bits
are required to locate 211 sets and each
set containing 2 lines each
 Also, offset is 25bytes and thus 5 bits are
required to locate individual byte.
 Thus Tag bits = 32 – 11 - 5 = 16 bits
16 11 5
Set Associative Mapping
Summary
 Address length = (s + w) bits
 Number of addressable units = 2s+w words or
bytes
 Block size = line size = 2w words or bytes
 Number of blocks in main memory = 2s
 Number of lines in set = k
 Number of sets = v = 2d
 Number of lines in cache = kv = k * 2d
 Size of tag = (s – d) bits
 Mapping Function
 Jth Block of the main memory maps to ith set
 I = J modulo v (v = no. of sets)
 Within the set, the block can be mapped to any
cache line.
Pro’s and Con’s
 After simulating the hit ratio for direct
mapped and (2,4,8 way) set associative
mapped cache, we observe that there
is significant difference in performance
at least up to cache size of 64KB, set
associative being the better one.
 However, beyond that, the complexity
of cache increases in proportion to the
associativity, hence both mapping give
approximately similar hit ratio.
N-way Set Associative Cache
Vs. Direct Mapped Cache:
 N comparators Vs 1
 Extra mux delay for the data
 Data comes after hit/miss
 In a direct map cache, cache block is
available before hit/miss
 Number of misses
 DM > SA > FA
 Access latency : time to perform read or
write operation, i.e. time from instant
address is presented to memory to the
instant that data have stored or made
available
 DM < SA < FA
Types of Misses
Compulsory Misses :-
 When a program is started, the cache
is completely empty and hence the
first access to the block will always be
a miss as it has to brought to the
cache from memory, at least for the
first time.
 Also called first reference misses.
Can’t be avoided easily.
Capacity Misses
 Since the cache cannot hold all the
blocks needed during the execution of
program
 Thus this miss occurs due to the
blocks being discarded and later
retrieved.
 They occur because the cache is
limited in size.
 Fully Associative cache has this as its
major miss reason.
Conflict Misses
 It occurs because multiple distinct
memory locations map to the same
cache location.
 Thus in case of DM or SA, it occurs
because a blocks being discarded and
later retrieved.
 In DM, this is a repeated phenomenon
as two blocks which map to the same
cache line can be accessed alternately
and thereby decreasing the hit ratio.
 This phenomenon is called
Solutions to reduce misses
 Capacity Misses :-
◦ Increase cache size
◦ Re-structure the program
 Conflict Misses :-
◦ Increase cache size
◦ Increase associativity
Coherence Misses
 Occur when other processors update
memory which in turn invalidates the
data block present in other
processor’s cache.
Replacement Algorithms
 For Direct Mapped Cache, since each
block maps to only one line, we have no
choice but the replace that line itself
 Hence there isn’t any replacement policy
for DM.
 For SA and FA, few replacement policies
:-
◦ Optimal
◦ Random
◦ Arrival
◦ Frequency
◦ Recently Used
Optimal
This is the ideal benchmarking
replacement strategy.
 All other policies are compared to it.
 This is not implemented, but used just
for comparison purposes.
Random
 Block to be replaced is randomly
picked
 Minimum hardware complexity – just a
pseudo random number generator
required.
 Access time is not affected by the
replacement circuit.
 Not suitable for high performance
systems
Arrival - FIFO
 For an N-way set associative cache
 Implementation 1
 Use N-bit register per cache line to store arrival time information
 On cache miss – registers of all cache line in the set are
 compared to choose the victim cache line
 Implementation 2
 Maintain a FIFO queue
 Register with (log2 N) bits per cache line
 On cache miss – cache line corresponding to register value 00
 will be the victim.
 Decrement all other registers in the set by 1 and set the victim
 register with value N-1
FIFO : Advantages &
Disadvantages
 Advantages
 Low hardware Complexity
 Better cache hit performance than Random
replacement
 The cache access time is not affected by the
replacement
 strategy (not in critical path)
 Disadvantages
 Cache hit performance is poor compared to LRU and
frequency based replacement schemes
 Not suitable for high performance systems
 Replacement circuit complexity increases with increase
Frequency – Least Frequently
Used
 Requires a register per cache line to
save number of references (frequency
count)
 If cache access is hit, then increase
frequency count of the corresponding
register by 1
 If cache miss, find the victim cache line
as the cache line corresponding to
minimum frequency count in the set
 Reset the register corresponding to
victim cache line as 0
 LFU can not differentiate between past
Least Frequently Used –
Dynamic Aging (LFU-DA)
 When any frequency count register in
the set reaches its maximum value, all
the frequency count registers in that
set is shifter one position right (divide
by 2)
 Rest is same as LFU
LFU : Advantages &
Disadvantages
 Advantages
 For small and medium caches LFU works better
than
 FIFO and Random replacements
 Suitable for high performance systems whose
memory pattern follows frequency order
 Disadvantages
 The register should be updated in every cache
access
 Affects the critical path
 The replacement circuit becomes more complicated
when
Least Recently Used Policy
 Most widely used replacement
strategy
 Replaces the least recently used
cache line
 Implemented by two techniques :-
◦ Square Matrix Implementation
◦ Counter Implementation
Square Matrix Implementation
 N2 bits per set (DFF’s) to store the LRU
information
 The cache line corresponding to the row
with all zeros is the victim cache line
for replacement
 If cache hit, all the bits in corresponding
row is set to 1 and all the bits in
corresponding column is set to 0.
 If cache miss, priority encoder selects
the cache line corresponding to the row
with all zeros for replacement
 Used when associativity is less
Matrix Implementation – 4 way
set Associative Cache
Counter Implementation
 N registers with log2N bits for N- way
set associativity. Thus Nlog2N bits
used.
 Each register for each line
 Cache line corresponding to counter 0
is victim cache line for replacement
 If hit, all cache line with counter
greater than hit cache line is
decremented by 1 & hit cache line is
set to N-1
 If miss, the cache whose count value
Look Policy
Look Through : Access Cache, if data not found access the lower
level
Look Aside : Request to Cache and its lower level at the same
Write Policy
Need of Write Policy :-
 A block in cache might have been be
updated, but corresponding updation
in main memory might not have been
done
 Multiple CPU’s have individual
cache’s, thereby invalidating the data
in other processor’s cache
 I/O may be able to read write directly
into main memory
Write Through
 In this technique, all the write operations
are made to main memory as well as to
cache, ensuring MM is always valid.
 Any other processor-cache module, may
monitor traffic to MM to maintain
consistency.
DISADVANTAGE
 It generates memory traffic and may
create bottleneck.
 Bottleneck : delay in transmission of data
due to less bandwidth. Hence info is not
relayed at speed it is processed.
Pseudo Write Through
 Also called Write Buffer
 Processor writes data into the cache
and the write buffer
 Memory controller writes contents of
the buffer to memory
 FIFO (typical number of entries 4)
 After write is complete, buffer is
flushed
Write Back
 In this technique, the updates are made only
in cache.
 When an update is made, a dirty bit or use bit,
associated with the line is set
 Then when a block is replaced, it is written
back into the main memory, iff the dirty bit is
set
 Thus it minimizes memory writes
DISADVANTAGE
 Portions of MM are still invalid, hence I/O
should be allowed access only through cache
 This makes complex circuitry and potential
bottleneck
Cache Coherency
This is required only in case of
multiprocessors where each CPU has
its own cache
Why is it needed ?
 Be it any write policy, if the data is
modified in one cache, it invalidates
the data in other cache, if they seem
to hold the same data
 Hence we need to maintain a cache
coherency to obtain correct results
Approaches towards Cache
Coherency
1) Bus watching write through :
 Cache controller monitors writes into
shared memory that also resides in
the cache memory
 If any writes are made, the controller
invalidates the cache entry
 This approach depends on use of
write through policy
2) Hardware Transparency :-
 Additional hardware to ensure that all
updates to main memory via cache
are reflected in all cache
3) Non Cacheable memory :-
 Only a portion of main memory is
shared by more than 1 processor, and
this is designated as non cacheable.
 Here, all access to shared memory
are cache misses, as its never copied
to cache
Cache Optimization
 Reducing the miss penalty
1. Multi level caches
2. Critical word first
3. Priority to Read miss over writes
4. Merging write buffers
5. Victim caches
Multilevel Cache
 The inclusion of an on-chip cache gave
left a question whether another external
cache is still desirable?
 The answer is yes! The reasons are :
◦ If there is no L2 cache and Processor makes
a request for a memory location not in the L1
cache, then it accesses the DRAM or ROM.
Due to relatively slower bus speed,
performance degrades.
◦ Whereas, if an L2 SRAM cache is included,
the frequently missing information can be
quickly retrieved. Also SRAM is fast enough
to match the bus speed, hence giving zero-
wait state transaction.
 L2 cache do not use the system bus as
path for transfer between L2 and
processor, but a separate data path to
reduce burden
 A series have simulations have proved
that L2 cache is most efficient when
its double the size of L1 cache, as
otherwise, its contents will be similar to
L1
 Due to continued shrinkage of processor
components, many processors can
accommodate L2 cache on chip giving
rise to opportunity to include an L3 cache
 The only disadvantage of multilevel
cache is that it complicates the design,
Cache Performance
 Average memory access time = Hit
timeL1+Miss Rate L1 X (Hit time L2 +
Miss Rate L2 X Miss penalty L2)
 Average memory stalls per instruction
= Misses per instruction L1 X (Hit time
L2 + Misses per instruction L2 X Miss
penalty L2)
Unified Vs Split Cache
 Earlier same cache is used for data as
well as instructions i.e. Unified Cache
 Now we have separate caches for
data and instructions i.e. Split cache
 Thus, if the processor attempts to
fetch instruction from main memory, it
first consults the instruction L1 cache
and similarly for data.
Advantages of Unified Cache
 It balances load between data and
instructions automatically.
 That is, if execution involves more
instruction fetches, the cache will tend
to fill up with instructions, and if
execution involves more of data
fetches, the cache tends to fill up with
data.
 Only one cache is needed to design
Advantages of Split Cache
 Useful in parallel instruction execution
and pre-fetching of predicted future
instructions
 Eliminate contention for the instruction
fetch/decode unit and the execution
unit and thereby supporting pipelining
 the processor will fetch the instructions
ahead of time and fill the buffer, or
pipeline.
 E.g. Super scalar machines Pentium
and Power PC
Critical Word First
 This policy involves sending the
requested word first and then transfer
the rest. Thus getting the data to the
processor in 1st cycle.
 Assume that 1 block = 16 bytes. 1 cycle
transfers 4 bytes. Thus at least 4 cycles
required to transfer the block.
 If the processor demands for 2nd byte,
then why should we wait for entire block
to be transferred. We can first send that
word and then the complete block with
the remaining bytes.
Priority to read miss over
writes
Write Buffer:
 Using write buffers: RAW conflicts with reads on cache
misses
 If simply wait for write buffer to empty - increases read
miss penalty by 50%
 Check the content of the write buffer on read miss, if no
conflicts and memory system is available, allow read
miss to continue. If there is a conflict, then flush the
buffer before read
Write Back?
 Read miss replacing dirty block
 Normal: Write dirty block to memory, and then do the
read
 Instead copy the dirty block to a write buffer, then do the
Victim Cache
 How to combine fast hit time of DM with
reduced conflict Misses?
 Add a small fully associative buffer
(cache) to hold data discarded from
cache Victim Cache
 A small fully associative cache is used
for collecting spill out data
 Blocks that are discarded because of a
miss (Victim) is stored in victim cache
and is checked on a cache miss.
 If found swap the data block between
victim cache and main cache
 Replacement will always happen with the LRU
block of victim cache. The block that we want
to transfer is made MRU.
 Then from cache, the block will come to victim
cache and made MRU.
 The block which was transferred to cache is
now made LRU
 If miss in victim cache also, then MM is
referred.
01 00 10 11
8
8 0
0
11 11
00
Cache Optimization
 Reducing hit time
1. Small and simple caches
2. Way prediction cache
3. Trace cache
4. Avoid Address translation during
indexing of the cache
Cache Optimization
 Reducing miss rate
1)Changing cache configurations
2)Compiler optimization
Cache Optimization
 Reducing miss penalty per miss rate
via parallelism
1)Hardware prefetching
2)Compiler prefetching
Cache Optimization
 Increasing cache bandwidth
1)Pipelined cache,
2)Multi-banked cache
3)Non-blocking cache

More Related Content

PPTX
Cache memory
PDF
Cache replacement policies,cache miss,writingtechniques
PPTX
Cache memory ppt
PPT
Computer architecture cache memory
PPT
Cache Memory
PDF
Unit 4-input-output organization
PPTX
Memory Organization
PPTX
Segmentation
Cache memory
Cache replacement policies,cache miss,writingtechniques
Cache memory ppt
Computer architecture cache memory
Cache Memory
Unit 4-input-output organization
Memory Organization
Segmentation

What's hot (20)

PDF
Computer organiztion5
PPTX
CS304PC:Computer Organization and Architecture Session 11 general register or...
PPTX
Computer architecture memory system
PDF
Computer organization memory
PPT
cache memory
PPT
04 cache memory.ppt 1
PDF
Programed I/O Modul..
PPTX
Auxiliary Memory in computer Architecture.pptx
PPS
Computer instructions
PPTX
Elements of cache design
PPTX
Associative memory and set associative memory mapping
PPT
Cache memory
PPTX
Stack organization
PPTX
Computer registers
PPT
Memory Reference instruction
PPTX
PROGRAMMABLE KEYBOARD AND DISPLAY INTERFACE(8279).pptx
PPTX
Micro program example
DOCX
Cache memory
PDF
Associative memory
PPT
Microprogram Control
Computer organiztion5
CS304PC:Computer Organization and Architecture Session 11 general register or...
Computer architecture memory system
Computer organization memory
cache memory
04 cache memory.ppt 1
Programed I/O Modul..
Auxiliary Memory in computer Architecture.pptx
Computer instructions
Elements of cache design
Associative memory and set associative memory mapping
Cache memory
Stack organization
Computer registers
Memory Reference instruction
PROGRAMMABLE KEYBOARD AND DISPLAY INTERFACE(8279).pptx
Micro program example
Cache memory
Associative memory
Microprogram Control
Ad

Similar to Cache memory (20)

PPT
04 Cache Memory
PPTX
CACHEMAPPING POLICIE AND MERITS & DEMERITS
PPTX
Cache management
PPT
Memory Organization and Cache mapping.ppt
PPT
Akanskaha_ganesh_kullarni_memory_computer.ppt
PPTX
2021Arch_5_ch2_2.pptx How to improve the performance of Memory hierarchy
PPTX
Memory mapping techniques and low power memory design
PPT
Cache memory
PPTX
Main Memory Management in Operating System
PDF
Computer architecture for HNDIT
PPTX
Cache.pptx
PPT
Cache Memory for Computer Architecture.ppt
PPT
Chapter5 the memory-system-jntuworld
PPT
04_Cache_Memory-cust memori memori memori.ppt
PPT
Memory Organization
PPT
cache memory introduction, level, function
PPTX
memorytechnologyandoptimization-140416131506-phpapp02.pptx
PDF
Unit I Memory technology and optimization
PPTX
Memory technology and optimization in Advance Computer Architechture
PPTX
Cache memory
04 Cache Memory
CACHEMAPPING POLICIE AND MERITS & DEMERITS
Cache management
Memory Organization and Cache mapping.ppt
Akanskaha_ganesh_kullarni_memory_computer.ppt
2021Arch_5_ch2_2.pptx How to improve the performance of Memory hierarchy
Memory mapping techniques and low power memory design
Cache memory
Main Memory Management in Operating System
Computer architecture for HNDIT
Cache.pptx
Cache Memory for Computer Architecture.ppt
Chapter5 the memory-system-jntuworld
04_Cache_Memory-cust memori memori memori.ppt
Memory Organization
cache memory introduction, level, function
memorytechnologyandoptimization-140416131506-phpapp02.pptx
Unit I Memory technology and optimization
Memory technology and optimization in Advance Computer Architechture
Cache memory
Ad

Recently uploaded (20)

PPTX
master seminar digital applications in india
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Classroom Observation Tools for Teachers
master seminar digital applications in india
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
GDM (1) (1).pptx small presentation for students
Abdominal Access Techniques with Prof. Dr. R K Mishra
human mycosis Human fungal infections are called human mycosis..pptx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Anesthesia in Laparoscopic Surgery in India
O5-L3 Freight Transport Ops (International) V1.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
STATICS OF THE RIGID BODIES Hibbelers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Sports Quiz easy sports quiz sports quiz
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Classroom Observation Tools for Teachers

Cache memory

  • 2. Memory Hierarchy  Computer memory is organized in a hierarchy. This done to cope up with the speed of processor and hence increase performance.  Closest to the processor are the Processing registers. Then comes the Cache memory, followed by Main memory.
  • 3. SRAM and DRAM  Both are random access memories and are volatile, i.e. constant power supply is required to avoid data loss.  DRAM :- made up of a capacitor and a transistor. Transistor acts as a switch and data in the form of charge is present on the capacitor. Requires periodic charge refreshing to maintain data storage. Lesser cost per bit, less expensive. Used for large memory  SRAM :- made up of 4 transistors, which are cross-connected in an arrangement that produces stable logic state. Greater costs per bit, more expensive. Used for small memory.
  • 4. Principles of Locality  Since programs can access a small portion of their address space at any given instant, thus to increase performance, two policies are followed :-  A) Temporal Locality :- locality in time, i.e. if an item is referred, it will tend to referred again soon.  B) Spatial Locality :- locality in space, i.e. if an item is referred, its neighboring
  • 5. Mapping Functions  There are three main types of memory mapping functions :-  1) Direct Mapped  2) Fully Associative  3) Set Associative  For the coming explanations, let us assume 1GB main memory, 128KB Cache memory and Cache line size 32B.
  • 6. Direct Mapping TAG LINE or SLOT (r) OFFSET •Each memory block is mapped to a single cache line. For the purpose of cache access, each main memory address can be viewed as consisting of three fields •No two block in the same line have the same Tag field •Check contents of the cache by finding s w
  • 7.  For the given example, we have –  1GB main memory = 220 bytes  Cache size = 128KB = 217 bytes  Block size = 32B = 25 bytes  No. of cache lines = 217/25 = 212, thus 12 bits are required to locate 212 lines.  Also, offset is 25bytes and thus 5 bits are required to locate individual byte.  Thus Tag bits = 32 – 12 - 5 = 14 bits 14 12 5
  • 8. Summary  Address length = (s + w) bits  Number of addressable units = 2s+w words or bytes  Block size = line size = 2w words or bytes  No. of blocks in main memory = 2s+ w/2w = 2s  Number of lines in cache = m = 2r  Size of tag = (s – r) bits  Mapping Function  Jth Block of the main memory maps to ith cache line  I = J modulo M (M = no. of cache lines)
  • 9. Pro’s and Con’s  Simple  Inexpensive  Fixed location for given block  If a program accesses 2 blocks that map to the same line repeatedly, cache misses (conflict misses) are very high
  • 10. Fully Associative Mapping  A main memory block can load into any line  of cache  Memory address is interpreted as tag and  word  Tag uniquely identifies block of memory  Every line’s tag is examined for a match  Cache searching gets expensive and more power consumption due to parallel comparators TAG OFFSET s w
  • 12.  For the given example, we have –  1GB main memory = 220 bytes  Cache size = 128KB = 217 bytes  Block size = 32B = 25 bytes Here, offset is 25bytes and thus 5 bits are required to locate individual byte.  Thus Tag bits = 32 – 5 = 27 bits 27 5
  • 13. Fully Associative Mapping Summary  Address length = (s + w) bits  Number of addressable units = 2s+w words or bytes  Block size = line size = 2w words or bytes  No. of blocks in main memory = 2s+ w/2w = 2s  Number of lines in cache = Total Number of cache blocks  Size of tag = s bits
  • 14. Pro’s and Con’s  There is flexibility as to which block to replace when a new block is read into the cache  The complex circuitry required for parallel Tag comparison is however a major disadvantage.
  • 15. Set Associative Mapping  Cache is divided into a number of sets  Each set contains a number of lines  A given block maps to any line in a given set. e.g. Block B can be in any line of set i  If 2 lines per set,  2 way associative mapping  A given block can be in one of 2 lines in only one sets w TAG SET (d) OFFSET
  • 17.  For the given example, we have –  1GB main memory = 220 bytes  Cache size = 128KB = 217 bytes  Block size = 32B = 25 bytes  Let it be a 2-way set associative cache,  No. of sets = 217/(2*25 )= 211, thus 11 bits are required to locate 211 sets and each set containing 2 lines each  Also, offset is 25bytes and thus 5 bits are required to locate individual byte.  Thus Tag bits = 32 – 11 - 5 = 16 bits 16 11 5
  • 18. Set Associative Mapping Summary  Address length = (s + w) bits  Number of addressable units = 2s+w words or bytes  Block size = line size = 2w words or bytes  Number of blocks in main memory = 2s  Number of lines in set = k  Number of sets = v = 2d  Number of lines in cache = kv = k * 2d  Size of tag = (s – d) bits  Mapping Function  Jth Block of the main memory maps to ith set  I = J modulo v (v = no. of sets)  Within the set, the block can be mapped to any cache line.
  • 19. Pro’s and Con’s  After simulating the hit ratio for direct mapped and (2,4,8 way) set associative mapped cache, we observe that there is significant difference in performance at least up to cache size of 64KB, set associative being the better one.  However, beyond that, the complexity of cache increases in proportion to the associativity, hence both mapping give approximately similar hit ratio.
  • 20. N-way Set Associative Cache Vs. Direct Mapped Cache:  N comparators Vs 1  Extra mux delay for the data  Data comes after hit/miss  In a direct map cache, cache block is available before hit/miss  Number of misses  DM > SA > FA  Access latency : time to perform read or write operation, i.e. time from instant address is presented to memory to the instant that data have stored or made available  DM < SA < FA
  • 21. Types of Misses Compulsory Misses :-  When a program is started, the cache is completely empty and hence the first access to the block will always be a miss as it has to brought to the cache from memory, at least for the first time.  Also called first reference misses. Can’t be avoided easily.
  • 22. Capacity Misses  Since the cache cannot hold all the blocks needed during the execution of program  Thus this miss occurs due to the blocks being discarded and later retrieved.  They occur because the cache is limited in size.  Fully Associative cache has this as its major miss reason.
  • 23. Conflict Misses  It occurs because multiple distinct memory locations map to the same cache location.  Thus in case of DM or SA, it occurs because a blocks being discarded and later retrieved.  In DM, this is a repeated phenomenon as two blocks which map to the same cache line can be accessed alternately and thereby decreasing the hit ratio.  This phenomenon is called
  • 24. Solutions to reduce misses  Capacity Misses :- ◦ Increase cache size ◦ Re-structure the program  Conflict Misses :- ◦ Increase cache size ◦ Increase associativity
  • 25. Coherence Misses  Occur when other processors update memory which in turn invalidates the data block present in other processor’s cache.
  • 26. Replacement Algorithms  For Direct Mapped Cache, since each block maps to only one line, we have no choice but the replace that line itself  Hence there isn’t any replacement policy for DM.  For SA and FA, few replacement policies :- ◦ Optimal ◦ Random ◦ Arrival ◦ Frequency ◦ Recently Used
  • 27. Optimal This is the ideal benchmarking replacement strategy.  All other policies are compared to it.  This is not implemented, but used just for comparison purposes.
  • 28. Random  Block to be replaced is randomly picked  Minimum hardware complexity – just a pseudo random number generator required.  Access time is not affected by the replacement circuit.  Not suitable for high performance systems
  • 29. Arrival - FIFO  For an N-way set associative cache  Implementation 1  Use N-bit register per cache line to store arrival time information  On cache miss – registers of all cache line in the set are  compared to choose the victim cache line  Implementation 2  Maintain a FIFO queue  Register with (log2 N) bits per cache line  On cache miss – cache line corresponding to register value 00  will be the victim.  Decrement all other registers in the set by 1 and set the victim  register with value N-1
  • 30. FIFO : Advantages & Disadvantages  Advantages  Low hardware Complexity  Better cache hit performance than Random replacement  The cache access time is not affected by the replacement  strategy (not in critical path)  Disadvantages  Cache hit performance is poor compared to LRU and frequency based replacement schemes  Not suitable for high performance systems  Replacement circuit complexity increases with increase
  • 31. Frequency – Least Frequently Used  Requires a register per cache line to save number of references (frequency count)  If cache access is hit, then increase frequency count of the corresponding register by 1  If cache miss, find the victim cache line as the cache line corresponding to minimum frequency count in the set  Reset the register corresponding to victim cache line as 0  LFU can not differentiate between past
  • 32. Least Frequently Used – Dynamic Aging (LFU-DA)  When any frequency count register in the set reaches its maximum value, all the frequency count registers in that set is shifter one position right (divide by 2)  Rest is same as LFU
  • 33. LFU : Advantages & Disadvantages  Advantages  For small and medium caches LFU works better than  FIFO and Random replacements  Suitable for high performance systems whose memory pattern follows frequency order  Disadvantages  The register should be updated in every cache access  Affects the critical path  The replacement circuit becomes more complicated when
  • 34. Least Recently Used Policy  Most widely used replacement strategy  Replaces the least recently used cache line  Implemented by two techniques :- ◦ Square Matrix Implementation ◦ Counter Implementation
  • 35. Square Matrix Implementation  N2 bits per set (DFF’s) to store the LRU information  The cache line corresponding to the row with all zeros is the victim cache line for replacement  If cache hit, all the bits in corresponding row is set to 1 and all the bits in corresponding column is set to 0.  If cache miss, priority encoder selects the cache line corresponding to the row with all zeros for replacement  Used when associativity is less
  • 36. Matrix Implementation – 4 way set Associative Cache
  • 37. Counter Implementation  N registers with log2N bits for N- way set associativity. Thus Nlog2N bits used.  Each register for each line  Cache line corresponding to counter 0 is victim cache line for replacement  If hit, all cache line with counter greater than hit cache line is decremented by 1 & hit cache line is set to N-1  If miss, the cache whose count value
  • 38. Look Policy Look Through : Access Cache, if data not found access the lower level Look Aside : Request to Cache and its lower level at the same
  • 39. Write Policy Need of Write Policy :-  A block in cache might have been be updated, but corresponding updation in main memory might not have been done  Multiple CPU’s have individual cache’s, thereby invalidating the data in other processor’s cache  I/O may be able to read write directly into main memory
  • 40. Write Through  In this technique, all the write operations are made to main memory as well as to cache, ensuring MM is always valid.  Any other processor-cache module, may monitor traffic to MM to maintain consistency. DISADVANTAGE  It generates memory traffic and may create bottleneck.  Bottleneck : delay in transmission of data due to less bandwidth. Hence info is not relayed at speed it is processed.
  • 41. Pseudo Write Through  Also called Write Buffer  Processor writes data into the cache and the write buffer  Memory controller writes contents of the buffer to memory  FIFO (typical number of entries 4)  After write is complete, buffer is flushed
  • 42. Write Back  In this technique, the updates are made only in cache.  When an update is made, a dirty bit or use bit, associated with the line is set  Then when a block is replaced, it is written back into the main memory, iff the dirty bit is set  Thus it minimizes memory writes DISADVANTAGE  Portions of MM are still invalid, hence I/O should be allowed access only through cache  This makes complex circuitry and potential bottleneck
  • 43. Cache Coherency This is required only in case of multiprocessors where each CPU has its own cache Why is it needed ?  Be it any write policy, if the data is modified in one cache, it invalidates the data in other cache, if they seem to hold the same data  Hence we need to maintain a cache coherency to obtain correct results
  • 44. Approaches towards Cache Coherency 1) Bus watching write through :  Cache controller monitors writes into shared memory that also resides in the cache memory  If any writes are made, the controller invalidates the cache entry  This approach depends on use of write through policy
  • 45. 2) Hardware Transparency :-  Additional hardware to ensure that all updates to main memory via cache are reflected in all cache 3) Non Cacheable memory :-  Only a portion of main memory is shared by more than 1 processor, and this is designated as non cacheable.  Here, all access to shared memory are cache misses, as its never copied to cache
  • 46. Cache Optimization  Reducing the miss penalty 1. Multi level caches 2. Critical word first 3. Priority to Read miss over writes 4. Merging write buffers 5. Victim caches
  • 47. Multilevel Cache  The inclusion of an on-chip cache gave left a question whether another external cache is still desirable?  The answer is yes! The reasons are : ◦ If there is no L2 cache and Processor makes a request for a memory location not in the L1 cache, then it accesses the DRAM or ROM. Due to relatively slower bus speed, performance degrades. ◦ Whereas, if an L2 SRAM cache is included, the frequently missing information can be quickly retrieved. Also SRAM is fast enough to match the bus speed, hence giving zero- wait state transaction.
  • 48.  L2 cache do not use the system bus as path for transfer between L2 and processor, but a separate data path to reduce burden  A series have simulations have proved that L2 cache is most efficient when its double the size of L1 cache, as otherwise, its contents will be similar to L1  Due to continued shrinkage of processor components, many processors can accommodate L2 cache on chip giving rise to opportunity to include an L3 cache  The only disadvantage of multilevel cache is that it complicates the design,
  • 49. Cache Performance  Average memory access time = Hit timeL1+Miss Rate L1 X (Hit time L2 + Miss Rate L2 X Miss penalty L2)  Average memory stalls per instruction = Misses per instruction L1 X (Hit time L2 + Misses per instruction L2 X Miss penalty L2)
  • 50. Unified Vs Split Cache  Earlier same cache is used for data as well as instructions i.e. Unified Cache  Now we have separate caches for data and instructions i.e. Split cache  Thus, if the processor attempts to fetch instruction from main memory, it first consults the instruction L1 cache and similarly for data.
  • 51. Advantages of Unified Cache  It balances load between data and instructions automatically.  That is, if execution involves more instruction fetches, the cache will tend to fill up with instructions, and if execution involves more of data fetches, the cache tends to fill up with data.  Only one cache is needed to design
  • 52. Advantages of Split Cache  Useful in parallel instruction execution and pre-fetching of predicted future instructions  Eliminate contention for the instruction fetch/decode unit and the execution unit and thereby supporting pipelining  the processor will fetch the instructions ahead of time and fill the buffer, or pipeline.  E.g. Super scalar machines Pentium and Power PC
  • 53. Critical Word First  This policy involves sending the requested word first and then transfer the rest. Thus getting the data to the processor in 1st cycle.  Assume that 1 block = 16 bytes. 1 cycle transfers 4 bytes. Thus at least 4 cycles required to transfer the block.  If the processor demands for 2nd byte, then why should we wait for entire block to be transferred. We can first send that word and then the complete block with the remaining bytes.
  • 54. Priority to read miss over writes Write Buffer:  Using write buffers: RAW conflicts with reads on cache misses  If simply wait for write buffer to empty - increases read miss penalty by 50%  Check the content of the write buffer on read miss, if no conflicts and memory system is available, allow read miss to continue. If there is a conflict, then flush the buffer before read Write Back?  Read miss replacing dirty block  Normal: Write dirty block to memory, and then do the read  Instead copy the dirty block to a write buffer, then do the
  • 55. Victim Cache  How to combine fast hit time of DM with reduced conflict Misses?  Add a small fully associative buffer (cache) to hold data discarded from cache Victim Cache  A small fully associative cache is used for collecting spill out data  Blocks that are discarded because of a miss (Victim) is stored in victim cache and is checked on a cache miss.  If found swap the data block between victim cache and main cache
  • 56.  Replacement will always happen with the LRU block of victim cache. The block that we want to transfer is made MRU.  Then from cache, the block will come to victim cache and made MRU.  The block which was transferred to cache is now made LRU  If miss in victim cache also, then MM is referred. 01 00 10 11 8 8 0 0 11 11 00
  • 57. Cache Optimization  Reducing hit time 1. Small and simple caches 2. Way prediction cache 3. Trace cache 4. Avoid Address translation during indexing of the cache
  • 58. Cache Optimization  Reducing miss rate 1)Changing cache configurations 2)Compiler optimization
  • 59. Cache Optimization  Reducing miss penalty per miss rate via parallelism 1)Hardware prefetching 2)Compiler prefetching
  • 60. Cache Optimization  Increasing cache bandwidth 1)Pipelined cache, 2)Multi-banked cache 3)Non-blocking cache