Cache memory

CACHE MEMORY
By
Anand Goyal
2010C6PS648
G

Memory Hierarchy
 Computer memory is organized in a
hierarchy. This done to cope up with
the speed of processor and hence
increase performance.
 Closest to the processor are the
Processing registers. Then comes the
Cache memory, followed by Main
memory.

SRAM and DRAM
 Both are random access memories and are
volatile, i.e. constant power supply is required
to avoid data loss.
 DRAM :- made up of a capacitor and a
transistor. Transistor acts as a switch and
data in the form of charge is present on the
capacitor. Requires periodic charge
refreshing to maintain data storage. Lesser
cost per bit, less expensive. Used for large
memory
 SRAM :- made up of 4 transistors, which are
cross-connected in an arrangement that
produces stable logic state. Greater costs per
bit, more expensive. Used for small memory.

Principles of Locality
 Since programs can access a small
portion of their address space at any
given instant, thus to increase
performance, two policies are followed
:-
 A) Temporal Locality :- locality in time,
i.e. if an item is referred, it will tend to
referred again soon.
 B) Spatial Locality :- locality in space,
i.e. if an item is referred, its neighboring

Mapping Functions
 There are three main types of memory
mapping functions :-
 1) Direct Mapped
 2) Fully Associative
 3) Set Associative
 For the coming explanations, let us
assume 1GB main memory, 128KB
Cache memory and Cache line size
32B.

Direct Mapping
TAG LINE or SLOT (r) OFFSET
•Each memory block is mapped to a
single cache line. For the purpose of
cache access, each main memory
address can be viewed as consisting of
three fields
•No two block in the same line have the
same Tag field
•Check contents of the cache by finding
s w

 For the given example, we have –
 1GB main memory = 220 bytes
 Cache size = 128KB = 217 bytes
 Block size = 32B = 25 bytes
 No. of cache lines = 217/25 = 212, thus
12 bits are required to locate 212 lines.
 Also, offset is 25bytes and thus 5 bits
are required to locate individual byte.
 Thus Tag bits = 32 – 12 - 5 = 14 bits
14 12 5

Summary
 Address length = (s + w) bits
 Number of addressable units = 2s+w
words or bytes
 Block size = line size = 2w words or bytes
 No. of blocks in main memory = 2s+ w/2w
= 2s
 Number of lines in cache = m = 2r
 Size of tag = (s – r) bits
 Mapping Function
 Jth Block of the main memory maps to ith
cache line
 I = J modulo M (M = no. of cache lines)

Pro’s and Con’s
 Simple
 Inexpensive
 Fixed location for given block
 If a program accesses 2 blocks that
map to the same line repeatedly,
cache misses (conflict misses) are
very high

Fully Associative Mapping
 A main memory block can load into any
line
 of cache
 Memory address is interpreted as tag
and
 word
 Tag uniquely identifies block of memory
 Every line’s tag is examined for a match
 Cache searching gets expensive and
more power consumption due to parallel
comparators
TAG OFFSET
s w

Fully Associative Cache
Organization

Here, offset is 25bytes and thus 5 bits
are required to locate individual byte.
 Thus Tag bits = 32 – 5 = 27 bits
27 5

Fully Associative Mapping
Summary
 Number of addressable units = 2s+w words
or bytes
 No. of blocks in main memory = 2s+ w/2w =
2s
 Number of lines in cache = Total Number
of cache blocks
 Size of tag = s bits

Pro’s and Con’s
 There is flexibility as to which block to
replace when a new block is read into
the cache
 The complex circuitry required for
parallel Tag comparison is however a
major disadvantage.

Set Associative Mapping
 Cache is divided into a number of sets
 Each set contains a number of lines
 A given block maps to any line in a
given set. e.g. Block B can be in any
line of set i
 If 2 lines per set,
 2 way associative mapping
 A given block can be in one of 2 lines in
only one sets w
TAG SET (d) OFFSET

K-Way Set Associative
Organization

 Let it be a 2-way set associative cache,
 No. of sets = 217/(2*25 )= 211, thus 11 bits
are required to locate 211 sets and each
set containing 2 lines each
 Also, offset is 25bytes and thus 5 bits are
required to locate individual byte.
 Thus Tag bits = 32 – 11 - 5 = 16 bits
16 11 5

Set Associative Mapping
Summary
 Number of addressable units = 2s+w words or
bytes
 Number of blocks in main memory = 2s
 Number of lines in set = k
 Number of sets = v = 2d
 Number of lines in cache = kv = k * 2d
 Size of tag = (s – d) bits
 Mapping Function
 Jth Block of the main memory maps to ith set
 I = J modulo v (v = no. of sets)
 Within the set, the block can be mapped to any
cache line.

Pro’s and Con’s
 After simulating the hit ratio for direct
mapped and (2,4,8 way) set associative
mapped cache, we observe that there
is significant difference in performance
at least up to cache size of 64KB, set
associative being the better one.
 However, beyond that, the complexity
of cache increases in proportion to the
associativity, hence both mapping give
approximately similar hit ratio.

N-way Set Associative Cache
Vs. Direct Mapped Cache:
 N comparators Vs 1
 Extra mux delay for the data
 Data comes after hit/miss
 In a direct map cache, cache block is
available before hit/miss
 Number of misses
 DM > SA > FA
 Access latency : time to perform read or
write operation, i.e. time from instant
address is presented to memory to the
instant that data have stored or made
available
 DM < SA < FA

Types of Misses
Compulsory Misses :-
 When a program is started, the cache
is completely empty and hence the
first access to the block will always be
a miss as it has to brought to the
cache from memory, at least for the
first time.
 Also called first reference misses.
Can’t be avoided easily.

Capacity Misses
 Since the cache cannot hold all the
blocks needed during the execution of
program
 Thus this miss occurs due to the
blocks being discarded and later
retrieved.
 They occur because the cache is
limited in size.
 Fully Associative cache has this as its
major miss reason.

Conflict Misses
 It occurs because multiple distinct
memory locations map to the same
cache location.
 Thus in case of DM or SA, it occurs
because a blocks being discarded and
later retrieved.
 In DM, this is a repeated phenomenon
as two blocks which map to the same
cache line can be accessed alternately
and thereby decreasing the hit ratio.
 This phenomenon is called

Solutions to reduce misses
 Capacity Misses :-
◦ Increase cache size
◦ Re-structure the program
 Conflict Misses :-
◦ Increase cache size
◦ Increase associativity

Coherence Misses
 Occur when other processors update
memory which in turn invalidates the
data block present in other
processor’s cache.

Replacement Algorithms
 For Direct Mapped Cache, since each
block maps to only one line, we have no
choice but the replace that line itself
 Hence there isn’t any replacement policy
for DM.
 For SA and FA, few replacement policies
:-
◦ Optimal
◦ Random
◦ Arrival
◦ Frequency
◦ Recently Used

Optimal
This is the ideal benchmarking
replacement strategy.
 All other policies are compared to it.
 This is not implemented, but used just
for comparison purposes.

Random
 Block to be replaced is randomly
picked
 Minimum hardware complexity – just a
pseudo random number generator
required.
 Access time is not affected by the
replacement circuit.
 Not suitable for high performance
systems

Arrival - FIFO
 For an N-way set associative cache
 Implementation 1
 Use N-bit register per cache line to store arrival time information
 On cache miss – registers of all cache line in the set are
 compared to choose the victim cache line
 Implementation 2
 Maintain a FIFO queue
 Register with (log2 N) bits per cache line
 On cache miss – cache line corresponding to register value 00
 will be the victim.
 Decrement all other registers in the set by 1 and set the victim
 register with value N-1

FIFO : Advantages &
Disadvantages
 Advantages
 Low hardware Complexity
 Better cache hit performance than Random
replacement
 The cache access time is not affected by the
replacement
 strategy (not in critical path)
 Disadvantages
 Cache hit performance is poor compared to LRU and
frequency based replacement schemes
 Not suitable for high performance systems
 Replacement circuit complexity increases with increase

Frequency – Least Frequently
Used
 Requires a register per cache line to
save number of references (frequency
count)
 If cache access is hit, then increase
frequency count of the corresponding
register by 1
 If cache miss, find the victim cache line
as the cache line corresponding to
minimum frequency count in the set
 Reset the register corresponding to
victim cache line as 0
 LFU can not differentiate between past

Least Frequently Used –
Dynamic Aging (LFU-DA)
 When any frequency count register in
the set reaches its maximum value, all
the frequency count registers in that
set is shifter one position right (divide
by 2)
 Rest is same as LFU

LFU : Advantages &
Disadvantages
 Advantages
 For small and medium caches LFU works better
than
 FIFO and Random replacements
 Suitable for high performance systems whose
memory pattern follows frequency order
 Disadvantages
 The register should be updated in every cache
access
 Affects the critical path
 The replacement circuit becomes more complicated
when

Least Recently Used Policy
 Most widely used replacement
strategy
 Replaces the least recently used
cache line
 Implemented by two techniques :-
◦ Square Matrix Implementation
◦ Counter Implementation

Square Matrix Implementation
 N2 bits per set (DFF’s) to store the LRU
information
 The cache line corresponding to the row
with all zeros is the victim cache line
for replacement
 If cache hit, all the bits in corresponding
row is set to 1 and all the bits in
corresponding column is set to 0.
 If cache miss, priority encoder selects
the cache line corresponding to the row
with all zeros for replacement
 Used when associativity is less

Matrix Implementation – 4 way
set Associative Cache

Counter Implementation
 N registers with log2N bits for N- way
set associativity. Thus Nlog2N bits
used.
 Each register for each line
 Cache line corresponding to counter 0
is victim cache line for replacement
 If hit, all cache line with counter
greater than hit cache line is
decremented by 1 & hit cache line is
set to N-1
 If miss, the cache whose count value

Look Policy
Look Through : Access Cache, if data not found access the lower
level
Look Aside : Request to Cache and its lower level at the same

Write Policy
Need of Write Policy :-
 A block in cache might have been be
updated, but corresponding updation
in main memory might not have been
done
 Multiple CPU’s have individual
cache’s, thereby invalidating the data
in other processor’s cache
 I/O may be able to read write directly
into main memory

Write Through
 In this technique, all the write operations
are made to main memory as well as to
cache, ensuring MM is always valid.
 Any other processor-cache module, may
monitor traffic to MM to maintain
consistency.
DISADVANTAGE
 It generates memory traffic and may
create bottleneck.
 Bottleneck : delay in transmission of data
due to less bandwidth. Hence info is not
relayed at speed it is processed.

Pseudo Write Through
 Also called Write Buffer
 Processor writes data into the cache
and the write buffer
 Memory controller writes contents of
the buffer to memory
 FIFO (typical number of entries 4)
 After write is complete, buffer is
flushed

Write Back
 In this technique, the updates are made only
in cache.
 When an update is made, a dirty bit or use bit,
associated with the line is set
 Then when a block is replaced, it is written
back into the main memory, iff the dirty bit is
set
 Thus it minimizes memory writes
DISADVANTAGE
 Portions of MM are still invalid, hence I/O
should be allowed access only through cache
 This makes complex circuitry and potential
bottleneck

Cache Coherency
This is required only in case of
multiprocessors where each CPU has
its own cache
Why is it needed ?
 Be it any write policy, if the data is
modified in one cache, it invalidates
the data in other cache, if they seem
to hold the same data
 Hence we need to maintain a cache
coherency to obtain correct results

Approaches towards Cache
Coherency
1) Bus watching write through :
 Cache controller monitors writes into
shared memory that also resides in
the cache memory
 If any writes are made, the controller
invalidates the cache entry
 This approach depends on use of
write through policy

2) Hardware Transparency :-
 Additional hardware to ensure that all
updates to main memory via cache
are reflected in all cache
3) Non Cacheable memory :-
 Only a portion of main memory is
shared by more than 1 processor, and
this is designated as non cacheable.
 Here, all access to shared memory
are cache misses, as its never copied
to cache

Cache Optimization
 Reducing the miss penalty
1. Multi level caches
2. Critical word first
3. Priority to Read miss over writes
4. Merging write buffers
5. Victim caches

Multilevel Cache
 The inclusion of an on-chip cache gave
left a question whether another external
cache is still desirable?
 The answer is yes! The reasons are :
◦ If there is no L2 cache and Processor makes
a request for a memory location not in the L1
cache, then it accesses the DRAM or ROM.
Due to relatively slower bus speed,
performance degrades.
◦ Whereas, if an L2 SRAM cache is included,
the frequently missing information can be
quickly retrieved. Also SRAM is fast enough
to match the bus speed, hence giving zero-
wait state transaction.

 L2 cache do not use the system bus as
path for transfer between L2 and
processor, but a separate data path to
reduce burden
 A series have simulations have proved
that L2 cache is most efficient when
its double the size of L1 cache, as
otherwise, its contents will be similar to
L1
 Due to continued shrinkage of processor
components, many processors can
accommodate L2 cache on chip giving
rise to opportunity to include an L3 cache
 The only disadvantage of multilevel
cache is that it complicates the design,

Cache Performance
 Average memory access time = Hit
timeL1+Miss Rate L1 X (Hit time L2 +
Miss Rate L2 X Miss penalty L2)
 Average memory stalls per instruction
= Misses per instruction L1 X (Hit time
L2 + Misses per instruction L2 X Miss
penalty L2)

Unified Vs Split Cache
 Earlier same cache is used for data as
well as instructions i.e. Unified Cache
 Now we have separate caches for
data and instructions i.e. Split cache
 Thus, if the processor attempts to
fetch instruction from main memory, it
first consults the instruction L1 cache
and similarly for data.

Advantages of Unified Cache
 It balances load between data and
instructions automatically.
 That is, if execution involves more
instruction fetches, the cache will tend
to fill up with instructions, and if
execution involves more of data
fetches, the cache tends to fill up with
data.
 Only one cache is needed to design

Advantages of Split Cache
 Useful in parallel instruction execution
and pre-fetching of predicted future
instructions
 Eliminate contention for the instruction
fetch/decode unit and the execution
unit and thereby supporting pipelining
 the processor will fetch the instructions
ahead of time and fill the buffer, or
pipeline.
 E.g. Super scalar machines Pentium
and Power PC

Critical Word First
 This policy involves sending the
requested word first and then transfer
the rest. Thus getting the data to the
processor in 1st cycle.
 Assume that 1 block = 16 bytes. 1 cycle
transfers 4 bytes. Thus at least 4 cycles
required to transfer the block.
 If the processor demands for 2nd byte,
then why should we wait for entire block
to be transferred. We can first send that
word and then the complete block with
the remaining bytes.

Priority to read miss over
writes
Write Buffer:
 Using write buffers: RAW conflicts with reads on cache
misses
 If simply wait for write buffer to empty - increases read
miss penalty by 50%
 Check the content of the write buffer on read miss, if no
conflicts and memory system is available, allow read
miss to continue. If there is a conflict, then flush the
buffer before read
Write Back?
 Read miss replacing dirty block
 Normal: Write dirty block to memory, and then do the
read
 Instead copy the dirty block to a write buffer, then do the

Victim Cache
 How to combine fast hit time of DM with
reduced conflict Misses?
 Add a small fully associative buffer
(cache) to hold data discarded from
cache Victim Cache
 A small fully associative cache is used
for collecting spill out data
 Blocks that are discarded because of a
miss (Victim) is stored in victim cache
and is checked on a cache miss.
 If found swap the data block between
victim cache and main cache

 Replacement will always happen with the LRU
block of victim cache. The block that we want
to transfer is made MRU.
 Then from cache, the block will come to victim
cache and made MRU.
 The block which was transferred to cache is
now made LRU
 If miss in victim cache also, then MM is
referred.
01 00 10 11
8
8 0
0
11 11
00

Cache Optimization
 Reducing hit time
1. Small and simple caches
2. Way prediction cache
3. Trace cache
4. Avoid Address translation during
indexing of the cache

Cache Optimization
 Reducing miss rate
1)Changing cache configurations
2)Compiler optimization

Cache Optimization
 Reducing miss penalty per miss rate
via parallelism
1)Hardware prefetching
2)Compiler prefetching

Cache Optimization
 Increasing cache bandwidth
1)Pipelined cache,
2)Multi-banked cache
3)Non-blocking cache

Cache memory

More Related Content

What's hot (20)

Similar to Cache memory (20)

Recently uploaded (20)

Cache memory