SlideShare a Scribd company logo
Ch5-2
Cache coherence
Snooping protocol
Directory protocol
2
Centralized Shared-Memory
Architecture
Characteristics of SMP
Limited processors nodes----small scale, share
single physical memory connected by a shared
bus.
Large cache ----provide a sufficient amount of
memory bandwidth.
Increase bandwidth versus bus/memory
Reduce latency of access
Valuable for both private data and shared data
UMA----uniform memory access time.
3
Major issues for Shared Memory
Cache coherence ( Value, same location)
 “Common Sense”
 P1 Read[X] => P1 Write[X] => P1 Read[X] will return X
 P2 Read[X] => P1 Write[X] => will return value written by P1
 P1 Write[X] => P2 Write[X] => Serialized (all processor see the writes in
the same order)
Synchronization
 Atomic read/write operations
Memory consistency Model ( order, different locations)
 In what order must a processor observe the data writes of another
processor ?
 What properties must be enforced among reads and writes to
different locations by different processors?
These are not issues for message passing systems
 Why?
4
What is Multiprocessor Cache
Coherence?
5
Cache coherence in uniprocessor
6
Cache Coherence in Multiprocessor
7
Cache incoherence due to write
8
Cache incoherence due to write
9
What Does Coherency Mean?
Informally:
“Any read must return the most recent write”
Too strict and too difficult to implement
Better:
“Any write must eventually be seen by a read”
All writes are seen in proper order (“serialization”)
Two rules to ensure this:
“If P writes x and P1 reads it, P’s write will be seen by
P1 if the read and write are sufficiently far apart”
Writes to a single location are serialized:
seen in one order
 Latest write will be seen
 Otherewise could see writes in illogical order
(could see older value after a newer value)
10
Definition of Cache coherence
Cache coherence
P1 Read[X] => P1 Write[X] => P1 Read[X] will return X
P2 Read[X] => P1 Write[X] => will return value written by P1
P1 Write[X] => P2 Write[X] => Serialized (all processor see
the writes in the same order)
11
HW Coherence Protocols
Snooping Solution (Snoopy Bus):
 Send all requests for data to all processors
 Processors snoop to see if they have a copy and respond
accordingly
 Requires broadcast, since caching information is at processors
 Works well with bus (natural broadcast medium)
 Dominates for small scale machines (most of the market)
Directory-Based Schemes (discuss later)
 Keep track of what is being shared in 1 centralized place (logically)
 Distributed memory => distributed directory for scalability
(avoids bottlenecks)
 Send point-to-point requests to processors via network
 Scales better than Snooping
 Actually existed BEFORE Snooping-based schemes
12
Snooping solution
Every cache that has a copy of the data from a block of
physical memory also has a copy of the sharing status of
the block, but no centralized state is kept.
13
Basic Snoopy Protocols
Write Invalidate Protocol:
Multiple readers, single writer
Write to shared data: an invalidate is sent to all caches
which snoop and invalidate any copies
Read Miss:
 Write-through: memory is always up-to-date
 Write-back: snoop in caches to find most recent copy
Write Broadcast Protocol (typically write through):
Write to shared data: broadcast on bus, processors
snoop, and update any copies
Read miss: memory is always up-to-date
Write serialization: bus serializes requests!
Bus is single point of arbitration
14
EX: write back Cache, write invalidate
Processor
Activity
Bus activity Contents of
CPU A’s
cache
Contents of
CPU B’s
cache
Contents of
Memory
Location X
0
CPU A
Reads X
Cache miss
for X
0 0
CPU B
Reads X
Cache miss
for X
0 0 0
CPU A writes
A 1 to X
Invalidation
for X
1 0
CPU B
Reads X
Cache miss
for X
1 1 1
Mechanics
 Broadcast address of cache line to invalidate
 All processor snoop, then invalidate if in local cache
 policy can be used to service cache misses in write-back caches
15
Ex: Write back Cache, update(Broadcast)
Processor
Activity
Bus activity Contents of
CPU A’s
cache
Contents of
CPU B’s
cache
Contents of
Memory
Location X
0
CPU A
Reads X
Cache miss
for X
0 0
CPU B
Reads X
Cache miss
for X
0 0 0
CPU A writes
A 1 to X
Write broadcast
Of X
1 1 1
CPU B
Reads X
1 1 1
16
Bus-based protocols (Snooping)
Snooping
All caches see and react to all bus events
Protocol relies on global visibility of events
(ordered broadcast)
The serialization of access by the bus forces
serialization of writes.
Events:
Processor (events from own processor)
Read (R), Write (W), Writeback (WB)
Bus Events (events from other processors)
Bus Read (BR), Bus Write (BW)
17
Implementation of snooping protocols
18
5 snooping protocols
Name Protocol
type
Memory-write
policy
Unique feature Machine using
Write
once
Write
invalidate
Write back
after first
write
First snooping
protocol described
in literature
Synapse
N+1
Write
invalidate
Write back Explicit state
Where Memory is
the owner
Synapse machine;
first cache-coherence
machines available
Berkeley Write
invalidate
Write back Owned shared
state
Berkeley SPUR
machine
Illinois Write
invalidate
Write back Clean private State;
Can supply data
from any cache
with a clean copy
SGI Power and
Challenge series
“Firefly” Write
broadcast
Write back
When private,
Write through
When shared
Memory updated
on broadcast
No current
machines;
SPARCCenter
2000 closest
19
Simple write-invalidate protocol
Three states
Invalid, Shared, exclusive
Events
CPU-R, CPU-W
BUS-R, BUS-W
20
Snoopy Coherence Protocols
21
Snoopy-Cache State Machine-I
State machine
for CPU requests
for each
cache block
Invalid
Shared
(read/only)
Exclusive
(read/write)
CPU Read
CPU Write
CPU Read hit
Place read miss
on bus
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
Cache Block
State
22
Snoopy-Cache State Machine-II
State machine
for bus requests
for each
cache block
Invalid
Shared
(read/
only)
Exclusive
(read/
write)
Write Back
Block; (abort
memory access)
Write miss
for this block
Read miss
for this block
Write miss
for this block
Write Back
Block; (abort
memory
access)
23
Snoopy-Cache State Machine-III
 State machine
for CPU requests
for each
cache block and
for bus requests
for each
cache block
Cache Block
State
Place read miss
on bus
Invalid
Shared
(read/only)
Exclusive
(read/write)
CPU Read
CPU Write
CPU Read hit
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
Write miss
for this block
Write Back
Block; (abort
memory
access)
Write miss
for this block
Read miss
for this block
Write Back
Block; (abort
memory access)
24
Example
P1 P2 Bus Memory
step State Addr ValueState Addr ValueActionProc.Addr ValueAddrValue
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block,
initial cache state is invalid
25
Example: step 1
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
Assumes initial cache state
is invalid and A1 and A2 map
to same cache block,
but A1 != A2.
26
Example: step 2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
CPU Write Miss
Write Back
27
Example:step 3
P1 P2 Bus
step State Addr ValueState Addr ValueActionProc.Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10
Shar. A1 10 RdDa P2 A1 10
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
CPU Write Miss
Write Back
28
Example: step4
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
29
Example:step 5
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
CPU Write Miss
Write Back
30
x 所在的 block (w x y z)
i 所在的 block ( h i j k )
写 x, miss
取入整个 w x y* z
31
Snooping Cache Variations
Berkeley
Protocol
Owned Exclusive
Owned Shared
Shared
Invalid
Basic
Protocol
Exclusive
Shared
Invalid
Illinois
Protocol
Private Dirty
Private Clean
Shared
Invalid
Owner can update via bus invalidate operation
Owner must write back when replaced in cache
If read sourced from memory, then Private Clean
if read sourced from other cache, then Shared
Can write in cache if held private clean or dirty
MESI
Protocol
Modfied (private,!=Memory)
eXclusive (private,=Memory)
Shared (shared,=Memory)
Invalid
32
CPU Read hit / CPU read miss
place read miss
on bus
Shared
(read/only)
CPU Write
No need to
Place Write
Miss on Bus
Remote Read
Place Data on Bus
MESI (Illinois protocol) (write back cache)
Remote Write
Write back block
Remote Write
Invalid
Modified
(read/write)
CPU Read
Place read miss on Bus
CPU read hit
CPU write hit
Exclusive
(read/only)
CPU Read hit
Remote Read
Write back
block
Read-for-ownership
Place read miss on bus
Read-for-ownership
Place read miss on bus
CPU read miss
Write back block
Place read miss
on bus
CPU Write miss
Write back
Place Write
Miss on Bus
33
Snoopy Coherence Protocols
34
Coherence Protocols: Extensions
Shared memory bus and
snooping bandwidth is
bottleneck for scaling
symmetric multiprocessors
 Duplicating tags
 Place directory in outermost
cache
 Use crossbars or point-to-
point networks with banked
memory
35
Coherence Protocols
36
Performance
37
Performance Study: Commercial Workload
38
Performance Study: Commercial Workload
39
Performance Study: Commercial Workload
40
Performance Study: Commercial Workload
Directory-based
Cache coherence
42 Nov. 12 2008
Distributed Shared Memory
Each node has a local memory and cache
Local or remote memory access via memory controller
互联网络
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
43
Directory protocol
Directory: track state of every block in memory,
and change the state of block in cache according to
directory.
Information in directory
Status of Every block: shared/uncached/exclusive
Which processors have copies of the block: bit vector
Whether the block is dirty or clean
Directory protocol can be implemented with a
distributed memory
Directory protocol can be applied to a centralized
memory organized into banks
44 Nov. 12 2008
Distributed Directory MPs
Interconnection network
Processor
+cache
Memory I/O
Directory
Processor
+cache
Memory I/O
Directory
Processor
+cache
Memory I/O
Directory
Processor
+cache
Memory I/O
Directory
Processor
+cache
Memory I/O
Directory
Processor
+cache
Memory I/O
Directory
45 Nov. 12 2008
Directory protocol
implementation
Block status
 Shared: ≥ 1 processors have data, memory up-to-date
 Uncached (no processor hasit; not valid in any cache)
 Exclusive: 1 processor (owner) has data; memory out-of-date
Directory size = f (entry number * entry size)
 Each memory block has an entry in directory / only keep the entries
for cached blocks
 Every processor has one bit / Limited processor bits in bit vector
Directory can be distributed along with the memory to
avoid becoming the bottleneck
Assumptions to Keep it simple:
 Writes to non-exclusive data => write miss
 Processor blocks until access completes
 Assume messages received and acted upon in order as sent
46 Nov. 12 2008
Directory Protocol
No bus and don’t want to broadcast:
Interconnect means no longer single arbitration point
all messages have explicit responses
Terms: typically 3 processors involved
Local node where a request originates
Home node where the memory location
of an address resides
Remote node has a copy of a cache
block, whether exclusive or shared
Example messages on next slide:
P = processor number, A = address
47
Message type Source Destination Msg Content
Read miss Local cache Home directory P, A
 Processor P reads data at address A;
make P a read sharer and arrange to send data back
Write miss Local cache Home directory P, A
 Processor P has a write miss at address A;
make P the exclusive owner and arrange to send data back
Invalidate Local cache Home directory A
 Request to send invalidates to all remote caches that are caching the
block at address A
Invalidate Home directory Remote caches A
 Invalidate a shared copy at address A.
Fetch Home directory Remote cache A
 Fetch the block at address A and send it to its home directory
Fetch/Invalidate Home directory Remote cache A
 Fetch the block at address A and send it to its home directory;
invalidate the block in the cache
Data value reply Home directory Local cache Data
 Return a data value from the home memory (read miss response)
Data write-back Remote cache Home directory A, Data
 Write-back a data value for address A (invalidate response)
48
State Transition Diagram for an Individual
Cache Block in a Directory Based System
States identical to snoopy case; transactions very
similar.
Transitions caused by read misses, write misses,
invalidates, data fetch requests
Generates read miss & write miss msg to home
directory.
Write misses that were broadcast on the bus for
snooping => explicit invalidate & data fetch requests.
Note: on a write, a cache block is bigger, so need to
read the full cache block
49
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
Directory:
Uncached: Send Rp; S shared;
share = {p}
Shared: Send Rp;
share + = {p}
Exclusive: Send Fetch to R.N.
get reply back from R.N.
Send RP;
 Shared
share + = {p}
Directory:
Uncached: Send Rp; S Exclusive
share = {p}
Shared: Send invalidate;
Send Rp;
S Exclusive
share = {p}
Exclusive: Send Fetch/invalidate to R.N.
get reply back from R.N.
Send RP to P; SExclusive
share = {p}
50
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Invalidate
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
CPU Read hit
CPU Write hit:Send
invalidate to home directory
CPU read miss:
Send Read Miss
CPU Write miss:Send
Write Miss to home directory
51 Nov. 12 2008
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Invalidate
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
CPU Read hit
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
CPU Write hit:Send
invalidate to home directory
CPU read hit
CPU write hit
Fetch/Invalidate
Data Write Back
Fetch: Data Write Back to
home directory
CPU read miss:
Send Read Miss
CPU write miss:
Data Write Back
and send Write Miss to home
directory
CPU read miss:
Data Write Back and
Send read miss to home directory
CPU Write miss:Send
Write Miss to home directory
52 Nov. 12 2008
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Fetch/Invalidate
Data Write Back
Invalidate
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
CPU Read hit
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
CPU Write hit:Send
invalidate to home directory
CPU read hit
CPU write hit
Fetch: Data Write Back to
home directory
CPU read miss:
Send Read Miss
CPU write miss:
Data Write Back
and send Write Miss to home
directory
CPU read miss:
Data Write Back and
Send read miss to home directory
CPU Write miss:Send
Write Miss to home directory
re these write back
e same ?
53
State Transition Diagram for the
Directory
Same states & structure as the transition
diagram for an individual cache
2 actions: update of directory state & send
msgs to statisfy requests
Tracks all copies of memory block.
Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message.
54
Directory State Machine
State machine
for Directory requests for
each
memory block
Uncached state
if in memory
Uncached
Shared
(read only)
Exclusive
(read/writ)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss:
Data Value Reply
Sharers = {P};
55
Directory State Machine
State machine
for Directory requests for
each
memory block
Uncached state
if in memory
Uncached
Shared
(read only)
Exclusive
(read/writ)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss:
Data Value Reply
Sharers = {P};
Write Miss/ Invalidate:
Send Invalidate to R.N;
Sharers = {P};
(Data Value Reply)
Read miss:
Data Value Reply
Sharers += {P};
56 Nov. 12 2008
Directory State Machine
State machine
for Directory requests for
each
memory block
Uncached state
if in memory
Uncached
Shared
(read only)
Exclusive
(read/writ)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss/Invalid:
Invalidate ;
Sharers = {P};
Data Value Reply
Write Miss:
Data Value Reply
Sharers = {P};
Data Write Back:
Sharers = {}
Read miss:
Send Fetch to R.N.;
Get Data from R. N.
Reply Back to local
processor
Sharers += {P};
Read miss:
Data Value Reply
Sharers += {P};
Write Miss:
Fetch/Invalidate;
Receive Date from R.N
Data Value Reply
to local Node
Sharers = {P};
57 Nov. 12 2008
Directory State Machine
State machine
for Directory requests for
each
memory block
Uncached state
if in memory
Data Write Back:
Sharers = {}
Uncached
Shared
(read only)
Exclusive
(read/writ)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss:
Invalidate ;
Sharers = {P};
Data Value Reply
Write Miss:
Data Value Reply
Sharers = {P};
Read miss:
Fetch;
Data Value Reply
msg to remote cache
Sharers += {P};
Read miss:
Data Value Reply
Sharers += {P};
Write Miss:
Fetch/Invalidate;
Data Value Reply
msg to remote cache
Sharers = {P};
58
Example Directory Protocol
Message sent to directory causes two actions:
 Update the directory
 More messages to satisfy request
Block is in Uncached state: the copy in memory is the current
value; only possible requests for that block are:
 Read miss: requesting processor sent data from memory &requestor
made only sharing node; state of block made Shared.
 Write miss: requesting processor is sent the value & becomes the Sharing
node. The block is made Exclusive to indicate that the only valid copy is
cached. Sharers indicates the identity of the owner.
Block is Shared => the memory value is up-to-date:
 Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
 Write miss: requesting processor is sent the value. All processors in the
set Sharers are sent invalidate messages, & Sharers is set to identity of
requesting processor. The state of the block is made Exclusive.
59
Example Directory Protocol
Block is Exclusive: current value of the block is held in the cache of the
processor identified by the set Sharers (the owner) => three possible
directory requests:
 Read miss: owner processor sent data fetch message, causing state of block in
owner’s cache to transition to Shared and causes owner to send data to directory,
where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy).
State is shared.
 Data write-back: owner processor is replacing the block and hence must write it
back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now Uncached,
and the Sharer set is empty.
 Write miss: block has a new owner. A message is sent to old owner causing the
cache to send the value of the block to the directory from which it is sent to the
requesting processor, which becomes the new owner. Sharers is set to identity of
new owner, and state of block is made Exclusive.
60
Case Study: p1 write 888 to x
P1(local) P5(home) P2(remote) P3(remote)
P1 P2
X=111
P3
X=111
P4 P5
X’HOME
111
P6
Direct: X: S, {P2,P3}
Cache:X: S Cache: X: S
Cache X: I
WriteMiss X
Invalidate X
Invalidate X
Ack
Cache:X: I Cache:X: I
Ack
Direct: X: E, {P1}
Reply X=111
Cache X: E, 888
61
P2 write 999 to X
 P1(remote) P5(home) P2(local) P3(remote)
P1
X=888
P2 P3
P4 P5
X’HOME
111
P6
X: S, {P2,P3}
X: S X: S
X: I X: I X: I
X: E, {P1}
X: E, 888
62
Answer for P2 write 999 to X
 P1(remote) P5(home) P2(local) P3(remote)
P1
X=888
P2 P3
P4 P5
X’HOME
111
P6
X: S, {P2,P3}
X: S X: S
X: I X: I X: I
X: E, {P1}
X: E, 888
WriteMiss
Fetch/Invalidate
X: I
Write back X 888
888
X: E, { P2}
Data reply X 888
X: E, 999
How about P2 read x ?
More Cases for Cache Coherence
of Directory Protocol
Could you feel the blanks to complete the directories ?
I
P0 read 300
What operations will do when P0 read 300 ?
I
DataReply 300(0300)
S 300 0300
P0 read 300
ReadMiss 300
{P0} S 300 0300
P0(local node) P2(home node)
ReadMiss for Tag=300
DataReply 0300
M2, 300, {}, U  {P0}, S
Cach0,
B0: I, 100, 0100
S, 300, 0300
P0 read 300
What operations will do when P2 read 218 ?
I
ReadMiss(218)
Fetch(218)
S
Writeback 218(1218)
{P0,P2} S 218 1218
Modify Directory
P0 read 300
DataReply 218(1218)
S 218 1218
P2(local node) P1(home node) P0(remote node)
ReadMiss for Tag=218
DataReply 218(1218)
M1,
218, {P0}, E, 0218
{P1,P0}, S, 1218
Cach2,
B3: I, 118, 0318
S, 218, 1218
Fetch Tag=218
Cach0,
B3: E, 218, 1218
S, 218, 1218
WriteBack218(1218)
P2 read 218
What operations will do when P1 write 0888 into 310 ?
I
WriteMiss(310)
ACK
{P1} E
P0 read 300
invalidate 310
I
I
invalidate 310
ACK
DataReply 310(0310)
E 310 0310
E 310 0888
P1(local node ) P2(home node) P0. P2 (remote node)
WriteMiss for Tag=310
DataReply 0310
M2,
310,{P0,P2},S,0310
{P1}, E, 0310
Invalidate Tag=310
Cach0, Cache2
B2: S, 310, 0310
I, 310, 0310
Ack
P1 write 0888 into 310
Cache1,
B2:S, 110, 0110
E, 310, 0888
Anything UNcomfortable ?
What operations will do when P1 write 0888 into 310 ?
I
WriteMiss(310)
P0 read 300
{ } U
The directory is outdated
with wrong info.
How to solve it?
P1(local node ) P2(home node) P0(remote node)
WriteMiss for Tag=310
DataReply 0310
M2,
310,{P0,P2},S,0310
{P1}, E, 0310
Invalidate Tag=310
Cach0, {Cache2}
B2: S, 310, 0310
I, 310, 0310
Ack
P1 write 0888 into 310
Cache1,
B2:S, 110, 0110
E, 310, 0888
Kickout 110
M0,
110,{P1},S,0110
{ }, U, 0110
74
Example 1: initial
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
P1 P2 Bus Directory Memo
step State
AddrValue
State
Addr
Value
Action
Proc.
AddrValue
AddrState
{Procs}
Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
75
Example: P1 write 10 to A1
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
P1 P2 Bus Directory Memo
step State
AddrValue
State
Addr
Value
Action
Proc.
AddrValue
AddrState
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
76
Example: P1 read A1, P2 read A1
P1 P2 Bus Directory Memo
step State
AddrValue
State
Addr
Value
Action
Proc.
AddrValue
AddrState
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 VA1
P1: Read A1 Excl. A1 10
P2: Read A1 RdMs P2 A1
Ftch Ph A1
Shar. A1 10 DaWb P1 A1 10 10
Shar. A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
77
Example: P2 write 20 to A1
A1 and A2 map to the same cache block
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 RdMs P2 A1
Ftch Ph A1
Shar. A1 10 DaWb P1 A1 10 10
Shar. A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
Inva. P2 A1 20 A1 Ex {P2} 10
Inv. A1 10 Ex A1 20 Inva. Ph A1 10
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
A1
78
Example: P2 write 40 to A2
A1 and A2 map to the same cache block
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 RdMs P2 A1 A1 sharP1,P2
Ftch Ph A1
Shar. A1 10 DaWb P1 A1 10 10
Shar. A1 10 DaRp P2 A1 10 10
Inva. P2 A1 A1 Ex P2 10
Inv. A1 10 Excl A1 20 Inva. Ph A1 10
P2: Write 40 to A2 DaWbP2A2
P2 A1 20 A1 Unc {} 20
WrMs P2 A2 ValueA2 Ex P2
Excl A2 40 DaRp Ph A2 Value
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
A1
Implementation
of Directory-
base Coherence
80
Implementation issues
Nonatomic operations
Write serialization
Without Broadcast
81
Assumptions for implementation
simplicity
Network provides point-to-point in-order delivery
of message
Network has unlimited buffering
Network delivers all messages within a finite
time.
Coherence controller is duplicated for each
cache block.
A transition only completes when a message has
been transmitted and a data value reply received.
Omit the pending status
Outgoing message can be transmitted before the
next incoming message is accepted.
82
Deadlock example
Assume P1 and P2 each have exclusive copies
of cache blocks X1 and X2 that have different
home directories.
Resolve: duplicate coherence controller for each block
83
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Nov. 12 2008
Fetch/Invalidate
Data Write Back
Invalidate
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
CPU Read hit
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
CPU Write hit:Send
invalidate to home directory
CPU read hit
CPU write hit
Fetch: Data Write Back to
home directory
CPU read miss:
Send Read Miss
CPU write miss:
Data Write Back
and send Write Miss to home
directory
CPU read miss:
Data Write Back and
Send read miss to home directory
CPU Write miss:Send
Write Miss to home directory
84
How to assure write serialization ?
Serialization exclusive access by Home directory
 Buffer all the request (write miss/ invalidate );
 Process the request in order;
 Only start to process the new request until complete the
previous one.
85
How to solve the “race” ?
How does the processor know who is the winner?
Get acknowledgement message from home directory
 Date Reply (For write miss)
 Explicit ACK (For invalidate)
About the loser:
 Simplest: home directory send a NAK to loser.
How to know the invalidations are completed?
1. Directory collect and count ACK messages from remote
nodes, and then send confirmation to requester.
2. Home node collect and count ACK messages from
remote nodes directly.
86
Buffer requirement
Large amount of buffers required
A write miss may produce a large amount invalidate
message
Prefetch scheme might be used
Multiple outstanding misses
Limited buffer in practice
87
Avoid deadlock with limited
buffering
Deadlock arises from three properties
 More than one resource is needed to complete a transaction
 Buffers for request, reply, and accept message
Resources are held until a nonatomic transaction completes
There is no global partial order on the acquisition of
resource
88
Resolution
Strategy: Try to ensure that the resources will
always be available.
Separate network is used for request and
replies.
Every request need a reply allocate the space
to accept reply when the request is generated.
Replier can free the reply buffer.
Any controller can reject any request with a
NAK, but never NAK a reply.
Any request that receives a NAK is simply
retried.
89
Multithreaded directory to
handle multiple blocks
Directory controller must be reentrant.
Handle incoming requests for independent blocks
before the previous one finished.
Control state need be saved and restored while a
fetch(or fetch//invalidate) is outstanding
Owner node can provide the data directly to the
requester as well as to the home node to reduce
latency.
Can limit the outstanding transaction numbers
via NAK to new requests.
90
How to deal with NAK ?
How to know which is the original transaction ?
1. processor keep track of its outstanding
requests
2. Pack the original request into NAK.
3. The buffer holding the return slot for the
request can also hold info about the
request.
So that when receives NAK, the processor
know to resend the request.
91 Nov. 12 2008
Summary
Caches contain all information on state of cached memory
blocks
Snooping and Directory Protocols similar; bus makes
snooping easier because of broadcast (snooping => uniform
memory access)
Directory has extra data structure to keep track of state of all
cache blocks
Distributing directory => scalable shared address
multiprocessor
=> Cache coherent, Non uniform memory access
92 Nov. 12 2008
How about write through cache
with write invalidate?
Invalid
Valid
PR
[ BR miss on bus]
PW
[ BW miss on bus]
BW
BR, PR
PW
[send BW]

More Related Content

PPT
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
PPT
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
PPT
module4.ppt
PPTX
Bus Based Multiprocessors v2
PPT
chapter-6-multiprocessors-and-thread-level (1).ppt
PPT
Snooping 2
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
module4.ppt
Bus Based Multiprocessors v2
chapter-6-multiprocessors-and-thread-level (1).ppt
Snooping 2

Similar to 2021Arch_14_Ch5_2_coherence.pptx Cache coherence (20)

PPTX
캐쉬 일관성 Msi, mesi 프로토콜 흐름
PPT
Snooping protocols 3
PDF
Multiprocessor
PPT
Dos final ppt
PPT
Dos final ppt
PPTX
Cache coherence problem and its solutions
PPT
Distributed system
PDF
Coherence and consistency models in multiprocessor architecture
PPT
Executing Multiple Thread on Modern Processor
PDF
Week5
PPTX
Introduction to Thread Level Parallelism
PPTX
Multiprocessors and Thread-Level Parallelism.pptx
PPT
Distributed shared memory in distributed systems.ppt
PPTX
Cache coherence
PPTX
lecture21b.pptxdvvvffffffffffffffffffffffffffffffffffff
PDF
Cache Consistency – Requirements and its packet processing Performance implic...
PPTX
Cache coherence ppt
PPTX
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
PPTX
Cache Coherence.pptx
DOCX
Cache memory
캐쉬 일관성 Msi, mesi 프로토콜 흐름
Snooping protocols 3
Multiprocessor
Dos final ppt
Dos final ppt
Cache coherence problem and its solutions
Distributed system
Coherence and consistency models in multiprocessor architecture
Executing Multiple Thread on Modern Processor
Week5
Introduction to Thread Level Parallelism
Multiprocessors and Thread-Level Parallelism.pptx
Distributed shared memory in distributed systems.ppt
Cache coherence
lecture21b.pptxdvvvffffffffffffffffffffffffffffffffffff
Cache Consistency – Requirements and its packet processing Performance implic...
Cache coherence ppt
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Cache Coherence.pptx
Cache memory
Ad

More from 542590982 (7)

PDF
2021Arch_15_Ch5_3_Syncronization.pdf Synchronization in Multiprocessor
PPTX
2021Arch_6_Ch3_0_Extend2SupportingMCoperation.pptx
PPTX
2021Arch_5_ch2_2.pptx How to improve the performance of Memory hierarchy
PPTX
2021Arch_1_intro.pptx Computer Architecture ----A Quantitative Approach
PPTX
2021Arch_2_Ch1_1.pptx Fundamentals of Quantitative Design and Analysis
PPTX
Design compiler1_2012暑期.pptx teach people how to use design complier
PPTX
photograph skills to help people how to shoot in action
2021Arch_15_Ch5_3_Syncronization.pdf Synchronization in Multiprocessor
2021Arch_6_Ch3_0_Extend2SupportingMCoperation.pptx
2021Arch_5_ch2_2.pptx How to improve the performance of Memory hierarchy
2021Arch_1_intro.pptx Computer Architecture ----A Quantitative Approach
2021Arch_2_Ch1_1.pptx Fundamentals of Quantitative Design and Analysis
Design compiler1_2012暑期.pptx teach people how to use design complier
photograph skills to help people how to shoot in action
Ad

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Welding lecture in detail for understanding
PDF
PPT on Performance Review to get promotions
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Well-logging-methods_new................
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Model Code of Practice - Construction Work - 21102022 .pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
OOP with Java - Java Introduction (Basics)
Lesson 3_Tessellation.pptx finite Mathematics
Welding lecture in detail for understanding
PPT on Performance Review to get promotions
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Well-logging-methods_new................
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
additive manufacturing of ss316l using mig welding
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Foundation to blockchain - A guide to Blockchain Tech
Operating System & Kernel Study Guide-1 - converted.pdf
Structs to JSON How Go Powers REST APIs.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

2021Arch_14_Ch5_2_coherence.pptx Cache coherence

  • 2. 2 Centralized Shared-Memory Architecture Characteristics of SMP Limited processors nodes----small scale, share single physical memory connected by a shared bus. Large cache ----provide a sufficient amount of memory bandwidth. Increase bandwidth versus bus/memory Reduce latency of access Valuable for both private data and shared data UMA----uniform memory access time.
  • 3. 3 Major issues for Shared Memory Cache coherence ( Value, same location)  “Common Sense”  P1 Read[X] => P1 Write[X] => P1 Read[X] will return X  P2 Read[X] => P1 Write[X] => will return value written by P1  P1 Write[X] => P2 Write[X] => Serialized (all processor see the writes in the same order) Synchronization  Atomic read/write operations Memory consistency Model ( order, different locations)  In what order must a processor observe the data writes of another processor ?  What properties must be enforced among reads and writes to different locations by different processors? These are not issues for message passing systems  Why?
  • 4. 4 What is Multiprocessor Cache Coherence?
  • 5. 5 Cache coherence in uniprocessor
  • 6. 6 Cache Coherence in Multiprocessor
  • 9. 9 What Does Coherency Mean? Informally: “Any read must return the most recent write” Too strict and too difficult to implement Better: “Any write must eventually be seen by a read” All writes are seen in proper order (“serialization”) Two rules to ensure this: “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” Writes to a single location are serialized: seen in one order  Latest write will be seen  Otherewise could see writes in illogical order (could see older value after a newer value)
  • 10. 10 Definition of Cache coherence Cache coherence P1 Read[X] => P1 Write[X] => P1 Read[X] will return X P2 Read[X] => P1 Write[X] => will return value written by P1 P1 Write[X] => P2 Write[X] => Serialized (all processor see the writes in the same order)
  • 11. 11 HW Coherence Protocols Snooping Solution (Snoopy Bus):  Send all requests for data to all processors  Processors snoop to see if they have a copy and respond accordingly  Requires broadcast, since caching information is at processors  Works well with bus (natural broadcast medium)  Dominates for small scale machines (most of the market) Directory-Based Schemes (discuss later)  Keep track of what is being shared in 1 centralized place (logically)  Distributed memory => distributed directory for scalability (avoids bottlenecks)  Send point-to-point requests to processors via network  Scales better than Snooping  Actually existed BEFORE Snooping-based schemes
  • 12. 12 Snooping solution Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, but no centralized state is kept.
  • 13. 13 Basic Snoopy Protocols Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Read Miss:  Write-through: memory is always up-to-date  Write-back: snoop in caches to find most recent copy Write Broadcast Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies Read miss: memory is always up-to-date Write serialization: bus serializes requests! Bus is single point of arbitration
  • 14. 14 EX: write back Cache, write invalidate Processor Activity Bus activity Contents of CPU A’s cache Contents of CPU B’s cache Contents of Memory Location X 0 CPU A Reads X Cache miss for X 0 0 CPU B Reads X Cache miss for X 0 0 0 CPU A writes A 1 to X Invalidation for X 1 0 CPU B Reads X Cache miss for X 1 1 1 Mechanics  Broadcast address of cache line to invalidate  All processor snoop, then invalidate if in local cache  policy can be used to service cache misses in write-back caches
  • 15. 15 Ex: Write back Cache, update(Broadcast) Processor Activity Bus activity Contents of CPU A’s cache Contents of CPU B’s cache Contents of Memory Location X 0 CPU A Reads X Cache miss for X 0 0 CPU B Reads X Cache miss for X 0 0 0 CPU A writes A 1 to X Write broadcast Of X 1 1 1 CPU B Reads X 1 1 1
  • 16. 16 Bus-based protocols (Snooping) Snooping All caches see and react to all bus events Protocol relies on global visibility of events (ordered broadcast) The serialization of access by the bus forces serialization of writes. Events: Processor (events from own processor) Read (R), Write (W), Writeback (WB) Bus Events (events from other processors) Bus Read (BR), Bus Write (BW)
  • 18. 18 5 snooping protocols Name Protocol type Memory-write policy Unique feature Machine using Write once Write invalidate Write back after first write First snooping protocol described in literature Synapse N+1 Write invalidate Write back Explicit state Where Memory is the owner Synapse machine; first cache-coherence machines available Berkeley Write invalidate Write back Owned shared state Berkeley SPUR machine Illinois Write invalidate Write back Clean private State; Can supply data from any cache with a clean copy SGI Power and Challenge series “Firefly” Write broadcast Write back When private, Write through When shared Memory updated on broadcast No current machines; SPARCCenter 2000 closest
  • 19. 19 Simple write-invalidate protocol Three states Invalid, Shared, exclusive Events CPU-R, CPU-W BUS-R, BUS-W
  • 21. 21 Snoopy-Cache State Machine-I State machine for CPU requests for each cache block Invalid Shared (read/only) Exclusive (read/write) CPU Read CPU Write CPU Read hit Place read miss on bus Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit Cache Block State
  • 22. 22 Snoopy-Cache State Machine-II State machine for bus requests for each cache block Invalid Shared (read/ only) Exclusive (read/ write) Write Back Block; (abort memory access) Write miss for this block Read miss for this block Write miss for this block Write Back Block; (abort memory access)
  • 23. 23 Snoopy-Cache State Machine-III  State machine for CPU requests for each cache block and for bus requests for each cache block Cache Block State Place read miss on bus Invalid Shared (read/only) Exclusive (read/write) CPU Read CPU Write CPU Read hit Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit Write miss for this block Write Back Block; (abort memory access) Write miss for this block Read miss for this block Write Back Block; (abort memory access)
  • 24. 24 Example P1 P2 Bus Memory step State Addr ValueState Addr ValueActionProc.Addr ValueAddrValue P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes A1 and A2 map to same cache block, initial cache state is invalid
  • 25. 25 Example: step 1 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 != A2.
  • 26. 26 Example: step 2 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back CPU Write Miss Write Back
  • 27. 27 Example:step 3 P1 P2 Bus step State Addr ValueState Addr ValueActionProc.Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 Shar. A1 10 RdDa P2 A1 10 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back CPU Write Miss Write Back
  • 28. 28 Example: step4 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back
  • 29. 29 Example:step 5 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 WrMs P2 A2 A1 10 Excl. A2 40 WrBk P2 A1 20 A1 20 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back CPU Write Miss Write Back
  • 30. 30 x 所在的 block (w x y z) i 所在的 block ( h i j k ) 写 x, miss 取入整个 w x y* z
  • 31. 31 Snooping Cache Variations Berkeley Protocol Owned Exclusive Owned Shared Shared Invalid Basic Protocol Exclusive Shared Invalid Illinois Protocol Private Dirty Private Clean Shared Invalid Owner can update via bus invalidate operation Owner must write back when replaced in cache If read sourced from memory, then Private Clean if read sourced from other cache, then Shared Can write in cache if held private clean or dirty MESI Protocol Modfied (private,!=Memory) eXclusive (private,=Memory) Shared (shared,=Memory) Invalid
  • 32. 32 CPU Read hit / CPU read miss place read miss on bus Shared (read/only) CPU Write No need to Place Write Miss on Bus Remote Read Place Data on Bus MESI (Illinois protocol) (write back cache) Remote Write Write back block Remote Write Invalid Modified (read/write) CPU Read Place read miss on Bus CPU read hit CPU write hit Exclusive (read/only) CPU Read hit Remote Read Write back block Read-for-ownership Place read miss on bus Read-for-ownership Place read miss on bus CPU read miss Write back block Place read miss on bus CPU Write miss Write back Place Write Miss on Bus
  • 34. 34 Coherence Protocols: Extensions Shared memory bus and snooping bandwidth is bottleneck for scaling symmetric multiprocessors  Duplicating tags  Place directory in outermost cache  Use crossbars or point-to- point networks with banked memory
  • 42. 42 Nov. 12 2008 Distributed Shared Memory Each node has a local memory and cache Local or remote memory access via memory controller 互联网络 Processor +cache Memory I/O Processor +cache Memory I/O Processor +cache Memory I/O Processor +cache Memory I/O Processor +cache Memory I/O Processor +cache Memory I/O
  • 43. 43 Directory protocol Directory: track state of every block in memory, and change the state of block in cache according to directory. Information in directory Status of Every block: shared/uncached/exclusive Which processors have copies of the block: bit vector Whether the block is dirty or clean Directory protocol can be implemented with a distributed memory Directory protocol can be applied to a centralized memory organized into banks
  • 44. 44 Nov. 12 2008 Distributed Directory MPs Interconnection network Processor +cache Memory I/O Directory Processor +cache Memory I/O Directory Processor +cache Memory I/O Directory Processor +cache Memory I/O Directory Processor +cache Memory I/O Directory Processor +cache Memory I/O Directory
  • 45. 45 Nov. 12 2008 Directory protocol implementation Block status  Shared: ≥ 1 processors have data, memory up-to-date  Uncached (no processor hasit; not valid in any cache)  Exclusive: 1 processor (owner) has data; memory out-of-date Directory size = f (entry number * entry size)  Each memory block has an entry in directory / only keep the entries for cached blocks  Every processor has one bit / Limited processor bits in bit vector Directory can be distributed along with the memory to avoid becoming the bottleneck Assumptions to Keep it simple:  Writes to non-exclusive data => write miss  Processor blocks until access completes  Assume messages received and acted upon in order as sent
  • 46. 46 Nov. 12 2008 Directory Protocol No bus and don’t want to broadcast: Interconnect means no longer single arbitration point all messages have explicit responses Terms: typically 3 processors involved Local node where a request originates Home node where the memory location of an address resides Remote node has a copy of a cache block, whether exclusive or shared Example messages on next slide: P = processor number, A = address
  • 47. 47 Message type Source Destination Msg Content Read miss Local cache Home directory P, A  Processor P reads data at address A; make P a read sharer and arrange to send data back Write miss Local cache Home directory P, A  Processor P has a write miss at address A; make P the exclusive owner and arrange to send data back Invalidate Local cache Home directory A  Request to send invalidates to all remote caches that are caching the block at address A Invalidate Home directory Remote caches A  Invalidate a shared copy at address A. Fetch Home directory Remote cache A  Fetch the block at address A and send it to its home directory Fetch/Invalidate Home directory Remote cache A  Fetch the block at address A and send it to its home directory; invalidate the block in the cache Data value reply Home directory Local cache Data  Return a data value from the home memory (read miss response) Data write-back Remote cache Home directory A, Data  Write-back a data value for address A (invalidate response)
  • 48. 48 State Transition Diagram for an Individual Cache Block in a Directory Based System States identical to snoopy case; transactions very similar. Transitions caused by read misses, write misses, invalidates, data fetch requests Generates read miss & write miss msg to home directory. Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests. Note: on a write, a cache block is bigger, so need to read the full cache block
  • 49. 49 CPU -Cache State Machine State machine for CPU requests for each memory block Invalid state if in memory Invalid Shared (read/only) Exclusive (read/writ) CPU Read Send Read Miss CPU Write: Send Write Miss to h.d. Directory: Uncached: Send Rp; S shared; share = {p} Shared: Send Rp; share + = {p} Exclusive: Send Fetch to R.N. get reply back from R.N. Send RP;  Shared share + = {p} Directory: Uncached: Send Rp; S Exclusive share = {p} Shared: Send invalidate; Send Rp; S Exclusive share = {p} Exclusive: Send Fetch/invalidate to R.N. get reply back from R.N. Send RP to P; SExclusive share = {p}
  • 50. 50 CPU -Cache State Machine State machine for CPU requests for each memory block Invalid state if in memory Invalidate Invalid Shared (read/only) Exclusive (read/writ) CPU Read Send Read Miss CPU Write: Send Write Miss to h.d. CPU Read hit CPU Write hit:Send invalidate to home directory CPU read miss: Send Read Miss CPU Write miss:Send Write Miss to home directory
  • 51. 51 Nov. 12 2008 CPU -Cache State Machine State machine for CPU requests for each memory block Invalid state if in memory Invalidate Invalid Shared (read/only) Exclusive (read/writ) CPU Read CPU Read hit Send Read Miss CPU Write: Send Write Miss to h.d. CPU Write hit:Send invalidate to home directory CPU read hit CPU write hit Fetch/Invalidate Data Write Back Fetch: Data Write Back to home directory CPU read miss: Send Read Miss CPU write miss: Data Write Back and send Write Miss to home directory CPU read miss: Data Write Back and Send read miss to home directory CPU Write miss:Send Write Miss to home directory
  • 52. 52 Nov. 12 2008 CPU -Cache State Machine State machine for CPU requests for each memory block Invalid state if in memory Fetch/Invalidate Data Write Back Invalidate Invalid Shared (read/only) Exclusive (read/writ) CPU Read CPU Read hit Send Read Miss CPU Write: Send Write Miss to h.d. CPU Write hit:Send invalidate to home directory CPU read hit CPU write hit Fetch: Data Write Back to home directory CPU read miss: Send Read Miss CPU write miss: Data Write Back and send Write Miss to home directory CPU read miss: Data Write Back and Send read miss to home directory CPU Write miss:Send Write Miss to home directory re these write back e same ?
  • 53. 53 State Transition Diagram for the Directory Same states & structure as the transition diagram for an individual cache 2 actions: update of directory state & send msgs to statisfy requests Tracks all copies of memory block. Also indicates an action that updates the sharing set, Sharers, as well as sending a message.
  • 54. 54 Directory State Machine State machine for Directory requests for each memory block Uncached state if in memory Uncached Shared (read only) Exclusive (read/writ) Read miss: Sharers = {P} send Data Value Reply Write Miss: Data Value Reply Sharers = {P};
  • 55. 55 Directory State Machine State machine for Directory requests for each memory block Uncached state if in memory Uncached Shared (read only) Exclusive (read/writ) Read miss: Sharers = {P} send Data Value Reply Write Miss: Data Value Reply Sharers = {P}; Write Miss/ Invalidate: Send Invalidate to R.N; Sharers = {P}; (Data Value Reply) Read miss: Data Value Reply Sharers += {P};
  • 56. 56 Nov. 12 2008 Directory State Machine State machine for Directory requests for each memory block Uncached state if in memory Uncached Shared (read only) Exclusive (read/writ) Read miss: Sharers = {P} send Data Value Reply Write Miss/Invalid: Invalidate ; Sharers = {P}; Data Value Reply Write Miss: Data Value Reply Sharers = {P}; Data Write Back: Sharers = {} Read miss: Send Fetch to R.N.; Get Data from R. N. Reply Back to local processor Sharers += {P}; Read miss: Data Value Reply Sharers += {P}; Write Miss: Fetch/Invalidate; Receive Date from R.N Data Value Reply to local Node Sharers = {P};
  • 57. 57 Nov. 12 2008 Directory State Machine State machine for Directory requests for each memory block Uncached state if in memory Data Write Back: Sharers = {} Uncached Shared (read only) Exclusive (read/writ) Read miss: Sharers = {P} send Data Value Reply Write Miss: Invalidate ; Sharers = {P}; Data Value Reply Write Miss: Data Value Reply Sharers = {P}; Read miss: Fetch; Data Value Reply msg to remote cache Sharers += {P}; Read miss: Data Value Reply Sharers += {P}; Write Miss: Fetch/Invalidate; Data Value Reply msg to remote cache Sharers = {P};
  • 58. 58 Example Directory Protocol Message sent to directory causes two actions:  Update the directory  More messages to satisfy request Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:  Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.  Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. Block is Shared => the memory value is up-to-date:  Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set.  Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.
  • 59. 59 Example Directory Protocol Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:  Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared.  Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty.  Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.
  • 60. 60 Case Study: p1 write 888 to x P1(local) P5(home) P2(remote) P3(remote) P1 P2 X=111 P3 X=111 P4 P5 X’HOME 111 P6 Direct: X: S, {P2,P3} Cache:X: S Cache: X: S Cache X: I WriteMiss X Invalidate X Invalidate X Ack Cache:X: I Cache:X: I Ack Direct: X: E, {P1} Reply X=111 Cache X: E, 888
  • 61. 61 P2 write 999 to X  P1(remote) P5(home) P2(local) P3(remote) P1 X=888 P2 P3 P4 P5 X’HOME 111 P6 X: S, {P2,P3} X: S X: S X: I X: I X: I X: E, {P1} X: E, 888
  • 62. 62 Answer for P2 write 999 to X  P1(remote) P5(home) P2(local) P3(remote) P1 X=888 P2 P3 P4 P5 X’HOME 111 P6 X: S, {P2,P3} X: S X: S X: I X: I X: I X: E, {P1} X: E, 888 WriteMiss Fetch/Invalidate X: I Write back X 888 888 X: E, { P2} Data reply X 888 X: E, 999 How about P2 read x ?
  • 63. More Cases for Cache Coherence of Directory Protocol
  • 64. Could you feel the blanks to complete the directories ?
  • 66. What operations will do when P0 read 300 ? I DataReply 300(0300) S 300 0300 P0 read 300 ReadMiss 300 {P0} S 300 0300
  • 67. P0(local node) P2(home node) ReadMiss for Tag=300 DataReply 0300 M2, 300, {}, U  {P0}, S Cach0, B0: I, 100, 0100 S, 300, 0300 P0 read 300
  • 68. What operations will do when P2 read 218 ? I ReadMiss(218) Fetch(218) S Writeback 218(1218) {P0,P2} S 218 1218 Modify Directory P0 read 300 DataReply 218(1218) S 218 1218
  • 69. P2(local node) P1(home node) P0(remote node) ReadMiss for Tag=218 DataReply 218(1218) M1, 218, {P0}, E, 0218 {P1,P0}, S, 1218 Cach2, B3: I, 118, 0318 S, 218, 1218 Fetch Tag=218 Cach0, B3: E, 218, 1218 S, 218, 1218 WriteBack218(1218) P2 read 218
  • 70. What operations will do when P1 write 0888 into 310 ? I WriteMiss(310) ACK {P1} E P0 read 300 invalidate 310 I I invalidate 310 ACK DataReply 310(0310) E 310 0310 E 310 0888
  • 71. P1(local node ) P2(home node) P0. P2 (remote node) WriteMiss for Tag=310 DataReply 0310 M2, 310,{P0,P2},S,0310 {P1}, E, 0310 Invalidate Tag=310 Cach0, Cache2 B2: S, 310, 0310 I, 310, 0310 Ack P1 write 0888 into 310 Cache1, B2:S, 110, 0110 E, 310, 0888 Anything UNcomfortable ?
  • 72. What operations will do when P1 write 0888 into 310 ? I WriteMiss(310) P0 read 300 { } U The directory is outdated with wrong info. How to solve it?
  • 73. P1(local node ) P2(home node) P0(remote node) WriteMiss for Tag=310 DataReply 0310 M2, 310,{P0,P2},S,0310 {P1}, E, 0310 Invalidate Tag=310 Cach0, {Cache2} B2: S, 310, 0310 I, 310, 0310 Ack P1 write 0888 into 310 Cache1, B2:S, 110, 0110 E, 310, 0888 Kickout 110 M0, 110,{P1},S,0110 { }, U, 0110
  • 74. 74 Example 1: initial A1 and A2 map to the same cache block Processor 1 Processor 2 Interconnect Memory Directory P1 P2 Bus Directory Memo step State AddrValue State Addr Value Action Proc. AddrValue AddrState {Procs} Value P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2
  • 75. 75 Example: P1 write 10 to A1 A1 and A2 map to the same cache block Processor 1 Processor 2 Interconnect Memory Directory P1 P2 Bus Directory Memo step State AddrValue State Addr Value Action Proc. AddrValue AddrState {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 P2: Read A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2
  • 76. 76 Example: P1 read A1, P2 read A1 P1 P2 Bus Directory Memo step State AddrValue State Addr Value Action Proc. AddrValue AddrState {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 VA1 P1: Read A1 Excl. A1 10 P2: Read A1 RdMs P2 A1 Ftch Ph A1 Shar. A1 10 DaWb P1 A1 10 10 Shar. A1 10 DaRp P2 A1 10 A1Shar. {P1,P2} 10 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block Processor 1 Processor 2 Interconnect Memory Directory
  • 77. 77 Example: P2 write 20 to A1 A1 and A2 map to the same cache block P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 RdMs P2 A1 Ftch Ph A1 Shar. A1 10 DaWb P1 A1 10 10 Shar. A1 10 DaRp P2 A1 10 A1Shar. {P1,P2} 10 Inva. P2 A1 20 A1 Ex {P2} 10 Inv. A1 10 Ex A1 20 Inva. Ph A1 10 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Interconnect Memory Directory A1
  • 78. 78 Example: P2 write 40 to A2 A1 and A2 map to the same cache block P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 RdMs P2 A1 A1 sharP1,P2 Ftch Ph A1 Shar. A1 10 DaWb P1 A1 10 10 Shar. A1 10 DaRp P2 A1 10 10 Inva. P2 A1 A1 Ex P2 10 Inv. A1 10 Excl A1 20 Inva. Ph A1 10 P2: Write 40 to A2 DaWbP2A2 P2 A1 20 A1 Unc {} 20 WrMs P2 A2 ValueA2 Ex P2 Excl A2 40 DaRp Ph A2 Value P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Interconnect Memory Directory A1
  • 80. 80 Implementation issues Nonatomic operations Write serialization Without Broadcast
  • 81. 81 Assumptions for implementation simplicity Network provides point-to-point in-order delivery of message Network has unlimited buffering Network delivers all messages within a finite time. Coherence controller is duplicated for each cache block. A transition only completes when a message has been transmitted and a data value reply received. Omit the pending status Outgoing message can be transmitted before the next incoming message is accepted.
  • 82. 82 Deadlock example Assume P1 and P2 each have exclusive copies of cache blocks X1 and X2 that have different home directories. Resolve: duplicate coherence controller for each block
  • 83. 83 CPU -Cache State Machine State machine for CPU requests for each memory block Invalid state if in memory Nov. 12 2008 Fetch/Invalidate Data Write Back Invalidate Invalid Shared (read/only) Exclusive (read/writ) CPU Read CPU Read hit Send Read Miss CPU Write: Send Write Miss to h.d. CPU Write hit:Send invalidate to home directory CPU read hit CPU write hit Fetch: Data Write Back to home directory CPU read miss: Send Read Miss CPU write miss: Data Write Back and send Write Miss to home directory CPU read miss: Data Write Back and Send read miss to home directory CPU Write miss:Send Write Miss to home directory
  • 84. 84 How to assure write serialization ? Serialization exclusive access by Home directory  Buffer all the request (write miss/ invalidate );  Process the request in order;  Only start to process the new request until complete the previous one.
  • 85. 85 How to solve the “race” ? How does the processor know who is the winner? Get acknowledgement message from home directory  Date Reply (For write miss)  Explicit ACK (For invalidate) About the loser:  Simplest: home directory send a NAK to loser. How to know the invalidations are completed? 1. Directory collect and count ACK messages from remote nodes, and then send confirmation to requester. 2. Home node collect and count ACK messages from remote nodes directly.
  • 86. 86 Buffer requirement Large amount of buffers required A write miss may produce a large amount invalidate message Prefetch scheme might be used Multiple outstanding misses Limited buffer in practice
  • 87. 87 Avoid deadlock with limited buffering Deadlock arises from three properties  More than one resource is needed to complete a transaction  Buffers for request, reply, and accept message Resources are held until a nonatomic transaction completes There is no global partial order on the acquisition of resource
  • 88. 88 Resolution Strategy: Try to ensure that the resources will always be available. Separate network is used for request and replies. Every request need a reply allocate the space to accept reply when the request is generated. Replier can free the reply buffer. Any controller can reject any request with a NAK, but never NAK a reply. Any request that receives a NAK is simply retried.
  • 89. 89 Multithreaded directory to handle multiple blocks Directory controller must be reentrant. Handle incoming requests for independent blocks before the previous one finished. Control state need be saved and restored while a fetch(or fetch//invalidate) is outstanding Owner node can provide the data directly to the requester as well as to the home node to reduce latency. Can limit the outstanding transaction numbers via NAK to new requests.
  • 90. 90 How to deal with NAK ? How to know which is the original transaction ? 1. processor keep track of its outstanding requests 2. Pack the original request into NAK. 3. The buffer holding the return slot for the request can also hold info about the request. So that when receives NAK, the processor know to resend the request.
  • 91. 91 Nov. 12 2008 Summary Caches contain all information on state of cached memory blocks Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access) Directory has extra data structure to keep track of state of all cache blocks Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access
  • 92. 92 Nov. 12 2008 How about write through cache with write invalidate? Invalid Valid PR [ BR miss on bus] PW [ BW miss on bus] BW BR, PR PW [send BW]

Editor's Notes

  • #21: Invalid: read => shared write => dirty shared looks the same
  • #23: Invalid: read => shared write => dirty shared looks the same
  • #25: Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 != A2.
  • #29: Why write miss first? Because in general, only write a piece of block, may need to read it first so that can have a full vblock; therefore, need to get Write back is low priority event.
  • #49: Invalid: read => shared write => dirty shared looks the same
  • #50: Invalid: read => shared write => dirty shared looks the same
  • #51: Invalid: read => shared write => dirty shared looks the same
  • #52: Invalid: read => shared write => dirty shared looks the same
  • #54: Invalid: read => shared write => dirty shared looks the same
  • #55: Invalid: read => shared write => dirty shared looks the same
  • #56: Invalid: read => shared write => dirty shared looks the same
  • #57: Invalid: read => shared write => dirty shared looks the same
  • #83: Invalid: read => shared write => dirty shared looks the same