2. 2
Centralized Shared-Memory
Architecture
Characteristics of SMP
Limited processors nodes----small scale, share
single physical memory connected by a shared
bus.
Large cache ----provide a sufficient amount of
memory bandwidth.
Increase bandwidth versus bus/memory
Reduce latency of access
Valuable for both private data and shared data
UMA----uniform memory access time.
3. 3
Major issues for Shared Memory
Cache coherence ( Value, same location)
“Common Sense”
P1 Read[X] => P1 Write[X] => P1 Read[X] will return X
P2 Read[X] => P1 Write[X] => will return value written by P1
P1 Write[X] => P2 Write[X] => Serialized (all processor see the writes in
the same order)
Synchronization
Atomic read/write operations
Memory consistency Model ( order, different locations)
In what order must a processor observe the data writes of another
processor ?
What properties must be enforced among reads and writes to
different locations by different processors?
These are not issues for message passing systems
Why?
9. 9
What Does Coherency Mean?
Informally:
“Any read must return the most recent write”
Too strict and too difficult to implement
Better:
“Any write must eventually be seen by a read”
All writes are seen in proper order (“serialization”)
Two rules to ensure this:
“If P writes x and P1 reads it, P’s write will be seen by
P1 if the read and write are sufficiently far apart”
Writes to a single location are serialized:
seen in one order
Latest write will be seen
Otherewise could see writes in illogical order
(could see older value after a newer value)
10. 10
Definition of Cache coherence
Cache coherence
P1 Read[X] => P1 Write[X] => P1 Read[X] will return X
P2 Read[X] => P1 Write[X] => will return value written by P1
P1 Write[X] => P2 Write[X] => Serialized (all processor see
the writes in the same order)
11. 11
HW Coherence Protocols
Snooping Solution (Snoopy Bus):
Send all requests for data to all processors
Processors snoop to see if they have a copy and respond
accordingly
Requires broadcast, since caching information is at processors
Works well with bus (natural broadcast medium)
Dominates for small scale machines (most of the market)
Directory-Based Schemes (discuss later)
Keep track of what is being shared in 1 centralized place (logically)
Distributed memory => distributed directory for scalability
(avoids bottlenecks)
Send point-to-point requests to processors via network
Scales better than Snooping
Actually existed BEFORE Snooping-based schemes
12. 12
Snooping solution
Every cache that has a copy of the data from a block of
physical memory also has a copy of the sharing status of
the block, but no centralized state is kept.
13. 13
Basic Snoopy Protocols
Write Invalidate Protocol:
Multiple readers, single writer
Write to shared data: an invalidate is sent to all caches
which snoop and invalidate any copies
Read Miss:
Write-through: memory is always up-to-date
Write-back: snoop in caches to find most recent copy
Write Broadcast Protocol (typically write through):
Write to shared data: broadcast on bus, processors
snoop, and update any copies
Read miss: memory is always up-to-date
Write serialization: bus serializes requests!
Bus is single point of arbitration
14. 14
EX: write back Cache, write invalidate
Processor
Activity
Bus activity Contents of
CPU A’s
cache
Contents of
CPU B’s
cache
Contents of
Memory
Location X
0
CPU A
Reads X
Cache miss
for X
0 0
CPU B
Reads X
Cache miss
for X
0 0 0
CPU A writes
A 1 to X
Invalidation
for X
1 0
CPU B
Reads X
Cache miss
for X
1 1 1
Mechanics
Broadcast address of cache line to invalidate
All processor snoop, then invalidate if in local cache
policy can be used to service cache misses in write-back caches
15. 15
Ex: Write back Cache, update(Broadcast)
Processor
Activity
Bus activity Contents of
CPU A’s
cache
Contents of
CPU B’s
cache
Contents of
Memory
Location X
0
CPU A
Reads X
Cache miss
for X
0 0
CPU B
Reads X
Cache miss
for X
0 0 0
CPU A writes
A 1 to X
Write broadcast
Of X
1 1 1
CPU B
Reads X
1 1 1
16. 16
Bus-based protocols (Snooping)
Snooping
All caches see and react to all bus events
Protocol relies on global visibility of events
(ordered broadcast)
The serialization of access by the bus forces
serialization of writes.
Events:
Processor (events from own processor)
Read (R), Write (W), Writeback (WB)
Bus Events (events from other processors)
Bus Read (BR), Bus Write (BW)
18. 18
5 snooping protocols
Name Protocol
type
Memory-write
policy
Unique feature Machine using
Write
once
Write
invalidate
Write back
after first
write
First snooping
protocol described
in literature
Synapse
N+1
Write
invalidate
Write back Explicit state
Where Memory is
the owner
Synapse machine;
first cache-coherence
machines available
Berkeley Write
invalidate
Write back Owned shared
state
Berkeley SPUR
machine
Illinois Write
invalidate
Write back Clean private State;
Can supply data
from any cache
with a clean copy
SGI Power and
Challenge series
“Firefly” Write
broadcast
Write back
When private,
Write through
When shared
Memory updated
on broadcast
No current
machines;
SPARCCenter
2000 closest
21. 21
Snoopy-Cache State Machine-I
State machine
for CPU requests
for each
cache block
Invalid
Shared
(read/only)
Exclusive
(read/write)
CPU Read
CPU Write
CPU Read hit
Place read miss
on bus
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
Cache Block
State
22. 22
Snoopy-Cache State Machine-II
State machine
for bus requests
for each
cache block
Invalid
Shared
(read/
only)
Exclusive
(read/
write)
Write Back
Block; (abort
memory access)
Write miss
for this block
Read miss
for this block
Write miss
for this block
Write Back
Block; (abort
memory
access)
23. 23
Snoopy-Cache State Machine-III
State machine
for CPU requests
for each
cache block and
for bus requests
for each
cache block
Cache Block
State
Place read miss
on bus
Invalid
Shared
(read/only)
Exclusive
(read/write)
CPU Read
CPU Write
CPU Read hit
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
Write miss
for this block
Write Back
Block; (abort
memory
access)
Write miss
for this block
Read miss
for this block
Write Back
Block; (abort
memory access)
24. 24
Example
P1 P2 Bus Memory
step State Addr ValueState Addr ValueActionProc.Addr ValueAddrValue
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block,
initial cache state is invalid
25. 25
Example: step 1
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
Assumes initial cache state
is invalid and A1 and A2 map
to same cache block,
but A1 != A2.
26. 26
Example: step 2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
CPU Write Miss
Write Back
27. 27
Example:step 3
P1 P2 Bus
step State Addr ValueState Addr ValueActionProc.Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10
Shar. A1 10 RdDa P2 A1 10
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
CPU Write Miss
Write Back
28. 28
Example: step4
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
29. 29
Example:step 5
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Remote
Write
Write Back
Remote Write
Invalid Shared
Exclusive
CPU Read hit
Read
miss on bus
Write
miss on bus CPU Write
Place Write
Miss on Bus
CPU read hit
CPU write hit
Remote Read
Write Back
CPU Write Miss
Write Back
30. 30
x 所在的 block (w x y z)
i 所在的 block ( h i j k )
写 x, miss
取入整个 w x y* z
31. 31
Snooping Cache Variations
Berkeley
Protocol
Owned Exclusive
Owned Shared
Shared
Invalid
Basic
Protocol
Exclusive
Shared
Invalid
Illinois
Protocol
Private Dirty
Private Clean
Shared
Invalid
Owner can update via bus invalidate operation
Owner must write back when replaced in cache
If read sourced from memory, then Private Clean
if read sourced from other cache, then Shared
Can write in cache if held private clean or dirty
MESI
Protocol
Modfied (private,!=Memory)
eXclusive (private,=Memory)
Shared (shared,=Memory)
Invalid
32. 32
CPU Read hit / CPU read miss
place read miss
on bus
Shared
(read/only)
CPU Write
No need to
Place Write
Miss on Bus
Remote Read
Place Data on Bus
MESI (Illinois protocol) (write back cache)
Remote Write
Write back block
Remote Write
Invalid
Modified
(read/write)
CPU Read
Place read miss on Bus
CPU read hit
CPU write hit
Exclusive
(read/only)
CPU Read hit
Remote Read
Write back
block
Read-for-ownership
Place read miss on bus
Read-for-ownership
Place read miss on bus
CPU read miss
Write back block
Place read miss
on bus
CPU Write miss
Write back
Place Write
Miss on Bus
34. 34
Coherence Protocols: Extensions
Shared memory bus and
snooping bandwidth is
bottleneck for scaling
symmetric multiprocessors
Duplicating tags
Place directory in outermost
cache
Use crossbars or point-to-
point networks with banked
memory
42. 42 Nov. 12 2008
Distributed Shared Memory
Each node has a local memory and cache
Local or remote memory access via memory controller
互联网络
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
Processor
+cache
Memory I/O
43. 43
Directory protocol
Directory: track state of every block in memory,
and change the state of block in cache according to
directory.
Information in directory
Status of Every block: shared/uncached/exclusive
Which processors have copies of the block: bit vector
Whether the block is dirty or clean
Directory protocol can be implemented with a
distributed memory
Directory protocol can be applied to a centralized
memory organized into banks
45. 45 Nov. 12 2008
Directory protocol
implementation
Block status
Shared: ≥ 1 processors have data, memory up-to-date
Uncached (no processor hasit; not valid in any cache)
Exclusive: 1 processor (owner) has data; memory out-of-date
Directory size = f (entry number * entry size)
Each memory block has an entry in directory / only keep the entries
for cached blocks
Every processor has one bit / Limited processor bits in bit vector
Directory can be distributed along with the memory to
avoid becoming the bottleneck
Assumptions to Keep it simple:
Writes to non-exclusive data => write miss
Processor blocks until access completes
Assume messages received and acted upon in order as sent
46. 46 Nov. 12 2008
Directory Protocol
No bus and don’t want to broadcast:
Interconnect means no longer single arbitration point
all messages have explicit responses
Terms: typically 3 processors involved
Local node where a request originates
Home node where the memory location
of an address resides
Remote node has a copy of a cache
block, whether exclusive or shared
Example messages on next slide:
P = processor number, A = address
47. 47
Message type Source Destination Msg Content
Read miss Local cache Home directory P, A
Processor P reads data at address A;
make P a read sharer and arrange to send data back
Write miss Local cache Home directory P, A
Processor P has a write miss at address A;
make P the exclusive owner and arrange to send data back
Invalidate Local cache Home directory A
Request to send invalidates to all remote caches that are caching the
block at address A
Invalidate Home directory Remote caches A
Invalidate a shared copy at address A.
Fetch Home directory Remote cache A
Fetch the block at address A and send it to its home directory
Fetch/Invalidate Home directory Remote cache A
Fetch the block at address A and send it to its home directory;
invalidate the block in the cache
Data value reply Home directory Local cache Data
Return a data value from the home memory (read miss response)
Data write-back Remote cache Home directory A, Data
Write-back a data value for address A (invalidate response)
48. 48
State Transition Diagram for an Individual
Cache Block in a Directory Based System
States identical to snoopy case; transactions very
similar.
Transitions caused by read misses, write misses,
invalidates, data fetch requests
Generates read miss & write miss msg to home
directory.
Write misses that were broadcast on the bus for
snooping => explicit invalidate & data fetch requests.
Note: on a write, a cache block is bigger, so need to
read the full cache block
49. 49
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
Directory:
Uncached: Send Rp; S shared;
share = {p}
Shared: Send Rp;
share + = {p}
Exclusive: Send Fetch to R.N.
get reply back from R.N.
Send RP;
Shared
share + = {p}
Directory:
Uncached: Send Rp; S Exclusive
share = {p}
Shared: Send invalidate;
Send Rp;
S Exclusive
share = {p}
Exclusive: Send Fetch/invalidate to R.N.
get reply back from R.N.
Send RP to P; SExclusive
share = {p}
50. 50
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Invalidate
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
CPU Read hit
CPU Write hit:Send
invalidate to home directory
CPU read miss:
Send Read Miss
CPU Write miss:Send
Write Miss to home directory
51. 51 Nov. 12 2008
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Invalidate
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
CPU Read hit
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
CPU Write hit:Send
invalidate to home directory
CPU read hit
CPU write hit
Fetch/Invalidate
Data Write Back
Fetch: Data Write Back to
home directory
CPU read miss:
Send Read Miss
CPU write miss:
Data Write Back
and send Write Miss to home
directory
CPU read miss:
Data Write Back and
Send read miss to home directory
CPU Write miss:Send
Write Miss to home directory
52. 52 Nov. 12 2008
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Fetch/Invalidate
Data Write Back
Invalidate
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
CPU Read hit
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
CPU Write hit:Send
invalidate to home directory
CPU read hit
CPU write hit
Fetch: Data Write Back to
home directory
CPU read miss:
Send Read Miss
CPU write miss:
Data Write Back
and send Write Miss to home
directory
CPU read miss:
Data Write Back and
Send read miss to home directory
CPU Write miss:Send
Write Miss to home directory
re these write back
e same ?
53. 53
State Transition Diagram for the
Directory
Same states & structure as the transition
diagram for an individual cache
2 actions: update of directory state & send
msgs to statisfy requests
Tracks all copies of memory block.
Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message.
54. 54
Directory State Machine
State machine
for Directory requests for
each
memory block
Uncached state
if in memory
Uncached
Shared
(read only)
Exclusive
(read/writ)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss:
Data Value Reply
Sharers = {P};
55. 55
Directory State Machine
State machine
for Directory requests for
each
memory block
Uncached state
if in memory
Uncached
Shared
(read only)
Exclusive
(read/writ)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss:
Data Value Reply
Sharers = {P};
Write Miss/ Invalidate:
Send Invalidate to R.N;
Sharers = {P};
(Data Value Reply)
Read miss:
Data Value Reply
Sharers += {P};
56. 56 Nov. 12 2008
Directory State Machine
State machine
for Directory requests for
each
memory block
Uncached state
if in memory
Uncached
Shared
(read only)
Exclusive
(read/writ)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss/Invalid:
Invalidate ;
Sharers = {P};
Data Value Reply
Write Miss:
Data Value Reply
Sharers = {P};
Data Write Back:
Sharers = {}
Read miss:
Send Fetch to R.N.;
Get Data from R. N.
Reply Back to local
processor
Sharers += {P};
Read miss:
Data Value Reply
Sharers += {P};
Write Miss:
Fetch/Invalidate;
Receive Date from R.N
Data Value Reply
to local Node
Sharers = {P};
57. 57 Nov. 12 2008
Directory State Machine
State machine
for Directory requests for
each
memory block
Uncached state
if in memory
Data Write Back:
Sharers = {}
Uncached
Shared
(read only)
Exclusive
(read/writ)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss:
Invalidate ;
Sharers = {P};
Data Value Reply
Write Miss:
Data Value Reply
Sharers = {P};
Read miss:
Fetch;
Data Value Reply
msg to remote cache
Sharers += {P};
Read miss:
Data Value Reply
Sharers += {P};
Write Miss:
Fetch/Invalidate;
Data Value Reply
msg to remote cache
Sharers = {P};
58. 58
Example Directory Protocol
Message sent to directory causes two actions:
Update the directory
More messages to satisfy request
Block is in Uncached state: the copy in memory is the current
value; only possible requests for that block are:
Read miss: requesting processor sent data from memory &requestor
made only sharing node; state of block made Shared.
Write miss: requesting processor is sent the value & becomes the Sharing
node. The block is made Exclusive to indicate that the only valid copy is
cached. Sharers indicates the identity of the owner.
Block is Shared => the memory value is up-to-date:
Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
Write miss: requesting processor is sent the value. All processors in the
set Sharers are sent invalidate messages, & Sharers is set to identity of
requesting processor. The state of the block is made Exclusive.
59. 59
Example Directory Protocol
Block is Exclusive: current value of the block is held in the cache of the
processor identified by the set Sharers (the owner) => three possible
directory requests:
Read miss: owner processor sent data fetch message, causing state of block in
owner’s cache to transition to Shared and causes owner to send data to directory,
where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy).
State is shared.
Data write-back: owner processor is replacing the block and hence must write it
back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now Uncached,
and the Sharer set is empty.
Write miss: block has a new owner. A message is sent to old owner causing the
cache to send the value of the block to the directory from which it is sent to the
requesting processor, which becomes the new owner. Sharers is set to identity of
new owner, and state of block is made Exclusive.
60. 60
Case Study: p1 write 888 to x
P1(local) P5(home) P2(remote) P3(remote)
P1 P2
X=111
P3
X=111
P4 P5
X’HOME
111
P6
Direct: X: S, {P2,P3}
Cache:X: S Cache: X: S
Cache X: I
WriteMiss X
Invalidate X
Invalidate X
Ack
Cache:X: I Cache:X: I
Ack
Direct: X: E, {P1}
Reply X=111
Cache X: E, 888
61. 61
P2 write 999 to X
P1(remote) P5(home) P2(local) P3(remote)
P1
X=888
P2 P3
P4 P5
X’HOME
111
P6
X: S, {P2,P3}
X: S X: S
X: I X: I X: I
X: E, {P1}
X: E, 888
62. 62
Answer for P2 write 999 to X
P1(remote) P5(home) P2(local) P3(remote)
P1
X=888
P2 P3
P4 P5
X’HOME
111
P6
X: S, {P2,P3}
X: S X: S
X: I X: I X: I
X: E, {P1}
X: E, 888
WriteMiss
Fetch/Invalidate
X: I
Write back X 888
888
X: E, { P2}
Data reply X 888
X: E, 999
How about P2 read x ?
66. What operations will do when P0 read 300 ?
I
DataReply 300(0300)
S 300 0300
P0 read 300
ReadMiss 300
{P0} S 300 0300
67. P0(local node) P2(home node)
ReadMiss for Tag=300
DataReply 0300
M2, 300, {}, U {P0}, S
Cach0,
B0: I, 100, 0100
S, 300, 0300
P0 read 300
68. What operations will do when P2 read 218 ?
I
ReadMiss(218)
Fetch(218)
S
Writeback 218(1218)
{P0,P2} S 218 1218
Modify Directory
P0 read 300
DataReply 218(1218)
S 218 1218
70. What operations will do when P1 write 0888 into 310 ?
I
WriteMiss(310)
ACK
{P1} E
P0 read 300
invalidate 310
I
I
invalidate 310
ACK
DataReply 310(0310)
E 310 0310
E 310 0888
71. P1(local node ) P2(home node) P0. P2 (remote node)
WriteMiss for Tag=310
DataReply 0310
M2,
310,{P0,P2},S,0310
{P1}, E, 0310
Invalidate Tag=310
Cach0, Cache2
B2: S, 310, 0310
I, 310, 0310
Ack
P1 write 0888 into 310
Cache1,
B2:S, 110, 0110
E, 310, 0888
Anything UNcomfortable ?
72. What operations will do when P1 write 0888 into 310 ?
I
WriteMiss(310)
P0 read 300
{ } U
The directory is outdated
with wrong info.
How to solve it?
73. P1(local node ) P2(home node) P0(remote node)
WriteMiss for Tag=310
DataReply 0310
M2,
310,{P0,P2},S,0310
{P1}, E, 0310
Invalidate Tag=310
Cach0, {Cache2}
B2: S, 310, 0310
I, 310, 0310
Ack
P1 write 0888 into 310
Cache1,
B2:S, 110, 0110
E, 310, 0888
Kickout 110
M0,
110,{P1},S,0110
{ }, U, 0110
74. 74
Example 1: initial
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
P1 P2 Bus Directory Memo
step State
AddrValue
State
Addr
Value
Action
Proc.
AddrValue
AddrState
{Procs}
Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
75. 75
Example: P1 write 10 to A1
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
P1 P2 Bus Directory Memo
step State
AddrValue
State
Addr
Value
Action
Proc.
AddrValue
AddrState
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
76. 76
Example: P1 read A1, P2 read A1
P1 P2 Bus Directory Memo
step State
AddrValue
State
Addr
Value
Action
Proc.
AddrValue
AddrState
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 VA1
P1: Read A1 Excl. A1 10
P2: Read A1 RdMs P2 A1
Ftch Ph A1
Shar. A1 10 DaWb P1 A1 10 10
Shar. A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
77. 77
Example: P2 write 20 to A1
A1 and A2 map to the same cache block
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 RdMs P2 A1
Ftch Ph A1
Shar. A1 10 DaWb P1 A1 10 10
Shar. A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
Inva. P2 A1 20 A1 Ex {P2} 10
Inv. A1 10 Ex A1 20 Inva. Ph A1 10
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
A1
78. 78
Example: P2 write 40 to A2
A1 and A2 map to the same cache block
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 RdMs P2 A1 A1 sharP1,P2
Ftch Ph A1
Shar. A1 10 DaWb P1 A1 10 10
Shar. A1 10 DaRp P2 A1 10 10
Inva. P2 A1 A1 Ex P2 10
Inv. A1 10 Excl A1 20 Inva. Ph A1 10
P2: Write 40 to A2 DaWbP2A2
P2 A1 20 A1 Unc {} 20
WrMs P2 A2 ValueA2 Ex P2
Excl A2 40 DaRp Ph A2 Value
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
A1
81. 81
Assumptions for implementation
simplicity
Network provides point-to-point in-order delivery
of message
Network has unlimited buffering
Network delivers all messages within a finite
time.
Coherence controller is duplicated for each
cache block.
A transition only completes when a message has
been transmitted and a data value reply received.
Omit the pending status
Outgoing message can be transmitted before the
next incoming message is accepted.
82. 82
Deadlock example
Assume P1 and P2 each have exclusive copies
of cache blocks X1 and X2 that have different
home directories.
Resolve: duplicate coherence controller for each block
83. 83
CPU -Cache State Machine
State machine
for CPU requests
for each
memory block
Invalid state
if in
memory
Nov. 12 2008
Fetch/Invalidate
Data Write Back
Invalidate
Invalid
Shared
(read/only)
Exclusive
(read/writ)
CPU Read
CPU Read hit
Send Read Miss
CPU Write:
Send Write Miss
to h.d.
CPU Write hit:Send
invalidate to home directory
CPU read hit
CPU write hit
Fetch: Data Write Back to
home directory
CPU read miss:
Send Read Miss
CPU write miss:
Data Write Back
and send Write Miss to home
directory
CPU read miss:
Data Write Back and
Send read miss to home directory
CPU Write miss:Send
Write Miss to home directory
84. 84
How to assure write serialization ?
Serialization exclusive access by Home directory
Buffer all the request (write miss/ invalidate );
Process the request in order;
Only start to process the new request until complete the
previous one.
85. 85
How to solve the “race” ?
How does the processor know who is the winner?
Get acknowledgement message from home directory
Date Reply (For write miss)
Explicit ACK (For invalidate)
About the loser:
Simplest: home directory send a NAK to loser.
How to know the invalidations are completed?
1. Directory collect and count ACK messages from remote
nodes, and then send confirmation to requester.
2. Home node collect and count ACK messages from
remote nodes directly.
86. 86
Buffer requirement
Large amount of buffers required
A write miss may produce a large amount invalidate
message
Prefetch scheme might be used
Multiple outstanding misses
Limited buffer in practice
87. 87
Avoid deadlock with limited
buffering
Deadlock arises from three properties
More than one resource is needed to complete a transaction
Buffers for request, reply, and accept message
Resources are held until a nonatomic transaction completes
There is no global partial order on the acquisition of
resource
88. 88
Resolution
Strategy: Try to ensure that the resources will
always be available.
Separate network is used for request and
replies.
Every request need a reply allocate the space
to accept reply when the request is generated.
Replier can free the reply buffer.
Any controller can reject any request with a
NAK, but never NAK a reply.
Any request that receives a NAK is simply
retried.
89. 89
Multithreaded directory to
handle multiple blocks
Directory controller must be reentrant.
Handle incoming requests for independent blocks
before the previous one finished.
Control state need be saved and restored while a
fetch(or fetch//invalidate) is outstanding
Owner node can provide the data directly to the
requester as well as to the home node to reduce
latency.
Can limit the outstanding transaction numbers
via NAK to new requests.
90. 90
How to deal with NAK ?
How to know which is the original transaction ?
1. processor keep track of its outstanding
requests
2. Pack the original request into NAK.
3. The buffer holding the return slot for the
request can also hold info about the
request.
So that when receives NAK, the processor
know to resend the request.
91. 91 Nov. 12 2008
Summary
Caches contain all information on state of cached memory
blocks
Snooping and Directory Protocols similar; bus makes
snooping easier because of broadcast (snooping => uniform
memory access)
Directory has extra data structure to keep track of state of all
cache blocks
Distributing directory => scalable shared address
multiprocessor
=> Cache coherent, Non uniform memory access
92. 92 Nov. 12 2008
How about write through cache
with write invalidate?
Invalid
Valid
PR
[ BR miss on bus]
PW
[ BW miss on bus]
BW
BR, PR
PW
[send BW]
Editor's Notes
#21:Invalid:
read => shared
write => dirty
shared looks the same
#23:Invalid:
read => shared
write => dirty
shared looks the same
#25:Assumes initial cache state
is invalid and A1 and A2 map
to same cache block,
but A1 != A2.
#29:Why write miss first?
Because in general, only write a piece of block, may need to read it first so that can have a full vblock; therefore, need to get
Write back is low priority event.
#49:Invalid:
read => shared
write => dirty
shared looks the same
#50:Invalid:
read => shared
write => dirty
shared looks the same
#51:Invalid:
read => shared
write => dirty
shared looks the same
#52:Invalid:
read => shared
write => dirty
shared looks the same
#54:Invalid:
read => shared
write => dirty
shared looks the same
#55:Invalid:
read => shared
write => dirty
shared looks the same
#56:Invalid:
read => shared
write => dirty
shared looks the same
#57:Invalid:
read => shared
write => dirty
shared looks the same
#83:Invalid:
read => shared
write => dirty
shared looks the same