SlideShare a Scribd company logo
ECE 4100/6100
Advanced Computer Architecture
Lecture 14 Multiprocessor and Memory Coherence
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
2
Memory Hierarchy in a Multiprocessor
P P P
Cache
Memory
Shared cache
P P P
$
Bus-based shared memory
$ $
Memory
P P P
$
Memory
Fully-connected shared memory
(Dancehall)
$ $
Memory
Interconnection Network
P
$
Memory
Interconnection Network
P
$
Memory
Distributed shared memory
3
Cache Coherency
• Closest cache level is private
• Multiple copies of cache line can be present
across different processor nodes
• Local updates
– Lead to incoherent state
– Problem exhibits in both write-through and
writeback caches
• Bus-based  globally visible
• Point-to-point interconnect  visible only to
communicated processor nodes
4
Example (Writeback Cache)
P
Cache
Memory
P
X= -100
X= -100
Cache
P
Cache
X= -100X= 505
Rd?
X= -100
Rd?
5
Example (Write-through Cache)
P
Cache
Memory
P
X= -100
X= -100
Cache
P
Cache
X= -100X= 505
X= 505
X= 505
Rd?
6
Defining Coherence
• An MP is coherent if the results of any execution of
a program can be reconstructed by a hypothetical
serial order
Implicit definition of coherence
• Write propagation
– Writes are visible to other processes
• Write serialization
– All writes to the same location are seen in the same order
by all processes (to “all” locations called write atomicity)
– E.g., w1 followed by w2 seen by a read from P1, will be
seen in the same order by all reads by other processors Pi
7
Sounds Easy?
P0 P1 P2 P3
A=1 B=2T1
A=0 B=0
T2 A=1 A=1 B=2 B=2
T3 A=1 A=1 B=2
B=2 A=1
B=2
T3 A=1 A=1 B=2
B=2 A=1
B=2
B=2 A=1
See A’s update before B’s See B’s update before A’s
8
Bus Snooping based on Write-Through Cache
• All the writes will be shown as a transaction
on the shared bus to memory
• Two protocols
– Update-based Protocol
– Invalidation-based Protocol
9
Bus Snooping
(Update-based Protocol on Write-Through cache)
• Each processor’s cache controller constantly snoops on the bus
• Update local copies upon snoop hit
P
Cache
Memory
P
X= -100
X= -100
Cache
P
Cache
X= 505
Bus transaction
Bus snoop
X= 505
X= 505
10
• Each processor’s cache controller constantly snoops on the bus
• Invalidate local copies upon snoop hit
P
Cache
Memory
P
X= -100
X= -100
Cache
P
Cache
X= 505
Bus transaction
Bus snoop
X= 505
Load X
X= 505
Bus Snooping
(Invalidation-based Protocol on Write-Through cache)
11
A Simple Snoopy Coherence Protocol
for a WT, No Write-Allocate Cache
Invalid
Valid
PrRd / BusRd
PrRd / --- PrWr / BusWr
BusWr / ---
PrWr / BusWr
Processor-initiated Transaction
Bus-snooper-initiated Transaction
Observed / Transaction
12
How about Writeback Cache?
• WB cache to reduce bandwidth requirement
• The majority of local writes are hidden behind
the processor nodes
• How to snoop?
• Write Ordering
13
Cache Coherence Protocols for WB caches
• A cache has an exclusive copy of a line if
– It is the only cache having a valid copy
– Memory may or may not have it
• Modified (dirty) cache line
– The cache having the line is the owner of the line,
because it must supply the block
14
Cache Coherence Protocol
(Update-based Protocol on Writeback cache)
• Update data for all processor nodes who share the same data
• For a processor node keeps updating the memory location, a lot of traffic
will be incurred
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= -100X= -100X= -100
Store X
X= 505
update
update
X= 505X= 505
15
Cache Coherence Protocol
(Update-based Protocol on Writeback cache)
• Update data for all processor nodes who share the same data
• For a processor node keeps updating the memory location, a lot of
traffic will be incurred
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= 505X= 505X= 505
Load X
Hit !
Store X
X= 333
update update
X= 333X= 333
16
Cache Coherence Protocol
(Invalidation-based Protocol on Writeback cache)
• Invalidate the data copies for the sharing processor nodes
• Reduced traffic when a processor node keeps updating the same
memory location
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= -100X= -100X= -100
Store X
invalidate
invalidate
X= 505
17
Cache Coherence Protocol
(Invalidation-based Protocol on Writeback cache)
• Invalidate the data copies for the sharing processor nodes
• Reduced traffic when a processor node keeps updating the same
memory location
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= 505
Load X
Bus snoop
Miss !
Snoop hit
X= 505
18
Cache Coherence Protocol
(Invalidation-based Protocol on Writeback cache)
• Invalidate the data copies for the sharing processor nodes
• Reduced traffic when a processor node keeps updating the same
memory location
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= 505
Store X
Bus snoop
X= 505X= 333
Store X
X= 987
Store X
X= 444
19
MSI Writeback Invalidation Protocol
• Modified
– Dirty
– Only this cache has a valid copy
• Shared
– Memory is consistent
– One or more caches have a valid copy
• Invalid
• Writeback protocol: A cache line can be
written multiple times before the memory is
updated.
20
MSI Writeback Invalidation Protocol
• Two types of request from the processor
– PrRd
– PrWr
• Three types of bus transactions post by cache
controller
– BusRd
• PrRd misses the cache
• Memory or another cache supplies the line
– BusRd eXclusive (Read-to-own)
• PrWr is issued to a line which is not in the Modified state
– BusWB
• Writeback due to replacement
• Processor does not directly involve in initiating this operation
21
MSI Writeback Invalidation Protocol
(Processor Request)
Modified
Invalid
Shared
PrRd / BusRd
PrRd / ---
PrWr / BusRdX
PrWr / ---
PrRd / ---
PrWr / BusRdX
Processor-initiated
22
MSI Writeback Invalidation Protocol
(Bus Transaction)
• Flush data on the bus
• Both memory and requestor will
grab the copy
• The requestor get data by
– Cache-to-cache transfer; or
– Memory
Modified
Invalid
Shared
Bus-snooper-initiated
BusRd / ---
BusRd / Flush
BusRdX / Flush BusRdX / ---
23
MSI Writeback Invalidation Protocol
(Bus transaction) Another possible implementation
Modified
Invalid
Shared
Bus-snooper-initiated
BusRd / ---
BusRd / Flush
BusRdX / Flush BusRdX / ---
• Another possible, valid implementation
• Anticipate no more reads from this processor
• A performance concern
• Save “invalidation” trip if the requesting cache writes the shared line
later
BusRd / Flush
24
MSI Writeback Invalidation Protocol
Modified
Invalid
Shared
Bus-snooper-initiated
BusRd / ---
PrRd / BusRd
PrRd / ---
PrWr / BusRdX
PrWr / ---
PrRd / ---
PrWr / BusRdX
Processor-initiated
BusRd / Flush
BusRdX / Flush BusRdX / ---
25
MSI Example
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
BusRd
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier
S --- --- BusRd MemoryP1 reads X
X=10
X=10 SS
26
MSI Example
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
X=10 SS
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier
S --- --- BusRd MemoryP1 reads X
P3 reads X
BusRd
X=10 SS
S --- S BusRd Memory
X=10
27
MSI Example
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
X=10 SS
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier
S --- --- BusRd MemoryP1 reads X
P3 reads X
X=10 SS
S --- S BusRd Memory
P3 writes X
BusRdX
--- II MM
I --- M BusRdX
X=10
X=-25
P3 Cache
Does not come from memory if having “BusUpgrade”
28
MSI Example
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier
S --- --- BusRd MemoryP1 reads X
P3 reads X
X=-25 MM
S --- S BusRd Memory
P3 writes X
--- II
I --- M BusRdX
P1 reads X
BusRd
X=-25 SS SS
S --- S BusRd P3 Cache
X=10X=-25
P3 Cache
29
MSI Example
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier
S --- --- BusRd MemoryP1 reads X
P3 reads X
X=-25 MM
S --- S BusRd Memory
P3 writes X I --- M BusRdX
P1 reads X
X=-25 SS SS
S --- S BusRd P3 Cache
X=10X=-25
P2 reads X
BusRd
X=-25 SS
S S S BusRd Memory
P3 Cache
30
MESI Writeback Invalidation Protocol
• To reduce two types of unnecessary bus transactions
– BusRdX that snoops and converts the block from S to M
when only you are the sole owner of the block
– BusRd that gets the line in S state when there is no sharers
(that lead to the overhead above)
• Introduce the Exclusive state
– One can write to the copy without generating BusRdX
• Illinois Protocol: Proposed by Pamarcos and Patel in
1984
• Employed in Intel, PowerPC, MIPS
31
MESI Writeback Invalidation Protocol
Processor Request (Illinois Protocol)
Invalid
Exclusive Modified
Shared
PrRd / BusRd
(not-S)
PrWr / ---
Processor-initiated
PrRd / --- PrRd, PrWr / ---
PrRd / ---S: Shared Signal
PrWr / BusRdX
PrRd / BusRd (S)
PrWr / BusRdX
32
MESI Writeback Invalidation Protocol
Bus Transactions (Illinois Protocol)
Invalid
Exclusive Modified
Shared
Bus-snooper-initiated
BusRd / Flush
BusRdX / Flush
BusRd / Flush*
Flush*: Flush for data supplier; no action for other sharers
BusRdX / Flush*
BusRd / Flush
Or ---)
BusRdX / ---
• Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data
• Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory)
• Most of the MESI implementations simply write to memory
33
MESI Writeback Invalidation Protocol
(Illinois Protocol)
Invalid
Exclusive Modified
Shared
Bus-snooper-initiated
BusRd / Flush
BusRdX / Flush
BusRd / Flush*
BusRdX / Flush*
BusRdX / ---PrRd / BusRd
(not-S)
PrWr / ---
PrRd / BusRd (S)Processor-initiated
PrRd / --- PrRd, PrWr / ---
PrRd / ---
PrWr / BusRdX
S: Shared Signal
PrWr / BusRdX
BusRd / Flush
(or ---)
Flush*: Flush for data supplier; no action for other sharers
34
MOESI Protocol
• Add one additional state ─ Owner state
• Similar to Shared state
• The O state processor will be responsible for
supplying data (copy in memory may be stale)
• Employed by
– Sun UltraSparc
– AMD Opteron
• In dual-core Opteron, cache-to-cache
transfer is done through a system
request interface (SRI) running at full
CPU speed
CPU0
L2
CPU1
L2
System Request Interface
Crossbar
Hyper-
Transport
Mem
Controller
35
Implication on Multi-Level Caches
• How to guarantee coherence in a multi-level cache
hierarchy
– Snoop all cache levels?
– Intel’s 8870 chipset has a “snoop filter” for quad-core
• Maintaining inclusion property
– Ensure data in the outer level must be present in the
inner level
– Only snoop the outermost level (e.g. L2)
– L2 needs to know L1 has write hits
• Use Write-Through cache
• Use Write-back but maintain another “modified-but-stale” bit in
L2
36
Inclusion Property
• Not so easy …
– Replacement: Different bus observes different
access activities, e.g. L2 may replace a line
frequently accessed in L1
– Split L1 caches: Imagine all caches are direct-
mapped.
– Different cache line sizes
37
Inclusion Property
• Use specific cache configurations
– E.g., DM L1 + bigger DM or set-associative L2 with the
same cache line size
• Explicitly propagate L2 action to L1
– L2 replacement will flush the corresponding L1 line
– Observed BusRdX bus transaction will invalidate the
corresponding L1 line
– To avoid excess traffic, L2 maintains an Inclusion bit for
filtering (to indicate in L1 or not)
38
Directory-based Coherence Protocol
• Snooping-based protocol
– N transactions for an N-node MP
– All caches need to watch every memory request from each processor
– Not a scalable solution for maintaining coherence in large shared
memory systems
• Directory protocol
– Directory-based control of who has what;
– HW overheads to keep the directory (~ # lines * # processors)
P
$
P
$
P
$
P
$
Memory
Interconnection Network
Directory
Modified bit Presence bits, one for each node
39
Directory-based Coherence Protocol
P
$
P
$
P
$
P
$
Memory
Interconnection Network
P
$
1 1 1 000 00
0 0 0 001 01
C(k)
C(k+1)
0 0 0 101 00 C(k+j)
1 presence bit for each processor, each cache block in memory
1 modified bit for each cache block in memory
40
Directory-based Coherence Protocol (Limited Dir)
Encoded Present bits (lg2N),
each cache line can reside in 2 processors in this example
1 modified bit for each cache block in memory
P0
$
P13
$
P14
$
P15
$
Memory
Interconnection Network
P1
$
Presence encoding is NULL or not
0 0 0 00 1 1 1 1 01
0 0 0 11 1 - - - -0
- - - -0 0 - - - -0
41
Distributed Directory Coherence Protocol
• Centralized directory is less scalable (contention)
• Distributed shared memory (DSM) for a large MP system
• Interconnection network is no longer a shared bus
• Maintain cache coherence (CC-NUMA)
• Each address has a “home”
P
$
Memory
Interconnection Network
P
$
Memory
P
$
Memory
P
$
Memory
P
$
Memory
P
$
Memory
Directory Directory Directory
DirectoryDirectoryDirectory
42
Distributed Directory Coherence Protocol
• Stanford DASH (4 CPUs in each cluster, total 16 clusters)
– Invalidation-based cache coherence
– Directory keeps one of the 3 status of a cache block at its home node
• Uncached
• Shared (unmodified state)
• Dirty
P
$
Memory
P
$
Memory
Directory
Interconnection Network
Snoop bus
P
$
Memory
P
$
Memory
Directory
Snoop bus
43
DASH Memory Hierarchy
• Processor Level
• Local Cluster Level
• Home Cluster Level (address is at home)
If dirty, needs to get it from remote node which owns it
• Remote Cluster Level
P
$
Memory
P
$
Memory
Directory
Interconnection Network
Snoop bus
P
$
Memory
P
$
Memory
Directory
Snoop bus
44
$
Directory Coherence Protocol: Read Miss
Interconnection Network
0 0 1 1
P
$
MemoryMemory
P Miss Z (read)
Go to Home Node
Memory
P
Z
Z
1
Data Z is shared (clean)
Home of Z
$Z
45
Directory Coherence Protocol: Read Miss
Interconnection Network
1 0 1 0
P
MemoryMemory
P Miss Z (read)
Memory
P
Data Z is Dirty
Go to Home Node
Respond with Owner Info
Data Request
Z
0 1 1
Data Z is Clean, Shared by 3 nodes
$$$ ZZ
46
Directory Coherence Protocol: Write Miss
Interconnection Network
0 0 1
P
MemoryMemory
P Miss Z (write)
Memory
P
1
Z
Go to Home Node
Respond w/ sharers
InvalidateInvalidate
ACK ACK 0 01 1
Write Z can proceed in P0
$ $$ ZZ Z
47
Memory Consistency Issue
• What do you expect for the following codes?
P1 P2
A=1;
Flag = 1;
while (Flag==0) {};
print A;
P1 P2
A=1;
B=1;
print B;
print A;
Initial values
A=0
B=0
Is it possible P2 prints A=0?
Is it possible P2 prints A=0, B=1?
48
Memory Consistency Model
• Programmers anticipate certain memory ordering and
program behavior
• Become very complex When
– Running shared-memory programs
– A processor supports out-of-order execution
• A memory consistency model specifies the legal ordering of
memory events when several processors access the shared
memory locations
49
Sequential Consistency (SC) [Leslie Lamport]
• An MP is Sequentially Consistent if the result of any execution is the same as if
the operations of all the processors were executed in some sequential order, and
the operations of each individual processor appear in this sequence in the order
specified by its program.
• Two properties
– Program ordering
– Write atomicity (All writes to any location should appear to all processors in the same
order)
• Intuitive to programmers
P P P
Memory
50
SC Example
T=1
U=2
Y=1
Z=2
P1 P2 P3P0
A=1 A=2
T=A Y=A
U=A Z=A
Sequentially Consistent
T=1
U=2
Y=2
Z=1
Violating Sequential Consistency!
(but possible in processor consistency model)
P1 P2 P3P0
A=1 A=2
T=A Y=A
U=A Z=A
51
Maintain Program Ordering (SC)
• Dekker’s algorithm
• Only one processor is
allowed to enter the CS
P1 P2
Flag1 = Flag2 = 0
Flag1 = 1
if (Flag2 == 0)
enter Critical Section
Flag2 = 1
if (Flag1 == 0)
enter Critical Section
Caveat: implementation fine with uni-processor,but violate the ordering of the above
P1P0
Flag1=1
Write Buffer
Flag2=1
Write Buffer
Flag1: 0
Flag2: 0
Flag2=0 Flag1=0
INCORRECT!!
BOTH ARE IN CRITICAL SECTION!
52
Atomic and Instantaneous Update (SC)
• Update (of A) must take place
atomically to all processors
• A read cannot return the value
of another processor’s write
until the write is made visible by
“all” processors
P1 P2
A = B = 0
A = 1
if (A==1)
B =1
P3
if (B==1)
R1=A
53
Atomic and Instantaneous Update (SC)
• Update (of A) must take place
atomically to all processors
• A read cannot return the value
of another processor’s write
until the write is made visible by
“all” processors
P1 P2
A = B = 0
A = 1
if (A==1)
B =1
P3
if (B==1)
R1=A
P1 P2 P4P3
A=1
B=1
P0
A=1A=1
B=1
A=1
Caveat when an update is not atomic to all …
R1=0?
54
Atomic and Instantaneous Update (SC)
• Caches also make things
complicated
• P3 caches A and B
• A=1 will not show up in P3 until
P3 reads it in R1=A
P1 P2
A = B = 0
A = 1
if (A==1)
B =1
P3
if (B==1)
R1=A
55
Relaxed Memory Models
• How to relax program order requirement?
– Load bypass store
– Load bypass load
– Store bypass store
– Store bypass load
• How to relax write atomicity requirement?
– Read others’ write early
– Read own write early
56
Relaxed Consistency
• Processor Consistency
– Used in P6
– Write visibility could be in different orders of
different processors (not guarantee write atomicity)
– Allow loads to bypass independent stores in each
individual processor
– To achieve SC, explicit synchronization operations
need to be substituted or inserted
•Read-modify-write instructions
•Memory fence instructions
57
Processor Consistency
F1=1 F2=1
A=1 A=2
R1=A R3=A
R2=F2 R4=F1
R1=1; R3=1; R2=0; R4=0 is a possible outcome
Load bypassing Stores
F1=1 F2=1
A=1 A=2
R1=A R3=A
R2=F2 R4=F1
R1=1; R3=1; R2=0; R4=0 is a possible
outcome
58
Processor Consistency
Intuitive for event synchronization
“A” must be printed “1”
P1 P2
A = Flag = 0
A = 1
Flag=1
while (Flag==0);
Print A
59
Processor Consistency
• Allow load bypassing store to a different address
• Unlike SC, cannot guarantee mutual exclusion in the critical
section
P1 P2
Flag1 = Flag2 = 0
Flag1 = 1
if (Flag2 == 0)
enter Critical Section
Flag2 = 1
if (Flag1 == 0)
enter Critical Section
60
Processor Consistency
B=1;R1=0 is a possible outcome
Since PC allows A=1 to be visible in P2 prior to P3
P1 P2
A = B = 0
A = 1
if (A==1)
B =1
P3
if (B==1)
R1=A

More Related Content

PDF
Linux-Internals-and-Networking
PPTX
Dining Philosopher Problem
PDF
Past Papers (Compiler Construction).pdf
PDF
Zynq ultrascale
PDF
ARM Microcontrollers and Embedded Systems-Module 1_VTU
PPTX
Single accumulator based CPU.pptx
PPTX
U-Boot presentation 2013
PDF
ARM Architecture Instruction Set
Linux-Internals-and-Networking
Dining Philosopher Problem
Past Papers (Compiler Construction).pdf
Zynq ultrascale
ARM Microcontrollers and Embedded Systems-Module 1_VTU
Single accumulator based CPU.pptx
U-Boot presentation 2013
ARM Architecture Instruction Set

What's hot (20)

PPTX
Presentation on 8086 microprocessor
PPT
ARM - Advance RISC Machine
PPT
Ch12 microprocessor interrupts
PDF
Programmable Logic Devices
PPTX
8255 Programmable parallel I/O
ODP
axi protocol
PPT
Loaders complete
PPTX
Parsing in Compiler Design
DOCX
MASM -UNIT-III
PPT
BASIC COMPUTER ORGANIZATION AND DESIGN
PDF
PCI Drivers
PPTX
Introduction to EDA Tools
PPTX
07. Virtual Functions
PPTX
Sequential Logic Circuits
PDF
ARM architcture
PPTX
PDF
Xilinxaxi uart16550
PPTX
Timing and control unit
PPT
Unit 3-pipelining & vector processing
PDF
5. NFA & DFA.pdf
Presentation on 8086 microprocessor
ARM - Advance RISC Machine
Ch12 microprocessor interrupts
Programmable Logic Devices
8255 Programmable parallel I/O
axi protocol
Loaders complete
Parsing in Compiler Design
MASM -UNIT-III
BASIC COMPUTER ORGANIZATION AND DESIGN
PCI Drivers
Introduction to EDA Tools
07. Virtual Functions
Sequential Logic Circuits
ARM architcture
Xilinxaxi uart16550
Timing and control unit
Unit 3-pipelining & vector processing
5. NFA & DFA.pdf
Ad

Viewers also liked (20)

PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
PPT
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
PPT
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
PPT
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
PPT
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
PPT
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
PPT
Semiconductor
PPT
Shift Register
PPTX
B sc cs i bo-de u-iii counters & registers
PPT
Digital 9 16
PPTX
digital Counter
PPT
14827 shift registers
PPTX
2.3 sequantial logic circuit
PPTX
Overview of Shift register and applications
PPT
Shift Registers
PDF
Shift registers
PPT
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
PPT
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Semiconductor
Shift Register
B sc cs i bo-de u-iii counters & registers
Digital 9 16
digital Counter
14827 shift registers
2.3 sequantial logic circuit
Overview of Shift register and applications
Shift Registers
Shift registers
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Ad

Similar to Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence (20)

PPTX
2021Arch_14_Ch5_2_coherence.pptx Cache coherence
PDF
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
PPT
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
PPT
Distributed system
PPT
Distributed shared memory in distributed systems.ppt
PPT
SOC-CH5.pptSOC Processors Used in SOCSOC Processors Used in SOC
PDF
1083 wang
PDF
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
ODP
Io Architecture
PPT
routing2.ppt about the routing protocol for students
PPT
module4.ppt
PDF
Open vSwitch for networking solution for L2
PPTX
M4 san features-4.3.1
PDF
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
PPT
Dos final ppt
PPT
Dos final ppt
PDF
Computer network (14)
PPT
7_mem_cache.ppt
PPT
Cisco crs1
2021Arch_14_Ch5_2_coherence.pptx Cache coherence
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
Distributed system
Distributed shared memory in distributed systems.ppt
SOC-CH5.pptSOC Processors Used in SOCSOC Processors Used in SOC
1083 wang
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
Io Architecture
routing2.ppt about the routing protocol for students
module4.ppt
Open vSwitch for networking solution for L2
M4 san features-4.3.1
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
Dos final ppt
Dos final ppt
Computer network (14)
7_mem_cache.ppt
Cisco crs1

More from Hsien-Hsin Sean Lee, Ph.D. (20)

PPT
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
PPT
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
PPT
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
PPT
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
PPT
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
PPT
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
PPT
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
PPT
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
PPT
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
PPT
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
PPT
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
PPT
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
PPT
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
PPT
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
PPT
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
PPT
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
PPT
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
PPT
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
PPT
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
PPT
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2

Recently uploaded (20)

PPT
FABRICATION OF MOS FET BJT DEVICES IN NANOMETER
PPTX
code of ethics.pptxdvhwbssssSAssscasascc
PPTX
udi-benefits-ggggggggfor-healthcare.pptx
PPTX
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
PDF
Dynamic Checkweighers and Automatic Weighing Machine Solutions
PPTX
executive branch_no record.pptxsvvsgsggs
PPTX
sdn_based_controller_for_mobile_network_traffic_management1.pptx
PPTX
Sem-8 project ppt fortvfvmat uyyjhuj.pptx
PPTX
quadraticequations-111211090004-phpapp02.pptx
PPTX
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
PPTX
了解新西兰毕业证(Wintec毕业证书)怀卡托理工学院毕业证存档可查的
PPTX
making presentation that do no stick.pptx
PDF
Colorful Illustrative Digital Education For Children Presentation.pdf
PPTX
Embeded System for Artificial intelligence 2.pptx
PPTX
Embedded for Artificial Intelligence 1.pptx
PPTX
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
PPTX
kvjhvhjvhjhjhjghjghjgjhgjhgjhgjhgjhgjhgjhgjh
PDF
PPT Determiners.pdf.......................
PPTX
Nanokeyer nano keyekr kano ketkker nano keyer
PPTX
ERP good ERP good ERP good ERP good good ERP good ERP good
FABRICATION OF MOS FET BJT DEVICES IN NANOMETER
code of ethics.pptxdvhwbssssSAssscasascc
udi-benefits-ggggggggfor-healthcare.pptx
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
Dynamic Checkweighers and Automatic Weighing Machine Solutions
executive branch_no record.pptxsvvsgsggs
sdn_based_controller_for_mobile_network_traffic_management1.pptx
Sem-8 project ppt fortvfvmat uyyjhuj.pptx
quadraticequations-111211090004-phpapp02.pptx
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
了解新西兰毕业证(Wintec毕业证书)怀卡托理工学院毕业证存档可查的
making presentation that do no stick.pptx
Colorful Illustrative Digital Education For Children Presentation.pdf
Embeded System for Artificial intelligence 2.pptx
Embedded for Artificial Intelligence 1.pptx
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
kvjhvhjvhjhjhjghjghjgjhgjhgjhgjhgjhgjhgjhgjh
PPT Determiners.pdf.......................
Nanokeyer nano keyekr kano ketkker nano keyer
ERP good ERP good ERP good ERP good good ERP good ERP good

Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence

  • 1. ECE 4100/6100 Advanced Computer Architecture Lecture 14 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology
  • 2. 2 Memory Hierarchy in a Multiprocessor P P P Cache Memory Shared cache P P P $ Bus-based shared memory $ $ Memory P P P $ Memory Fully-connected shared memory (Dancehall) $ $ Memory Interconnection Network P $ Memory Interconnection Network P $ Memory Distributed shared memory
  • 3. 3 Cache Coherency • Closest cache level is private • Multiple copies of cache line can be present across different processor nodes • Local updates – Lead to incoherent state – Problem exhibits in both write-through and writeback caches • Bus-based  globally visible • Point-to-point interconnect  visible only to communicated processor nodes
  • 4. 4 Example (Writeback Cache) P Cache Memory P X= -100 X= -100 Cache P Cache X= -100X= 505 Rd? X= -100 Rd?
  • 5. 5 Example (Write-through Cache) P Cache Memory P X= -100 X= -100 Cache P Cache X= -100X= 505 X= 505 X= 505 Rd?
  • 6. 6 Defining Coherence • An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence • Write propagation – Writes are visible to other processes • Write serialization – All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) – E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi
  • 7. 7 Sounds Easy? P0 P1 P2 P3 A=1 B=2T1 A=0 B=0 T2 A=1 A=1 B=2 B=2 T3 A=1 A=1 B=2 B=2 A=1 B=2 T3 A=1 A=1 B=2 B=2 A=1 B=2 B=2 A=1 See A’s update before B’s See B’s update before A’s
  • 8. 8 Bus Snooping based on Write-Through Cache • All the writes will be shown as a transaction on the shared bus to memory • Two protocols – Update-based Protocol – Invalidation-based Protocol
  • 9. 9 Bus Snooping (Update-based Protocol on Write-Through cache) • Each processor’s cache controller constantly snoops on the bus • Update local copies upon snoop hit P Cache Memory P X= -100 X= -100 Cache P Cache X= 505 Bus transaction Bus snoop X= 505 X= 505
  • 10. 10 • Each processor’s cache controller constantly snoops on the bus • Invalidate local copies upon snoop hit P Cache Memory P X= -100 X= -100 Cache P Cache X= 505 Bus transaction Bus snoop X= 505 Load X X= 505 Bus Snooping (Invalidation-based Protocol on Write-Through cache)
  • 11. 11 A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache Invalid Valid PrRd / BusRd PrRd / --- PrWr / BusWr BusWr / --- PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction Observed / Transaction
  • 12. 12 How about Writeback Cache? • WB cache to reduce bandwidth requirement • The majority of local writes are hidden behind the processor nodes • How to snoop? • Write Ordering
  • 13. 13 Cache Coherence Protocols for WB caches • A cache has an exclusive copy of a line if – It is the only cache having a valid copy – Memory may or may not have it • Modified (dirty) cache line – The cache having the line is the owner of the line, because it must supply the block
  • 14. 14 Cache Coherence Protocol (Update-based Protocol on Writeback cache) • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred P Cache Memory P Cache P Cache Bus transaction X= -100X= -100X= -100 Store X X= 505 update update X= 505X= 505
  • 15. 15 Cache Coherence Protocol (Update-based Protocol on Writeback cache) • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred P Cache Memory P Cache P Cache Bus transaction X= 505X= 505X= 505 Load X Hit ! Store X X= 333 update update X= 333X= 333
  • 16. 16 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Cache Bus transaction X= -100X= -100X= -100 Store X invalidate invalidate X= 505
  • 17. 17 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Cache Bus transaction X= 505 Load X Bus snoop Miss ! Snoop hit X= 505
  • 18. 18 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Cache Bus transaction X= 505 Store X Bus snoop X= 505X= 333 Store X X= 987 Store X X= 444
  • 19. 19 MSI Writeback Invalidation Protocol • Modified – Dirty – Only this cache has a valid copy • Shared – Memory is consistent – One or more caches have a valid copy • Invalid • Writeback protocol: A cache line can be written multiple times before the memory is updated.
  • 20. 20 MSI Writeback Invalidation Protocol • Two types of request from the processor – PrRd – PrWr • Three types of bus transactions post by cache controller – BusRd • PrRd misses the cache • Memory or another cache supplies the line – BusRd eXclusive (Read-to-own) • PrWr is issued to a line which is not in the Modified state – BusWB • Writeback due to replacement • Processor does not directly involve in initiating this operation
  • 21. 21 MSI Writeback Invalidation Protocol (Processor Request) Modified Invalid Shared PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated
  • 22. 22 MSI Writeback Invalidation Protocol (Bus Transaction) • Flush data on the bus • Both memory and requestor will grab the copy • The requestor get data by – Cache-to-cache transfer; or – Memory Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / Flush BusRdX / ---
  • 23. 23 MSI Writeback Invalidation Protocol (Bus transaction) Another possible implementation Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / Flush BusRdX / --- • Another possible, valid implementation • Anticipate no more reads from this processor • A performance concern • Save “invalidation” trip if the requesting cache writes the shared line later BusRd / Flush
  • 24. 24 MSI Writeback Invalidation Protocol Modified Invalid Shared Bus-snooper-initiated BusRd / --- PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated BusRd / Flush BusRdX / Flush BusRdX / ---
  • 25. 25 MSI Example P1 Cache P2 P3 Bus Cache Cache MEMORY BusRd Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier S --- --- BusRd MemoryP1 reads X X=10 X=10 SS
  • 26. 26 MSI Example P1 Cache P2 P3 Bus Cache Cache MEMORY X=10 SS Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier S --- --- BusRd MemoryP1 reads X P3 reads X BusRd X=10 SS S --- S BusRd Memory X=10
  • 27. 27 MSI Example P1 Cache P2 P3 Bus Cache Cache MEMORY X=10 SS Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier S --- --- BusRd MemoryP1 reads X P3 reads X X=10 SS S --- S BusRd Memory P3 writes X BusRdX --- II MM I --- M BusRdX X=10 X=-25 P3 Cache Does not come from memory if having “BusUpgrade”
  • 28. 28 MSI Example P1 Cache P2 P3 Bus Cache Cache MEMORY Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier S --- --- BusRd MemoryP1 reads X P3 reads X X=-25 MM S --- S BusRd Memory P3 writes X --- II I --- M BusRdX P1 reads X BusRd X=-25 SS SS S --- S BusRd P3 Cache X=10X=-25 P3 Cache
  • 29. 29 MSI Example P1 Cache P2 P3 Bus Cache Cache MEMORY Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier S --- --- BusRd MemoryP1 reads X P3 reads X X=-25 MM S --- S BusRd Memory P3 writes X I --- M BusRdX P1 reads X X=-25 SS SS S --- S BusRd P3 Cache X=10X=-25 P2 reads X BusRd X=-25 SS S S S BusRd Memory P3 Cache
  • 30. 30 MESI Writeback Invalidation Protocol • To reduce two types of unnecessary bus transactions – BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block – BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) • Introduce the Exclusive state – One can write to the copy without generating BusRdX • Illinois Protocol: Proposed by Pamarcos and Patel in 1984 • Employed in Intel, PowerPC, MIPS
  • 31. 31 MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol) Invalid Exclusive Modified Shared PrRd / BusRd (not-S) PrWr / --- Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / ---S: Shared Signal PrWr / BusRdX PrRd / BusRd (S) PrWr / BusRdX
  • 32. 32 MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol) Invalid Exclusive Modified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* Flush*: Flush for data supplier; no action for other sharers BusRdX / Flush* BusRd / Flush Or ---) BusRdX / --- • Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data • Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) • Most of the MESI implementations simply write to memory
  • 33. 33 MESI Writeback Invalidation Protocol (Illinois Protocol) Invalid Exclusive Modified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* BusRdX / ---PrRd / BusRd (not-S) PrWr / --- PrRd / BusRd (S)Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- PrWr / BusRdX S: Shared Signal PrWr / BusRdX BusRd / Flush (or ---) Flush*: Flush for data supplier; no action for other sharers
  • 34. 34 MOESI Protocol • Add one additional state ─ Owner state • Similar to Shared state • The O state processor will be responsible for supplying data (copy in memory may be stale) • Employed by – Sun UltraSparc – AMD Opteron • In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 L2 CPU1 L2 System Request Interface Crossbar Hyper- Transport Mem Controller
  • 35. 35 Implication on Multi-Level Caches • How to guarantee coherence in a multi-level cache hierarchy – Snoop all cache levels? – Intel’s 8870 chipset has a “snoop filter” for quad-core • Maintaining inclusion property – Ensure data in the outer level must be present in the inner level – Only snoop the outermost level (e.g. L2) – L2 needs to know L1 has write hits • Use Write-Through cache • Use Write-back but maintain another “modified-but-stale” bit in L2
  • 36. 36 Inclusion Property • Not so easy … – Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 – Split L1 caches: Imagine all caches are direct- mapped. – Different cache line sizes
  • 37. 37 Inclusion Property • Use specific cache configurations – E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size • Explicitly propagate L2 action to L1 – L2 replacement will flush the corresponding L1 line – Observed BusRdX bus transaction will invalidate the corresponding L1 line – To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)
  • 38. 38 Directory-based Coherence Protocol • Snooping-based protocol – N transactions for an N-node MP – All caches need to watch every memory request from each processor – Not a scalable solution for maintaining coherence in large shared memory systems • Directory protocol – Directory-based control of who has what; – HW overheads to keep the directory (~ # lines * # processors) P $ P $ P $ P $ Memory Interconnection Network Directory Modified bit Presence bits, one for each node
  • 39. 39 Directory-based Coherence Protocol P $ P $ P $ P $ Memory Interconnection Network P $ 1 1 1 000 00 0 0 0 001 01 C(k) C(k+1) 0 0 0 101 00 C(k+j) 1 presence bit for each processor, each cache block in memory 1 modified bit for each cache block in memory
  • 40. 40 Directory-based Coherence Protocol (Limited Dir) Encoded Present bits (lg2N), each cache line can reside in 2 processors in this example 1 modified bit for each cache block in memory P0 $ P13 $ P14 $ P15 $ Memory Interconnection Network P1 $ Presence encoding is NULL or not 0 0 0 00 1 1 1 1 01 0 0 0 11 1 - - - -0 - - - -0 0 - - - -0
  • 41. 41 Distributed Directory Coherence Protocol • Centralized directory is less scalable (contention) • Distributed shared memory (DSM) for a large MP system • Interconnection network is no longer a shared bus • Maintain cache coherence (CC-NUMA) • Each address has a “home” P $ Memory Interconnection Network P $ Memory P $ Memory P $ Memory P $ Memory P $ Memory Directory Directory Directory DirectoryDirectoryDirectory
  • 42. 42 Distributed Directory Coherence Protocol • Stanford DASH (4 CPUs in each cluster, total 16 clusters) – Invalidation-based cache coherence – Directory keeps one of the 3 status of a cache block at its home node • Uncached • Shared (unmodified state) • Dirty P $ Memory P $ Memory Directory Interconnection Network Snoop bus P $ Memory P $ Memory Directory Snoop bus
  • 43. 43 DASH Memory Hierarchy • Processor Level • Local Cluster Level • Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it • Remote Cluster Level P $ Memory P $ Memory Directory Interconnection Network Snoop bus P $ Memory P $ Memory Directory Snoop bus
  • 44. 44 $ Directory Coherence Protocol: Read Miss Interconnection Network 0 0 1 1 P $ MemoryMemory P Miss Z (read) Go to Home Node Memory P Z Z 1 Data Z is shared (clean) Home of Z $Z
  • 45. 45 Directory Coherence Protocol: Read Miss Interconnection Network 1 0 1 0 P MemoryMemory P Miss Z (read) Memory P Data Z is Dirty Go to Home Node Respond with Owner Info Data Request Z 0 1 1 Data Z is Clean, Shared by 3 nodes $$$ ZZ
  • 46. 46 Directory Coherence Protocol: Write Miss Interconnection Network 0 0 1 P MemoryMemory P Miss Z (write) Memory P 1 Z Go to Home Node Respond w/ sharers InvalidateInvalidate ACK ACK 0 01 1 Write Z can proceed in P0 $ $$ ZZ Z
  • 47. 47 Memory Consistency Issue • What do you expect for the following codes? P1 P2 A=1; Flag = 1; while (Flag==0) {}; print A; P1 P2 A=1; B=1; print B; print A; Initial values A=0 B=0 Is it possible P2 prints A=0? Is it possible P2 prints A=0, B=1?
  • 48. 48 Memory Consistency Model • Programmers anticipate certain memory ordering and program behavior • Become very complex When – Running shared-memory programs – A processor supports out-of-order execution • A memory consistency model specifies the legal ordering of memory events when several processors access the shared memory locations
  • 49. 49 Sequential Consistency (SC) [Leslie Lamport] • An MP is Sequentially Consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. • Two properties – Program ordering – Write atomicity (All writes to any location should appear to all processors in the same order) • Intuitive to programmers P P P Memory
  • 50. 50 SC Example T=1 U=2 Y=1 Z=2 P1 P2 P3P0 A=1 A=2 T=A Y=A U=A Z=A Sequentially Consistent T=1 U=2 Y=2 Z=1 Violating Sequential Consistency! (but possible in processor consistency model) P1 P2 P3P0 A=1 A=2 T=A Y=A U=A Z=A
  • 51. 51 Maintain Program Ordering (SC) • Dekker’s algorithm • Only one processor is allowed to enter the CS P1 P2 Flag1 = Flag2 = 0 Flag1 = 1 if (Flag2 == 0) enter Critical Section Flag2 = 1 if (Flag1 == 0) enter Critical Section Caveat: implementation fine with uni-processor,but violate the ordering of the above P1P0 Flag1=1 Write Buffer Flag2=1 Write Buffer Flag1: 0 Flag2: 0 Flag2=0 Flag1=0 INCORRECT!! BOTH ARE IN CRITICAL SECTION!
  • 52. 52 Atomic and Instantaneous Update (SC) • Update (of A) must take place atomically to all processors • A read cannot return the value of another processor’s write until the write is made visible by “all” processors P1 P2 A = B = 0 A = 1 if (A==1) B =1 P3 if (B==1) R1=A
  • 53. 53 Atomic and Instantaneous Update (SC) • Update (of A) must take place atomically to all processors • A read cannot return the value of another processor’s write until the write is made visible by “all” processors P1 P2 A = B = 0 A = 1 if (A==1) B =1 P3 if (B==1) R1=A P1 P2 P4P3 A=1 B=1 P0 A=1A=1 B=1 A=1 Caveat when an update is not atomic to all … R1=0?
  • 54. 54 Atomic and Instantaneous Update (SC) • Caches also make things complicated • P3 caches A and B • A=1 will not show up in P3 until P3 reads it in R1=A P1 P2 A = B = 0 A = 1 if (A==1) B =1 P3 if (B==1) R1=A
  • 55. 55 Relaxed Memory Models • How to relax program order requirement? – Load bypass store – Load bypass load – Store bypass store – Store bypass load • How to relax write atomicity requirement? – Read others’ write early – Read own write early
  • 56. 56 Relaxed Consistency • Processor Consistency – Used in P6 – Write visibility could be in different orders of different processors (not guarantee write atomicity) – Allow loads to bypass independent stores in each individual processor – To achieve SC, explicit synchronization operations need to be substituted or inserted •Read-modify-write instructions •Memory fence instructions
  • 57. 57 Processor Consistency F1=1 F2=1 A=1 A=2 R1=A R3=A R2=F2 R4=F1 R1=1; R3=1; R2=0; R4=0 is a possible outcome Load bypassing Stores F1=1 F2=1 A=1 A=2 R1=A R3=A R2=F2 R4=F1 R1=1; R3=1; R2=0; R4=0 is a possible outcome
  • 58. 58 Processor Consistency Intuitive for event synchronization “A” must be printed “1” P1 P2 A = Flag = 0 A = 1 Flag=1 while (Flag==0); Print A
  • 59. 59 Processor Consistency • Allow load bypassing store to a different address • Unlike SC, cannot guarantee mutual exclusion in the critical section P1 P2 Flag1 = Flag2 = 0 Flag1 = 1 if (Flag2 == 0) enter Critical Section Flag2 = 1 if (Flag1 == 0) enter Critical Section
  • 60. 60 Processor Consistency B=1;R1=0 is a possible outcome Since PC allows A=1 to be visible in P2 prior to P3 P1 P2 A = B = 0 A = 1 if (A==1) B =1 P3 if (B==1) R1=A