SlideShare a Scribd company logo
Chapter 6
Multiprocessors and
Thread-Level Parallelism
吳俊興
高雄大學資訊工程學系
December 2004
EEF011 Computer Architecture
計算機結構
2
Chapter 6. Multiprocessors and
Thread-Level Parallelism
6.1 Introduction
6.2 Characteristics of Application Domains
6.3 Symmetric Shared-Memory Architectures
6.4 Performance of Symmetric Shared-Memory
Multiprocessors
6.5 Distributed Shared-Memory Architectures
6.6 Performance of Distributed Shared-Memory
Multiprocessors
6.7 Synchronization
6.8 Models of Memory Consistency: An Introduction
6.9 Multithreading: Exploiting Thread-Level Parallelism
within a Processor
3
6.1 Introduction
•Increasing demands of parallel processors
–Microprocessors are likely to remain the dominant uniprocessor
technology
• Connecting multiple microprocessors together is likely to be more cost-effective
than designing a custom parallel processor
–It’s unclear whether architectural innovation can be sustained
indefinitely
• Multiprocessors are another way to improve parallelism
–Server and embedded applications exhibit natural parallelism to be
exploited beyond desktop applications (ILP)
•Challenges to architecture research and development
–Death of advances in uniprocessor architecture?
–More multiprocessor architectures failing than succeeding
• more design spaces and tradeoffs
4
Taxonomy of Parallel Architectures
Flynn Categories
• SISD (Single Instruction Single Data)
– Uniprocessors
• MISD (Multiple Instruction Single Data)
– ???; multiple processors on a single data stream
• SIMD (Single Instruction Multiple Data)
– same instruction executed by multiple processors using different data streams
• Each processor has its data memory (hence multiple data)
• There’s a single instruction memory and control processor
– Simple programming model, Low overhead, Flexibility
– (Phrase reused by Intel marketing for media instructions ~ vector)
– Examples: vector architectures, Illiac-IV, CM-2
• MIMD (Multiple Instruction Multiple Data)
– Each processor fetches its own instructions and operates on its own data
– MIMD current winner: Concentrate on major design emphasis <= 128 processors
• Use off-the-shelf microprocessors: cost-performance advantages
• Flexible: high performance for one application, running many tasks simultaneously
– Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
5
MIMD Class 1:
Centralized shared-memory multiprocessor
share a single centralized memory, interconnect processors and memory by a bus
• also known as “uniform memory access” (UMA) or
“symmetric (shared-memory) multiprocessor” (SMP)
– A symmetric relationship to all processors
– A uniform memory access time from any processor
• scalability problem: less attractive for large-scale processors
6
MIMD Class 2:
Distributed-memory multiprocessor
memory modules associated with CPUs
• Advantages:
– cost-effective way to scale memory bandwidth
– lower memory latency for local memory access
• Drawbacks
– longer communication latency for communicating data between processors
– software model more complex
7
MIMD Hybrid I (Clusters of SMP):
Distributed Shared Memory Multiprocessor
Cluster Interconnection Network
Memory I/O
Proc.
Caches
Node Interc. Network
Proc.
Caches
Proc.
Caches
Memory I/O
Proc.
Caches
Node Interc. Network
Proc.
Caches
Proc.
Caches
Physically separate memories addressed as one logically shared address space
– a memory reference can be made by any processor to any memory location
– also called NUMA (Nonuniform memory access)
8
MIMD Hybrid II (Multicomputers):
Message-Passing Multiprocessor
•Data Communication Models for Multiprocessors
–shared memory: access shared address space implicitly via load and store
operations
–message-passing: done by explicitly passing messages among the processors
• can invoke software with Remote Procedure Call (RPC)
• often via library, such as MPI: Message Passing Interface
• also called "Synchronous communication" since communication causes
synchronization between 2 processes
•Message-Passing Multiprocessor
–The address space can consist of multiple private address spaces that are
logically disjoint and cannot be addressed by a remote processor
–The same physical address on two different processors refers to two different
locations in two different memories
•Multicomputer (cluster): can even consist of completely separate
computers connected on a LAN
–cost-effective for applications that require little or no communication
9
Comparisons of Communication Models
Advantages of Shared-Memory Communication Model
• Compatibility with SMP hardware
• Ease of programming when communication patterns are complex or vary
dynamically during execution
• Ability to develop applications using familiar SMP model, attention only on
performance critical accesses
• Lower communication overhead, better use of bandwidth for small items, due to
implicit communication and memory mapping to implement protection in hardware,
rather than through the I/O system
• Hardware-controlled caching to reduce the frequency of remote communication by
caching of all data, both shared and private
Advantages of Message-Passing Communication Model
• The hardware can be simpler (esp. vs. NUMA)
• Communication explicit => simpler to understand; in shared memory it can be hard
to know when communicating and when not, and how costly it is
• Explicit communication focuses programmer attention on costly aspect of parallel
computation, sometimes leading to improved structure in multiprocessor program
• Synchronization is naturally associated with sending messages, reducing the
possibility for errors introduced by incorrect synchronization
• Easier to use sender-initiated communication, which may have some advantages in
performance
10
6.3 Symmetric Shared-Memory Architectures
Caching in shared-memory machines
• private data: used by a single processor
– When a private item is cached, its location is migrated to the cache
– Since no other processor uses the data, the program behavior is identical to that
in a uniprocessor
• shared data: used by multiple processor
– When shared data are cached, the shared value may be replicated in multiple
caches
– advantages: reduce access latency and memory contention
– induce a new problem: cache coherence
Coherence cache provides:
• migration: a data item can be moved to a local cache and used there in a
transparent fashion
• replication for shared data that are being simultaneously read
both are critical to performance in accessing shared data
11
Multiprocessor Cache Coherence Problem
• Informally:
– “Any read must return the most recent write”
– Too strict and too difficult to implement
• Better:
– “Any write must eventually be seen by a read”
– All writes are seen in proper order (“serialization”)
• Two rules to ensure this:
– “If P writes x and then P1 reads it, P’s write will be seen by P1 if the read and
write are sufficiently far apart”
– Writes to a single location are serialized: seen in one order
• Latest write will be seen
• Otherwise could see writes in illogical order
(could see older value after a newer value)
12
Two Classes of Cache Coherence Protocols
•Snooping Solution (Snoopy Bus)
– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
•Directory-Based Schemes (Section 6.5)
– Directory keeps track of what is being shared in a centralized place (logically)
– Distributed memory => distributed directory for scalability
(avoids bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Actually existed BEFORE Snooping-based schemes
13
Basic Snoopy Protocols
• Write strategies
– Write-through: memory is always up-to-date
– Write-back: snoop in caches to find most recent copy
• Write Invalidate Protocol
– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which snoop and
invalidate any copies
• Read miss: further read will miss in the cache and fetch a new copy of the data
• Write Broadcast/Update Protocol (typically write through)
– Write to shared data: broadcast on bus, processors snoop, and update any
copies
– Read miss: memory/cache is always up-to-date
• Write serialization: bus serializes requests!
– Bus is single point of arbitration
14
Examples of Basic Snooping Protocols
Assume neither cache initially holds X and the value of X in memory is 0
Write Invalidate
Write Update
15
• Multiple writes to the same word with no intervening reads
– multiple write broadcasts in an write update protocol
– only one initial invalidation in a write invalidate protocol
• With multiword cache blocks, each word written in a cache block
– A write broadcast for each word is required in an update protocol
– Only the first write to any word in the block needs to generate an invalidate
in an invalidation protocols
An invalidation protocol works on cache blocks, while an update protocol
must work on individual words
• Delay between writing a word in one processor and reading the
written value in another processor is usually less in a write update
scheme
– In an invalidation protocol, the reader is invalidated first, then later reads the
data
Comparisons of Basic Snoopy Protocols
16
An Example Snoopy Protocol
Invalidation protocol, write-back cache
• Each cache block is in one state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data
– an extra state bit (shared/exclusive) associated with a valid bit and a
dirty bit for each block
• Each block of memory is in one state:
– Clean in all caches and up-to-date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches
• Each processor snoops every address placed on the bus
– If a processor finds that is has a dirty copy of the requested cache block,
it provides that cache block in response to the read request
17
Cache Coherence Mechanism of the Example
Placing a write miss on the bus when a write hits in the shared state ensures an
exclusive copy (data not transferred)
18
Figure 6.11 State Transitions for Each Cache Block
•CPU may read/write hit/miss to the block
•May place write/read miss on bus
•May receive read/write miss from bus
Requests from CPU Requests from bus
19
Cache Coherence
State Diagram
Figure 6.10 and Figure 6.12 (CPU in
black and bus in gray from Figure 6.11)
20
6.5 Distributed Shared-Memory Architectures
Distributed shared-memory architectures
• Separate memory per processor
– Local or remote access via memory controller
– The physical address space is statically distributed
Coherence Problems
• Simple approach: uncacheable
– shared data are marked as uncacheable and only private data are kept in caches
– very long latency to access memory for shared data
• Alternative: directory for memory blocks
– The directory per memory tracks state of every block in every cache
• which caches have a copies of the memory block, dirty vs. clean, ...
– Two additional complications
• The interconnect cannot be used as a single point of arbitration like the bus
• Because the interconnect is message oriented, many messages must have
explicit responses
21
Distributed Directory Multiprocessor
To prevent directory becoming the bottleneck, we distribute directory entries with
memory, each keeping track of which processors have copies of their memory blocks
22
Directory Protocols
• Similar to Snoopy Protocol: Three states
– Shared: 1 or more processors have the block cached, and the value in memory is
up-to-date (as well as in all the caches)
– Uncached: no processor has a copy of the cache block (not valid in any cache)
– Exclusive: Exactly one processor has a copy of the cache block, and it has
written the block, so the memory copy is out of date
• The processor is called the owner of the block
• In addition to tracking the state of each cache block, we must track
the processors that have copies of the block when it is shared
(usually a bit vector for each memory block: 1 if processor has copy)
• Keep it simple(r):
– Writes to non-exclusive data
=> write miss
– Processor blocks until access completes
– Assume messages received and acted upon in order sent
23
Messages for Directory Protocols
•local node: the node where a request originates
•home node: the node where the memory location and directory entry of an address reside
•remote node: the node that has a copy of a cache block (exclusive or shared)
24
State Transition Diagram
for Individual Cache Block
• Comparing to snooping protocols:
– identical states
– stimulus is almost identical
– write a shared cache block is
treated as a write miss (without
fetch the block)
– cache block must be in exclusive
state when it is written
– any shared block must be up to
date in memory
• write miss: data fetch and selective
invalidate operations sent by the
directory controller (broadcast in
snooping protocols)
25
State Transition Diagram for
the Directory
Figure 6.29
Transition
diagram for
cache block
Three requests: read miss,
write miss and data write back
26
Directory Operations: Requests and Actions
• Message sent to directory causes two actions:
– Update the directory
– More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current value;
only possible requests for that block are:
– Read miss: requesting processor sent data from memory &requestor made only
sharing node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the Sharing node.
The block is made Exclusive to indicate that the only valid copy is cached.
Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:
– Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
– Write miss: requesting processor is sent the value. All processors in the set
Sharers are sent invalidate messages, & Sharers is set to identity of requesting
processor. The state of the block is made Exclusive.
27
Directory Operations: Requests and Actions(cont.)
• Block is Exclusive: current value of the block is held in the cache of
the processor identified by the set Sharers (the owner) => three
possible directory requests:
– Read miss: owner processor sent data fetch message, causing state of block in
owner’s cache to transition to Shared and causes owner to send data to
directory, where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy).
State is shared.
– Data write-back: owner processor is replacing the block and hence must write it
back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now
Uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the
cache to send the value of the block to the directory from which it is sent to the
requesting processor, which becomes the new owner. Sharers is set to identity
of new owner, and state of block is made Exclusive.
28
Summary
Chapter 6. Multiprocessors and Thread-Level Parallelism
6.1 Introduction
6.2 Characteristics of Application Domains
6.3 Symmetric Shared-Memory Architectures
6.4 Performance of Symmetric Shared-Memory
Multiprocessors
6.5 Distributed Shared-Memory Architectures
6.6 Performance of Distributed Shared-Memory
Multiprocessors
6.7 Synchronization
6.8 Models of Memory Consistency: An Introduction
6.9 Multithreading: Exploiting Thread-Level Parallelism
within a Processor

More Related Content

PPT
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
PPT
Introduction to symmetric multiprocessor
PDF
22CS201 COA
PPT
Introduction 1
PPTX
Multiprocessor.pptx
PPTX
parellelisum edited_jsdnsfnjdnjfnjdn.pptx
PPTX
Introduction to parallel processing
PPTX
Intro_ppt.pptx
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
Introduction to symmetric multiprocessor
22CS201 COA
Introduction 1
Multiprocessor.pptx
parellelisum edited_jsdnsfnjdnjfnjdn.pptx
Introduction to parallel processing
Intro_ppt.pptx

Similar to chapter-6-multiprocessors-and-thread-level (1).ppt (20)

PPT
12-6810-12.ppt
PPT
chapter-18-parallel-processing-multiprocessing (1).ppt
PPT
parallel processing.ppt
PPTX
message passing vs shared memory
PDF
Pthread
DOC
Symmetric multiprocessing and Microkernel
PDF
OS_MD_4.pdf
PPT
4.Hardware concepts, Software Concept & Middleware.ppt
PPTX
PPT
Multiprocessor_YChen.ppt
PPT
parallel-processing.ppt
PPT
18 parallel processing
PDF
Week5
PPT
module4.ppt
PPT
Module2 MultiThreads.ppt
PPTX
CA UNIT IV.pptx
DOC
Introduction to multi core
PPT
Parallel processing extra
PDF
Coherence and consistency models in multiprocessor architecture
PPTX
Computer system Architecture. This PPT is based on computer system
12-6810-12.ppt
chapter-18-parallel-processing-multiprocessing (1).ppt
parallel processing.ppt
message passing vs shared memory
Pthread
Symmetric multiprocessing and Microkernel
OS_MD_4.pdf
4.Hardware concepts, Software Concept & Middleware.ppt
Multiprocessor_YChen.ppt
parallel-processing.ppt
18 parallel processing
Week5
module4.ppt
Module2 MultiThreads.ppt
CA UNIT IV.pptx
Introduction to multi core
Parallel processing extra
Coherence and consistency models in multiprocessor architecture
Computer system Architecture. This PPT is based on computer system
Ad

Recently uploaded (20)

PPTX
How Social Media Influencers Repurpose Content (1).pptx
PDF
Instant Audience, Long-Term Impact Buy Real Telegram Members
PPTX
Types of Social Media Marketing for Business Success
PDF
Live Echo Boost on TikTok_ Double Devices, Higher Ranks
PDF
Climate Risk and Credit Allocation: How Banks Are Integrating Environmental R...
PDF
Transform Your Social Media, Grow Your Brand
PDF
Why Digital Marketing Matters in Today’s World Ask ChatGPT
DOCX
Buy Goethe A1 ,B2 ,C1 certificate online without writing
PDF
How can India improve its Public Diplomacy - Social Media.pdf
PDF
Real Presence. Real Power. Boost with Authenticity
PPTX
Preposition and Asking and Responding Suggestion.pptx
PDF
TikTok Live shadow viewers_ Who watches without being counted
PPTX
Developing lesson plan gejegkavbw gagsgf
PDF
11111111111111111111111111111111111111111111111
PPTX
Strategies for Social Media App Enhancement
PDF
Instagram Reels Growth Guide 2025.......
PDF
Presence That Pays Off Activate My Social Growth
PDF
The Fastest Way to Look Popular Buy Reactions Today
PDF
Customer Churn Prediction in Digital Banking: A Comparative Study of Xai Tech...
PDF
FINAL-Content-Marketing-Made-Easy-Workbook-Guied-Editable.pdf
How Social Media Influencers Repurpose Content (1).pptx
Instant Audience, Long-Term Impact Buy Real Telegram Members
Types of Social Media Marketing for Business Success
Live Echo Boost on TikTok_ Double Devices, Higher Ranks
Climate Risk and Credit Allocation: How Banks Are Integrating Environmental R...
Transform Your Social Media, Grow Your Brand
Why Digital Marketing Matters in Today’s World Ask ChatGPT
Buy Goethe A1 ,B2 ,C1 certificate online without writing
How can India improve its Public Diplomacy - Social Media.pdf
Real Presence. Real Power. Boost with Authenticity
Preposition and Asking and Responding Suggestion.pptx
TikTok Live shadow viewers_ Who watches without being counted
Developing lesson plan gejegkavbw gagsgf
11111111111111111111111111111111111111111111111
Strategies for Social Media App Enhancement
Instagram Reels Growth Guide 2025.......
Presence That Pays Off Activate My Social Growth
The Fastest Way to Look Popular Buy Reactions Today
Customer Churn Prediction in Digital Banking: A Comparative Study of Xai Tech...
FINAL-Content-Marketing-Made-Easy-Workbook-Guied-Editable.pdf
Ad

chapter-6-multiprocessors-and-thread-level (1).ppt

  • 1. Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構
  • 2. 2 Chapter 6. Multiprocessors and Thread-Level Parallelism 6.1 Introduction 6.2 Characteristics of Application Domains 6.3 Symmetric Shared-Memory Architectures 6.4 Performance of Symmetric Shared-Memory Multiprocessors 6.5 Distributed Shared-Memory Architectures 6.6 Performance of Distributed Shared-Memory Multiprocessors 6.7 Synchronization 6.8 Models of Memory Consistency: An Introduction 6.9 Multithreading: Exploiting Thread-Level Parallelism within a Processor
  • 3. 3 6.1 Introduction •Increasing demands of parallel processors –Microprocessors are likely to remain the dominant uniprocessor technology • Connecting multiple microprocessors together is likely to be more cost-effective than designing a custom parallel processor –It’s unclear whether architectural innovation can be sustained indefinitely • Multiprocessors are another way to improve parallelism –Server and embedded applications exhibit natural parallelism to be exploited beyond desktop applications (ILP) •Challenges to architecture research and development –Death of advances in uniprocessor architecture? –More multiprocessor architectures failing than succeeding • more design spaces and tradeoffs
  • 4. 4 Taxonomy of Parallel Architectures Flynn Categories • SISD (Single Instruction Single Data) – Uniprocessors • MISD (Multiple Instruction Single Data) – ???; multiple processors on a single data stream • SIMD (Single Instruction Multiple Data) – same instruction executed by multiple processors using different data streams • Each processor has its data memory (hence multiple data) • There’s a single instruction memory and control processor – Simple programming model, Low overhead, Flexibility – (Phrase reused by Intel marketing for media instructions ~ vector) – Examples: vector architectures, Illiac-IV, CM-2 • MIMD (Multiple Instruction Multiple Data) – Each processor fetches its own instructions and operates on its own data – MIMD current winner: Concentrate on major design emphasis <= 128 processors • Use off-the-shelf microprocessors: cost-performance advantages • Flexible: high performance for one application, running many tasks simultaneously – Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
  • 5. 5 MIMD Class 1: Centralized shared-memory multiprocessor share a single centralized memory, interconnect processors and memory by a bus • also known as “uniform memory access” (UMA) or “symmetric (shared-memory) multiprocessor” (SMP) – A symmetric relationship to all processors – A uniform memory access time from any processor • scalability problem: less attractive for large-scale processors
  • 6. 6 MIMD Class 2: Distributed-memory multiprocessor memory modules associated with CPUs • Advantages: – cost-effective way to scale memory bandwidth – lower memory latency for local memory access • Drawbacks – longer communication latency for communicating data between processors – software model more complex
  • 7. 7 MIMD Hybrid I (Clusters of SMP): Distributed Shared Memory Multiprocessor Cluster Interconnection Network Memory I/O Proc. Caches Node Interc. Network Proc. Caches Proc. Caches Memory I/O Proc. Caches Node Interc. Network Proc. Caches Proc. Caches Physically separate memories addressed as one logically shared address space – a memory reference can be made by any processor to any memory location – also called NUMA (Nonuniform memory access)
  • 8. 8 MIMD Hybrid II (Multicomputers): Message-Passing Multiprocessor •Data Communication Models for Multiprocessors –shared memory: access shared address space implicitly via load and store operations –message-passing: done by explicitly passing messages among the processors • can invoke software with Remote Procedure Call (RPC) • often via library, such as MPI: Message Passing Interface • also called "Synchronous communication" since communication causes synchronization between 2 processes •Message-Passing Multiprocessor –The address space can consist of multiple private address spaces that are logically disjoint and cannot be addressed by a remote processor –The same physical address on two different processors refers to two different locations in two different memories •Multicomputer (cluster): can even consist of completely separate computers connected on a LAN –cost-effective for applications that require little or no communication
  • 9. 9 Comparisons of Communication Models Advantages of Shared-Memory Communication Model • Compatibility with SMP hardware • Ease of programming when communication patterns are complex or vary dynamically during execution • Ability to develop applications using familiar SMP model, attention only on performance critical accesses • Lower communication overhead, better use of bandwidth for small items, due to implicit communication and memory mapping to implement protection in hardware, rather than through the I/O system • Hardware-controlled caching to reduce the frequency of remote communication by caching of all data, both shared and private Advantages of Message-Passing Communication Model • The hardware can be simpler (esp. vs. NUMA) • Communication explicit => simpler to understand; in shared memory it can be hard to know when communicating and when not, and how costly it is • Explicit communication focuses programmer attention on costly aspect of parallel computation, sometimes leading to improved structure in multiprocessor program • Synchronization is naturally associated with sending messages, reducing the possibility for errors introduced by incorrect synchronization • Easier to use sender-initiated communication, which may have some advantages in performance
  • 10. 10 6.3 Symmetric Shared-Memory Architectures Caching in shared-memory machines • private data: used by a single processor – When a private item is cached, its location is migrated to the cache – Since no other processor uses the data, the program behavior is identical to that in a uniprocessor • shared data: used by multiple processor – When shared data are cached, the shared value may be replicated in multiple caches – advantages: reduce access latency and memory contention – induce a new problem: cache coherence Coherence cache provides: • migration: a data item can be moved to a local cache and used there in a transparent fashion • replication for shared data that are being simultaneously read both are critical to performance in accessing shared data
  • 11. 11 Multiprocessor Cache Coherence Problem • Informally: – “Any read must return the most recent write” – Too strict and too difficult to implement • Better: – “Any write must eventually be seen by a read” – All writes are seen in proper order (“serialization”) • Two rules to ensure this: – “If P writes x and then P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” – Writes to a single location are serialized: seen in one order • Latest write will be seen • Otherwise could see writes in illogical order (could see older value after a newer value)
  • 12. 12 Two Classes of Cache Coherence Protocols •Snooping Solution (Snoopy Bus) – Send all requests for data to all processors – Processors snoop to see if they have a copy and respond accordingly – Requires broadcast, since caching information is at processors – Works well with bus (natural broadcast medium) – Dominates for small scale machines (most of the market) •Directory-Based Schemes (Section 6.5) – Directory keeps track of what is being shared in a centralized place (logically) – Distributed memory => distributed directory for scalability (avoids bottlenecks) – Send point-to-point requests to processors via network – Scales better than Snooping – Actually existed BEFORE Snooping-based schemes
  • 13. 13 Basic Snoopy Protocols • Write strategies – Write-through: memory is always up-to-date – Write-back: snoop in caches to find most recent copy • Write Invalidate Protocol – Multiple readers, single writer – Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies • Read miss: further read will miss in the cache and fetch a new copy of the data • Write Broadcast/Update Protocol (typically write through) – Write to shared data: broadcast on bus, processors snoop, and update any copies – Read miss: memory/cache is always up-to-date • Write serialization: bus serializes requests! – Bus is single point of arbitration
  • 14. 14 Examples of Basic Snooping Protocols Assume neither cache initially holds X and the value of X in memory is 0 Write Invalidate Write Update
  • 15. 15 • Multiple writes to the same word with no intervening reads – multiple write broadcasts in an write update protocol – only one initial invalidation in a write invalidate protocol • With multiword cache blocks, each word written in a cache block – A write broadcast for each word is required in an update protocol – Only the first write to any word in the block needs to generate an invalidate in an invalidation protocols An invalidation protocol works on cache blocks, while an update protocol must work on individual words • Delay between writing a word in one processor and reading the written value in another processor is usually less in a write update scheme – In an invalidation protocol, the reader is invalidated first, then later reads the data Comparisons of Basic Snoopy Protocols
  • 16. 16 An Example Snoopy Protocol Invalidation protocol, write-back cache • Each cache block is in one state (track these): – Shared : block can be read – OR Exclusive : cache has only copy, its writeable, and dirty – OR Invalid : block contains no data – an extra state bit (shared/exclusive) associated with a valid bit and a dirty bit for each block • Each block of memory is in one state: – Clean in all caches and up-to-date in memory (Shared) – OR Dirty in exactly one cache (Exclusive) – OR Not in any caches • Each processor snoops every address placed on the bus – If a processor finds that is has a dirty copy of the requested cache block, it provides that cache block in response to the read request
  • 17. 17 Cache Coherence Mechanism of the Example Placing a write miss on the bus when a write hits in the shared state ensures an exclusive copy (data not transferred)
  • 18. 18 Figure 6.11 State Transitions for Each Cache Block •CPU may read/write hit/miss to the block •May place write/read miss on bus •May receive read/write miss from bus Requests from CPU Requests from bus
  • 19. 19 Cache Coherence State Diagram Figure 6.10 and Figure 6.12 (CPU in black and bus in gray from Figure 6.11)
  • 20. 20 6.5 Distributed Shared-Memory Architectures Distributed shared-memory architectures • Separate memory per processor – Local or remote access via memory controller – The physical address space is statically distributed Coherence Problems • Simple approach: uncacheable – shared data are marked as uncacheable and only private data are kept in caches – very long latency to access memory for shared data • Alternative: directory for memory blocks – The directory per memory tracks state of every block in every cache • which caches have a copies of the memory block, dirty vs. clean, ... – Two additional complications • The interconnect cannot be used as a single point of arbitration like the bus • Because the interconnect is message oriented, many messages must have explicit responses
  • 21. 21 Distributed Directory Multiprocessor To prevent directory becoming the bottleneck, we distribute directory entries with memory, each keeping track of which processors have copies of their memory blocks
  • 22. 22 Directory Protocols • Similar to Snoopy Protocol: Three states – Shared: 1 or more processors have the block cached, and the value in memory is up-to-date (as well as in all the caches) – Uncached: no processor has a copy of the cache block (not valid in any cache) – Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the memory copy is out of date • The processor is called the owner of the block • In addition to tracking the state of each cache block, we must track the processors that have copies of the block when it is shared (usually a bit vector for each memory block: 1 if processor has copy) • Keep it simple(r): – Writes to non-exclusive data => write miss – Processor blocks until access completes – Assume messages received and acted upon in order sent
  • 23. 23 Messages for Directory Protocols •local node: the node where a request originates •home node: the node where the memory location and directory entry of an address reside •remote node: the node that has a copy of a cache block (exclusive or shared)
  • 24. 24 State Transition Diagram for Individual Cache Block • Comparing to snooping protocols: – identical states – stimulus is almost identical – write a shared cache block is treated as a write miss (without fetch the block) – cache block must be in exclusive state when it is written – any shared block must be up to date in memory • write miss: data fetch and selective invalidate operations sent by the directory controller (broadcast in snooping protocols)
  • 25. 25 State Transition Diagram for the Directory Figure 6.29 Transition diagram for cache block Three requests: read miss, write miss and data write back
  • 26. 26 Directory Operations: Requests and Actions • Message sent to directory causes two actions: – Update the directory – More messages to satisfy request • Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are: – Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared. – Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. • Block is Shared => the memory value is up-to-date: – Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set. – Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.
  • 27. 27 Directory Operations: Requests and Actions(cont.) • Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests: – Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared. – Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty. – Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.
  • 28. 28 Summary Chapter 6. Multiprocessors and Thread-Level Parallelism 6.1 Introduction 6.2 Characteristics of Application Domains 6.3 Symmetric Shared-Memory Architectures 6.4 Performance of Symmetric Shared-Memory Multiprocessors 6.5 Distributed Shared-Memory Architectures 6.6 Performance of Distributed Shared-Memory Multiprocessors 6.7 Synchronization 6.8 Models of Memory Consistency: An Introduction 6.9 Multithreading: Exploiting Thread-Level Parallelism within a Processor