Multiprocessors and Thread-Level Parallelism.pptx

Copyright © 2012, Elsevier Inc. All rights reserved. 1
Chapter 5
Multiprocessors and
Thread-Level Parallelism
Computer Architecture
A Quantitative Approach, Fifth Edition

2
Copyright © 2012, Elsevier Inc. All rights reserved.
Introduction
 Thread-Level parallelism
 Have multiple program counters
 Uses MIMD model
 Targeted for tightly-coupled shared-memory
multiprocessors
 For n processors, need n threads
 Amount of computation assigned to each thread
= grain size
 Threads can be used for data-level parallelism, but
the overheads may outweigh the benefit
Introduction

3
Types
 Symmetric multiprocessors
(SMP)
 Small number of cores
 Share single memory with
uniform memory latency
 Distributed shared memory
(DSM)
 Memory distributed among
processors
 Non-uniform memory
access/latency (NUMA)
 Processors connected via
direct (switched) and non-
direct (multi-hop)
interconnection networks
Introduction

4
Cache Coherence
 Processors may see different values through
their caches:
Centralized
Shared-Memory
Architectures

5
Cache Coherence
 Coherence
 All reads by any processor must return the most
recently written value
 Writes to the same location by any two processors are
seen in the same order by all processors
 Consistency
 When a written value will be returned by a read
 If a processor writes location A followed by location B,
any processor that sees the new value of B must also
see the new value of A
Centralized
Shared-Memory
Architectures

6
Enforcing Coherence
 Coherent caches provide:
 Migration: movement of data
 Replication: multiple copies of data
 Cache coherence protocols
 Directory based

Sharing status of each block kept in one location
 Snooping

Each core tracks sharing status of each block
Centralized
Shared-Memory
Architectures

7
Snoopy Coherence Protocols
 Write invalidate
 On write, invalidate all other copies
 Use bus itself to serialize

Write cannot complete until bus access is obtained
 Write update
 On write, update all copies
Centralized
Shared-Memory
Architectures

8
 Locating an item when a read miss occurs
 In write-back cache, the updated value must be sent
to the requesting processor
 Cache lines marked as shared or
exclusive/modified
 Only writes to shared lines need an invalidate
broadcast

After this, the line is marked as exclusive
Centralized
Shared-Memory
Architectures

9
Centralized
Shared-Memory
Architectures

10
Centralized
Shared-Memory
Architectures

11
 Complications for the basic MSI protocol:
 Operations are not atomic

E.g. detect miss, acquire bus, receive a response

Creates possibility of deadlock and races

One solution: processor that sends invalidate can hold bus
until other processors receive the invalidate
 Extensions:
 Add exclusive state to indicate clean block in only one
cache (MESI protocol)

Prevents needing to write invalidate on a write
 Owned state
Centralized
Shared-Memory
Architectures

12
Coherence Protocols: Extensions
 Shared memory bus
and snooping
bandwidth is
bottleneck for scaling
symmetric
multiprocessors
 Duplicating tags
 Place directory in
outermost cache
 Use crossbars or point-
to-point networks with
banked memory
Centralized
Shared-Memory
Architectures

13
Coherence Protocols
 AMD Opteron:
 Memory directly connected to each multicore chip in
NUMA-like organization
 Implement coherence protocol using point-to-point
links
 Use explicit acknowledgements to order operations
Centralized
Shared-Memory
Architectures

14
Performance
 Coherence influences cache miss rate
 Coherence misses

True sharing misses
 Write to shared block (transmission of invalidation)
 Read an invalidated block

False sharing misses
 Read an unmodified word in an invalidated block
Performance
of
Symmetric
Shared-Memory
Multiprocessors

15
Performance Study: Commercial Workload
Performance
of
Symmetric
Shared-Memory
Multiprocessors

16
Performance
of
Symmetric
Shared-Memory
Multiprocessors

17
Performance
of
Symmetric
Shared-Memory
Multiprocessors

18
Performance
of
Symmetric
Shared-Memory
Multiprocessors

19
Directory Protocols
 Directory keeps track of every block
 Which caches have each block
 Dirty status of each block
 Implement in shared L3 cache
 Keep bit vector of size = # cores for each block in L3
 Not scalable beyond shared L3
 Implement in a distributed fashion:
Distributed
Shared
Memory
and
Directory-Based
Coherence

20
Directory Protocols
 For each block, maintain state:
 Shared

One or more nodes have the block cached, value in memory
is up-to-date

Set of node IDs
 Uncached
 Modified

Exactly one node has a copy of the cache block, value in
memory is out-of-date

Owner node ID
 Directory maintains block states and sends
invalidation messages
Distributed
Shared
Memory
and
Directory-Based
Coherence

21
Messages
Distributed
Shared
Memory
and
Directory-Based
Coherence

22
Directory Protocols
Distributed
Shared
Memory
and
Directory-Based
Coherence

23
Directory Protocols
 For uncached block:
 Read miss

Requesting node is sent the requested data and is made the
only sharing node, block is now shared
 Write miss

The requesting node is sent the requested data and becomes
the sharing node, block is now exclusive
 For shared block:
 Read miss

The requesting node is sent the requested data from memory,
node is added to sharing set
 Write miss

The requesting node is sent the value, all nodes in the sharing
set are sent invalidate messages, sharing set only contains
requesting node, block is now exclusive
Distributed
Shared
Memory
and
Directory-Based
Coherence

24
Directory Protocols
 For exclusive block:
 Read miss

The owner is sent a data fetch message, block becomes
shared, owner sends data to the directory, data written
back to memory, sharers set contains old owner and
requestor
 Data write back

Block becomes uncached, sharer set is empty
 Write miss

Message is sent to old owner to invalidate and send the
value to the directory, requestor becomes new owner,
block remains exclusive
Distributed
Shared
Memory
and
Directory-Based
Coherence

25
Synchronization
 Basic building blocks:
 Atomic exchange

Swaps register with memory location
 Test-and-set

Sets under condition
 Fetch-and-increment

Reads original value from memory and increments it in memory
 Requires memory read and write in uninterruptable instruction
 load linked/store conditional

If the contents of the memory location specified by the load linked
are changed before the store conditional to the same address, the
store conditional fails
Synchronization

26
Implementing Locks
 Spin lock
 If no coherence:
DADDUI R2,R0,#1
lockit: EXCH R2,0(R1) ;atomic exchange
BNEZ R2,lockit ;already locked?
 If coherence:
lockit: LD R2,0(R1) ;load of lock
BNEZ R2,lockit ;not available-spin
DADDUI R2,R0,#1 ;load locked value
EXCH R2,0(R1) ;swap
BNEZ R2,lockit ;branch if lock
wasn’t 0
Synchronization

27
Implementing Locks
 Advantage of this scheme: reduces memory
traffic
Synchronization

28
Models of Memory Consistency
Models
of
Memory
Consistency:
An
Introduction
Processor 1:
A=0
…
A=1
if (B==0) …
Processor 2:
B=0
…
B=1
if (A==0) …
 Should be impossible for both if-statements to be
evaluated as true
 Delayed write invalidate?
 Sequential consistency:
 Result of execution should be the same as long as:
 Accesses on each processor were kept in order
 Accesses on different processors were arbitrarily interleaved

29
Implementing Locks
 To implement, delay completion of all memory
accesses until all invalidations caused by the
access are completed
 Reduces performance!
 Alternatives:
 Program-enforced synchronization to force write on
processor to occur before read on the other processor

Requires synchronization object for A and another for B
 “Unlock” after write
 “Lock” after read
Models
of
Memory
Consistency:
An
Introduction

30
Relaxed Consistency Models
 Rules:
 X → Y

Operation X must complete before operation Y is done

Sequential consistency requires:
 R → W, R → R, W → R, W → W
 Relax W → R

“Total store ordering”
 Relax W → W

“Partial store order”
 Relax R → W and R → R

“Weak ordering” and “release consistency”
Models
of
Memory
Consistency:
An
Introduction

31
Relaxed Consistency Models
 Consistency model is multiprocessor specific
 Programmers will often implement explicit
synchronization
 Speculation gives much of the performance
advantage of relaxed models with sequential
consistency
 Basic idea: if an invalidation arrives for a result that
has not been committed, use speculation recovery
Models
of
Memory
Consistency:
An
Introduction

Multiprocessors and Thread-Level Parallelism.pptx

More Related Content

Similar to Multiprocessors and Thread-Level Parallelism.pptx (20)

Recently uploaded (20)

Multiprocessors and Thread-Level Parallelism.pptx