Introduction to Thread Level Parallelism

Thread-Level Parallelism
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers

Outline
 Multi-processors
 Shared-memory architectures
 Memory synchronization
2

Increasing Importance of
Multiprocessing
 Inability to exploit more ILP
 Power & silicon costs grow faster than performance
 Growing interest in
 High-end servers as cloud computing & software as a
service
 Data-intensive applications
 Lower interest in increasing desktop performance
 Better understanding of how to use multiprocessors
effectively
 Replication of a core is relatively easy than a
completely new design
3

Vectors, MMX, GPUs vs.
Multiprocessors
 Vectors, MMX, GPUs
 SIMD
 Multiprocessors
 MIMD
4

Multiprocessor Architecture
 MIMD multiprocessor with n processors
 Need n threads or processors to keep it fully utilized
 Thread-Level parallelism
 Uses MIMD model
 Have multiple program counters
 Targeted for tightly-coupled shared-memory
multiprocessors
 Communication among threads through shared
memory
5

Definition – Grain Size
 Amount of computation assigned to each thread
 Threads can be used for data-level parallelism,
but overheads may outweigh benefits
6

Symmetric Multiprocessors (SMP)
 Small no of cores
 Share single memory
with uniform memory
latency
 A.k.a. Uniform Memory
Access (UMA)
7

Distributed Shared Memory (DSM)
 Memory distributed among processors
 Processors connected via direct (switched) & non-direct
(multi-hop) interconnection networks
 A.k.a. Non-Uniform Memory Access/latency (NUMA)
8

Cache Coherence
9
Core A Core B
Cache Cache
RAM

Cache Coherence (Cont.)
 Processors may see different values through
their caches
 Caching shared data introduces new problems
 In a coherent memory system read of a data
item returns the most recently written value
 2 aspects
 Coherence
 Consistency
10

1. Coherence
 Defines behavior of reads & writes to the same
memory location
 What value can be returned by a read
 All reads by any processor must return most recently
written value
 Writes to same location by any 2 processors are seen
in same order by all processors
 Serialized writes to same location
11

2. Consistency
 Defines behavior of reads & writes with respect
to access to other memory locations
 When a written value will be returned by a read
 If a processor writes location x followed by
location y, any processor that sees new value of
y must also see new value of x
 Serialized writes different locations
12

Enforcing Coherence
 Coherent caches provide
 Migration – movement of data
 Replication – multiple copies of data
 Cache coherence protocols
 Snooping
 Each core tracks sharing status of each of its blocks
 Distributed
 Directory based
 Sharing status of each block kept in a single location
 Centralized
13

Snooping Coherence Protocol
 Write invalidate protocol
 On write, invalidate all other cached copies
 Use bus itself to serialize
 Write can’t complete until bus access is obtained
 Concurrent writes?
 One who obtains bus wins
 Most common implementation
14

Snooping Coherence Protocol (Cont.)
 Write update protocol
 A.k.a. Write broadcast protocol
 On write, update all copies
 Need more bandwidth
15

Snooping Coherence Protocol –
Implementation Techniques
 Locating an item when a read miss occurs
 Write-through cache
 Recent copy in memory
 Write-back cache
 Every processor snoops every address placed on shared bus
 If a processor finds it has a dirty block, updated block is sent to
requesting processor
 Cache lines marked as shared or
exclusive/modified
 Only writes to shared lines need an invalidate
broadcast
 After this, line is marked as exclusive
16

Snoopy Coherence Protocol – State
Transition Example
17

Snoopy Coherence Protocol – State
Transition Example
18

Snoopy Coherence Protocol – Issues
 Operations are not atomic
 e.g., detect miss, acquire bus, receive a response
 Applies to both reads & writes
 Creates possibility of deadlock & races
 Actual Snoopy Coherence Protocols are more
complicated
19

Coherence Protocols – Extensions
 Shared memory bus &
snooping bandwidth is
bottleneck for scaling
symmetric
multiprocessors
 Duplicating tags
 Place directory in
outermost cache
 Use crossbars or point-to-
point networks with
banked memory
20

Coherence Protocols – Example
 AMD Opteron
 Memory directly connected
to each multicore chip in
NUMA-like organization
 Implement coherence
protocol using point-to-
point links
 Use explicit
acknowledgements to
order operations
21
Source: www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html

Cache Coherence – Performance
 Coherence influences cache miss rate
 Coherence misses
 True sharing misses
 Write to shared block (transmission of invalidation)
 Read an invalidated block
 False sharing misses
 Read an unmodified word in an invalidated block
22

Performance Study – Commercial
Workload
23

Directory-Based Cache Coherence
 Sharing status of each physical memory block
kept in a single location
 Approaches
 Central directory for memory or common cache
 For Symmetric Multiprocessors (SMPs)
 Distributed directory
 For Distributed Shared Memory (DSM) systems
 Overcomes single point of contention in SMPs
24
Source: www.icsa.inf.ed.ac.uk/cgi-bin/hase/dir-
cache-m.pl?cd-t.html,cd-f.html,menu1.html

Directory Protocols
 For each block, maintain state
 Shared
 1 or more nodes have the block cached, value in memory is
up-to-date
 Set of node IDs
 Uncached
 Modified
 Exactly 1 node has a copy of the cache block, value in
memory is out-of-date
 Owner node ID
 Directory maintains block states & sends
invalidation messages
25

Directory Protocols (Cont.)
26

 For uncached block
 Read miss
 Requesting node is sent requested data & is made the only
sharing node, block is now shared
 Write miss
 Requesting node is sent requested data & becomes the sharing
node, block is now exclusive (modified)
 For shared block
 Read miss
 Requesting node is sent requested data from memory, node is
added to sharing set
 Write miss
 Requesting node is sent value, all nodes in sharing set are sent
invalidate messages, sharing set only contains requesting
node, block is now exclusive 27

 For exclusive block
 Read miss
 Owner is sent a data fetch message, block becomes shared,
owner sends data to directory, data written back to memory,
sharers set contains old owner & requestor
 Data write back
 Block becomes uncached, sharer set is empty
 Write miss
 Message is sent to old owner to invalidate & send value to
directory, requestor becomes new owner, block remains
exclusive
28

29

Synchronization Operations
 Basic building blocks
 Atomic exchange
 Swaps register with memory location
 Test-and-set
 Sets under condition
 Fetch-and-increment
 Reads original value from memory & increments it in memory
 Requires memory read & write in uninterruptable
instruction
 load linked/store conditional
 If contents of memory location specified by load linked are
changed before store conditional to same address, store
conditional fails 30

Implementing Locks Using Coherence
 Spin lock
 If no coherence
DADDUI R2,R0,#1
lockit: EXCH R2,0(R1) ;atomic exchange
BNEZ R2,lockit ;already locked?
 If coherence
lockit: LD R2,0(R1) ;load of lock
BNEZ R2,lockit ;not available-spin
DADDUI R2,R0,#1 ;load locked value
EXCH R2,0(R1); swap
BNEZ R2,lockit; branch if lock wasn’t 0
31

Implementing Locks – Advantages
 Reduces memory traffic
 During each iteration, current lock value can be read from cache
 Locality of accessing lock
32

Summary
 Multi-processors
 Create new problems, e.g., cache coherency
 Aspects of cache coherence
 Coherence
 Consistency
 Shared-memory architectures
 Snooping
 Directory based
 Memory synchronization
34

Introduction to Thread Level Parallelism

More Related Content

What's hot (20)

Similar to Introduction to Thread Level Parallelism (20)

More from Dilum Bandara (20)

Recently uploaded (20)

Introduction to Thread Level Parallelism

Editor's Notes