SlideShare a Scribd company logo
Thread-Level Parallelism
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
Outline
 Multi-processors
 Shared-memory architectures
 Memory synchronization
2
Increasing Importance of
Multiprocessing
 Inability to exploit more ILP
 Power & silicon costs grow faster than performance
 Growing interest in
 High-end servers as cloud computing & software as a
service
 Data-intensive applications
 Lower interest in increasing desktop performance
 Better understanding of how to use multiprocessors
effectively
 Replication of a core is relatively easy than a
completely new design
3
Vectors, MMX, GPUs vs.
Multiprocessors
 Vectors, MMX, GPUs
 SIMD
 Multiprocessors
 MIMD
4
Multiprocessor Architecture
 MIMD multiprocessor with n processors
 Need n threads or processors to keep it fully utilized
 Thread-Level parallelism
 Uses MIMD model
 Have multiple program counters
 Targeted for tightly-coupled shared-memory
multiprocessors
 Communication among threads through shared
memory
5
Definition – Grain Size
 Amount of computation assigned to each thread
 Threads can be used for data-level parallelism,
but overheads may outweigh benefits
6
Symmetric Multiprocessors (SMP)
 Small no of cores
 Share single memory
with uniform memory
latency
 A.k.a. Uniform Memory
Access (UMA)
7
Distributed Shared Memory (DSM)
 Memory distributed among processors
 Processors connected via direct (switched) & non-direct
(multi-hop) interconnection networks
 A.k.a. Non-Uniform Memory Access/latency (NUMA)
8
Cache Coherence
9
Core A Core B
Cache Cache
RAM
Cache Coherence (Cont.)
 Processors may see different values through
their caches
 Caching shared data introduces new problems
 In a coherent memory system read of a data
item returns the most recently written value
 2 aspects
 Coherence
 Consistency
10
1. Coherence
 Defines behavior of reads & writes to the same
memory location
 What value can be returned by a read
 All reads by any processor must return most recently
written value
 Writes to same location by any 2 processors are seen
in same order by all processors
 Serialized writes to same location
11
2. Consistency
 Defines behavior of reads & writes with respect
to access to other memory locations
 When a written value will be returned by a read
 If a processor writes location x followed by
location y, any processor that sees new value of
y must also see new value of x
 Serialized writes different locations
12
Enforcing Coherence
 Coherent caches provide
 Migration – movement of data
 Replication – multiple copies of data
 Cache coherence protocols
 Snooping
 Each core tracks sharing status of each of its blocks
 Distributed
 Directory based
 Sharing status of each block kept in a single location
 Centralized
13
Snooping Coherence Protocol
 Write invalidate protocol
 On write, invalidate all other cached copies
 Use bus itself to serialize
 Write can’t complete until bus access is obtained
 Concurrent writes?
 One who obtains bus wins
 Most common implementation
14
Snooping Coherence Protocol (Cont.)
 Write update protocol
 A.k.a. Write broadcast protocol
 On write, update all copies
 Need more bandwidth
15
Snooping Coherence Protocol –
Implementation Techniques
 Locating an item when a read miss occurs
 Write-through cache
 Recent copy in memory
 Write-back cache
 Every processor snoops every address placed on shared bus
 If a processor finds it has a dirty block, updated block is sent to
requesting processor
 Cache lines marked as shared or
exclusive/modified
 Only writes to shared lines need an invalidate
broadcast
 After this, line is marked as exclusive
16
Snoopy Coherence Protocol – State
Transition Example
17
Snoopy Coherence Protocol – State
Transition Example
18
Snoopy Coherence Protocol – Issues
 Operations are not atomic
 e.g., detect miss, acquire bus, receive a response
 Applies to both reads & writes
 Creates possibility of deadlock & races
 Actual Snoopy Coherence Protocols are more
complicated
19
Coherence Protocols – Extensions
 Shared memory bus &
snooping bandwidth is
bottleneck for scaling
symmetric
multiprocessors
 Duplicating tags
 Place directory in
outermost cache
 Use crossbars or point-to-
point networks with
banked memory
20
Coherence Protocols – Example
 AMD Opteron
 Memory directly connected
to each multicore chip in
NUMA-like organization
 Implement coherence
protocol using point-to-
point links
 Use explicit
acknowledgements to
order operations
21
Source: www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html
Cache Coherence – Performance
 Coherence influences cache miss rate
 Coherence misses
 True sharing misses
 Write to shared block (transmission of invalidation)
 Read an invalidated block
 False sharing misses
 Read an unmodified word in an invalidated block
22
Performance Study – Commercial
Workload
23
Directory-Based Cache Coherence
 Sharing status of each physical memory block
kept in a single location
 Approaches
 Central directory for memory or common cache
 For Symmetric Multiprocessors (SMPs)
 Distributed directory
 For Distributed Shared Memory (DSM) systems
 Overcomes single point of contention in SMPs
24
Source: www.icsa.inf.ed.ac.uk/cgi-bin/hase/dir-
cache-m.pl?cd-t.html,cd-f.html,menu1.html
Directory Protocols
 For each block, maintain state
 Shared
 1 or more nodes have the block cached, value in memory is
up-to-date
 Set of node IDs
 Uncached
 Modified
 Exactly 1 node has a copy of the cache block, value in
memory is out-of-date
 Owner node ID
 Directory maintains block states & sends
invalidation messages
25
Directory Protocols (Cont.)
26
Directory Protocols (Cont.)
 For uncached block
 Read miss
 Requesting node is sent requested data & is made the only
sharing node, block is now shared
 Write miss
 Requesting node is sent requested data & becomes the sharing
node, block is now exclusive (modified)
 For shared block
 Read miss
 Requesting node is sent requested data from memory, node is
added to sharing set
 Write miss
 Requesting node is sent value, all nodes in sharing set are sent
invalidate messages, sharing set only contains requesting
node, block is now exclusive 27
Directory Protocols (Cont.)
 For exclusive block
 Read miss
 Owner is sent a data fetch message, block becomes shared,
owner sends data to directory, data written back to memory,
sharers set contains old owner & requestor
 Data write back
 Block becomes uncached, sharer set is empty
 Write miss
 Message is sent to old owner to invalidate & send value to
directory, requestor becomes new owner, block remains
exclusive
28
Directory Protocols (Cont.)
29
Synchronization Operations
 Basic building blocks
 Atomic exchange
 Swaps register with memory location
 Test-and-set
 Sets under condition
 Fetch-and-increment
 Reads original value from memory & increments it in memory
 Requires memory read & write in uninterruptable
instruction
 load linked/store conditional
 If contents of memory location specified by load linked are
changed before store conditional to same address, store
conditional fails 30
Implementing Locks Using Coherence
 Spin lock
 If no coherence
DADDUI R2,R0,#1
lockit: EXCH R2,0(R1) ;atomic exchange
BNEZ R2,lockit ;already locked?
 If coherence
lockit: LD R2,0(R1) ;load of lock
BNEZ R2,lockit ;not available-spin
DADDUI R2,R0,#1 ;load locked value
EXCH R2,0(R1); swap
BNEZ R2,lockit; branch if lock wasn’t 0
31
Implementing Locks – Advantages
 Reduces memory traffic
 During each iteration, current lock value can be read from cache
 Locality of accessing lock
32
33
Summary
 Multi-processors
 Create new problems, e.g., cache coherency
 Aspects of cache coherence
 Coherence
 Consistency
 Shared-memory architectures
 Snooping
 Directory based
 Memory synchronization
34

More Related Content

PDF
Zynq architecture
PPTX
dual-port RAM (DPRAM)
PPTX
Unit 3 CO.pptx
ODP
APB protocol v1.0
PPTX
Computer Organization
PPTX
Implicit and explicit sequence control with exception handling
PPTX
Direct memory access
ODP
axi protocol
Zynq architecture
dual-port RAM (DPRAM)
Unit 3 CO.pptx
APB protocol v1.0
Computer Organization
Implicit and explicit sequence control with exception handling
Direct memory access
axi protocol

What's hot (20)

PPTX
PPT
microprocessor
PPTX
Lec 4 (program and network properties)
PDF
Instruction Level Parallelism (ILP) Limitations
PPTX
AMBA 2.0 PPT
PPT
Intel 64bit Architecture
PPTX
Addressing Modes of 8085 Microprocessor
PPTX
Advanced Pipelining in ARM Processors.pptx
PPTX
Digital Signal processor ADSP 21XX family
PPTX
4.FPGA for dummies: Design Flow
PPTX
Scope of parallelism
PDF
Memory ECC - The Comprehensive of SEC-DED.
PDF
ARM Processor Tutorial
PDF
Lecture 6.1
PPSX
Processors used in System on chip
PPT
Pipeline hazard
PPTX
Processor organization & register organization
PPTX
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
PDF
vdoc.pub_static-timing-analysis-for-nanometer-designs-a-practical-approach-.pdf
microprocessor
Lec 4 (program and network properties)
Instruction Level Parallelism (ILP) Limitations
AMBA 2.0 PPT
Intel 64bit Architecture
Addressing Modes of 8085 Microprocessor
Advanced Pipelining in ARM Processors.pptx
Digital Signal processor ADSP 21XX family
4.FPGA for dummies: Design Flow
Scope of parallelism
Memory ECC - The Comprehensive of SEC-DED.
ARM Processor Tutorial
Lecture 6.1
Processors used in System on chip
Pipeline hazard
Processor organization & register organization
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
vdoc.pub_static-timing-analysis-for-nanometer-designs-a-practical-approach-.pdf
Ad

Similar to Introduction to Thread Level Parallelism (20)

PPTX
Multiprocessors and Thread-Level Parallelism.pptx
PPT
chapter-6-multiprocessors-and-thread-level (1).ppt
PPTX
Parallel Processing (Part 2)
PPT
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
PPT
Distributed Shared memory architecture.ppt
PPTX
Distributed Shared Memory Systems
PPT
Chap 4
PPTX
Study of various factors affecting performance of multi core processors
PDF
Multiprocessor
PPTX
Cache Coherence.pptx
PPT
Dsm (Distributed computing)
PPTX
Distributed Shared Memory Systems
PPT
Bab 4
 
PPTX
Topic 4- processes.pptx
PPTX
Week 12 Operating System Lectures lec 2.pptx
PDF
Talon systems - Distributed multi master replication strategy
PPT
Chapter 9 OS
 
PDF
SoC-2012-pres-2
PPT
Sinfonia
PPTX
Chorus - Distributed Operating System [ case study ]
Multiprocessors and Thread-Level Parallelism.pptx
chapter-6-multiprocessors-and-thread-level (1).ppt
Parallel Processing (Part 2)
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
Distributed Shared memory architecture.ppt
Distributed Shared Memory Systems
Chap 4
Study of various factors affecting performance of multi core processors
Multiprocessor
Cache Coherence.pptx
Dsm (Distributed computing)
Distributed Shared Memory Systems
Bab 4
 
Topic 4- processes.pptx
Week 12 Operating System Lectures lec 2.pptx
Talon systems - Distributed multi master replication strategy
Chapter 9 OS
 
SoC-2012-pres-2
Sinfonia
Chorus - Distributed Operating System [ case study ]
Ad

More from Dilum Bandara (20)

PPTX
Designing for Multiple Blockchains in Industry Ecosystems
PPTX
Introduction to Machine Learning
PPTX
Time Series Analysis and Forecasting in Practice
PPTX
Introduction to Dimension Reduction with PCA
PPTX
Introduction to Descriptive & Predictive Analytics
PPTX
Introduction to Concurrent Data Structures
PPTX
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
PPTX
Introduction to Map-Reduce Programming with Hadoop
PPTX
Embarrassingly/Delightfully Parallel Problems
PPTX
Introduction to Warehouse-Scale Computers
PPTX
CPU Memory Hierarchy and Caching Techniques
PPTX
Data-Level Parallelism in Microprocessors
PDF
Instruction Level Parallelism – Hardware Techniques
PPTX
Instruction Level Parallelism – Compiler Techniques
PPTX
CPU Pipelining and Hazards - An Introduction
PPTX
Advanced Computer Architecture – An Introduction
PPTX
High Performance Networking with Advanced TCP
PPTX
Introduction to Content Delivery Networks
PPTX
Peer-to-Peer Networking Systems and Streaming
PPTX
Mobile Services
Designing for Multiple Blockchains in Industry Ecosystems
Introduction to Machine Learning
Time Series Analysis and Forecasting in Practice
Introduction to Dimension Reduction with PCA
Introduction to Descriptive & Predictive Analytics
Introduction to Concurrent Data Structures
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Introduction to Map-Reduce Programming with Hadoop
Embarrassingly/Delightfully Parallel Problems
Introduction to Warehouse-Scale Computers
CPU Memory Hierarchy and Caching Techniques
Data-Level Parallelism in Microprocessors
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Compiler Techniques
CPU Pipelining and Hazards - An Introduction
Advanced Computer Architecture – An Introduction
High Performance Networking with Advanced TCP
Introduction to Content Delivery Networks
Peer-to-Peer Networking Systems and Streaming
Mobile Services

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Monthly Chronicles - July 2025
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Empathic Computing: Creating Shared Understanding
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation

Introduction to Thread Level Parallelism

  • 1. Thread-Level Parallelism CS4342 Advanced Computer Architecture Dilum Bandara Dilum.Bandara@uom.lk Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 2. Outline  Multi-processors  Shared-memory architectures  Memory synchronization 2
  • 3. Increasing Importance of Multiprocessing  Inability to exploit more ILP  Power & silicon costs grow faster than performance  Growing interest in  High-end servers as cloud computing & software as a service  Data-intensive applications  Lower interest in increasing desktop performance  Better understanding of how to use multiprocessors effectively  Replication of a core is relatively easy than a completely new design 3
  • 4. Vectors, MMX, GPUs vs. Multiprocessors  Vectors, MMX, GPUs  SIMD  Multiprocessors  MIMD 4
  • 5. Multiprocessor Architecture  MIMD multiprocessor with n processors  Need n threads or processors to keep it fully utilized  Thread-Level parallelism  Uses MIMD model  Have multiple program counters  Targeted for tightly-coupled shared-memory multiprocessors  Communication among threads through shared memory 5
  • 6. Definition – Grain Size  Amount of computation assigned to each thread  Threads can be used for data-level parallelism, but overheads may outweigh benefits 6
  • 7. Symmetric Multiprocessors (SMP)  Small no of cores  Share single memory with uniform memory latency  A.k.a. Uniform Memory Access (UMA) 7
  • 8. Distributed Shared Memory (DSM)  Memory distributed among processors  Processors connected via direct (switched) & non-direct (multi-hop) interconnection networks  A.k.a. Non-Uniform Memory Access/latency (NUMA) 8
  • 9. Cache Coherence 9 Core A Core B Cache Cache RAM
  • 10. Cache Coherence (Cont.)  Processors may see different values through their caches  Caching shared data introduces new problems  In a coherent memory system read of a data item returns the most recently written value  2 aspects  Coherence  Consistency 10
  • 11. 1. Coherence  Defines behavior of reads & writes to the same memory location  What value can be returned by a read  All reads by any processor must return most recently written value  Writes to same location by any 2 processors are seen in same order by all processors  Serialized writes to same location 11
  • 12. 2. Consistency  Defines behavior of reads & writes with respect to access to other memory locations  When a written value will be returned by a read  If a processor writes location x followed by location y, any processor that sees new value of y must also see new value of x  Serialized writes different locations 12
  • 13. Enforcing Coherence  Coherent caches provide  Migration – movement of data  Replication – multiple copies of data  Cache coherence protocols  Snooping  Each core tracks sharing status of each of its blocks  Distributed  Directory based  Sharing status of each block kept in a single location  Centralized 13
  • 14. Snooping Coherence Protocol  Write invalidate protocol  On write, invalidate all other cached copies  Use bus itself to serialize  Write can’t complete until bus access is obtained  Concurrent writes?  One who obtains bus wins  Most common implementation 14
  • 15. Snooping Coherence Protocol (Cont.)  Write update protocol  A.k.a. Write broadcast protocol  On write, update all copies  Need more bandwidth 15
  • 16. Snooping Coherence Protocol – Implementation Techniques  Locating an item when a read miss occurs  Write-through cache  Recent copy in memory  Write-back cache  Every processor snoops every address placed on shared bus  If a processor finds it has a dirty block, updated block is sent to requesting processor  Cache lines marked as shared or exclusive/modified  Only writes to shared lines need an invalidate broadcast  After this, line is marked as exclusive 16
  • 17. Snoopy Coherence Protocol – State Transition Example 17
  • 18. Snoopy Coherence Protocol – State Transition Example 18
  • 19. Snoopy Coherence Protocol – Issues  Operations are not atomic  e.g., detect miss, acquire bus, receive a response  Applies to both reads & writes  Creates possibility of deadlock & races  Actual Snoopy Coherence Protocols are more complicated 19
  • 20. Coherence Protocols – Extensions  Shared memory bus & snooping bandwidth is bottleneck for scaling symmetric multiprocessors  Duplicating tags  Place directory in outermost cache  Use crossbars or point-to- point networks with banked memory 20
  • 21. Coherence Protocols – Example  AMD Opteron  Memory directly connected to each multicore chip in NUMA-like organization  Implement coherence protocol using point-to- point links  Use explicit acknowledgements to order operations 21 Source: www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html
  • 22. Cache Coherence – Performance  Coherence influences cache miss rate  Coherence misses  True sharing misses  Write to shared block (transmission of invalidation)  Read an invalidated block  False sharing misses  Read an unmodified word in an invalidated block 22
  • 23. Performance Study – Commercial Workload 23
  • 24. Directory-Based Cache Coherence  Sharing status of each physical memory block kept in a single location  Approaches  Central directory for memory or common cache  For Symmetric Multiprocessors (SMPs)  Distributed directory  For Distributed Shared Memory (DSM) systems  Overcomes single point of contention in SMPs 24 Source: www.icsa.inf.ed.ac.uk/cgi-bin/hase/dir- cache-m.pl?cd-t.html,cd-f.html,menu1.html
  • 25. Directory Protocols  For each block, maintain state  Shared  1 or more nodes have the block cached, value in memory is up-to-date  Set of node IDs  Uncached  Modified  Exactly 1 node has a copy of the cache block, value in memory is out-of-date  Owner node ID  Directory maintains block states & sends invalidation messages 25
  • 27. Directory Protocols (Cont.)  For uncached block  Read miss  Requesting node is sent requested data & is made the only sharing node, block is now shared  Write miss  Requesting node is sent requested data & becomes the sharing node, block is now exclusive (modified)  For shared block  Read miss  Requesting node is sent requested data from memory, node is added to sharing set  Write miss  Requesting node is sent value, all nodes in sharing set are sent invalidate messages, sharing set only contains requesting node, block is now exclusive 27
  • 28. Directory Protocols (Cont.)  For exclusive block  Read miss  Owner is sent a data fetch message, block becomes shared, owner sends data to directory, data written back to memory, sharers set contains old owner & requestor  Data write back  Block becomes uncached, sharer set is empty  Write miss  Message is sent to old owner to invalidate & send value to directory, requestor becomes new owner, block remains exclusive 28
  • 30. Synchronization Operations  Basic building blocks  Atomic exchange  Swaps register with memory location  Test-and-set  Sets under condition  Fetch-and-increment  Reads original value from memory & increments it in memory  Requires memory read & write in uninterruptable instruction  load linked/store conditional  If contents of memory location specified by load linked are changed before store conditional to same address, store conditional fails 30
  • 31. Implementing Locks Using Coherence  Spin lock  If no coherence DADDUI R2,R0,#1 lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked?  If coherence lockit: LD R2,0(R1) ;load of lock BNEZ R2,lockit ;not available-spin DADDUI R2,R0,#1 ;load locked value EXCH R2,0(R1); swap BNEZ R2,lockit; branch if lock wasn’t 0 31
  • 32. Implementing Locks – Advantages  Reduces memory traffic  During each iteration, current lock value can be read from cache  Locality of accessing lock 32
  • 33. 33
  • 34. Summary  Multi-processors  Create new problems, e.g., cache coherency  Aspects of cache coherence  Coherence  Consistency  Shared-memory architectures  Snooping  Directory based  Memory synchronization 34

Editor's Notes

  • #32: If no coherence – lock in memory, keep reading it If coherence – lock in cache, keep reading from cache (local copy), until it get changes. If we do EXCH on 1st line it require getting exclusive access to cache.