SlideShare a Scribd company logo
John Mellor-Crummey
Department of Computer Science
Rice University
johnmc@rice.edu
Principles of Parallel
Algorithm Design:
Concurrency and Mapping
COMP 422/534 Lecture 3 21 January 2020
Last Thursday
• Introduction to parallel algorithms
—tasks and decomposition
—threads and mapping
—threads versus cores
• Decomposition techniques - part 1
—recursive decomposition
—data decomposition
2
3
Owner Computes Rule
• Each datum is assigned to a thread
• Each thread computes values associated with its data
• Implications
—input data decomposition
– all computations using an input datum are performed by its thread
—output data decomposition
– an output is computed by the thread assigned to the output data
Topics for Today
• Decomposition techniques - part 2
—exploratory decomposition
—hybrid decomposition
• Characteristics of tasks and interactions
• Mapping techniques for load balancing
—static mappings
—dynamic mappings
• Methods for minimizing interaction overheads
4
Exploratory Decomposition
• Exploration (search) of a state space of solutions
—problem decomposition reflects shape of execution
• Examples
—discrete optimization
– 0/1 integer programming
—theorem proving
—game playing
5
6
Exploratory Decomposition Example
Solving a 15 puzzle
• Sequence of three moves from state (a) to final state (d)
• From an arbitrary state, must search for a solution
7
Exploratory Decomposition: Example
Solving a 15 puzzle
Search
— generate successor states of the current state
— explore each as an independent task
initial state
final state (solution)
after first move
8
Exploratory Decomposition Speedup
• Parallel formulation may perform a different amount of work
• Can cause super- or sub-linear speedup
m m m m m m m m
total serial work = 2m + 1
total parallel work = 4
total serial work = m
total parallel work = 4m
solution
9
Speculative Decomposition
• Dependencies between tasks are not always known a-priori
—makes it impossible to identify independent tasks
• Conservative approach
—identify independent tasks only when no dependencies left
• Optimistic (speculative) approach
—schedule tasks even when they may potentially be erroneous
• Drawbacks for each
—conservative approaches
– may yield little concurrency
—optimistic approaches
– may require a roll-back mechanism if a dependence is encountered
10
Speculative Decomposition in Practice
Discrete event simulation
• Data structure: centralized time-ordered event list
• Simulation
— extract next event in time order
— process the event
— if required, insert new events into the event list
• Optimistic event scheduling
— assume outcomes of all prior events
— speculatively process next event
— if assumption is incorrect, roll back its effects and continue
Time Warp
David Jefferson. “Virtual Time,”
ACM TOPLAS, 7(3):404-425, July 1985
11
Speculative Decomposition in Practice
Time Warp OS http://bit.ly/twos-94
• A new operating system for military simulations
—expensive computational tasks
—composed of many interacting subsystems
—highly irregular temporal behavior
• Optimistic execution and process rollback
—don't treat rollback as a special case for handling exceptions,
breaking deadlock, aborting transactions, …
—use rollback as frequently as other systems use blocking
• Why a new OS?
—rollback forces a rethinking of all OS issues
– scheduling, synchronization, message queueing, flow control,
memory management, error handling, I/O, and commitment
—building Time Warp on top of an OS would require two levels of
synchronization, two levels of message queues, …
12
Optimistic Simulation
David Bauer et al. “ROSS.NET: Optimistic Simulation
Framework For Large-scale Internet Models,” Proc. of
the 2003 Winter Simulation Conference
13
Hybrid Decomposition
Use multiple decomposition strategies together
Often necessary for adequate concurrency
• Quicksort
—recursive decomposition alone limits concurrency
—augmenting recursive with data decomposition is better
– can use data decomposition on input data to compute a split
Hybrid Decomposition for Climate Simulation
14
Figure courtesy of Pat Worley (ORNL)
Data decomposition within atmosphere, ocean, land, and sea-ice tasks
15
Topics for Today
• Decomposition techniques - part 2
— data decomposition
— exploratory decomposition
— hybrid decomposition
• Characteristics of tasks and interactions
• Mapping techniques for load balancing
— static mappings
— dynamic mappings
• Methods for minimizing interaction overheads
• Parallel algorithm design templates
☛
16
Characteristics of Tasks
• Key characteristics
—generation strategy
—associated work
—associated data size
• Impact choice and performance of parallel algorithms
Task Generation
• Static task generation
—identify concurrent tasks a-priori
—typically decompose using data or recursive decomposition
—examples
– matrix operations
– graph algorithms on static graphs
– image processing applications
– other regularly structured problems
• Dynamic task generation
—identify concurrent tasks as a computation unfolds
—typically decompose using exploratory or speculative
decompositions
—examples
– puzzle solving
– game playing
17
18
Task Size
• Uniform: all the same size
• Non-uniform
— sometimes sizes known or can be estimated a-priori
— sometimes not
– example: tasks in quicksort
size of each partition depends upon pivot selected
19
Size of Data Associated with Tasks
• Data may be small or large compared to the computation
— size(input) < size(computation), e.g., 15 puzzle
— size(input) = size(computation) > size(output), e.g., min
— size(input) = size(output) < size(computation), e.g., sort
• Implications
— small data: task can easily migrate to another thread
— large data: ties the task to a thread
– possibly can avoid communicating the task context
reconstruct/recompute the context elsewhere
20
Characteristics of Task Interactions
Orthogonal classification criteria
• Static vs. dynamic
• Regular vs. irregular
• Read-only vs. read-write
• One-sided vs. two-sided
21
Characteristics of Task Interactions
• Static interactions
—tasks and interactions are known a-priori
—simpler to code
• Dynamic interactions
—timing or interacting tasks cannot be determined a-priori
—harder to code
– especially using two-sided message passing APIs
Characteristics of Task Interactions
• Regular interactions
—interactions have a pattern that can be described with a function
– e.g. mesh, ring
—regular patterns can be exploited for efficient implementation
– e.g. schedule communication to avoid conflicts on network links
• Irregular interactions
—lack a well-defined topology
—modeled by a graph
22
23
Static Regular Task Interaction Pattern
Image operations, e.g., edge detection
Nearest neighbor interactions on a 2D mesh
Sobel Edge
Detection Stencils
24
Static Irregular Task Interaction Pattern
Sparse matrix-vector multiply
Characteristics of Task Interactions
• Read-only interactions
—tasks only read data associated with other tasks
• Read-write interactions
—read and modify data associated with other tasks
—harder to code: requires synchronization
– need to avoid read-write and write-write ordering races
25
Characteristics of Task Interactions
• One-sided
—initiated & completed independently by 1 of 2 interacting tasks
– READ or WRITE
– GET or PUT
• Two-sided
—both tasks coordinate in an interaction
– SEND and RECV
26
27
Topics for Today
• Decomposition techniques - part 2
— data decomposition
— exploratory decomposition
— hybrid decomposition
• Characteristics of tasks and interactions
• Mapping techniques for load balancing
— static mappings
— dynamic mappings
• Methods for minimizing interaction overheads
• Parallel algorithm design templates
☛
28
Mapping Techniques
Map concurrent tasks to threads for execution
• Overheads of mappings
—serialization (idling)
—communication
• Select mapping to minimize overheads
• Conflicting objectives: minimizing one increases the other
—assigning all work to one thread
– minimizes communication
– significant idling
—minimizing serialization introduces communication
29
Mapping Techniques for Minimum Idling
• Must simultaneously minimize idling and load balance
• Balancing load alone does not minimize idling
Time Time
30
Mapping Techniques for Minimum Idling
Static vs. dynamic mappings
• Static mapping
—a-priori mapping of tasks to threads or processes
— requirements
– a good estimate of task size
– even so, computing an optimal mapping may be NP hard
e.g., even decomposition analogous to bin packing
• Dynamic mapping
— map tasks to threads or processes at runtime
— why?
– tasks are generated at runtime, or
– their sizes are unknown
Factors that influence choice of mapping
• size of data associated with a task
• nature of underlying domain
31
Schemes for Static Mapping
• Data partitionings
• Task graph partitionings
• Hybrid strategies
32
Mappings Based on Data Partitioning
Partition computation using a combination of
—data partitioning
—owner-computes rule
Example: 1-D block distribution for dense matrices
33
Block Array Distribution Schemes
Multi-dimensional block distributions
Multi-dimensional partitioning enables larger # of threads
34
Block Array Distribution Example
Multiplying two dense matrices C = A x B
• Partition the output matrix C using a block decomposition
• Give each task the same number of elements of C
— each element of C corresponds to a dot product
— even load balance
• Obvious choices: 1D or 2D decomposition
• Select to minimize associated communication overhead
x =
35
Data Usage in Dense Matrix Multiplication
x =
36
Consider: Gaussian Elimination
Active submatrix shrinks as elimination progresses
A[k,j]
active for step k
active for step k+1
37
Imbalance and Block Array Distributions
• Consider a block distribution for Gaussian Elimination
— amount of computation per data item varies
— a block decomposition would lead to significant load
imbalance
38
Block Cyclic Distribution
Variant of the block distribution scheme that can be used to
alleviate the load-imbalance and idling
Steps
1. partition an array into many more blocks than the number
of available threads or processes
2. round-robin assignment of blocks to threads or processes
– each thread or process gets several non-adjacent blocks
39
Block-Cyclic Distribution
1D block-cyclic 2D block-cyclic
• Cyclic distribution: special case with block size = 1
• Block distribution: special case with block size is n/p
—n is the dimension of the matrix; p is the # of threads
40
Decomposition by Graph Partitioning
Sparse-matrix vector multiply
• Graph of the matrix is useful for decomposition
— work ~ number of edges
— communication for a node ~ node degree
• Goal: balance work & minimize communication
• Partition the graph
— assign equal number of nodes to each thread
— minimize edge count of the graph partition
41
Partitioning a Graph of Lake Superior
Random Partitioning
Partitioning for minimum edge-cut
42
Mappings Based on Task Partitioning
Partitioning a task-dependency graph
• Optimal partitioning for general task-dependency graph
— NP-hard problem
• Excellent heuristics exist for structured graphs
43
Mapping a Sparse Matrix
Sparse matrix-vector product
sparse matrix structure
17 items to
communicate
partitioning
mapping
44
Mapping a Sparse Matrix
Sparse matrix-vector product
mapping
13 items to
communicate
partitioning
sparse matrix structure
17 items to
communicate
45
Hierarchical Mappings
• Sometimes a single-level mapping is inadequate
• Hierarchical approach
— use a task mapping at the top level
— data partitioning within each task
Example:
Hybrid Decomposition
+ Data Partitioning for
Community Earth System Model
46
Topics for Today
• Decomposition techniques - part 2
— data decomposition
— exploratory decomposition
— hybrid decomposition
• Characteristics of tasks and interactions
• Mapping techniques for load balancing
— static mappings
— dynamic mappings
• Methods for minimizing interaction overheads
• Parallel algorithm design templates
☛
47
Schemes for Dynamic Mapping
• Dynamic mapping AKA dynamic load balancing
—load balancing is the primary motivation for dynamic mapping
• Styles
—centralized
—distributed
Centralized Dynamic Mapping
• Threads types: masters or slaves
• General strategy
—when a slave runs out of work → request more from master
• Challenge
—master may become bottleneck for large # of threads
• Approach
—chunk scheduling: thread picks up several of tasks at once
—however
– large chunk sizes may cause significant load imbalances
– gradually decrease chunk size as the computation progresses
48
Distributed Dynamic Mapping
• All threads as peers
• Each thread can send or receive work from other threads
—avoids centralized bottleneck
• Four critical design questions
—how are sending and receiving threads paired together?
—who initiates work transfer?
—how much work is transferred?
—when is a transfer triggered?
• Ideal answers can be application specific
• Cilk uses a distributed dynamic mapping: “work stealing”
49
50
Topics for Today
• Decomposition techniques - part 2
— data decomposition
— exploratory decomposition
— hybrid decomposition
• Characteristics of tasks and interactions
• Mapping techniques for load balancing
— static mappings
— dynamic mappings
• Methods for minimizing interaction overheads
• Parallel algorithm design templates
☛
51
Minimizing Interaction Overheads (1)
“Rules of thumb”
• Maximize data locality
— don’t fetch data you already have
— restructure computation to reuse data promptly
• Minimize volume of data exchange
— partition interaction graph to minimize edge crossings
• Minimize frequency of communication
— try to aggregate messages where possible
• Minimize contention and hot-spots
— use decentralized techniques (avoidance)
52
Minimizing Interaction Overheads (2)
Techniques
• Overlap communication with computation
— use non-blocking communication primitives
– overlap communication with your own computation
– one-sided: prefetch remote data to hide latency
— multithread code
– overlap communication with another thread’s computation
• Replicate data or computation to reduce communication
• Use group communication instead of point-to-point primitives
• Issue multiple communications and overlap their latency
(reduces exposed latency)
53
Topics for Today
• Decomposition techniques - part 2
— data decomposition
— exploratory decomposition
— hybrid decomposition
• Characteristics of tasks and interactions
• Mapping techniques for load balancing
— static mappings
— dynamic mappings
• Methods for minimizing interaction overheads
• Parallel algorithm design templates
☛
54
Parallel Algorithm Model
• Definition: ways of structuring a parallel algorithm
• Aspects of a model
— decomposition
— mapping technique
— strategy to minimize interactions
55
Common Parallel Algorithm Templates
• Data parallel
— each task performs similar operations on different data
— typically statically map tasks to threads or processes
• Task graph
— use task dependency graph relationships to promote locality,
or reduce interaction costs
• Master-slave
— one or more master threads generate work
— allocate it to worker threads
— allocation may be static or dynamic
• Pipeline / producer-consumer
— pass a stream of data through a sequence of workers
— each performs some operation on it
• Hybrid
— apply multiple models hierarchically, or
— apply multiple models in sequence to different phases
56
Topics for Tuesday
• Threaded programming models
• Introduction to Cilk Plus
—tasks
—algorithmic complexity measures
—scheduling
—performance and granularity
—task parallelism examples
57
References
• Adapted from slides “Principles of Parallel Algorithm
Design” by Ananth Grama
• Based on Chapter 3 of “Introduction to Parallel
Computing” by Ananth Grama, Anshul Gupta, George
Karypis, and Vipin Kumar. Addison Wesley, 2003

More Related Content

PDF
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
PPT
Unit-3.ppt
PPT
Parallel Processing Concepts
PDF
Unit- 2_my1.pdf jbvjwe vbeijv dv d d d kjd k
PDF
Lecture 4 principles of parallel algorithm design updated
PPTX
Data decomposition techniques
PPTX
Parallel Distributive Computing Lecture 6
PPT
Chap3 slides
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
Unit-3.ppt
Parallel Processing Concepts
Unit- 2_my1.pdf jbvjwe vbeijv dv d d d kjd k
Lecture 4 principles of parallel algorithm design updated
Data decomposition techniques
Parallel Distributive Computing Lecture 6
Chap3 slides

Similar to comp422-534-2020-Lecture3-ConcurrencyMapping.pdf (20)

PPT
SecondPresentationDesigning_Parallel_Programs.ppt
PDF
Chapter 3 principles of parallel algorithm design
PDF
Arc 300-3 ade miller-en
PPT
Chap3 slides
PPT
01-MessagePassingFundamentals.ppt
PPTX
Solution Patterns for Parallel Programming
PDF
Introduction to Parallel Programming
PPT
Parallel Computing 2007: Overview
PDF
Introduction to Parallel Computing
PPTX
TASK AND DATA PARALLELISM in Computer Science pptx
PDF
Parallelising Dynamic Programming
PPT
PMSCS 657_Parallel and Distributed processing
PPT
Chapter 3 pc
PDF
Data Analytics and Simulation in Parallel with MATLAB*
PDF
Parallel Computing - Lec 4
PDF
Galois: A System for Parallel Execution of Irregular Algorithms
PPTX
Dos unit3
PPT
Task and Data Parallelism
PPTX
Distributed Graph Transformations Supported By Multi-Agent Systems
PPT
chapter 1 Introduction Distributed System
SecondPresentationDesigning_Parallel_Programs.ppt
Chapter 3 principles of parallel algorithm design
Arc 300-3 ade miller-en
Chap3 slides
01-MessagePassingFundamentals.ppt
Solution Patterns for Parallel Programming
Introduction to Parallel Programming
Parallel Computing 2007: Overview
Introduction to Parallel Computing
TASK AND DATA PARALLELISM in Computer Science pptx
Parallelising Dynamic Programming
PMSCS 657_Parallel and Distributed processing
Chapter 3 pc
Data Analytics and Simulation in Parallel with MATLAB*
Parallel Computing - Lec 4
Galois: A System for Parallel Execution of Irregular Algorithms
Dos unit3
Task and Data Parallelism
Distributed Graph Transformations Supported By Multi-Agent Systems
chapter 1 Introduction Distributed System
Ad

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Sustainable Sites - Green Building Construction
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
composite construction of structures.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PDF
PPT on Performance Review to get promotions
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
web development for engineering and engineering
PPTX
Welding lecture in detail for understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Construction Project Organization Group 2.pptx
Lecture Notes Electrical Wiring System Components
CYBER-CRIMES AND SECURITY A guide to understanding
additive manufacturing of ss316l using mig welding
Sustainable Sites - Green Building Construction
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Operating System & Kernel Study Guide-1 - converted.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
bas. eng. economics group 4 presentation 1.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
R24 SURVEYING LAB MANUAL for civil enggi
composite construction of structures.pdf
UNIT 4 Total Quality Management .pptx
PPT on Performance Review to get promotions
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
web development for engineering and engineering
Welding lecture in detail for understanding
Ad

comp422-534-2020-Lecture3-ConcurrencyMapping.pdf

  • 1. John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu Principles of Parallel Algorithm Design: Concurrency and Mapping COMP 422/534 Lecture 3 21 January 2020
  • 2. Last Thursday • Introduction to parallel algorithms —tasks and decomposition —threads and mapping —threads versus cores • Decomposition techniques - part 1 —recursive decomposition —data decomposition 2
  • 3. 3 Owner Computes Rule • Each datum is assigned to a thread • Each thread computes values associated with its data • Implications —input data decomposition – all computations using an input datum are performed by its thread —output data decomposition – an output is computed by the thread assigned to the output data
  • 4. Topics for Today • Decomposition techniques - part 2 —exploratory decomposition —hybrid decomposition • Characteristics of tasks and interactions • Mapping techniques for load balancing —static mappings —dynamic mappings • Methods for minimizing interaction overheads 4
  • 5. Exploratory Decomposition • Exploration (search) of a state space of solutions —problem decomposition reflects shape of execution • Examples —discrete optimization – 0/1 integer programming —theorem proving —game playing 5
  • 6. 6 Exploratory Decomposition Example Solving a 15 puzzle • Sequence of three moves from state (a) to final state (d) • From an arbitrary state, must search for a solution
  • 7. 7 Exploratory Decomposition: Example Solving a 15 puzzle Search — generate successor states of the current state — explore each as an independent task initial state final state (solution) after first move
  • 8. 8 Exploratory Decomposition Speedup • Parallel formulation may perform a different amount of work • Can cause super- or sub-linear speedup m m m m m m m m total serial work = 2m + 1 total parallel work = 4 total serial work = m total parallel work = 4m solution
  • 9. 9 Speculative Decomposition • Dependencies between tasks are not always known a-priori —makes it impossible to identify independent tasks • Conservative approach —identify independent tasks only when no dependencies left • Optimistic (speculative) approach —schedule tasks even when they may potentially be erroneous • Drawbacks for each —conservative approaches – may yield little concurrency —optimistic approaches – may require a roll-back mechanism if a dependence is encountered
  • 10. 10 Speculative Decomposition in Practice Discrete event simulation • Data structure: centralized time-ordered event list • Simulation — extract next event in time order — process the event — if required, insert new events into the event list • Optimistic event scheduling — assume outcomes of all prior events — speculatively process next event — if assumption is incorrect, roll back its effects and continue Time Warp David Jefferson. “Virtual Time,” ACM TOPLAS, 7(3):404-425, July 1985
  • 11. 11 Speculative Decomposition in Practice Time Warp OS http://bit.ly/twos-94 • A new operating system for military simulations —expensive computational tasks —composed of many interacting subsystems —highly irregular temporal behavior • Optimistic execution and process rollback —don't treat rollback as a special case for handling exceptions, breaking deadlock, aborting transactions, … —use rollback as frequently as other systems use blocking • Why a new OS? —rollback forces a rethinking of all OS issues – scheduling, synchronization, message queueing, flow control, memory management, error handling, I/O, and commitment —building Time Warp on top of an OS would require two levels of synchronization, two levels of message queues, …
  • 12. 12 Optimistic Simulation David Bauer et al. “ROSS.NET: Optimistic Simulation Framework For Large-scale Internet Models,” Proc. of the 2003 Winter Simulation Conference
  • 13. 13 Hybrid Decomposition Use multiple decomposition strategies together Often necessary for adequate concurrency • Quicksort —recursive decomposition alone limits concurrency —augmenting recursive with data decomposition is better – can use data decomposition on input data to compute a split
  • 14. Hybrid Decomposition for Climate Simulation 14 Figure courtesy of Pat Worley (ORNL) Data decomposition within atmosphere, ocean, land, and sea-ice tasks
  • 15. 15 Topics for Today • Decomposition techniques - part 2 — data decomposition — exploratory decomposition — hybrid decomposition • Characteristics of tasks and interactions • Mapping techniques for load balancing — static mappings — dynamic mappings • Methods for minimizing interaction overheads • Parallel algorithm design templates ☛
  • 16. 16 Characteristics of Tasks • Key characteristics —generation strategy —associated work —associated data size • Impact choice and performance of parallel algorithms
  • 17. Task Generation • Static task generation —identify concurrent tasks a-priori —typically decompose using data or recursive decomposition —examples – matrix operations – graph algorithms on static graphs – image processing applications – other regularly structured problems • Dynamic task generation —identify concurrent tasks as a computation unfolds —typically decompose using exploratory or speculative decompositions —examples – puzzle solving – game playing 17
  • 18. 18 Task Size • Uniform: all the same size • Non-uniform — sometimes sizes known or can be estimated a-priori — sometimes not – example: tasks in quicksort size of each partition depends upon pivot selected
  • 19. 19 Size of Data Associated with Tasks • Data may be small or large compared to the computation — size(input) < size(computation), e.g., 15 puzzle — size(input) = size(computation) > size(output), e.g., min — size(input) = size(output) < size(computation), e.g., sort • Implications — small data: task can easily migrate to another thread — large data: ties the task to a thread – possibly can avoid communicating the task context reconstruct/recompute the context elsewhere
  • 20. 20 Characteristics of Task Interactions Orthogonal classification criteria • Static vs. dynamic • Regular vs. irregular • Read-only vs. read-write • One-sided vs. two-sided
  • 21. 21 Characteristics of Task Interactions • Static interactions —tasks and interactions are known a-priori —simpler to code • Dynamic interactions —timing or interacting tasks cannot be determined a-priori —harder to code – especially using two-sided message passing APIs
  • 22. Characteristics of Task Interactions • Regular interactions —interactions have a pattern that can be described with a function – e.g. mesh, ring —regular patterns can be exploited for efficient implementation – e.g. schedule communication to avoid conflicts on network links • Irregular interactions —lack a well-defined topology —modeled by a graph 22
  • 23. 23 Static Regular Task Interaction Pattern Image operations, e.g., edge detection Nearest neighbor interactions on a 2D mesh Sobel Edge Detection Stencils
  • 24. 24 Static Irregular Task Interaction Pattern Sparse matrix-vector multiply
  • 25. Characteristics of Task Interactions • Read-only interactions —tasks only read data associated with other tasks • Read-write interactions —read and modify data associated with other tasks —harder to code: requires synchronization – need to avoid read-write and write-write ordering races 25
  • 26. Characteristics of Task Interactions • One-sided —initiated & completed independently by 1 of 2 interacting tasks – READ or WRITE – GET or PUT • Two-sided —both tasks coordinate in an interaction – SEND and RECV 26
  • 27. 27 Topics for Today • Decomposition techniques - part 2 — data decomposition — exploratory decomposition — hybrid decomposition • Characteristics of tasks and interactions • Mapping techniques for load balancing — static mappings — dynamic mappings • Methods for minimizing interaction overheads • Parallel algorithm design templates ☛
  • 28. 28 Mapping Techniques Map concurrent tasks to threads for execution • Overheads of mappings —serialization (idling) —communication • Select mapping to minimize overheads • Conflicting objectives: minimizing one increases the other —assigning all work to one thread – minimizes communication – significant idling —minimizing serialization introduces communication
  • 29. 29 Mapping Techniques for Minimum Idling • Must simultaneously minimize idling and load balance • Balancing load alone does not minimize idling Time Time
  • 30. 30 Mapping Techniques for Minimum Idling Static vs. dynamic mappings • Static mapping —a-priori mapping of tasks to threads or processes — requirements – a good estimate of task size – even so, computing an optimal mapping may be NP hard e.g., even decomposition analogous to bin packing • Dynamic mapping — map tasks to threads or processes at runtime — why? – tasks are generated at runtime, or – their sizes are unknown Factors that influence choice of mapping • size of data associated with a task • nature of underlying domain
  • 31. 31 Schemes for Static Mapping • Data partitionings • Task graph partitionings • Hybrid strategies
  • 32. 32 Mappings Based on Data Partitioning Partition computation using a combination of —data partitioning —owner-computes rule Example: 1-D block distribution for dense matrices
  • 33. 33 Block Array Distribution Schemes Multi-dimensional block distributions Multi-dimensional partitioning enables larger # of threads
  • 34. 34 Block Array Distribution Example Multiplying two dense matrices C = A x B • Partition the output matrix C using a block decomposition • Give each task the same number of elements of C — each element of C corresponds to a dot product — even load balance • Obvious choices: 1D or 2D decomposition • Select to minimize associated communication overhead
  • 35. x = 35 Data Usage in Dense Matrix Multiplication x =
  • 36. 36 Consider: Gaussian Elimination Active submatrix shrinks as elimination progresses A[k,j] active for step k active for step k+1
  • 37. 37 Imbalance and Block Array Distributions • Consider a block distribution for Gaussian Elimination — amount of computation per data item varies — a block decomposition would lead to significant load imbalance
  • 38. 38 Block Cyclic Distribution Variant of the block distribution scheme that can be used to alleviate the load-imbalance and idling Steps 1. partition an array into many more blocks than the number of available threads or processes 2. round-robin assignment of blocks to threads or processes – each thread or process gets several non-adjacent blocks
  • 39. 39 Block-Cyclic Distribution 1D block-cyclic 2D block-cyclic • Cyclic distribution: special case with block size = 1 • Block distribution: special case with block size is n/p —n is the dimension of the matrix; p is the # of threads
  • 40. 40 Decomposition by Graph Partitioning Sparse-matrix vector multiply • Graph of the matrix is useful for decomposition — work ~ number of edges — communication for a node ~ node degree • Goal: balance work & minimize communication • Partition the graph — assign equal number of nodes to each thread — minimize edge count of the graph partition
  • 41. 41 Partitioning a Graph of Lake Superior Random Partitioning Partitioning for minimum edge-cut
  • 42. 42 Mappings Based on Task Partitioning Partitioning a task-dependency graph • Optimal partitioning for general task-dependency graph — NP-hard problem • Excellent heuristics exist for structured graphs
  • 43. 43 Mapping a Sparse Matrix Sparse matrix-vector product sparse matrix structure 17 items to communicate partitioning mapping
  • 44. 44 Mapping a Sparse Matrix Sparse matrix-vector product mapping 13 items to communicate partitioning sparse matrix structure 17 items to communicate
  • 45. 45 Hierarchical Mappings • Sometimes a single-level mapping is inadequate • Hierarchical approach — use a task mapping at the top level — data partitioning within each task Example: Hybrid Decomposition + Data Partitioning for Community Earth System Model
  • 46. 46 Topics for Today • Decomposition techniques - part 2 — data decomposition — exploratory decomposition — hybrid decomposition • Characteristics of tasks and interactions • Mapping techniques for load balancing — static mappings — dynamic mappings • Methods for minimizing interaction overheads • Parallel algorithm design templates ☛
  • 47. 47 Schemes for Dynamic Mapping • Dynamic mapping AKA dynamic load balancing —load balancing is the primary motivation for dynamic mapping • Styles —centralized —distributed
  • 48. Centralized Dynamic Mapping • Threads types: masters or slaves • General strategy —when a slave runs out of work → request more from master • Challenge —master may become bottleneck for large # of threads • Approach —chunk scheduling: thread picks up several of tasks at once —however – large chunk sizes may cause significant load imbalances – gradually decrease chunk size as the computation progresses 48
  • 49. Distributed Dynamic Mapping • All threads as peers • Each thread can send or receive work from other threads —avoids centralized bottleneck • Four critical design questions —how are sending and receiving threads paired together? —who initiates work transfer? —how much work is transferred? —when is a transfer triggered? • Ideal answers can be application specific • Cilk uses a distributed dynamic mapping: “work stealing” 49
  • 50. 50 Topics for Today • Decomposition techniques - part 2 — data decomposition — exploratory decomposition — hybrid decomposition • Characteristics of tasks and interactions • Mapping techniques for load balancing — static mappings — dynamic mappings • Methods for minimizing interaction overheads • Parallel algorithm design templates ☛
  • 51. 51 Minimizing Interaction Overheads (1) “Rules of thumb” • Maximize data locality — don’t fetch data you already have — restructure computation to reuse data promptly • Minimize volume of data exchange — partition interaction graph to minimize edge crossings • Minimize frequency of communication — try to aggregate messages where possible • Minimize contention and hot-spots — use decentralized techniques (avoidance)
  • 52. 52 Minimizing Interaction Overheads (2) Techniques • Overlap communication with computation — use non-blocking communication primitives – overlap communication with your own computation – one-sided: prefetch remote data to hide latency — multithread code – overlap communication with another thread’s computation • Replicate data or computation to reduce communication • Use group communication instead of point-to-point primitives • Issue multiple communications and overlap their latency (reduces exposed latency)
  • 53. 53 Topics for Today • Decomposition techniques - part 2 — data decomposition — exploratory decomposition — hybrid decomposition • Characteristics of tasks and interactions • Mapping techniques for load balancing — static mappings — dynamic mappings • Methods for minimizing interaction overheads • Parallel algorithm design templates ☛
  • 54. 54 Parallel Algorithm Model • Definition: ways of structuring a parallel algorithm • Aspects of a model — decomposition — mapping technique — strategy to minimize interactions
  • 55. 55 Common Parallel Algorithm Templates • Data parallel — each task performs similar operations on different data — typically statically map tasks to threads or processes • Task graph — use task dependency graph relationships to promote locality, or reduce interaction costs • Master-slave — one or more master threads generate work — allocate it to worker threads — allocation may be static or dynamic • Pipeline / producer-consumer — pass a stream of data through a sequence of workers — each performs some operation on it • Hybrid — apply multiple models hierarchically, or — apply multiple models in sequence to different phases
  • 56. 56 Topics for Tuesday • Threaded programming models • Introduction to Cilk Plus —tasks —algorithmic complexity measures —scheduling —performance and granularity —task parallelism examples
  • 57. 57 References • Adapted from slides “Principles of Parallel Algorithm Design” by Ananth Grama • Based on Chapter 3 of “Introduction to Parallel Computing” by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003