SlideShare a Scribd company logo
Insertion Tree PhasersEfficient and Scalable Barrier Synchronization for Fine-grained ParallelismStefan MarrS. Verhaegen, B. De Fraine, T. D’Hondt, W. De MeuterSoftware Languages LabVrijeUniversiteitBrussel
AgendaIntroductionBarriers, PhasersInsertion Tree PhasersInsertion TreePhaser AlgorithmEvaluationSummary9/26/102
BarriersSynchronizing parallel activitiesHigh productivity: easy to get right Mostly for scientific computingMany-core evolutionSynchronizing dynamic and irregular problemsRequires low-overhead dynamic hierarchical barriers9/26/103Introductiont1p1t2p1t3p1t1p2t2p2t3p2t1p3t2p3t3p3
t1p1Phasers9/26/104IntroductionExtension of X10 clocksClocks: dynamic two-phase barrier for fork/join parallelismRegistration modes for barrierEnables expression of producer/consumer relationSingle statementsExecuted only by single thread, avoids duplicated barrier operationst1p2t2p2t3p2t2p2t3p2t2p3t3p3
Hierarchical Phasers9/26/105IntroductionShirako & Sarkar in Proc. of IEEE IPDPS 2010 [1]Array accessList accessFirst scalable implementation strategyPredefined tree structureDegree, i.e., tree arityMax. number of tiers, i.e., heightComposed from phasersProblematicNone dynamic structureTwo-phase support incompleteLeaves design decisions open PhaserTier 0subsubTier 1subsubsubsubTier 2(leafs)sigsigsigsigsigsigsigsigA1A2A3A4A5A6A7A8
Open Questions withHierarchical PhasersDynamic tree construction, or on initialization?Tradeoffs for atomic operations, overhead of joining/leaving phaserHow are operations synchronized?Tradeoffs for overheads and restrictions on parallelismGarbage collection problem for dropped participantsKeeps list of synchronization objects incl. dropped participantsAfter reaching max. #participantsIs the tree rebalanced? (Hint at it for dropped nodes)Two-phase barrier support does not hide latency for original phasers9/26/106Introduction
Insertion Tree Phasers9/26/107
Design GoalSupport for full generality of Phaser propertiesTwo-phase supportSignal-only/wait-only for producers/consumersSingle statementFull dynamicity: fine-grained hierarchical fork/joinAdaptation of existing, scalable approachesDissemination barrier not adaptableRemaining are tree-based approaches9/26/108Insertion TreePhaserAlgorithm
Insertion TreeGoalsStable, i.e., minimized tree modificationsAvoid inconsistent synchronization informationMaximizing parallel operationsSolution: Insertion TreeInverted treeNo removalComplete smallest subtree first9/26/109Insertion TreePhaserAlgorithm1/2
Insertion Tree9/26/1010Insertion TreePhaserAlgorithm2/2
Insertion Tree9/26/1011Insertion TreePhaserAlgorithm2/21
Insertion Tree9/26/1012Insertion TreePhaserAlgorithm2/2h112
Insertion Tree9/26/1013Insertion TreePhaserAlgorithm2/2h2h1123
Insertion Tree9/26/1014Insertion TreePhaserAlgorithm2/2h2h1h31234
Insertion Tree9/26/1015Insertion TreePhaserAlgorithm2/2h4h2h1h312345
Insertion Tree9/26/1016Insertion TreePhaserAlgorithm2/2h4h2h6h1h3h5h712345678
Determining the Insertion PointdefgetNextInsertNode(tree):  result = tree.lastNodei = tree.numLeaveswhileimod 2 == 0:    result = result.parenti = i/2return result  # this is for 2-ary trees  # is adaptable for n-ary trees, too9/26/1017Insertion TreePhaserAlgorithm
Synchronization Tree*9/26/1018Insertion TreePhaserAlgorithmPhaserphase:  000Phase counter0000woHelper nodesWait-only flagPhase counter0000rsmdParticipant nodesResume flag*)	is simplified, leaves out registration modesA1A2A3A4
Announcing Synchronization9/26/1019Insertion Tree Phaser AlgorithmPhaserphase:  00000000000A1A2A3A4
Announcing Synchronization9/26/1020Insertion Tree Phaser AlgorithmPhaserphase:  0000110001rsmd1rsmdA1A2A3A4
Announcing Synchronization9/26/1021Insertion Tree Phaser AlgorithmPhaserphase:  00011111rsmd1rsmd1rsmd1rsmdA1A2A3A4
Announcing Synchronization9/26/1022Insertion Tree Phaser AlgorithmPhaserphase:  00111111rsmd1rsmd1rsmd1rsmdA1A2A3A4
Announcing Synchronization9/26/1023Insertion Tree Phaser AlgorithmPhaserphase:  01111111rsmd1rsmd1rsmd1rsmdA1A2A3A4
Announcing Synchronization9/26/1024Insertion Tree Phaser AlgorithmSynchronization reached.Continue to next phase.Phaserphase:  11111111rsmd1rsmd1rsmd1rsmdA1A2A3A4
Dropping Participants9/26/1025Insertion TreePhaserAlgorithmPhaserphase:  0010011001rsmd1rsmdA1A2A3A4
Dropping Participants9/26/1026Insertion TreePhaserAlgorithmPhaserphase:  0010wo1101rsmd1rsmdA1A2A3A4
h1:RDropping Participants9/26/1027Insertion TreePhaserAlgorithmPhaserphase:  001wowo111rsmd1rsmdA1A2A3A4
h1:RDropping Participants9/26/1028Insertion TreePhaserAlgorithmPhaserphase:  0wo1wowo111rsmd1rsmdA1A2A3A4
Dropping Participants9/26/1029Insertion TreePhaserAlgorithmSynchronization reached.Continue to next phase.Phaserphase:  1h1:Rwo1wowo111rsmd1rsmdA1A2A3A4
h1:RDropping Participants9/26/1030Insertion TreePhaserAlgorithmPhaserphase:  1wo1h1:Lwowo111rsmd1rsmdA1A2A3A4
Adding New Participants9/26/1031Insertion TreePhaserAlgorithmPhaserphase:  898899rsmd89rsmd8A1A2A3A4
Adding New Participants9/26/1032Insertion TreePhaserAlgorithmPhaserphase:  89888899rsmd89rsmd8A1A2A3A4
Adding New Participants9/26/1033Insertion TreePhaserAlgorithmPhaserphase:  8-188+198899rsmd89rsmd8A1A2A3A4
Adding New Participants9/26/1034Insertion TreePhaserAlgorithmPhaserphase:  888propagate phase count98899rsmd89rsmd8A1A2A3A4
Evaluation9/26/1035
Two-Phaser Barrier Operation9/26/1036Evaluation
Overhead: Two-Phase vs. Classic9/26/1037Evaluation
Use as Drop-In Replacement for SPLASH-2Speedup compared to TmcSpinBarrier9/26/1038Evaluation
SummaryScalable and efficient approach to PhasersDocuments implementationBased on fully dynamic insertion treeOvercomes limitations of existing approachesUsable as drop-in replacementFuture workScalability beyond 59 coresOptimization for other memory architectures9/26/1039Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers

More Related Content

PPTX
Tonpilz Transducer Simulations
PDF
Code GPU with CUDA - Device code optimization principle
PDF
Code GPU with CUDA - Memory Subsystem
PDF
Code GPU with CUDA - SIMT
PDF
PPT
Updated funding outlook for training providers - government funding for train...
DOCX
Trabajo final camilo y avella 12
DOCX
Trabajo de camilo arias gravedad (3)
Tonpilz Transducer Simulations
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - SIMT
Updated funding outlook for training providers - government funding for train...
Trabajo final camilo y avella 12
Trabajo de camilo arias gravedad (3)

Similar to Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism (20)

PDF
High Accuracy Distance Measurement for Bluetooth Based on Phase Ranging
PPTX
8085 interrupts
PPT
MPC8313E PowerQUICC II Pro Processor
PDF
IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...
PDF
Aw25293296
PDF
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
PDF
ADS1256 library documentation
PDF
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
PPT
Crash course in verilog
PDF
PDF
Efficient Design of Reversible Multiplexers with Low Quantum Cost
PPTX
Signal descriptors of 8086
PPT
Fpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adder
PDF
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
PDF
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
PDF
Ad4103173176
PDF
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
PDF
Site Operation Manual for a Typical Air Monitoring Site
PDF
Building communication platforms for the IoT
High Accuracy Distance Measurement for Bluetooth Based on Phase Ranging
8085 interrupts
MPC8313E PowerQUICC II Pro Processor
IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...
Aw25293296
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
ADS1256 library documentation
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
Crash course in verilog
Efficient Design of Reversible Multiplexers with Low Quantum Cost
Signal descriptors of 8086
Fpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adder
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
Ad4103173176
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Site Operation Manual for a Typical Air Monitoring Site
Building communication platforms for the IoT
Ad

More from Stefan Marr (20)

PPTX
Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
PPTX
Seminar on Parallel and Concurrent Programming
PPTX
Optimizing Communicating Event-Loop Languages with Truffle
PPTX
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
PPTX
Why Is Concurrent Programming Hard? And What Can We Do about It?
PPTX
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
PPTX
Building High-Performance Language Implementations With Low Effort
PPTX
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
PPTX
Supporting Concurrency Abstractions in High-level Language Virtual Machines
PDF
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
PDF
Sly and the RoarVM: Parallel Programming with Smalltalk
PDF
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
PDF
Sly and the RoarVM: Exploring the Manycore Future of Programming
PDF
PHP.next: Traits
PDF
The Price of the Free Lunch: Programming in the Multicore Era
PDF
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
PPTX
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
PPTX
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
PPTX
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
PDF
VMADL: An Architecture Definition Language for Variability and Composition ...
Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
Seminar on Parallel and Concurrent Programming
Optimizing Communicating Event-Loop Languages with Truffle
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
Why Is Concurrent Programming Hard? And What Can We Do about It?
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
Building High-Performance Language Implementations With Low Effort
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Supporting Concurrency Abstractions in High-level Language Virtual Machines
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
Sly and the RoarVM: Parallel Programming with Smalltalk
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Sly and the RoarVM: Exploring the Manycore Future of Programming
PHP.next: Traits
The Price of the Free Lunch: Programming in the Multicore Era
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
VMADL: An Architecture Definition Language for Variability and Composition ...
Ad

Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Editor's Notes

  • #5: Shirako et al.X10 Vijay Saraswat
  • #6: Shirako + Sarkar
  • #8: So I went to the whiteboard drew a tree and figured out how to do it slightly different
  • #10: How to build a tree to synchronize dynamic parallelism?
  • #19: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #20: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #21: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #22: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #23: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #24: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #25: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #26: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #27: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #28: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #29: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #30: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #31: Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  • #32: In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
  • #33: In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
  • #34: In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
  • #35: In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed