Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree PhasersEfficient and Scalable Barrier Synchronization for Fine-grained ParallelismStefan MarrS. Verhaegen, B. De Fraine, T. D’Hondt, W. De MeuterSoftware Languages LabVrijeUniversiteitBrussel

AgendaIntroductionBarriers, PhasersInsertion Tree PhasersInsertion TreePhaser AlgorithmEvaluationSummary9/26/102

BarriersSynchronizing parallel activitiesHigh productivity: easy to get right Mostly for scientific computingMany-core evolutionSynchronizing dynamic and irregular problemsRequires low-overhead dynamic hierarchical barriers9/26/103Introductiont1p1t2p1t3p1t1p2t2p2t3p2t1p3t2p3t3p3

t1p1Phasers9/26/104IntroductionExtension of X10 clocksClocks: dynamic two-phase barrier for fork/join parallelismRegistration modes for barrierEnables expression of producer/consumer relationSingle statementsExecuted only by single thread, avoids duplicated barrier operationst1p2t2p2t3p2t2p2t3p2t2p3t3p3

Hierarchical Phasers9/26/105IntroductionShirako & Sarkar in Proc. of IEEE IPDPS 2010 [1]Array accessList accessFirst scalable implementation strategyPredefined tree structureDegree, i.e., tree arityMax. number of tiers, i.e., heightComposed from phasersProblematicNone dynamic structureTwo-phase support incompleteLeaves design decisions open PhaserTier 0subsubTier 1subsubsubsubTier 2(leafs)sigsigsigsigsigsigsigsigA1A2A3A4A5A6A7A8

Open Questions withHierarchical PhasersDynamic tree construction, or on initialization?Tradeoffs for atomic operations, overhead of joining/leaving phaserHow are operations synchronized?Tradeoffs for overheads and restrictions on parallelismGarbage collection problem for dropped participantsKeeps list of synchronization objects incl. dropped participantsAfter reaching max. #participantsIs the tree rebalanced? (Hint at it for dropped nodes)Two-phase barrier support does not hide latency for original phasers9/26/106Introduction

Insertion Tree Phasers9/26/107

Design GoalSupport for full generality of Phaser propertiesTwo-phase supportSignal-only/wait-only for producers/consumersSingle statementFull dynamicity: fine-grained hierarchical fork/joinAdaptation of existing, scalable approachesDissemination barrier not adaptableRemaining are tree-based approaches9/26/108Insertion TreePhaserAlgorithm

Insertion TreeGoalsStable, i.e., minimized tree modificationsAvoid inconsistent synchronization informationMaximizing parallel operationsSolution: Insertion TreeInverted treeNo removalComplete smallest subtree first9/26/109Insertion TreePhaserAlgorithm1/2

Insertion Tree9/26/1010Insertion TreePhaserAlgorithm2/2

Insertion Tree9/26/1011Insertion TreePhaserAlgorithm2/21

Insertion Tree9/26/1012Insertion TreePhaserAlgorithm2/2h112

Insertion Tree9/26/1013Insertion TreePhaserAlgorithm2/2h2h1123

Insertion Tree9/26/1014Insertion TreePhaserAlgorithm2/2h2h1h31234

Insertion Tree9/26/1015Insertion TreePhaserAlgorithm2/2h4h2h1h312345

Insertion Tree9/26/1016Insertion TreePhaserAlgorithm2/2h4h2h6h1h3h5h712345678

Determining the Insertion PointdefgetNextInsertNode(tree): result = tree.lastNodei = tree.numLeaveswhileimod 2 == 0: result = result.parenti = i/2return result # this is for 2-ary trees # is adaptable for n-ary trees, too9/26/1017Insertion TreePhaserAlgorithm

Synchronization Tree*9/26/1018Insertion TreePhaserAlgorithmPhaserphase: 000Phase counter0000woHelper nodesWait-only flagPhase counter0000rsmdParticipant nodesResume flag*) is simplified, leaves out registration modesA1A2A3A4

Announcing Synchronization9/26/1019Insertion Tree Phaser AlgorithmPhaserphase: 00000000000A1A2A3A4

Announcing Synchronization9/26/1020Insertion Tree Phaser AlgorithmPhaserphase: 0000110001rsmd1rsmdA1A2A3A4

Announcing Synchronization9/26/1021Insertion Tree Phaser AlgorithmPhaserphase: 00011111rsmd1rsmd1rsmd1rsmdA1A2A3A4

Announcing Synchronization9/26/1024Insertion Tree Phaser AlgorithmSynchronization reached.Continue to next phase.Phaserphase: 11111111rsmd1rsmd1rsmd1rsmdA1A2A3A4

Dropping Participants9/26/1025Insertion TreePhaserAlgorithmPhaserphase: 0010011001rsmd1rsmdA1A2A3A4

Dropping Participants9/26/1026Insertion TreePhaserAlgorithmPhaserphase: 0010wo1101rsmd1rsmdA1A2A3A4

h1:RDropping Participants9/26/1027Insertion TreePhaserAlgorithmPhaserphase: 001wowo111rsmd1rsmdA1A2A3A4

h1:RDropping Participants9/26/1028Insertion TreePhaserAlgorithmPhaserphase: 0wo1wowo111rsmd1rsmdA1A2A3A4

Dropping Participants9/26/1029Insertion TreePhaserAlgorithmSynchronization reached.Continue to next phase.Phaserphase: 1h1:Rwo1wowo111rsmd1rsmdA1A2A3A4

h1:RDropping Participants9/26/1030Insertion TreePhaserAlgorithmPhaserphase: 1wo1h1:Lwowo111rsmd1rsmdA1A2A3A4

Adding New Participants9/26/1031Insertion TreePhaserAlgorithmPhaserphase: 898899rsmd89rsmd8A1A2A3A4

Adding New Participants9/26/1032Insertion TreePhaserAlgorithmPhaserphase: 89888899rsmd89rsmd8A1A2A3A4

Adding New Participants9/26/1033Insertion TreePhaserAlgorithmPhaserphase: 8-188+198899rsmd89rsmd8A1A2A3A4

Adding New Participants9/26/1034Insertion TreePhaserAlgorithmPhaserphase: 888propagate phase count98899rsmd89rsmd8A1A2A3A4

Two-Phaser Barrier Operation9/26/1036Evaluation

Overhead: Two-Phase vs. Classic9/26/1037Evaluation

Use as Drop-In Replacement for SPLASH-2Speedup compared to TmcSpinBarrier9/26/1038Evaluation

SummaryScalable and efficient approach to PhasersDocuments implementationBased on fully dynamic insertion treeOvercomes limitations of existing approachesUsable as drop-in replacementFuture workScalability beyond 59 coresOptimization for other memory architectures9/26/1039Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers

Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

More Related Content

Similar to Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism (20)

More from Stefan Marr (20)

Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Editor's Notes