20090720 smith

7/17/20091Parallel and High Performance ComputingBurton SmithTechnical FellowMicrosoft

AgendaIntroductionDefinitionsArchitecture and ProgrammingExamplesConclusions7/17/20092

“Parallel and High Performance”?“Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994A High Performance (Super) Computer is:One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmarkA computer that costs 200.000.000 руб or moreNecessarily parallel, at least since the 1970’s7/17/20094

Recent DevelopmentsFor 20 years, parallel and high performance computing have been the same subjectParallel computing is now mainstreamIt reaches well beyond HPC into client systems: desktops, laptops, mobile phonesHPC software once had to stand aloneNow, it can be based on parallel PC softwareThe result: better tools and new possibilities7/17/20095

The Emergence of the Parallel ClientUniprocessor performance is leveling offInstruction-level parallelism nears a limit (ILP Wall)Power is getting painfully high (Power Wall)Caches show diminishing returns (Memory Wall)Logic density continues to grow (Moore’s Law)So uniprocessors will collapse in area and costCores per chip need to increase exponentiallyWe must all learn to write parallel programsSo new “killer apps” will enjoy more speed

The ILP WallInstruction-level parallelism preserves the serial programming modelWhile getting speed from “undercover” parallelism For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, …At best, we get a few instructions/clock† Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.

The Power WallIn the old days, power was kept roughly constantDynamic power, equal to CV2f, dominatedEvery shrink of .7 in feature size halved transistor area Capacitance C and voltage V also decreased by .7Even with the clock frequency f increased by 1.4, power per transistor was cut in halfNow, shrinking no longer reduces V very muchSo even at constant frequency, power density doublesStatic (leakage) power is also getting worseSimpler, slower processors are more efficientAnd to conserve power, we can turn some of them off

The Memory WallWe can get bigger caches from more transistorsDoes this suffice, or is there a problem scaling up?To speed up 2X without changing bandwidth below the cache, the miss rate must be halvedHow much bigger does the cache have to be?†For dense matrix multiply or dense LU, 4x biggerFor sorting or FFTs, the square of its former sizeFor sparse or dense matrix-vector multiply, impossibleDeeper interconnects increase miss latencyLatency tolerance needs memory access parallelism† H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.

Overcoming the Memory WallProvide more memory bandwidthIncrease DRAM I/O bandwidth per gigabyteIncrease microprocessor off-chip bandwidthUse architecture to tolerate memory latencyMore latency  more threads or longer vectorsNo change in programming model is neededUse caches for bandwidth as well as latencyLet compilers control localityKeep cache lines shortAvoid mis-speculation

The End of The von Neumann Model“Instructions are executed one at a time…”We have relied on this idea for 60 yearsNow it (and things it brought) must changeSerial programming is easier than parallel programming, at least for the momentBut serial programs are now slow programsWe need parallel programming paradigms that will make all programmers successfulThe stakes for our field’s vitality are highComputing must be reinvented

Asymptotic NotationQuantities are often meaningful only within a constant factorAlgorithm performance analyses, for examplef(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n))7/17/200913

Speedup, Time, and WorkThe speedup of a computation is how much faster it runs in parallel compared to seriallyIf one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/TpThe work done is the number of operations performed, either serially or in parallelW1 = O(T1) is the serial work, Wp the parallel workWe say a parallel computation is work-optimal ifWp = O(W1) = O(T1)We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p)7/17/200914

Latency, Bandwidth, & ConcurrencyIn any system that moves items from input to output without creating or destroying them,Queueing theory calls this result Little’s lawlatency × bandwidth = concurrencyconcurrency = 6bandwidth = 2latency = 3

Architecture ANDPROGRAMMING7/17/200916

Parallel Processor ArchitectureSIMD: Each instruction operates concurrently on multiple data itemsMIMD: Multiple instruction sequences execute concurrentlyConcurrency is expressible in space or timeSpatial: the hardware is replicatedTemporal: the hardware is pipelined7/17/200917

Trends in Parallel ProcessorsToday’s chips are spatial MIMD at top levelTo get enough performance, even in PCsTemporal MIMD is also usedSIMD is tending back toward spatialIntel’s Larrabee combines all threeTemporal concurrency is easily “adjusted”Vector length or number of hardware contextsTemporal concurrency tolerates latencyMemory latency in the SIMD caseFor MIMD, branches and synchronization also7/17/200918

Parallel Memory ArchitectureA shared memory system is one in which any processor can address any memory locationQuality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidthA distributed memory system is one in which processors can’t address most of memoryThe disjoint memory regions and their associated processors are usually called nodesA cluster is a distributed memory system with more than one processor per nodeNearly all HPC systems are clusters 7/17/200919

Parallel Programming VariationsData Parallelism andTask ParallelismFunctional Style and Imperative StyleShared Memory and Message Passing…and more we won’t have time to look at A parallel application may use all of them 7/17/200920

Data Parallelism and Task ParallelismA computation is data parallel when similar independent sub-computations are done simultaneously on multiple data itemsApplying the same function to every element of a data sequence, for exampleA computation is task parallel when dissimilar independent sub-computations are done simultaneouslyControlling the motions of a robot, for exampleIt sounds like SIMD vs. MIMD, but isn’t quiteSome kinds of data parallelism need MIMD7/17/200921

Functional and Imperative ProgramsA program is said to be written in (pure) functional style if it has no mutable stateComputing = naming and evaluating expressions Programs with mutable state are usually called imperative because the state changes must be done when and where specified:while (z < x) { x = y; y = z; z = f(x, y);} return y;Often, programs can be written either way:let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;7/17/200922

Shared Memory and Message PassingShared memory programs access data in a shared address spaceWhen to access the data is the big issueSubcomputations therefore must synchronizeMessage passing programs transmit data between subcomputationsThe sender computes a value and then sends itThe receiver recieves a value and then uses itSynchronization can be built in to communicationMessage passing can be implemented very well on shared memory architectures7/17/200923

Barrier SynchronizationA barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrivedIt is named after the barrierused to start horse racesIt guarantees everything before the barrier finishes before anything after it beginsIt is a central feature in several data-parallel languages such as OpenMP7/17/200924

Mutual ExclusionThis type of synchronization ensures only one subcomputation can do a thing at any timeIf the thing is a code block, it is a critical sectionIt classically uses a lock: a data structure with which subcomputations can stop and startBasic operations on a lock object L might be Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownershipRelease(L): yields L and unblocks some Acquire(L)A lot has been written on these subjects7/17/200925

Non-Blocking SynchronizationThe basic idea is to achieve mutual exclusion using memory read-modify-write operationsMost commonly used is compare-and-swap: CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by newArbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeedsIf there is significant updating contention at addr, the repeated computation of new may be wasteful7/17/200926

Load BalancingSome processors may be busier than othersTo balance the workload, subcomputations can be scheduled on processors dynamicallyA technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterationsIn guided self-scheduling, the chunk sizes shrinkAnalogous imbalances can occur in memoryOverloaded memory locations are called hot spotsParallel algorithms and data structures must be designed to avoid themImbalanced messaging is sometimes seen7/17/200927

A Data Parallel Example: Sorting7/17/200929void sort(int *src, int *dst,int size, intnvals) {inti, j, t1[nvals], t2[nvals]; for (j = 0 ; j < nvals ; j++) { t1[j] = 0;} for (i = 0 ; i < size ; i++) { t1[src[i]]++;} //t1[] now contains a histogram of the values t2[0] = 0; for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];} //t2[j] now contains the origin for value j for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i];}}

When Is a Loop Parallelizable?The loop instances must safely interleaveA way to do this is to only read the data Another way is to isolate data accessesLook at the first loop:The accesses to t1[] are isolated from each other This loop can run in parallel “as is”7/17/200930for (j = 0 ; j < nvals ; j++) { t1[j] = 0;}

Isolating Data UpdatesThe second loop seems to have a problem:Two iterations may access the same t1[src[i]]If both reads precede both increments, oops!A few ways to isolate the iteration conflicts:Use an “isolated update” (lock prefix) instructionUse an array of locks, perhaps as big as t1[] Use non-blocking updatesUse a transaction7/17/200931for (i = 0 ; i < size ; i++) { t1[src[i]]++;}

Dependent Loop IterationsThe 3rd loop is an interesting challenge:Each iteration depends on the previous oneThis loop is an example of a prefix computationIf • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 …Prefix computations are often known as scansScan can be done in efficiently in parallel7/17/200932for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1]; }

Cyclic ReductionEach vertical line represents a loop iterationThe associated sequence element is to its rightOn step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k7/17/200933abcdefgaabbccddeeffgaababcabcdbcdecdefdefgaababcabcdabcdeabcdefabcdefg

Applications of ScanLinear recurrences like the third loopPolynomial evaluationString comparisonHigh-precision additionFinite automataEach xi is the next-state function given the ith input symbol and • is function compositionAPL compressWhen only the final value is needed, the computation is called a reduction insteadIt’s a little bit cheaper than a full scan

More Iterations nThan Processors p7/17/200935Wp = 3n + O(p log p), Tp = 3n / p + O(log p)

OpenMPOpenMP is a widely-implemented extension to C++ and Fortran for data† parallelismIt adds directives to serial programsA few of the more important directives:#pragmaomp parallel for <modifiers><for loop>#pragmaomp atomic<binary op=,++ or -- statement>#pragmaomp critical <name><structured block>#pragmaomp barrier7/17/200936†And perhaps task parallelism soon

The Sorting Example in OpenMPOnly the third “scan” loop is a problemWe can at least do this loop “manually”:7/17/200937nt = omp_get_num_threads();intta[nt], tb[nt];#omp parallel forfor(myt = 0; myt < nt; myt++) { //Set ta[myt]= local sum of nvals/nt elements of t1[] #pragmaomp barrier for(k = 1; k <= myt; k *= 2){tb[myt] = ta[myt];ta[myt] += tb[myt - k]; #pragmaomp barrier } fix = (myt > 0) ? ta[myt – 1] : 0; //Setnvals/ntelements of t2[] to fix + local scan of t1[]}

Parallel Patterns Library (PPL)PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtimeIt supports mixed data- and task-parallelism:parallel_for, parallel_for_each, parallel_invokeagent, send, receive, choice, join, task_groupParallel loops use C++ lambda expressions:Updates can be isolated using intrinsic functionsMicrosoft and Intel plan to unify PPL and TBB7/17/200938parallel_for(1,nvals,[&t1](int j) { t1[j] = 0;});(void)_InterlockedIncrement(t1[src[i]]++);

Dynamic Resource ManagementPPL programs are written for an arbitrary number of processors, could be just oneLoad balancing is mostly done by work stealingThere are two kinds of work to steal:Work that is unblocked and waiting for a processorWork that is not yet started and is potentially parallelWork of the latter kind will be done serially unless it is first stolen by another processorThis makes recursive divide and conquer easyThere is no concern about when to stop parallelism7/17/200939

A Quicksort Examplevoid quicksort (vector<int>::iterator first, vector<int>::iterator last) { if (last - first < 2){return;}int pivot = *first; auto mid1 = partition (first, last, [=](int e){return e < pivot;}); auto mid2 = partition (mid1, last, [=](int e){return e == pivot;});parallel_invoke( [=] { quicksort(first, mid1); }, [=] { quicksort(mid2, last); } );}; 7/17/200940

LINQ and PLINQLINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F#A LINQ query is really just a functional monadIt queries databases, XML, or any IEnumerablePLINQ is a parallel implementation of LINQNon-isolated functions must be avoidedOtherwise it is hard to tell the two apart7/17/200941

A LINQ Example7/17/200942PLINQ.AsParallel()var q = from n in names where n.Name == queryInfo.Name && n.State == queryInfo.State &&n.Year >= yearStart &&n.Year <= yearEnd orderbyn.Year ascending select n;

Message Passing Interface (MPI)MPI is a widely used message passing library for distributed memory HPC systemsSome of its basic functions:A few of its “collective communication” functions:7/17/200943MPI_InitMPI_Comm_rankMPI_Comm_sizeMPI_SendMPI_RecvMPI_ReduceMPI_AllreduceMPI_ScanMPI_ExscanMPI_BarrierMPI_GatherMPI_AllgatherMPI_Alltoall

Sorting in MPIRoughly, it could work like this on n nodes:Run the first two loops locallyUse MPI_Allreduce to build a global histogramRun the third loop (redundantly) at every nodeAllocate n value intervals to nodes (redundantly)Balancing the data per node as well as possibleRun the fourth loop using the local histogramUse MPI_Alltoall to redistribute the dataMerge the n sorted subarrays on each nodeCollective communication is expensiveBut sorting needs it (see the Memory Wall slide)7/17/200944

Another Way to Sort in MPIThe Samplesort algorithm is like QuicksortIt works like this on n nodes:Sort the local data on each node independentlyTake s samples of the sorted data on each nodeUse MPI_Allgather to send all nodes all samplesCompute n  1 splitters (redundantly) on all nodesBalancing the data per node as well as possibleUse MPI_Alltoall to redistribute the dataMerge the n sorted subarrays on each node7/17/200945

Parallel Computing Has ArrivedWe must rethink how we write programsAnd we are definitely doing thatOther things will also need to changeArchitectureOperating systemsAlgorithmsTheoryApplication softwareWe are seeing the biggest revolution in computing since its very beginnings7/17/200947

20090720 smith

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to 20090720 smith (20)

More from Michael Karpov (20)

Recently uploaded (20)

20090720 smith