SlideShare a Scribd company logo
7/17/20091Parallel and High Performance ComputingBurton SmithTechnical FellowMicrosoft
AgendaIntroductionDefinitionsArchitecture and ProgrammingExamplesConclusions7/17/20092
Introduction7/17/20093
“Parallel and High Performance”?“Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994A High Performance (Super) Computer is:One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmarkA computer that costs 200.000.000 руб or moreNecessarily parallel, at least since the 1970’s7/17/20094
Recent DevelopmentsFor 20 years, parallel and high performance computing have been the same subjectParallel computing is now mainstreamIt reaches well beyond HPC into client systems: desktops, laptops, mobile phonesHPC software once had to stand aloneNow, it can be based on parallel PC softwareThe result: better tools and new possibilities7/17/20095
 The Emergence of the Parallel ClientUniprocessor performance is leveling offInstruction-level parallelism nears a limit (ILP Wall)Power is getting painfully high (Power Wall)Caches show diminishing returns (Memory Wall)Logic density continues to grow (Moore’s Law)So uniprocessors will collapse in area and costCores per chip need to increase exponentiallyWe must all learn to write parallel programsSo new “killer apps” will enjoy more speed
The ILP WallInstruction-level parallelism preserves the serial programming modelWhile getting speed from “undercover” parallelism For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, …At best, we get a few instructions/clock† Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.
The Power WallIn the old days, power was kept roughly constantDynamic power, equal to CV2f, dominatedEvery shrink of .7 in feature size halved transistor area Capacitance C and voltage V also decreased by .7Even with the clock frequency f increased by 1.4, power per transistor was cut in halfNow, shrinking no longer reduces V very muchSo even at constant frequency, power density doublesStatic (leakage) power is also getting worseSimpler, slower processors are more efficientAnd to conserve power, we can turn some of them off
The Memory WallWe can get bigger caches from more transistorsDoes this suffice, or is there a problem scaling up?To speed up 2X without changing bandwidth below the cache, the miss rate must be halvedHow much bigger does the cache have to be?†For dense matrix multiply or dense LU, 4x biggerFor sorting or FFTs, the square of its former sizeFor sparse or dense matrix-vector multiply, impossibleDeeper interconnects increase miss latencyLatency tolerance needs memory access parallelism† H.T. Kung, “Memory requirements for balanced computer architectures,”   13th International Symposium on Computer Architecture, 1986, pp. 49−54.
Overcoming the Memory WallProvide more memory bandwidthIncrease DRAM I/O bandwidth per gigabyteIncrease microprocessor off-chip bandwidthUse architecture to tolerate memory latencyMore latency  more threads or longer vectorsNo change in programming model is neededUse caches for bandwidth as well as latencyLet compilers control localityKeep cache lines shortAvoid mis-speculation
The End of The von Neumann Model“Instructions are executed one at a time…”We have relied on this idea for 60 yearsNow it (and things it brought) must changeSerial programming is easier than parallel programming, at least for the momentBut serial programs are now slow programsWe need parallel programming paradigms that will make all programmers successfulThe stakes for our field’s vitality are highComputing must be reinvented
Definitions7/17/200912
Asymptotic NotationQuantities are often meaningful only within a constant factorAlgorithm performance analyses, for examplef(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n))7/17/200913
Speedup, Time, and WorkThe speedup of a computation is how much faster it runs in parallel compared to seriallyIf one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/TpThe work done is the number of operations performed, either serially or in parallelW1 = O(T1) is the serial work, Wp the parallel workWe say a parallel computation is work-optimal ifWp = O(W1) = O(T1)We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p)7/17/200914
Latency, Bandwidth, & ConcurrencyIn any system that moves items from input to output without creating or destroying them,Queueing theory calls this result Little’s lawlatency × bandwidth = concurrencyconcurrency = 6bandwidth = 2latency = 3
Architecture ANDPROGRAMMING7/17/200916
Parallel Processor ArchitectureSIMD: Each instruction operates concurrently on multiple data itemsMIMD: Multiple instruction sequences execute concurrentlyConcurrency is expressible in space or timeSpatial: the hardware is replicatedTemporal: the hardware is pipelined7/17/200917
Trends in Parallel ProcessorsToday’s chips are spatial MIMD at top levelTo get enough performance, even in PCsTemporal MIMD is also usedSIMD is tending back toward spatialIntel’s Larrabee combines all threeTemporal concurrency is easily “adjusted”Vector length or number of hardware contextsTemporal concurrency tolerates latencyMemory latency in the SIMD caseFor MIMD, branches and synchronization also7/17/200918
Parallel Memory ArchitectureA shared memory system is one in which any processor can address any memory locationQuality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidthA distributed memory system is one in which processors can’t address most of memoryThe disjoint memory regions and their associated processors are usually called nodesA cluster is a distributed memory system with more than one processor per nodeNearly all HPC systems are clusters  7/17/200919
Parallel Programming VariationsData Parallelism andTask ParallelismFunctional Style and Imperative StyleShared Memory and Message Passing…and more we won’t have time to look at A parallel application may use all of them 7/17/200920
Data Parallelism and Task ParallelismA computation is data parallel when similar independent sub-computations are done simultaneously on multiple data itemsApplying the same function to every element of a data sequence, for exampleA computation is task parallel when dissimilar independent sub-computations are done simultaneouslyControlling the motions of a robot, for exampleIt sounds like SIMD vs. MIMD, but isn’t quiteSome kinds of data parallelism need MIMD7/17/200921
Functional and Imperative ProgramsA program is said to be written in (pure) functional style if it has no mutable stateComputing = naming and evaluating expressions Programs with mutable state are usually called imperative because the state changes must be done when and where specified:while (z < x) { x = y; y = z; z = f(x, y);} return y;Often, programs can be written either way:let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;7/17/200922
Shared Memory and Message PassingShared memory programs access data in a shared address spaceWhen to access the data is the big issueSubcomputations therefore must synchronizeMessage passing programs transmit data between subcomputationsThe sender computes a value and then sends itThe receiver recieves a value and then uses itSynchronization can be built in to communicationMessage passing can be implemented very well on shared memory architectures7/17/200923
Barrier SynchronizationA barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrivedIt is named after the barrierused to start horse racesIt guarantees everything before the barrier finishes before anything after it beginsIt is a central feature in several data-parallel languages such as OpenMP7/17/200924
Mutual ExclusionThis type of synchronization ensures only one subcomputation can do a thing at any timeIf the thing is a code block, it is a critical sectionIt classically uses a lock: a data structure with which subcomputations can stop and startBasic operations on a lock object L might be Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownershipRelease(L): yields L and unblocks some Acquire(L)A lot has been written on these subjects7/17/200925
Non-Blocking SynchronizationThe basic idea is to achieve mutual exclusion using memory read-modify-write operationsMost commonly used is compare-and-swap: CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by newArbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeedsIf there is significant updating contention at addr, the repeated computation of new may be wasteful7/17/200926
Load BalancingSome processors may be busier than othersTo balance the workload, subcomputations can be scheduled on processors dynamicallyA technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterationsIn guided self-scheduling, the chunk sizes shrinkAnalogous imbalances can occur in memoryOverloaded memory locations are called hot spotsParallel algorithms and data structures must be designed to avoid themImbalanced messaging is sometimes seen7/17/200927
Examples7/17/200928
A Data Parallel Example: Sorting7/17/200929void sort(int *src, int *dst,int size, intnvals) {inti, j, t1[nvals], t2[nvals];	for (j = 0 ; j < nvals ; j++) {	t1[j] = 0;}	for (i = 0 ; i < size ; i++) {	t1[src[i]]++;}	//t1[] now contains a histogram of the values	t2[0] = 0;	for (j = 1 ; j < nvals ; j++) { 	t2[j] = t2[j-1] + t1[j-1];}	//t2[j] now contains the origin for value j	for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i];}}
When Is a Loop Parallelizable?The loop instances must safely interleaveA way to do this is to only read the data Another way is to isolate data accessesLook at the first loop:The accesses to t1[] are isolated from each other This loop can run in parallel “as is”7/17/200930for (j = 0 ; j < nvals ; j++) {    t1[j] = 0;}
Isolating Data UpdatesThe second loop seems to have a problem:Two iterations may access the same t1[src[i]]If both reads precede both increments, oops!A few ways to isolate the iteration conflicts:Use an “isolated update” (lock prefix) instructionUse an array of locks, perhaps as big as t1[] Use non-blocking updatesUse a transaction7/17/200931for (i = 0 ; i < size ; i++) {    t1[src[i]]++;}
Dependent Loop IterationsThe 3rd loop is an interesting challenge:Each iteration depends on the previous oneThis loop is an example of a prefix computationIf • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 …Prefix computations are often known as scansScan can be done in efficiently in parallel7/17/200932for (j = 1 ; j < nvals ; j++) { 	t2[j] = t2[j-1] + t1[j-1];	}
Cyclic ReductionEach vertical line represents a loop iterationThe associated sequence element is to its rightOn step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k7/17/200933abcdefgaabbccddeeffgaababcabcdbcdecdefdefgaababcabcdabcdeabcdefabcdefg
Applications of ScanLinear recurrences like the third loopPolynomial evaluationString comparisonHigh-precision additionFinite automataEach xi is the next-state function given the ith input symbol and • is function compositionAPL compressWhen only the final value is needed, the computation is called a reduction insteadIt’s a little bit cheaper than a full scan
More Iterations nThan Processors p7/17/200935Wp = 3n + O(p log p), Tp = 3n / p + O(log p)
OpenMPOpenMP is a widely-implemented extension to C++ and Fortran for data† parallelismIt adds directives to serial programsA few of the more important directives:#pragmaomp parallel for <modifiers><for loop>#pragmaomp atomic<binary op=,++ or -- statement>#pragmaomp critical <name><structured block>#pragmaomp barrier7/17/200936†And perhaps task parallelism soon
The Sorting Example in OpenMPOnly the third “scan” loop is a problemWe can at least do this loop “manually”:7/17/200937nt = omp_get_num_threads();intta[nt], tb[nt];#omp parallel forfor(myt = 0; myt < nt; myt++) {  //Set ta[myt]= local sum of nvals/nt elements of t1[]  #pragmaomp barrier  for(k = 1; k <= myt; k *= 2){tb[myt] = ta[myt];ta[myt] += tb[myt - k];  #pragmaomp barrier  }  fix = (myt > 0) ? ta[myt – 1] : 0;  //Setnvals/ntelements of t2[] to fix + local scan of t1[]}
Parallel Patterns Library (PPL)PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtimeIt supports mixed data- and task-parallelism:parallel_for, parallel_for_each, parallel_invokeagent, send, receive, choice, join, task_groupParallel loops use C++ lambda expressions:Updates can be isolated using intrinsic functionsMicrosoft and Intel plan to unify PPL and TBB7/17/200938parallel_for(1,nvals,[&t1](int j) {  t1[j] = 0;});(void)_InterlockedIncrement(t1[src[i]]++);
Dynamic Resource ManagementPPL programs are written for an arbitrary number of processors, could be just oneLoad balancing is mostly done by work stealingThere are two kinds of  work to steal:Work that is unblocked and waiting for a processorWork that is not yet started and is potentially parallelWork of the latter kind will be done serially unless it is first stolen by another processorThis makes recursive divide and conquer easyThere is no concern about when to stop parallelism7/17/200939
A Quicksort Examplevoid quicksort (vector<int>::iterator first,                vector<int>::iterator last) {    if (last - first < 2){return;}int pivot = *first;    auto mid1 = partition (first, last,                   [=](int e){return e < pivot;});    auto mid2 = partition (mid1, last,                   [=](int e){return e == pivot;});parallel_invoke(         [=] { quicksort(first, mid1); },         [=] { quicksort(mid2, last); }    );}; 7/17/200940
LINQ and PLINQLINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F#A LINQ query is really just a functional monadIt queries databases, XML, or any IEnumerablePLINQ is a parallel implementation of LINQNon-isolated functions must be avoidedOtherwise it is hard to tell the two apart7/17/200941
A  LINQ  Example7/17/200942PLINQ.AsParallel()var q = from n in names        where n.Name == queryInfo.Name && n.State == queryInfo.State &&n.Year >= yearStart &&n.Year <= yearEnd        orderbyn.Year ascending        select n;
Message Passing Interface (MPI)MPI is a widely used message passing library for distributed memory HPC systemsSome of its basic functions:A few of its “collective communication” functions:7/17/200943MPI_InitMPI_Comm_rankMPI_Comm_sizeMPI_SendMPI_RecvMPI_ReduceMPI_AllreduceMPI_ScanMPI_ExscanMPI_BarrierMPI_GatherMPI_AllgatherMPI_Alltoall
Sorting in MPIRoughly, it could work like this on n nodes:Run the first two loops locallyUse MPI_Allreduce to build a global histogramRun the third loop (redundantly) at every nodeAllocate n value intervals to nodes (redundantly)Balancing the data per node as well as possibleRun the fourth loop using the local histogramUse MPI_Alltoall to redistribute the dataMerge the n sorted subarrays on each nodeCollective communication is expensiveBut sorting needs it (see the Memory Wall slide)7/17/200944
Another Way to Sort in MPIThe Samplesort algorithm is like QuicksortIt works like this on n nodes:Sort the local data on each node independentlyTake s samples of the sorted data on each nodeUse MPI_Allgather to send all nodes all samplesCompute n  1 splitters (redundantly) on all nodesBalancing the data per node as well as possibleUse MPI_Alltoall to redistribute the dataMerge the n sorted subarrays on each node7/17/200945
CONCLUSIONS7/17/200946
Parallel Computing Has ArrivedWe must rethink how we write programsAnd we are definitely doing thatOther things will also need to changeArchitectureOperating systemsAlgorithmsTheoryApplication softwareWe are seeing the biggest revolution in computing since its very beginnings7/17/200947

More Related Content

PPT
Parallel Computing 2007: Overview
PPT
Parallel Computing 2007: Bring your own parallel application
PPTX
Parallel processing coa
PPTX
Parallel Processing
PDF
Parallel computation
PPTX
Parallel processing
PDF
Parallel Algorithms
PPT
program partitioning and scheduling IN Advanced Computer Architecture
Parallel Computing 2007: Overview
Parallel Computing 2007: Bring your own parallel application
Parallel processing coa
Parallel Processing
Parallel computation
Parallel processing
Parallel Algorithms
program partitioning and scheduling IN Advanced Computer Architecture

What's hot (20)

PPT
Parallel Processing Concepts
DOCX
Matrix Multiplication Report
PPTX
Introduction To Parallel Computing
PDF
2014 valat-phd-defense-slides
PDF
Parallel Algorithms
PDF
Bulk-Synchronous-Parallel - BSP
PPTX
Dichotomy of parallel computing platforms
PPT
Dryad Paper Review and System Analysis
PDF
Feng’s classification
PPT
Lecture 2
PDF
Solution(1)
PDF
Optimization of Collective Communication in MPICH
PPTX
Parallel computing
PDF
Parallel Computing - Lec 5
PPT
Chapter 1 pc
PDF
Modelling Adaptation Policies As Domain-Specific Constraints
PPT
PPT
Chapter 4 pc
PPT
Chap5 slides
PPT
Chap6 slides
Parallel Processing Concepts
Matrix Multiplication Report
Introduction To Parallel Computing
2014 valat-phd-defense-slides
Parallel Algorithms
Bulk-Synchronous-Parallel - BSP
Dichotomy of parallel computing platforms
Dryad Paper Review and System Analysis
Feng’s classification
Lecture 2
Solution(1)
Optimization of Collective Communication in MPICH
Parallel computing
Parallel Computing - Lec 5
Chapter 1 pc
Modelling Adaptation Policies As Domain-Specific Constraints
Chapter 4 pc
Chap5 slides
Chap6 slides
Ad

Viewers also liked (8)

PPT
отладка Mpi приложений
PPT
PPT
2009 10-31 есть ли жизнь после mpi
PPTX
Ropa para Mujer
PPT
трасировка Mpi приложений
PPT
якобовский - введение в параллельное программирование (1)
PPT
российские суперкомпьютеры (современность)
PPT
Test design print
отладка Mpi приложений
2009 10-31 есть ли жизнь после mpi
Ropa para Mujer
трасировка Mpi приложений
якобовский - введение в параллельное программирование (1)
российские суперкомпьютеры (современность)
Test design print
Ad

Similar to 20090720 smith (20)

PPT
parallel computing.ppt
PDF
Report on High Performance Computing
PPTX
distributed system lab materials about ad
PPT
Parallel Programming Primer 1
PPT
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
PPT
Parallel Programming Models: Shared variable model, Message passing model, Da...
PPT
Parallel Programming Primer
PPTX
PP - CH01 (2).pptxhhsjoshhshhshhhshhshsbx
PPT
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
PDF
introduction to advanced distributed system
PPTX
Asynchronous and Parallel Programming in .NET
PDF
Pipelining and ILP (Instruction Level Parallelism)
PPT
Migration To Multi Core - Parallel Programming Models
PPTX
Parallel architecture &programming
PDF
Chapter 1 - introduction - parallel computing
PPTX
week_2Lec02_CS422.pptx
PDF
Introduction to Parallel Computing
PPT
PMSCS 657_Parallel and Distributed processing
PPTX
Parallel Computing-Part-1.pptx
PPTX
Thinking in parallel ab tuladev
parallel computing.ppt
Report on High Performance Computing
distributed system lab materials about ad
Parallel Programming Primer 1
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Parallel Programming Models: Shared variable model, Message passing model, Da...
Parallel Programming Primer
PP - CH01 (2).pptxhhsjoshhshhshhhshhshsbx
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
introduction to advanced distributed system
Asynchronous and Parallel Programming in .NET
Pipelining and ILP (Instruction Level Parallelism)
Migration To Multi Core - Parallel Programming Models
Parallel architecture &programming
Chapter 1 - introduction - parallel computing
week_2Lec02_CS422.pptx
Introduction to Parallel Computing
PMSCS 657_Parallel and Distributed processing
Parallel Computing-Part-1.pptx
Thinking in parallel ab tuladev

More from Michael Karpov (20)

PDF
EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...
PDF
Movement to business goals: Data, Team, Users (4C Conference)
PDF
Save Africa: NASA hackathon 2016
PPT
Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014)
PPT
Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...
PPT
Поговорим про ошибки (Sumit)
PPT
(2niversity) проектная работа tips&tricks
PPT
"Пользователи: сигнал из космоса". CodeFest mini 2012
PPT
(Analyst days2012) Как мы готовим продукты - вклад аналитиков
PPTX
Как сделать команде приятное - Михаил Карпов (Яндекс)
PPTX
Как мы готовим продукты
PPT
Hpc Visualization with WebGL
DOC
Hpc Visualization with X3D (Michail Karpov)
PPT
сбор требований с помощью Innovation games
PDF
Зачем нам Это? или Как продать agile команде
PPT
"Зачем нам Это?" или как продать Agile команде
PPT
"Зачем нам Это?" или как продать Agile команде
DOC
HPC Visualization
PPT
Hpc Visualization
PPT
Высоконагруженая команда - AgileDays 2010
EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...
Movement to business goals: Data, Team, Users (4C Conference)
Save Africa: NASA hackathon 2016
Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014)
Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...
Поговорим про ошибки (Sumit)
(2niversity) проектная работа tips&tricks
"Пользователи: сигнал из космоса". CodeFest mini 2012
(Analyst days2012) Как мы готовим продукты - вклад аналитиков
Как сделать команде приятное - Михаил Карпов (Яндекс)
Как мы готовим продукты
Hpc Visualization with WebGL
Hpc Visualization with X3D (Michail Karpov)
сбор требований с помощью Innovation games
Зачем нам Это? или Как продать agile команде
"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile команде
HPC Visualization
Hpc Visualization
Высоконагруженая команда - AgileDays 2010

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Review of recent advances in non-invasive hemoglobin estimation
Dropbox Q2 2025 Financial Results & Investor Presentation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
NewMind AI Monthly Chronicles - July 2025
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Review of recent advances in non-invasive hemoglobin estimation

20090720 smith

  • 1. 7/17/20091Parallel and High Performance ComputingBurton SmithTechnical FellowMicrosoft
  • 4. “Parallel and High Performance”?“Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994A High Performance (Super) Computer is:One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmarkA computer that costs 200.000.000 руб or moreNecessarily parallel, at least since the 1970’s7/17/20094
  • 5. Recent DevelopmentsFor 20 years, parallel and high performance computing have been the same subjectParallel computing is now mainstreamIt reaches well beyond HPC into client systems: desktops, laptops, mobile phonesHPC software once had to stand aloneNow, it can be based on parallel PC softwareThe result: better tools and new possibilities7/17/20095
  • 6. The Emergence of the Parallel ClientUniprocessor performance is leveling offInstruction-level parallelism nears a limit (ILP Wall)Power is getting painfully high (Power Wall)Caches show diminishing returns (Memory Wall)Logic density continues to grow (Moore’s Law)So uniprocessors will collapse in area and costCores per chip need to increase exponentiallyWe must all learn to write parallel programsSo new “killer apps” will enjoy more speed
  • 7. The ILP WallInstruction-level parallelism preserves the serial programming modelWhile getting speed from “undercover” parallelism For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, …At best, we get a few instructions/clock† Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.
  • 8. The Power WallIn the old days, power was kept roughly constantDynamic power, equal to CV2f, dominatedEvery shrink of .7 in feature size halved transistor area Capacitance C and voltage V also decreased by .7Even with the clock frequency f increased by 1.4, power per transistor was cut in halfNow, shrinking no longer reduces V very muchSo even at constant frequency, power density doublesStatic (leakage) power is also getting worseSimpler, slower processors are more efficientAnd to conserve power, we can turn some of them off
  • 9. The Memory WallWe can get bigger caches from more transistorsDoes this suffice, or is there a problem scaling up?To speed up 2X without changing bandwidth below the cache, the miss rate must be halvedHow much bigger does the cache have to be?†For dense matrix multiply or dense LU, 4x biggerFor sorting or FFTs, the square of its former sizeFor sparse or dense matrix-vector multiply, impossibleDeeper interconnects increase miss latencyLatency tolerance needs memory access parallelism† H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.
  • 10. Overcoming the Memory WallProvide more memory bandwidthIncrease DRAM I/O bandwidth per gigabyteIncrease microprocessor off-chip bandwidthUse architecture to tolerate memory latencyMore latency  more threads or longer vectorsNo change in programming model is neededUse caches for bandwidth as well as latencyLet compilers control localityKeep cache lines shortAvoid mis-speculation
  • 11. The End of The von Neumann Model“Instructions are executed one at a time…”We have relied on this idea for 60 yearsNow it (and things it brought) must changeSerial programming is easier than parallel programming, at least for the momentBut serial programs are now slow programsWe need parallel programming paradigms that will make all programmers successfulThe stakes for our field’s vitality are highComputing must be reinvented
  • 13. Asymptotic NotationQuantities are often meaningful only within a constant factorAlgorithm performance analyses, for examplef(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n))7/17/200913
  • 14. Speedup, Time, and WorkThe speedup of a computation is how much faster it runs in parallel compared to seriallyIf one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/TpThe work done is the number of operations performed, either serially or in parallelW1 = O(T1) is the serial work, Wp the parallel workWe say a parallel computation is work-optimal ifWp = O(W1) = O(T1)We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p)7/17/200914
  • 15. Latency, Bandwidth, & ConcurrencyIn any system that moves items from input to output without creating or destroying them,Queueing theory calls this result Little’s lawlatency × bandwidth = concurrencyconcurrency = 6bandwidth = 2latency = 3
  • 17. Parallel Processor ArchitectureSIMD: Each instruction operates concurrently on multiple data itemsMIMD: Multiple instruction sequences execute concurrentlyConcurrency is expressible in space or timeSpatial: the hardware is replicatedTemporal: the hardware is pipelined7/17/200917
  • 18. Trends in Parallel ProcessorsToday’s chips are spatial MIMD at top levelTo get enough performance, even in PCsTemporal MIMD is also usedSIMD is tending back toward spatialIntel’s Larrabee combines all threeTemporal concurrency is easily “adjusted”Vector length or number of hardware contextsTemporal concurrency tolerates latencyMemory latency in the SIMD caseFor MIMD, branches and synchronization also7/17/200918
  • 19. Parallel Memory ArchitectureA shared memory system is one in which any processor can address any memory locationQuality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidthA distributed memory system is one in which processors can’t address most of memoryThe disjoint memory regions and their associated processors are usually called nodesA cluster is a distributed memory system with more than one processor per nodeNearly all HPC systems are clusters 7/17/200919
  • 20. Parallel Programming VariationsData Parallelism andTask ParallelismFunctional Style and Imperative StyleShared Memory and Message Passing…and more we won’t have time to look at A parallel application may use all of them 7/17/200920
  • 21. Data Parallelism and Task ParallelismA computation is data parallel when similar independent sub-computations are done simultaneously on multiple data itemsApplying the same function to every element of a data sequence, for exampleA computation is task parallel when dissimilar independent sub-computations are done simultaneouslyControlling the motions of a robot, for exampleIt sounds like SIMD vs. MIMD, but isn’t quiteSome kinds of data parallelism need MIMD7/17/200921
  • 22. Functional and Imperative ProgramsA program is said to be written in (pure) functional style if it has no mutable stateComputing = naming and evaluating expressions Programs with mutable state are usually called imperative because the state changes must be done when and where specified:while (z < x) { x = y; y = z; z = f(x, y);} return y;Often, programs can be written either way:let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;7/17/200922
  • 23. Shared Memory and Message PassingShared memory programs access data in a shared address spaceWhen to access the data is the big issueSubcomputations therefore must synchronizeMessage passing programs transmit data between subcomputationsThe sender computes a value and then sends itThe receiver recieves a value and then uses itSynchronization can be built in to communicationMessage passing can be implemented very well on shared memory architectures7/17/200923
  • 24. Barrier SynchronizationA barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrivedIt is named after the barrierused to start horse racesIt guarantees everything before the barrier finishes before anything after it beginsIt is a central feature in several data-parallel languages such as OpenMP7/17/200924
  • 25. Mutual ExclusionThis type of synchronization ensures only one subcomputation can do a thing at any timeIf the thing is a code block, it is a critical sectionIt classically uses a lock: a data structure with which subcomputations can stop and startBasic operations on a lock object L might be Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownershipRelease(L): yields L and unblocks some Acquire(L)A lot has been written on these subjects7/17/200925
  • 26. Non-Blocking SynchronizationThe basic idea is to achieve mutual exclusion using memory read-modify-write operationsMost commonly used is compare-and-swap: CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by newArbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeedsIf there is significant updating contention at addr, the repeated computation of new may be wasteful7/17/200926
  • 27. Load BalancingSome processors may be busier than othersTo balance the workload, subcomputations can be scheduled on processors dynamicallyA technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterationsIn guided self-scheduling, the chunk sizes shrinkAnalogous imbalances can occur in memoryOverloaded memory locations are called hot spotsParallel algorithms and data structures must be designed to avoid themImbalanced messaging is sometimes seen7/17/200927
  • 29. A Data Parallel Example: Sorting7/17/200929void sort(int *src, int *dst,int size, intnvals) {inti, j, t1[nvals], t2[nvals]; for (j = 0 ; j < nvals ; j++) { t1[j] = 0;} for (i = 0 ; i < size ; i++) { t1[src[i]]++;} //t1[] now contains a histogram of the values t2[0] = 0; for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];} //t2[j] now contains the origin for value j for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i];}}
  • 30. When Is a Loop Parallelizable?The loop instances must safely interleaveA way to do this is to only read the data Another way is to isolate data accessesLook at the first loop:The accesses to t1[] are isolated from each other This loop can run in parallel “as is”7/17/200930for (j = 0 ; j < nvals ; j++) { t1[j] = 0;}
  • 31. Isolating Data UpdatesThe second loop seems to have a problem:Two iterations may access the same t1[src[i]]If both reads precede both increments, oops!A few ways to isolate the iteration conflicts:Use an “isolated update” (lock prefix) instructionUse an array of locks, perhaps as big as t1[] Use non-blocking updatesUse a transaction7/17/200931for (i = 0 ; i < size ; i++) { t1[src[i]]++;}
  • 32. Dependent Loop IterationsThe 3rd loop is an interesting challenge:Each iteration depends on the previous oneThis loop is an example of a prefix computationIf • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 …Prefix computations are often known as scansScan can be done in efficiently in parallel7/17/200932for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1]; }
  • 33. Cyclic ReductionEach vertical line represents a loop iterationThe associated sequence element is to its rightOn step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k7/17/200933abcdefgaabbccddeeffgaababcabcdbcdecdefdefgaababcabcdabcdeabcdefabcdefg
  • 34. Applications of ScanLinear recurrences like the third loopPolynomial evaluationString comparisonHigh-precision additionFinite automataEach xi is the next-state function given the ith input symbol and • is function compositionAPL compressWhen only the final value is needed, the computation is called a reduction insteadIt’s a little bit cheaper than a full scan
  • 35. More Iterations nThan Processors p7/17/200935Wp = 3n + O(p log p), Tp = 3n / p + O(log p)
  • 36. OpenMPOpenMP is a widely-implemented extension to C++ and Fortran for data† parallelismIt adds directives to serial programsA few of the more important directives:#pragmaomp parallel for <modifiers><for loop>#pragmaomp atomic<binary op=,++ or -- statement>#pragmaomp critical <name><structured block>#pragmaomp barrier7/17/200936†And perhaps task parallelism soon
  • 37. The Sorting Example in OpenMPOnly the third “scan” loop is a problemWe can at least do this loop “manually”:7/17/200937nt = omp_get_num_threads();intta[nt], tb[nt];#omp parallel forfor(myt = 0; myt < nt; myt++) { //Set ta[myt]= local sum of nvals/nt elements of t1[] #pragmaomp barrier for(k = 1; k <= myt; k *= 2){tb[myt] = ta[myt];ta[myt] += tb[myt - k]; #pragmaomp barrier } fix = (myt > 0) ? ta[myt – 1] : 0; //Setnvals/ntelements of t2[] to fix + local scan of t1[]}
  • 38. Parallel Patterns Library (PPL)PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtimeIt supports mixed data- and task-parallelism:parallel_for, parallel_for_each, parallel_invokeagent, send, receive, choice, join, task_groupParallel loops use C++ lambda expressions:Updates can be isolated using intrinsic functionsMicrosoft and Intel plan to unify PPL and TBB7/17/200938parallel_for(1,nvals,[&t1](int j) { t1[j] = 0;});(void)_InterlockedIncrement(t1[src[i]]++);
  • 39. Dynamic Resource ManagementPPL programs are written for an arbitrary number of processors, could be just oneLoad balancing is mostly done by work stealingThere are two kinds of work to steal:Work that is unblocked and waiting for a processorWork that is not yet started and is potentially parallelWork of the latter kind will be done serially unless it is first stolen by another processorThis makes recursive divide and conquer easyThere is no concern about when to stop parallelism7/17/200939
  • 40. A Quicksort Examplevoid quicksort (vector<int>::iterator first, vector<int>::iterator last) { if (last - first < 2){return;}int pivot = *first; auto mid1 = partition (first, last, [=](int e){return e < pivot;}); auto mid2 = partition (mid1, last, [=](int e){return e == pivot;});parallel_invoke( [=] { quicksort(first, mid1); }, [=] { quicksort(mid2, last); } );}; 7/17/200940
  • 41. LINQ and PLINQLINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F#A LINQ query is really just a functional monadIt queries databases, XML, or any IEnumerablePLINQ is a parallel implementation of LINQNon-isolated functions must be avoidedOtherwise it is hard to tell the two apart7/17/200941
  • 42. A LINQ Example7/17/200942PLINQ.AsParallel()var q = from n in names        where n.Name == queryInfo.Name && n.State == queryInfo.State &&n.Year >= yearStart &&n.Year <= yearEnd        orderbyn.Year ascending        select n;
  • 43. Message Passing Interface (MPI)MPI is a widely used message passing library for distributed memory HPC systemsSome of its basic functions:A few of its “collective communication” functions:7/17/200943MPI_InitMPI_Comm_rankMPI_Comm_sizeMPI_SendMPI_RecvMPI_ReduceMPI_AllreduceMPI_ScanMPI_ExscanMPI_BarrierMPI_GatherMPI_AllgatherMPI_Alltoall
  • 44. Sorting in MPIRoughly, it could work like this on n nodes:Run the first two loops locallyUse MPI_Allreduce to build a global histogramRun the third loop (redundantly) at every nodeAllocate n value intervals to nodes (redundantly)Balancing the data per node as well as possibleRun the fourth loop using the local histogramUse MPI_Alltoall to redistribute the dataMerge the n sorted subarrays on each nodeCollective communication is expensiveBut sorting needs it (see the Memory Wall slide)7/17/200944
  • 45. Another Way to Sort in MPIThe Samplesort algorithm is like QuicksortIt works like this on n nodes:Sort the local data on each node independentlyTake s samples of the sorted data on each nodeUse MPI_Allgather to send all nodes all samplesCompute n  1 splitters (redundantly) on all nodesBalancing the data per node as well as possibleUse MPI_Alltoall to redistribute the dataMerge the n sorted subarrays on each node7/17/200945
  • 47. Parallel Computing Has ArrivedWe must rethink how we write programsAnd we are definitely doing thatOther things will also need to changeArchitectureOperating systemsAlgorithmsTheoryApplication softwareWe are seeing the biggest revolution in computing since its very beginnings7/17/200947