Lecture 1

Multithreaded Programming in Cilk L ECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. virus shell assembly graphics rendering n -body simulation heuristic search dense and sparse matrix computations friction-stir welding simulation artificial evolution Example applications:

Shared-Memory Multiprocessor In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform! P P P Network … Memory I/O $ $ $

Cilk Is Simple Cilk extends the C language with just a handful of keywords. Every Cilk program has a serial semantics . Not only is Cilk fast, it provides performance guarantees based on performance abstractions. Cilk is processor-oblivious . Cilk’s provably good runtime system auto-matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. Cilk supports speculative parallelism.

Minicourse Outline L ECTURE 1 Basic Cilk programming: Cilk keywords, performance measures, scheduling. L ECTURE 2 Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction. L ABORATORY Programming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul L ECTURE 3 Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.

L ECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Conclusion Parallelizing Vector Addition A Chess Lesson

Fibonacci Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types. int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } } C elision cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Cilk code

Basic Cilk Keywords cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Identifies a function as a Cilk procedure , capable of being spawned in parallel. The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

Dynamic Multithreading cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync ; return (x+y); } } The computation dag unfolds dynamically. Example: fib(4) “ Processor oblivious” 4 3 2 2 1 1 1 0 0

Multithreaded Computation The dag G = ( V , E ) represents a parallel instruction stream. Each vertex v 2 V represents a (Cilk) thread : a maximal sequence of instructions not containing parallel control ( spawn , sync , return ). Every edge e 2 E is either a spawn edge, a return edge, or a continue edge. spawn edge return edge continue edge initial thread final thread

Cactus Stack Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc .) Cilk’s cactus stack supports several views in parallel. B A C E D A A B A C A C D A C E Views of stack C B A D E

LECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion

Algorithmic Complexity Measures T P = execution time on P processors

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work T 1 = span * * Also called critical-path length or computational depth .

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work L OWER B OUNDS T P ¸ T 1 / P T P ¸ T 1 * Also called critical-path length or computational depth . T 1 = span *

Speedup Definition: T 1 /T P = speedup on P processors. If T 1 /T P =  ( P ) · P , we have linear speedup ; = P , we have perfect linear speedup ; > P , we have superlinear speedup , which is not possible in our model, because of the lower bound T P ¸ T 1 / P .

Parallelism Because we have the lower bound T P ¸ T 1 , the maximum possible speedup given T 1 and T 1 is T 1 /T 1 = parallelism = the average amount of work per step along the span.

Example: fib(4) Span: T 1 = ? Work: T 1 = ? Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Span: T 1 = 8 3 4 5 6 1 2 7 8 Work: T 1 = 17

Example: fib(4) Parallelism: T 1 / T 1 = 2.125 Span: T 1 = ? Work: T 1 = ? Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Span: T 1 = 8 Work: T 1 = 17 Using many more than 2 processors makes little sense.

L ECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C

Parallelizing Vector Addition C C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); Parallelization strategy: Convert loops to recursion. void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } } }

Parallelizing Vector Addition if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { C Parallelization strategy: Convert loops to recursion. Insert Cilk keywords. void vadd (real *A, real *B, int n){ cil k spawn vadd (A, B, n/2; vadd (A+n/2, B+n/2, n-n/2; spawn Side benefit: D&C is generally good for caches! sync ; C ilk void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } } }

Vector Addition cil k void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync ; } }

Vector Addition Analysis To add two vectors of length n , where BASE =  (1) : Work: T 1 =  Span: T 1 =  Parallelism: T 1 / T 1 =   ( n /lg n )  ( n )  (lg n ) BASE

Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } } Cilk cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); } sync; }

Analysis To add two vectors of length n , where BASE =  (1) :  (1)  ( n ) … …  ( n ) BASE Work: T 1 =  Span: T 1 =  Parallelism: T 1 / T 1 =  PUNY!

Optimal Choice of BASE To add two vectors of length n using an optimal choice of BASE to maximize parallelism: Parallelism: T 1 / T 1 =   ( √ n ) … Work: T 1 =   ( n ) BASE … Span: T 1 =   (BASE + n /BASE) Choosing BASE = √ n ) T 1 =  √ n )

Scheduling Cilk allows the programmer to express potential parallelism in an application. The Cilk scheduler maps Cilk threads onto processors dynamically at runtime. Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler. P P P Network … Memory I/O $ $ $

Greedy Scheduling I DEA : Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed .

Greedy Scheduling I DEA : Do as much as possible on every step. Complete step ¸ P threads ready. Run any P . Definition: A thread is ready if all its predecessors have executed . P = 3

Greedy Scheduling I DEA : Do as much as possible on every step. Complete step ¸ P threads ready. Run any P . Incomplete step < P threads ready. Run all of them. Definition: A thread is ready if all its predecessors have executed . P = 3

Greedy-Scheduling Theorem Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves T P  T 1 / P + T  . Proof . # complete steps · T 1 / P , since each complete step performs P work. # incomplete steps · T 1 , since each incomplete step reduces the span of the unexecuted dag by 1 . ■ P = 3

Optimality of Greedy Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof . Let T P * be the execution time produced by the optimal scheduler. Since T P * ¸ max{ T 1 / P , T 1 } (lower bounds), we have T P · T 1 / P + T 1 · 2 ¢ max{ T 1 / P , T 1 } · 2 T P * . ■

Linear Speedup Corollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ¿ T 1 / T 1 . Proof. Since P ¿ T 1 / T 1 is equivalent to T 1 ¿ T 1 / P , the Greedy Scheduling Theorem gives us T P · T 1 / P + T 1 ¼ T 1 / P . Thus, the speedup is T 1 / T P ¼ P . ■ Definition. The quantity ( T 1 / T 1 )/ P is called the parallel slackness .

Cilk Performance Cilk’s “work-stealing” scheduler achieves T P = T 1 / P + O ( T 1 ) expected time (provably); T P  T 1 / P + T 1 time (empirically). Near-perfect linear speedup if P ¿ T 1 / T 1 . Instrumentation in Cilk allows the user to determine accurate measures of T 1 and T 1 . The average cost of a spawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform.

Cilk Chess Programs  Socrates placed 3rd in the 1994 International Computer Chess Championship running on NCSA’s 512 -node Connection Machine CM5.  Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824 -node Intel Paragon. Cilkchess placed 1st in the 1996 Dutch Open running on a 12 -processor Sun Enterprise 5000. It placed 2nd in 1997 and 1998 running on Boston University’s 64 -processor SGI Origin 2000. Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256 -node SGI Origin 2000.

 Socrates Normalized Speedup T P = T 1 / P + T  measured speedup 0.01 0.1 1 0.01 0.1 1 T P = T  T P = T 1 / P T 1 /T P T 1 /T  P T 1 /T 

Developing  Socrates For the competition,  Socrates was to run on a 512 -processor Connection Machine Model CM5 supercomputer at the University of Illinois. The developers had easy access to a similar 32 -processor CM5 at MIT. One of the developers proposed a change to the program that produced a speedup of over 20% on the MIT machine. After a back-of-the-envelope calculation, the proposed “improvement” was rejected!

 Socrates Speedup Paradox T P  T 1 / P + T  T 32 = 2048/32 + 1 = 65 seconds = 40 seconds T  32 = 1024/32 + 8 Original program Proposed program T 32 = 65 seconds T  32 = 40 seconds T 1 = 2048 seconds T  = 1 second T  1 = 1024 seconds T   = 8 seconds T 512 = 2048/512 + 1 = 5 seconds T  512 = 1024/512 + 8 = 10 seconds

Lesson Work and span can predict performance on large machines better than running times on small machines can.

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn!

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn! Spawn!

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Return!

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Steal! When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn! When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Performance of Work-Stealing Theorem : Cilk’s work-stealing scheduler achieves an expected running time of T P  T 1 / P + O ( T 1 ) on P processors. Pseudoproof . A processor is either working or stealing . The total time all processors spend working is T 1 . Each steal has a 1/ P chance of reducing the span by 1 . Thus, the expected cost of all steals is O ( PT 1 ) . Since there are P processors, the expected time is ( T 1 + O ( PT 1 ))/ P = T 1 / P + O ( T 1 ) . ■

Space Bounds Theorem. Let S 1 be the stack space required by a serial execution of a Cilk program. Then, the space required by a P -processor execution is at most S P · PS 1 . Proof (by induction). The work-stealing algorithm maintains the busy-leaves property : every extant procedure frame with no extant descendents has a processor working on it. ■ P = 3 P P P S 1

Linguistic Implications Code like the following executes properly without any risk of blowing out memory: for (i=1; i<1000000000; i++) { spawn foo(i); } sync; M ORAL Better to steal parents than children!

Key Ideas Cilk is simple: cilk , spawn , sync Recursion, recursion, recursion, … Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span

Lecture 1

More Related Content

What's hot (19)

Similar to Lecture 1 (20)

Recently uploaded (20)

Lecture 1