SlideShare a Scribd company logo
Multithreaded Programming in Cilk L ECTURE  1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. virus shell assembly graphics rendering n -body simulation heuristic search dense and sparse matrix computations friction-stir welding simulation artificial evolution Example applications:
Shared-Memory Multiprocessor In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform! P P P Network … Memory I/O $ $ $
Cilk Is Simple Cilk extends the C language with just a  handful   of keywords. Every Cilk program has a  serial semantics . Not only is Cilk fast, it provides  performance guarantees  based on performance abstractions. Cilk is  processor-oblivious . Cilk’s  provably good  runtime system auto-matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. Cilk supports  speculative  parallelism.
Minicourse Outline L ECTURE  1   Basic Cilk programming:   Cilk keywords, performance measures, scheduling. L ECTURE  2 Analysis of Cilk   algorithms:  matrix multiplication, sorting, tableau construction. L ABORATORY Programming matrix multiplication in Cilk  — Dr. Bradley C. Kuszmaul L ECTURE  3 Advanced Cilk programming:  inlets, abort, speculation, data synchronization, & more.
L ECTURE  1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Conclusion Parallelizing Vector Addition A Chess Lesson
Fibonacci Cilk is a  faithful   extension of C.  A Cilk program’s  serial elision  is always a legal implementation of Cilk semantics.  Cilk provides  no   new data types. int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } } C elision cilk  int fib (int n) { if (n<2) return (n); else { int x,y; x =  spawn  fib(n-1); y =  spawn  fib(n-2); sync; return (x+y); } } Cilk code
Basic Cilk Keywords cilk  int fib (int n) { if (n<2) return (n); else { int x,y; x =  spawn  fib(n-1); y =  spawn  fib(n-2); sync; return (x+y); } } Identifies a function as a  Cilk procedure , capable of being spawned in parallel. The named  child  Cilk procedure can execute in parallel with the  parent  caller. Control cannot pass this point until all spawned children have returned.
Dynamic Multithreading cilk   int fib (int n) { if (n<2) return (n); else { int x,y; x =  spawn  fib(n-1); y =  spawn  fib(n-2); sync ; return (x+y); } } The  computation dag  unfolds dynamically. Example:   fib(4) “ Processor   oblivious” 4 3 2 2 1 1 1 0 0
Multithreaded Computation The dag  G  = ( V ,  E )  represents a parallel instruction stream. Each vertex  v   2   V  represents a  (Cilk) thread :  a maximal sequence of instructions not containing parallel control ( spawn ,  sync ,  return ). Every edge  e   2   E  is either a  spawn  edge, a  return  edge, or a  continue  edge. spawn edge return edge continue edge initial thread final thread
Cactus Stack Cilk supports C’s rule for pointers:   A pointer to stack space can be passed from parent to child, but not from child to parent.  (Cilk also supports  malloc .) Cilk’s  cactus stack  supports several views in parallel. B A C E D A A B A C A C D A C E Views of stack C B A D E
LECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
Algorithmic Complexity Measures T P  =  execution time on  P  processors
Algorithmic Complexity Measures T P  =  execution time on  P  processors T 1  =  work
Algorithmic Complexity Measures T P  =  execution time on  P  processors T 1  =  work T 1  =  span * * Also called  critical-path length  or  computational depth .
Algorithmic Complexity Measures T P  =  execution time on  P  processors T 1  =  work L OWER  B OUNDS T P   ¸   T 1 / P T P   ¸   T 1 * Also called  critical-path length  or  computational depth . T 1  =  span *
Speedup Definition:   T 1 /T P  =  speedup   on  P  processors. If  T 1 /T P =   ( P )  ·   P ,  we have  linear speedup ; =  P , we have  perfect linear speedup ; >  P , we have  superlinear speedup , which is not possible in our model, because of the lower bound  T P   ¸   T 1 / P .
Parallelism Because we have the lower bound  T P   ¸   T 1 , the maximum possible speedup given  T 1  and  T 1  is T 1 /T 1 = parallelism = the average amount  of work per step  along the span.
Example:  fib(4) Span:   T 1  = ? Work:   T 1  =  ? Assume for simplicity that each Cilk thread in  fib()  takes unit time to execute. Span:   T 1  = 8 3 4 5 6 1 2 7 8 Work:   T 1  = 17
Example:  fib(4) Parallelism:   T 1 / T 1  = 2.125 Span:   T 1  = ? Work:   T 1  =  ? Assume for simplicity that each Cilk thread in  fib()  takes unit time to execute. Span:   T 1  = 8 Work:   T 1  = 17 Using many more than  2  processors makes little sense.
L ECTURE  1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C
Parallelizing Vector Addition C C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); Parallelization strategy:   Convert loops to recursion. void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } } }
Parallelizing Vector Addition if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { C Parallelization strategy:   Convert loops to recursion. Insert Cilk keywords. void vadd (real *A, real *B, int n){ cil k spawn vadd (A, B, n/2; vadd (A+n/2, B+n/2, n-n/2; spawn Side benefit:   D&C is generally good for caches! sync ; C ilk void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } } }
Vector Addition cil k   void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn  vadd (A, B, n/2); spawn  vadd (A+n/2, B+n/2, n-n/2); sync ; } }
Vector Addition Analysis To add two vectors of length  n , where  BASE =   (1) : Work:   T 1  =   Span:   T 1  =   Parallelism:   T 1 / T 1  =    ( n /lg  n )  ( n )  (lg  n ) BASE
Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } } Cilk cilk  void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk  void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn  vadd1(A+j, B+j, min(BASE, n-j)); } sync; }
Analysis To add two vectors of length  n , where  BASE =   (1) :  (1)  ( n ) … …  ( n ) BASE Work:   T 1  =   Span:   T 1  =   Parallelism:   T 1 / T 1  =   PUNY!
Optimal Choice of BASE To add two vectors of length  n  using an optimal choice of  BASE  to maximize parallelism: Parallelism:  T 1 / T 1  =    ( √ n  ) … Work:  T 1  =    ( n ) BASE … Span:   T 1  =    (BASE +  n /BASE) Choosing  BASE = √ n   )   T 1  =   √ n )
L ECTURE  1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
Scheduling Cilk allows the programmer to express  potential  parallelism in an application. The Cilk  scheduler  maps Cilk threads onto processors dynamically at runtime. Since  on-line  schedulers are complicated, we’ll illustrate the ideas with an  off-line  scheduler. P P P Network … Memory I/O $ $ $
Greedy Scheduling I DEA :  Do as much as possible on every step. Definition:   A thread is  ready  if all its predecessors have  executed .
Greedy Scheduling I DEA :  Do as much as possible on every step. Complete   step   ¸   P  threads ready. Run any  P . Definition:   A thread is  ready  if all its predecessors have  executed . P  = 3
Greedy Scheduling I DEA :  Do as much as possible on every step. Complete   step   ¸   P  threads ready. Run any  P . Incomplete step   <  P  threads ready. Run all of them. Definition:   A thread is  ready  if all its predecessors have  executed . P  = 3
Greedy-Scheduling Theorem Theorem  [Graham ’68 & Brent ’75]. Any greedy scheduler achieves T P      T 1 / P   + T  . Proof .  # complete steps  ·   T 1 / P , since each complete step performs  P  work. # incomplete steps  ·   T 1 , since each incomplete step reduces the span of the unexecuted dag by  1 .  ■ P  = 3
Optimality of Greedy Corollary.   Any greedy scheduler achieves within a factor of  2  of optimal. Proof .  Let  T P *  be the execution time produced by the optimal scheduler.  Since  T P *  ¸  max{ T 1 / P ,  T 1 }  (lower bounds), we have T P ·   T 1 / P  +  T 1  ·  2 ¢ max{ T 1 / P ,  T 1 } ·  2 T P *  .  ■
Linear Speedup Corollary.   Any greedy scheduler achieves near-perfect linear speedup whenever  P  ¿  T 1 / T 1 .  Proof.   Since  P  ¿   T 1 / T 1  is equivalent to  T 1   ¿   T 1 / P , the Greedy Scheduling Theorem gives us   T P ·   T 1 / P  +  T 1 ¼  T 1 / P  . Thus, the speedup is  T 1 / T P   ¼   P .  ■ Definition.  The quantity  ( T 1 / T 1  )/ P  is called the  parallel slackness .
Cilk Performance Cilk’s “work-stealing” scheduler achieves T P  =  T 1 / P  +  O ( T 1 )  expected time (provably); T P      T 1 / P  +  T 1   time (empirically). Near-perfect linear speedup if  P  ¿   T 1 / T 1   . Instrumentation in Cilk allows the user to determine accurate measures of  T 1  and  T 1  . The average cost of a  spawn  in Cilk-5 is only  2–6  times the cost of an ordinary C function call, depending on the platform.
L ECTURE  1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
Cilk Chess Programs  Socrates  placed  3rd  in the  1994  International Computer Chess Championship running on NCSA’s  512 -node Connection Machine CM5.  Socrates 2.0  took  2nd  place in the  1995  World Computer Chess Championship running on Sandia National Labs’  1824 -node Intel Paragon.  Cilkchess   placed  1st  in the  1996  Dutch Open running on a  12 -processor Sun Enterprise 5000.  It placed  2nd  in  1997  and  1998  running on Boston University’s  64 -processor SGI Origin 2000. Cilkchess  tied for  3rd  in the  1999  WCCC running on NASA’s  256 -node SGI Origin 2000.
 Socrates Normalized Speedup T P  =  T 1 / P  +  T  measured speedup 0.01 0.1 1 0.01 0.1 1 T P  =  T  T P  =  T 1 / P T 1 /T P T 1 /T  P T 1 /T 
Developing   Socrates  For the competition,   Socrates was to run on a  512 -processor Connection Machine Model CM5 supercomputer at the University of Illinois. The developers had easy access to a similar  32 -processor CM5 at MIT. One of the developers proposed a change to the program that produced a speedup of over  20%  on the MIT machine. After a back-of-the-envelope calculation, the proposed “improvement” was rejected!
 Socrates Speedup Paradox T P      T 1 / P  +  T  T 32 = 2048/32 + 1   = 65  seconds   = 40  seconds T  32 = 1024/32 + 8 Original program Proposed program T 32 = 65  seconds T  32 = 40  seconds T 1 = 2048  seconds T    = 1  second T  1 = 1024  seconds T     = 8  seconds T 512 = 2048/512 + 1   = 5  seconds T  512 = 1024/512 + 8   = 10  seconds
Lesson Work  and  span  can predict performance on large machines better than running times on small machines can.
L ECTURE  1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
Cilk’s Work-Stealing Scheduler Each processor maintains a  work deque   of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn!
Cilk’s Work-Stealing Scheduler Each processor maintains a  work deque  of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn! Spawn!
Cilk’s Work-Stealing Scheduler Each processor maintains a   work deque  of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Return!
Cilk’s Work-Stealing Scheduler Each processor maintains a  work deque  of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Return!
Cilk’s Work-Stealing Scheduler Each processor maintains a  work deque  of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Steal! When a processor runs out of work, it  steals  a thread from the top of a  random  victim’s deque.
Cilk’s Work-Stealing Scheduler Each processor maintains a  work deque  of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Steal! When a processor runs out of work, it  steals  a thread from the top of a  random  victim’s deque.
Cilk’s Work-Stealing Scheduler Each processor maintains a  work deque  of ready threads, and it manipulates the bottom of the deque like a stack. P P P P When a processor runs out of work, it  steals  a thread from the top of a  random  victim’s deque.
Cilk’s Work-Stealing Scheduler Each processor maintains a  work deque  of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn! When a processor runs out of work, it  steals  a thread from the top of a  random  victim’s deque.
Performance of Work-Stealing Theorem : Cilk’s work-stealing scheduler achieves an expected running time of T P      T 1 / P   + O ( T 1 ) on  P  processors. Pseudoproof . A processor is either  working  or  stealing .  The total time all processors spend working is  T 1 .  Each steal has a  1/ P  chance of reducing the span by  1 .  Thus, the expected cost of all steals is  O ( PT 1 ) .  Since there are  P  processors, the expected time is  ( T 1  +  O ( PT 1 ))/ P  =   T 1 / P  +  O ( T 1 )  .  ■
Space Bounds Theorem.   Let  S 1  be the stack space required by a serial execution of a Cilk program.  Then, the space required by a  P -processor execution is at most  S P   ·   PS 1  . Proof  (by induction). The work-stealing algorithm maintains the  busy-leaves property :  every extant procedure frame with no extant descendents has a processor working on it.   ■   P  = 3 P P P S 1
Linguistic Implications Code like the following executes properly without any risk of blowing out memory: for (i=1; i<1000000000; i++) { spawn  foo(i); } sync; M ORAL Better to steal parents than children!
L ECTURE  1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
Key Ideas Cilk is simple:  cilk ,  spawn ,  sync Recursion, recursion, recursion, … Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span
Minicourse Outline L ECTURE  1   Basic Cilk programming:   Cilk keywords, performance measures, scheduling. L ECTURE  2 Analysis of Cilk   algorithms:  matrix multiplication, sorting, tableau construction. L ABORATORY Programming matrix multiplication in Cilk  — Dr. Bradley C. Kuszmaul L ECTURE  3 Advanced Cilk programming:  inlets, abort, speculation, data synchronization, & more.

More Related Content

PDF
Anlysis and design of algorithms part 1
PPT
multi threaded and distributed algorithms
PPTX
Lecture 2 data structures and algorithms
DOCX
Basic Computer Engineering Unit II as per RGPV Syllabus
PPT
Time andspacecomplexity
PDF
Analysis and design of algorithms part2
PPT
Complexity of Algorithm
PPT
Parallel algorithms
Anlysis and design of algorithms part 1
multi threaded and distributed algorithms
Lecture 2 data structures and algorithms
Basic Computer Engineering Unit II as per RGPV Syllabus
Time andspacecomplexity
Analysis and design of algorithms part2
Complexity of Algorithm
Parallel algorithms

What's hot (19)

PDF
Symbolic Execution as DPLL Modulo Theories
PDF
14 - 08 Feb - Dynamic Programming
PPTX
Lecture 5: Asymptotic analysis of algorithms
PDF
Introduction to Algorithms Complexity Analysis
PDF
Data Structures - Lecture 8 - Study Notes
PPT
02 order of growth
PDF
Algorithm Analyzing
RTF
Design and Analysis of algorithms
PPT
Algorithm analysis
PPT
Analysis of Algorithm
PPTX
asymptotic analysis and insertion sort analysis
PPT
Introduction to Algorithms
PDF
Data Structure: Algorithm and analysis
PPTX
Performance analysis(Time & Space Complexity)
PPT
how to calclute time complexity of algortihm
PDF
PPT
Data Structures- Part2 analysis tools
PDF
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
DOC
Algorithms Question bank
Symbolic Execution as DPLL Modulo Theories
14 - 08 Feb - Dynamic Programming
Lecture 5: Asymptotic analysis of algorithms
Introduction to Algorithms Complexity Analysis
Data Structures - Lecture 8 - Study Notes
02 order of growth
Algorithm Analyzing
Design and Analysis of algorithms
Algorithm analysis
Analysis of Algorithm
asymptotic analysis and insertion sort analysis
Introduction to Algorithms
Data Structure: Algorithm and analysis
Performance analysis(Time & Space Complexity)
how to calclute time complexity of algortihm
Data Structures- Part2 analysis tools
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Algorithms Question bank
Ad

Similar to Lecture 1 (20)

PDF
Data Structure & Algorithms - Mathematical
PDF
Analysis and Matrix Multiplication using Parallel
PPTX
Intro to super. advance algorithm..pptx
PPT
chapter1.ppt
PDF
design and analysis of algorithm basic concepts.pdf
PDF
008. PROGRAM EFFICIENCY computer science.pdf
PPTX
Analysis of Algorithms (1).pptx, asymptotic
PPT
Matlab Nn Intro
PPTX
01 - DAA - PPT.pptx
PPTX
Module-1.pptxbdjdhcdbejdjhdbchchchchchjcjcjc
PDF
PPTX
AA_Unit 1_part-I.pptx
PPTX
Analysis of Algorithms, recurrence relation, solving recurrences
PPTX
Unit i basic concepts of algorithms
PDF
BCS401 ADA First IA Test Question Bank.pdf
PDF
Towards an SMT-based approach for Quantitative Information Flow
PPT
lecture3.pptlecture3 data structures pptt
PDF
A peek on numerical programming in perl and python e christopher dyken 2005
PPT
Basic_analysis.ppt
Data Structure & Algorithms - Mathematical
Analysis and Matrix Multiplication using Parallel
Intro to super. advance algorithm..pptx
chapter1.ppt
design and analysis of algorithm basic concepts.pdf
008. PROGRAM EFFICIENCY computer science.pdf
Analysis of Algorithms (1).pptx, asymptotic
Matlab Nn Intro
01 - DAA - PPT.pptx
Module-1.pptxbdjdhcdbejdjhdbchchchchchjcjcjc
AA_Unit 1_part-I.pptx
Analysis of Algorithms, recurrence relation, solving recurrences
Unit i basic concepts of algorithms
BCS401 ADA First IA Test Question Bank.pdf
Towards an SMT-based approach for Quantitative Information Flow
lecture3.pptlecture3 data structures pptt
A peek on numerical programming in perl and python e christopher dyken 2005
Basic_analysis.ppt
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Empathic Computing: Creating Shared Understanding
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
sap open course for s4hana steps from ECC to s4
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Review of recent advances in non-invasive hemoglobin estimation
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology

Lecture 1

  • 1. Multithreaded Programming in Cilk L ECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
  • 2. Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. virus shell assembly graphics rendering n -body simulation heuristic search dense and sparse matrix computations friction-stir welding simulation artificial evolution Example applications:
  • 3. Shared-Memory Multiprocessor In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform! P P P Network … Memory I/O $ $ $
  • 4. Cilk Is Simple Cilk extends the C language with just a handful of keywords. Every Cilk program has a serial semantics . Not only is Cilk fast, it provides performance guarantees based on performance abstractions. Cilk is processor-oblivious . Cilk’s provably good runtime system auto-matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. Cilk supports speculative parallelism.
  • 5. Minicourse Outline L ECTURE 1 Basic Cilk programming: Cilk keywords, performance measures, scheduling. L ECTURE 2 Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction. L ABORATORY Programming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul L ECTURE 3 Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.
  • 6. L ECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Conclusion Parallelizing Vector Addition A Chess Lesson
  • 7. Fibonacci Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types. int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } } C elision cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Cilk code
  • 8. Basic Cilk Keywords cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Identifies a function as a Cilk procedure , capable of being spawned in parallel. The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.
  • 9. Dynamic Multithreading cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync ; return (x+y); } } The computation dag unfolds dynamically. Example: fib(4) “ Processor oblivious” 4 3 2 2 1 1 1 0 0
  • 10. Multithreaded Computation The dag G = ( V , E ) represents a parallel instruction stream. Each vertex v 2 V represents a (Cilk) thread : a maximal sequence of instructions not containing parallel control ( spawn , sync , return ). Every edge e 2 E is either a spawn edge, a return edge, or a continue edge. spawn edge return edge continue edge initial thread final thread
  • 11. Cactus Stack Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc .) Cilk’s cactus stack supports several views in parallel. B A C E D A A B A C A C D A C E Views of stack C B A D E
  • 12. LECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
  • 13. Algorithmic Complexity Measures T P = execution time on P processors
  • 14. Algorithmic Complexity Measures T P = execution time on P processors T 1 = work
  • 15. Algorithmic Complexity Measures T P = execution time on P processors T 1 = work T 1 = span * * Also called critical-path length or computational depth .
  • 16. Algorithmic Complexity Measures T P = execution time on P processors T 1 = work L OWER B OUNDS T P ¸ T 1 / P T P ¸ T 1 * Also called critical-path length or computational depth . T 1 = span *
  • 17. Speedup Definition: T 1 /T P = speedup on P processors. If T 1 /T P =  ( P ) · P , we have linear speedup ; = P , we have perfect linear speedup ; > P , we have superlinear speedup , which is not possible in our model, because of the lower bound T P ¸ T 1 / P .
  • 18. Parallelism Because we have the lower bound T P ¸ T 1 , the maximum possible speedup given T 1 and T 1 is T 1 /T 1 = parallelism = the average amount of work per step along the span.
  • 19. Example: fib(4) Span: T 1 = ? Work: T 1 = ? Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Span: T 1 = 8 3 4 5 6 1 2 7 8 Work: T 1 = 17
  • 20. Example: fib(4) Parallelism: T 1 / T 1 = 2.125 Span: T 1 = ? Work: T 1 = ? Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Span: T 1 = 8 Work: T 1 = 17 Using many more than 2 processors makes little sense.
  • 21. L ECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
  • 22. Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C
  • 23. Parallelizing Vector Addition C C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); Parallelization strategy: Convert loops to recursion. void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } } }
  • 24. Parallelizing Vector Addition if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { C Parallelization strategy: Convert loops to recursion. Insert Cilk keywords. void vadd (real *A, real *B, int n){ cil k spawn vadd (A, B, n/2; vadd (A+n/2, B+n/2, n-n/2; spawn Side benefit: D&C is generally good for caches! sync ; C ilk void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } } }
  • 25. Vector Addition cil k void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync ; } }
  • 26. Vector Addition Analysis To add two vectors of length n , where BASE =  (1) : Work: T 1 =  Span: T 1 =  Parallelism: T 1 / T 1 =   ( n /lg n )  ( n )  (lg n ) BASE
  • 27. Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } } Cilk cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); } sync; }
  • 28. Analysis To add two vectors of length n , where BASE =  (1) :  (1)  ( n ) … …  ( n ) BASE Work: T 1 =  Span: T 1 =  Parallelism: T 1 / T 1 =  PUNY!
  • 29. Optimal Choice of BASE To add two vectors of length n using an optimal choice of BASE to maximize parallelism: Parallelism: T 1 / T 1 =   ( √ n ) … Work: T 1 =   ( n ) BASE … Span: T 1 =   (BASE + n /BASE) Choosing BASE = √ n ) T 1 =  √ n )
  • 30. L ECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
  • 31. Scheduling Cilk allows the programmer to express potential parallelism in an application. The Cilk scheduler maps Cilk threads onto processors dynamically at runtime. Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler. P P P Network … Memory I/O $ $ $
  • 32. Greedy Scheduling I DEA : Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed .
  • 33. Greedy Scheduling I DEA : Do as much as possible on every step. Complete step ¸ P threads ready. Run any P . Definition: A thread is ready if all its predecessors have executed . P = 3
  • 34. Greedy Scheduling I DEA : Do as much as possible on every step. Complete step ¸ P threads ready. Run any P . Incomplete step < P threads ready. Run all of them. Definition: A thread is ready if all its predecessors have executed . P = 3
  • 35. Greedy-Scheduling Theorem Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves T P  T 1 / P + T  . Proof . # complete steps · T 1 / P , since each complete step performs P work. # incomplete steps · T 1 , since each incomplete step reduces the span of the unexecuted dag by 1 . ■ P = 3
  • 36. Optimality of Greedy Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof . Let T P * be the execution time produced by the optimal scheduler. Since T P * ¸ max{ T 1 / P , T 1 } (lower bounds), we have T P · T 1 / P + T 1 · 2 ¢ max{ T 1 / P , T 1 } · 2 T P * . ■
  • 37. Linear Speedup Corollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ¿ T 1 / T 1 . Proof. Since P ¿ T 1 / T 1 is equivalent to T 1 ¿ T 1 / P , the Greedy Scheduling Theorem gives us T P · T 1 / P + T 1 ¼ T 1 / P . Thus, the speedup is T 1 / T P ¼ P . ■ Definition. The quantity ( T 1 / T 1 )/ P is called the parallel slackness .
  • 38. Cilk Performance Cilk’s “work-stealing” scheduler achieves T P = T 1 / P + O ( T 1 ) expected time (provably); T P  T 1 / P + T 1 time (empirically). Near-perfect linear speedup if P ¿ T 1 / T 1 . Instrumentation in Cilk allows the user to determine accurate measures of T 1 and T 1 . The average cost of a spawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform.
  • 39. L ECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
  • 40. Cilk Chess Programs  Socrates placed 3rd in the 1994 International Computer Chess Championship running on NCSA’s 512 -node Connection Machine CM5.  Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824 -node Intel Paragon. Cilkchess placed 1st in the 1996 Dutch Open running on a 12 -processor Sun Enterprise 5000. It placed 2nd in 1997 and 1998 running on Boston University’s 64 -processor SGI Origin 2000. Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256 -node SGI Origin 2000.
  • 41.  Socrates Normalized Speedup T P = T 1 / P + T  measured speedup 0.01 0.1 1 0.01 0.1 1 T P = T  T P = T 1 / P T 1 /T P T 1 /T  P T 1 /T 
  • 42. Developing  Socrates For the competition,  Socrates was to run on a 512 -processor Connection Machine Model CM5 supercomputer at the University of Illinois. The developers had easy access to a similar 32 -processor CM5 at MIT. One of the developers proposed a change to the program that produced a speedup of over 20% on the MIT machine. After a back-of-the-envelope calculation, the proposed “improvement” was rejected!
  • 43.  Socrates Speedup Paradox T P  T 1 / P + T  T 32 = 2048/32 + 1 = 65 seconds = 40 seconds T  32 = 1024/32 + 8 Original program Proposed program T 32 = 65 seconds T  32 = 40 seconds T 1 = 2048 seconds T  = 1 second T  1 = 1024 seconds T   = 8 seconds T 512 = 2048/512 + 1 = 5 seconds T  512 = 1024/512 + 8 = 10 seconds
  • 44. Lesson Work and span can predict performance on large machines better than running times on small machines can.
  • 45. L ECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
  • 46. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn!
  • 47. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn! Spawn!
  • 48. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Return!
  • 49. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Return!
  • 50. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Steal! When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
  • 51. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Steal! When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
  • 52. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
  • 53. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn! When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
  • 54. Performance of Work-Stealing Theorem : Cilk’s work-stealing scheduler achieves an expected running time of T P  T 1 / P + O ( T 1 ) on P processors. Pseudoproof . A processor is either working or stealing . The total time all processors spend working is T 1 . Each steal has a 1/ P chance of reducing the span by 1 . Thus, the expected cost of all steals is O ( PT 1 ) . Since there are P processors, the expected time is ( T 1 + O ( PT 1 ))/ P = T 1 / P + O ( T 1 ) . ■
  • 55. Space Bounds Theorem. Let S 1 be the stack space required by a serial execution of a Cilk program. Then, the space required by a P -processor execution is at most S P · PS 1 . Proof (by induction). The work-stealing algorithm maintains the busy-leaves property : every extant procedure frame with no extant descendents has a processor working on it. ■ P = 3 P P P S 1
  • 56. Linguistic Implications Code like the following executes properly without any risk of blowing out memory: for (i=1; i<1000000000; i++) { spawn foo(i); } sync; M ORAL Better to steal parents than children!
  • 57. L ECTURE 1 Performance Measures Scheduling Theory Basic Cilk Programming Cilk’s Scheduler Parallelizing Vector Addition A Chess Lesson Conclusion
  • 58. Key Ideas Cilk is simple: cilk , spawn , sync Recursion, recursion, recursion, … Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span Work & span
  • 59. Minicourse Outline L ECTURE 1 Basic Cilk programming: Cilk keywords, performance measures, scheduling. L ECTURE 2 Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction. L ABORATORY Programming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul L ECTURE 3 Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.