SlideShare a Scribd company logo
Multicore Programming
Agenda
 Part 1 - Current state of affairs
 Part 2 - Multithreaded algorithms
 Part 3 – Task Parallel Library
Multicore Programming
Part 1: Current state of affairs
Why Moore's law is not working
anymore
 Power consumption
 Wire delays
 DRAM access latency
 Diminishing returns of more instruction-level
 parallelism
Power consumption

                                                                                             Sun‟s Surface
                        10,000


                         1,000                                                 Rocket Nozzle
Power Density (W/cm2)




                          100                                      Nuclear Reactor



                           10                                   Pentium® processors

                                                                                      Hot Plate
                            1
                                   8080
                                                                  486

                                                          386

                             „70                   „80    ‟90               ‟00                        „10
     Intel Developer Forum, Spring 2004 - Pat Gelsinger
Wire delays
DRAM access latency
Diminishing returns
 80‟s
   10 CPI  1 CPI
 90
   1 CPI  0.5CPI
 00‟s: multicore
The Free Lunch Is Over. A
Fundamental Turn Toward
Concurrency in Software

                        Herb Sutter
Survival




 To scale performance, put many processing cores on the
  microprocessor chip
 New Moore’s law edition is about doubling of cores.
Quotations
 No matter how fast processors get, software
  consistently finds new ways to eat up the extra speed
 If you haven‟t done so already, now is the time to take
  a hard look at the design of your
  application, determine what operations are CPU-
  sensitive now or are likely to become so soon, and
  identify how those places could benefit from
  concurrency.”
       -- Herb Sutter, C++ Architect at Microsoft (March
2005)
 After decades of single core processors, the high
  volume processor industry has gone from single to
  dual to quad-core in just the last two years. Moore‟s
  Law scaling should easily let us hit the 80-core mark
  in mainstream processors within the next ten years
  and quite possibly even less.
              -- Justin Rattner, CTO, Intel (February
What keeps us away from multicore
 Sequential way of thinking
 Believe that parallel programming is difficult and
  error-prone
 Unwilling to accept the fact that sequential era is
  over
 Neglecting performance
What have been done
 Many frameworks have been created, that brings
  parallelism at application level.
 Vendors hardly tries to teach programming
  community how to write parallel programs
 MIT and other education centers did a lot of
  researches in this area
Multicore Programming
Part 2: Multithreaded algorithms
Chapter 27 Multithreaded
Algorithms
Multithreaded algorithms
 No single architecture of parallel
  computer  no single and wide
  accepted model of parallel
  computing
 We rely on parallel shared memory
  computer
Dynamic multithreaded model(DMM)
 Allows programmer to operate with “logical
  parallelism” without worrying about any issues of
  static programming
 Two main features are:
   Nested parallelism (parent can proceed while
    spawned child is computing its result)
   Parallel loop (iteration of the loop can execute
    concurrently)
DMM - advantages
 Simple extension of “serial model”. Only 3 new
  keywords: parallel, spawn and sync.
 Provides theoretically clean way of quantify
  parallelism based on notions of “work” and
  “span”
 Many MT algorithms based on nested parallelism
  a naturally follows from divide and conquer
  approach
Multithreaded execution model
Work

Span

Speedup

Parallelism

Performance summary

Example: fib(4)
Scheduler role

Analyzing MT algorithms: Matrix
multiplication
P-Square-Matrix-Multiply:
1. n = a.rows
2. let C be new NxN matrix
3. parallel for i = 1 to n
4. parallel for j = 1 to n
5.     Cij = 0
6.     for k 1 to n
7.          Cij= Cij + Aik * B kj
Analyzing MT algorithms: Matrix
multiplication

Chess Lesson

Multicore Programming
Part 2: Task Parallel Library
TPL building blocks
 Consist of:
  - Tasks
  - Tread Safe Scalable Collections
  - Phases and Work Exchange
  - Partitioning
  - Looping
  - Control
  - Breaking
  - Exceptions
  - Results
Data parallelism




Parallel.ForEach(letters, ch => Capitalize(ch));
Task parallelism




Parallel.Invoke(() => Average(), () => Minimum()
…);
Thread Pool in .net 3.5
Thread Pool in .NET 4.0
Task Scheduler & Thread pool
 3.5 ThreadPool.QueueUserWorkItem
 disadvantages:
  Zero information about each work item
  Fairness FIFO queue maintain
 Improvements:
  More efficient FIFO queue (ConcurrentQueue)
  Enhance the API to get more information from user
    Task
    Work stealing
    Threads injections
    Wait completion, handling exceptions, getting computation
     result
New Primitives
 Thread-safe, scalable collections        AggregateException
   IProducerConsumerCollection<T>      Initialization
       ConcurrentQueue<T>                 Lazy<T>
       ConcurrentStack<T>                    LazyInitializer.EnsureInitialized<T>
       ConcurrentBag<T>                   ThreadLocal<T>
    ConcurrentDictionary<TKey,TValu
     e>
                                        Locks
                                          ManualResetEventSlim
 Phases and work exchange
                                          SemaphoreSlim
   Barrier
                                          SpinLock
   BlockingCollection<T>
                                          SpinWait
   CountdownEvent

                                        Cancellation
 Partitioning
                                          CancellationToken{Source}
   {Orderable}Partitioner<T>
       Partitioner.Create



 Exception handling
References
 The Free Lunch Is Over: A Fundamental Turn
    Toward Concurrency in Software
   MIT Introduction to algorithms video lectures
   Chapter 27 Multithreaded Algorithms from
    Introduction to algorithms 3rd edition
   CLR 4.0 ThreadPool Improvements: Part 1
   Multicore Programming Primer
   ThreadPool on Channel 9

More Related Content

PPTX
Multicore programmingandtpl(.net day)
PDF
Inference accelerators
PPTX
Scalable Parallel Computing on Clouds
PDF
Introduction to Chainer 11 may,2018
PPTX
Training course lect1
PDF
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
PDF
HTCC poster for CERN Openlab opendays 2015
PPTX
Protocol implementation on NS2
Multicore programmingandtpl(.net day)
Inference accelerators
Scalable Parallel Computing on Clouds
Introduction to Chainer 11 may,2018
Training course lect1
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
HTCC poster for CERN Openlab opendays 2015
Protocol implementation on NS2

What's hot (20)

PDF
Attention mechanisms with tensorflow
PDF
Manycores for the Masses
PDF
Using neon for pattern recognition in audio data
PDF
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
PPTX
Lec07 threading hw
PDF
Real-time applications on IntelXeon/Phi
PPT
Ibm quantum computing
PPTX
High Performance Parallel Computing with Clouds and Cloud Technologies
PDF
Parallel computation
PDF
第11回 配信講義 計算科学技術特論A(2021)
PDF
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
PPT
PPTX
Architecture and Performance of Runtime Environments for Data Intensive Scala...
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineeri...
PPT
Tridiagonal solver in gpu
PDF
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
PPT
Hs java open_party
PPTX
Lec13 multidevice
PDF
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Attention mechanisms with tensorflow
Manycores for the Masses
Using neon for pattern recognition in audio data
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Lec07 threading hw
Real-time applications on IntelXeon/Phi
Ibm quantum computing
High Performance Parallel Computing with Clouds and Cloud Technologies
Parallel computation
第11回 配信講義 計算科学技術特論A(2021)
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Architecture and Performance of Runtime Environments for Data Intensive Scala...
IJCER (www.ijceronline.com) International Journal of computational Engineeri...
Tridiagonal solver in gpu
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
Hs java open_party
Lec13 multidevice
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Ad

Similar to Multicore programmingandtpl (20)

PPTX
Patterns of parallel programming
PDF
Cooperative Task Execution for Apache Spark
PPTX
[COSCUP 2022] 腳踏多條船-利用 Coroutine在 Software Transactional Memory上進行動態排程
PPT
Parallelism Processor Design
PPTX
Natural Laws of Software Performance
PDF
Porting a Streaming Pipeline from Scala to Rust
PPT
Webinaron muticoreprocessors
PDF
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
PDF
Automating the Hunt for Non-Obvious Sources of Latency Spreads
PDF
IEEE CloudCom 2014参加報告
PPTX
Super scaling singleton inserts
PPT
Real-time Programming in Java
PDF
RTOS implementation
PPTX
Solve the colocation conundrum: Performance and density at scale with Kubernetes
PDF
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
PPTX
Parallel Computing For Managed Developers
PDF
May2010 hex-core-opt
PDF
High Performance Cloud Computing
Patterns of parallel programming
Cooperative Task Execution for Apache Spark
[COSCUP 2022] 腳踏多條船-利用 Coroutine在 Software Transactional Memory上進行動態排程
Parallelism Processor Design
Natural Laws of Software Performance
Porting a Streaming Pipeline from Scala to Rust
Webinaron muticoreprocessors
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Automating the Hunt for Non-Obvious Sources of Latency Spreads
IEEE CloudCom 2014参加報告
Super scaling singleton inserts
Real-time Programming in Java
RTOS implementation
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Parallel Computing For Managed Developers
May2010 hex-core-opt
High Performance Cloud Computing
Ad

Multicore programmingandtpl

  • 2. Agenda  Part 1 - Current state of affairs  Part 2 - Multithreaded algorithms  Part 3 – Task Parallel Library
  • 3. Multicore Programming Part 1: Current state of affairs
  • 4. Why Moore's law is not working anymore  Power consumption  Wire delays  DRAM access latency  Diminishing returns of more instruction-level parallelism
  • 5. Power consumption Sun‟s Surface 10,000 1,000 Rocket Nozzle Power Density (W/cm2) 100 Nuclear Reactor 10 Pentium® processors Hot Plate 1 8080 486 386 „70 „80 ‟90 ‟00 „10 Intel Developer Forum, Spring 2004 - Pat Gelsinger
  • 8. Diminishing returns  80‟s  10 CPI  1 CPI  90  1 CPI  0.5CPI  00‟s: multicore
  • 9. The Free Lunch Is Over. A Fundamental Turn Toward Concurrency in Software Herb Sutter
  • 10. Survival  To scale performance, put many processing cores on the microprocessor chip  New Moore’s law edition is about doubling of cores.
  • 11. Quotations  No matter how fast processors get, software consistently finds new ways to eat up the extra speed  If you haven‟t done so already, now is the time to take a hard look at the design of your application, determine what operations are CPU- sensitive now or are likely to become so soon, and identify how those places could benefit from concurrency.” -- Herb Sutter, C++ Architect at Microsoft (March 2005)  After decades of single core processors, the high volume processor industry has gone from single to dual to quad-core in just the last two years. Moore‟s Law scaling should easily let us hit the 80-core mark in mainstream processors within the next ten years and quite possibly even less. -- Justin Rattner, CTO, Intel (February
  • 12. What keeps us away from multicore  Sequential way of thinking  Believe that parallel programming is difficult and error-prone  Unwilling to accept the fact that sequential era is over  Neglecting performance
  • 13. What have been done  Many frameworks have been created, that brings parallelism at application level.  Vendors hardly tries to teach programming community how to write parallel programs  MIT and other education centers did a lot of researches in this area
  • 14. Multicore Programming Part 2: Multithreaded algorithms
  • 16. Multithreaded algorithms  No single architecture of parallel computer  no single and wide accepted model of parallel computing  We rely on parallel shared memory computer
  • 17. Dynamic multithreaded model(DMM)  Allows programmer to operate with “logical parallelism” without worrying about any issues of static programming  Two main features are:  Nested parallelism (parent can proceed while spawned child is computing its result)  Parallel loop (iteration of the loop can execute concurrently)
  • 18. DMM - advantages  Simple extension of “serial model”. Only 3 new keywords: parallel, spawn and sync.  Provides theoretically clean way of quantify parallelism based on notions of “work” and “span”  Many MT algorithms based on nested parallelism a naturally follows from divide and conquer approach
  • 27. Analyzing MT algorithms: Matrix multiplication P-Square-Matrix-Multiply: 1. n = a.rows 2. let C be new NxN matrix 3. parallel for i = 1 to n 4. parallel for j = 1 to n 5. Cij = 0 6. for k 1 to n 7. Cij= Cij + Aik * B kj
  • 28. Analyzing MT algorithms: Matrix multiplication 
  • 30. Multicore Programming Part 2: Task Parallel Library
  • 31. TPL building blocks  Consist of: - Tasks - Tread Safe Scalable Collections - Phases and Work Exchange - Partitioning - Looping - Control - Breaking - Exceptions - Results
  • 33. Task parallelism Parallel.Invoke(() => Average(), () => Minimum() …);
  • 34. Thread Pool in .net 3.5
  • 35. Thread Pool in .NET 4.0
  • 36. Task Scheduler & Thread pool  3.5 ThreadPool.QueueUserWorkItem disadvantages:  Zero information about each work item  Fairness FIFO queue maintain  Improvements:  More efficient FIFO queue (ConcurrentQueue)  Enhance the API to get more information from user  Task  Work stealing  Threads injections  Wait completion, handling exceptions, getting computation result
  • 37. New Primitives  Thread-safe, scalable collections  AggregateException  IProducerConsumerCollection<T>  Initialization  ConcurrentQueue<T>  Lazy<T>  ConcurrentStack<T>  LazyInitializer.EnsureInitialized<T>  ConcurrentBag<T>  ThreadLocal<T>  ConcurrentDictionary<TKey,TValu e>  Locks  ManualResetEventSlim  Phases and work exchange  SemaphoreSlim  Barrier  SpinLock  BlockingCollection<T>  SpinWait  CountdownEvent  Cancellation  Partitioning  CancellationToken{Source}  {Orderable}Partitioner<T>  Partitioner.Create  Exception handling
  • 38. References  The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software  MIT Introduction to algorithms video lectures  Chapter 27 Multithreaded Algorithms from Introduction to algorithms 3rd edition  CLR 4.0 ThreadPool Improvements: Part 1  Multicore Programming Primer  ThreadPool on Channel 9