SlideShare a Scribd company logo
Exploiting Parallelism with Multi-core Technologies James Reinders Date:  Thursday, July 26 Time:  2:35pm - 3:20pm Location:  E142
Coding with TBB Contest threadingbuildingblocks.org Challenge : Integrate TBB in open source projects through 8/31/07 Judging: Best implementation and most benefit achieved from using TBB  Grand Prize : A multi-core laptop and recognition at the upcoming Intel Developers Forum  Details : threadingbuildingblocks.org
Problem Gaining performance from multi-core requires parallel programming Even a simple “parallel for” is tricky for a non-expert to write well with threads. Two aspects to parallel programming Correctness : avoiding race conditions and deadlock Performance : efficient use of resources Hardware threads (keep busy with real work)  Memory space (share between cores) Memory bandwidth (efficient cache line use)
Three Approaches for Improvement New language: Cilk, NESL, Haskell, Erlang, Fortress, … Language extensions / pragmas: OpenMP Easier to get acceptance, but require special compiler Library: POOMA, Hood, … Easy to use Domain specific
Family Tree  1988 2001 2006 1995 Languages *Other names and brands may be claimed as the property of others Cilk  space efficient scheduler cache-oblivious algorithms Threaded-C continuation tasks task stealing OpenMP* fork/join tasks OpenMP taskqueue while & recursion Pragmas Chare Kernel small tasks JSR-166 (FJTask) containers Intel® Threading Building Blocks   STL generic programming STAPL recursive ranges ECMA .NET* parallel iteration classes Libraries
Enter Intel® Threading Building Blocks 2 years from conception to 1.0 beta Born from a meeting of Principal Engineers who wanted to: Ease parallel programming for variable numbers of cores Build an accessible technology to encourage adoption Take advantage of existing research to meld ease of conception with performance Version 1.0 launched August 2006 Version 1.1 launched April 2007 Version 2.0 Intel provides Threading Building Blocks as an Open Source project with more ports done, and more we can work on together.  threadingbuildingblocks.org
Example (just a peek) void ApplyFoo( size_t n , int x  ) { for(size_t i=range_begin; i<range_end; ++i) Foo( i,x ); } SERIAL VERSION void ParallelApplyFoo(size_t n, int x) { parallel_for( blocked_range<size_t>(0,n,10),  <>(const blocked_range<size_t>& range) { for(size_t i=range.begin(); i<range.end(); ++i)  Foo(i,x); } ); } PARALLEL VERSION (the way I wish I could write it)
Parallel Version  (as it can be written) class ApplyFoo { public: int my_x; ApplyFoo( int x ) : my_x(x) {} void operator()(const blocked_range<size_t>& range) const { for(size_t i=range.begin(); i!=range.end(); ++i)  Foo(i,my_x); } }; void ParallelApplyFoo(size_t n, int x) { parallel_for(blocked_range<size_t>(0,n,10),  ApplyFoo(x)); }
Underlying concepts
Generic Programming Best known:  C++ Standard Template Library Write best possible algorithm - most general way parallel_for  does not require  specific  type of iteration space, but only that it have  signatures  for recursive splitting Instantiate algorithm to specific situation Enables distribution of broadly-useful high-quality algorithms and data structures
Key Features of Intel Threading Building Blocks Manage work as  divisible tasks   instead of threads Intel TBB maps such tasks onto physical threads Solving cache issues and load balancing for us Full support for  nested parallelism Targets threading for  robust performance portable, scalable perf. for computationally intense portions Interoperable  with other threading packages Emphasizes  scalable, data parallel  programming
Relaxed Sequential Semantics TBB emphasizes relaxed sequential   semantics  Parallelism as  accelerator , not mandatory for correctness.
Synchronization Primitives atomic, spin_mutex, spin_rw_mutex, queuing_mutex, queuing_rw_mutex, mutex Generic Parallel Algorithms parallel_for parallel_while parallel_reduce pipeline parallel_sort parallel_scan Concurrent Containers concurrent_hash_map concurrent_queue concurrent_vector Task scheduler Memory Allocation cache_aligned_allocator scalable_allocator
Serial Example static void SerialUpdateVelocity() { for( int i=1; i<UniverseHeight-1; ++i ) for( int j=1; j<UniverseWidth-1; ++j )  V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; }
Parallel Version blue = original code red = provided by TBB black = boilerplate for library struct UpdateVelocityBody { void operator()( const   blocked_range <int>& range ) const { int end =  range.end (); for( int i=   range.begin ();  i<end; ++i ) {   for( int j=1; j<UniverseWidth-1; ++j ) { V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; } } } void ParallelUpdateVelocity() { parallel_for (   blocked_range<int> (  1, UniverseHeight-1),    UpdateVelocityBody(),  auto_partitioner()  ); } Task Parallel control structure Task subdivision handler
Range is Generic Requirements for parallel_for Range Library provides  blocked_range  and  blocked_range2d You can define your own ranges Partitioner calls splitting constructor to spread tasks over range Destructor R::~R() True if range is empty bool R::empty() const True if range can be partitioned bool R::is_divisible() const Split  r  into two subranges R::R (R& r,  split ) Copy constructor R::R (const  R&)
parallel_reduce parallel_scan parallel_while parallel_sort pipeline
Parallel  pipeline Linear pipeline of stages specify maximum number of items that can be in flight handling arbitrary DAG can take thought but is doable Each stage can be  serial  or  parallel Serial stage = one item at a time, in order. Parallel stage = multiple items at a time, out of order. Uses cache efficiently Each thread carries an item through as many stages as possible. Biases towards finishing old items before tackling new ones.
Parallel  pipeline Parallel stage scales because it can process items in parallel or out of order.  Serial stage processes items one at a time in order. Another serial stage. Items wait for turn in serial stage Controls excessive parallelism by limiting total number of items flowing through pipeline. Uses sequence numbers recover order for serial stage. Tag incoming items with sequence numbers Throughput limited by throughput  of slowest serial stage. 1 3 2 4 5 6 7 8 9 10 11 12
Concurrent Containers, Mutual Exclusion Memory Allocator Task Scheduler
Concurrent Containers Intel TBB provides  concurrent  containers  STL containers are  not safe  under concurrent operations attempting multiple modifications concurrently could corrupt them Standard practice: wrap a lock around STL container accesses Limits accessors to operating one at a time TBB provides fine-grained locking and lockless operations where possible Worse single-thread performance, but better scalability. Can be used with TBB, OpenMP, or native threads.
Concurrent interface requirements Some STL interfaces are not designed to support concurrency For example, suppose two threads each execute: Solution: Intel TBB   concurrent_queue  has  pop_if_present() extern std::queue q; if(!q.empty()) { item=q.front();  q.pop(); } At this instant, another thread  might pop last element.
concurrent_vector <T> Dynamically growable array of  T grow_by (n)  grow_to_at_least (n) Never moves elements until cleared Can concurrently access and grow Method  clear()  is not thread-safe with respect to access/resizing // Append sequence [begin,end) to x in thread-safe way. template<typename T> void Append( concurrent_vector <T> &x, const T *begin, const T *end ) { std::copy(begin, end, x.begin() + x. grow_by (end-begin) ) } Example
concurrent_queue <T> Preserves local element order If one thread pushes and another thread pops two values, they come out in the same order as they went in. Two kinds of pops blocking non-blocking Method  size()  returns  signed  integer If  size()  returns – n,  it means  n  pops await corresponding pushes. Caution: queues are cache coolers
concurrent_hash <Key,T,HashCompare> Associative table allows concurrent access for reads and updates bool  insert ( accessor &result, const Key &key) to add or edit bool  find ( accessor &result, const Key &key) to edit bool  find ( const_accessor &result, const Key &key) to look up bool  erase ( const Key &key) to remove Reader locks coexist – writer locks are exclusive
struct MyHashCompare { static long  hash ( const char* x ) { long h = 0; for( const char* s = x; *s; s++ ) h = (h*157)^*s; return h; } static bool  equal ( const char* x, const char* y ) { return strcmp(x,y)==0; } }; typedef  concurrent_hash_map <const char*,int,MyHashCompare> StringTable; StringTable MyTable; Example: map strings to integers void MyUpdateCount( const char* x ) { StringTable::accessor  a; MyTable. insert ( a, x ); a->second += 1; } Multiple threads can insert and update entries concurrently. accessor object acts as smart pointer and writer lock.
Synchronization Primitives Shared Data use mutual exclusion to avoid races TBB mutual exclusion regions are protected by scoped locks The range of the lock is determined by its lifetime (scope) Leaving lock scope calls the destructor, making it exception safe Minimizing lock lifetime avoids possible contention Several mutex behaviors are available Spin vs. queued Writer vs. reader/writer Scoped wrapper of native mutual exclusion function
Example: spin_rw_mutex promotion spin_rw_mutex MyMutex; int foo () { spin_rw_mutex::scoped_lock  lock (MyMutex, /*is_writer*/ false); … if (!lock. upgrade_to_writer  ())  {  /* reacquire state */   } /* perform desired write */ return 0;  /* Destructor of ‘lock’ releases ‘MyMutex’ */ } Exceptions occurring within the locked code range automatically release the lock ( lock  passes out of scope), avoiding deadlock Any reader lock may be upgraded to writer lock;  upgrade_to_writer()  fails if the lock had to be released before it can be locked for writing
Scalable Memory Allocator Problem Memory allocation can bottle-neck a concurrent environment Thread allocation from a global heap requires global locks. Solution Intel® Threading Building Blocks provides tested, tuned, scalable, per-thread memory allocation  Scalable memory allocator interface can be used… As an  allocator  argument to STL template classes As a replacement for  malloc/realloc/free  calls (C programs) As a replacement for  new  and  delete  operators (C++ programs)
Task Scheduler Underlying the generic task structure is a  task scheduler   Core to task scheduling is a  thread pool  whereby Intel TBB maximizes thread efficiency and manages its complexity Task scheduler is designed to address common performance issues of parallel programming with native threads TBB Approach Problem Task chunking and work-stealing help balance load Load imbalance Programmer specifies tasks, not threads. Program complexity “ Greedy” scheduling often wins “ Fair” scheduling One scheduler thread per hardware thread Oversubscription
Task Scheduler Intel TBB task interest is managed in the  task_scheduler_init  object Thread pool construction also tied to the life of this object Nested construction is reference counted, low overhead Put Init object scope high to avoid pool reconstruction overhead Construction specifies thread pool size  automatic ,  explicit  or  deferred Dynamic init object lifetime management offers thread pool size control #include “tbb/task_scheduler_init.h” using namespace tbb; int main() { task_scheduler_init  init; … . return 0; }
Tasking Development tools Intel TBB offers facilities to accelerate development Linking with  libtbb_debug.so  (or Win/Mac equivalents) adds checking TBB_DO_ASSERT  macro extends checking into the header/inline code TBB_DO_THREADING_TOOLS  adds hooks for Intel Thread Analysis tools The  tick_count  class offers convenient timing services. tick_count:: now () returns current timestamp tick_count::interval_t:: operator- (const tick_count &t1, const tick_count &t2) double tick_count::interval_t:: seconds () converts intervals to real time ITT_NOTIFY  events can be useful with Intel Thread Analysis tools
open source tour
Open Source – quick ‘tour’ Source library organized around 4 directories src – C++ source for Intel TBB, TBBmalloc and the unit tests include – the standard include files build – catchall for platform-specific build information examples – TBB sample code  Top level index.html offers help on building and porting Build prerequisites: C++ compiler for target environment GNU make Bourne or BASH-compatible shell Some architectures may require an assembler for low-level primitives
Coding with TBB Contest Challenge : Integrate TBB in open source projects through 8/31/07 Judging: Best implementation and most benefit achieved from using TBB  Grand Prize : A multi-core laptop and recognition at the upcoming Intel Developers Forum  Details : threadingbuildingblocks.org
Learn more… INTEL booth (today until 5pm!) many Intel engineers – come see us! threadingbuildingblocks.org

More Related Content

PDF
SunPy: Python for solar physics
PPTX
Transformer Zoo (a deeper dive)
PDF
Numba: Array-oriented Python Compiler for NumPy
PDF
PPU Optimisation Lesson
PPTX
Using Parallel Computing Platform - NHDNUG
PPTX
Transformer Zoo
PDF
Fletcher Framework for Programming FPGA
PDF
LeFlowを調べてみました
SunPy: Python for solar physics
Transformer Zoo (a deeper dive)
Numba: Array-oriented Python Compiler for NumPy
PPU Optimisation Lesson
Using Parallel Computing Platform - NHDNUG
Transformer Zoo
Fletcher Framework for Programming FPGA
LeFlowを調べてみました

What's hot (20)

PDF
Конверсия управляемых языков в неуправляемые
PDF
C optimization notes
PDF
Re-engineering Eclipse MDT/OCL for Xtext
PDF
Code GPU with CUDA - Memory Subsystem
ODP
Implementation of “Parma Polyhedron Library”-functions in MATLAB
PPTX
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
PDF
Code GPU with CUDA - Optimizing memory and control flow
PPTX
Async await in C++
PDF
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
PDF
Return Oriented Programming
PDF
An evaluation of LLVM compiler for SVE with fairly complicated loops
PPT
ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUs
PPT
OpenMP And C++
PDF
Apache Hadoop Java API
PDF
Numba: Flexible analytics written in Python with machine-code speeds and avo...
PDF
tokyotalk
PDF
Parallel Computing with R
PDF
Compilation of COSMO for GPU using LLVM
PDF
A closure ekon16
KEY
Конверсия управляемых языков в неуправляемые
C optimization notes
Re-engineering Eclipse MDT/OCL for Xtext
Code GPU with CUDA - Memory Subsystem
Implementation of “Parma Polyhedron Library”-functions in MATLAB
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
Code GPU with CUDA - Optimizing memory and control flow
Async await in C++
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
Return Oriented Programming
An evaluation of LLVM compiler for SVE with fairly complicated loops
ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUs
OpenMP And C++
Apache Hadoop Java API
Numba: Flexible analytics written in Python with machine-code speeds and avo...
tokyotalk
Parallel Computing with R
Compilation of COSMO for GPU using LLVM
A closure ekon16
Ad

Viewers also liked (20)

PDF
Os Harris
PPS
背叛成功
PPS
C A R T A D E U M A M E R I C A N O
PPT
It’S A Wonderful World
PPS
阿拉斯加釣鮭記
PDF
Os Pittaro
PPT
Rails Plugins 1 Plugin
PPT
Money Wash 12
PPS
芬蘭
PDF
Os Brockmeier
PPS
Safe netizens
PDF
Rails Plugins 2 Hoe
PDF
Barmherzigkeit 328066
PPS
Tsunami en Asia
PDF
Os Pennyleach
PPS
3 Deseos
PDF
Os Schlossnagle Theo
PDF
Os Accardi
PDF
Os Gottfrid
PPS
玻利維亞的驚險山路
Os Harris
背叛成功
C A R T A D E U M A M E R I C A N O
It’S A Wonderful World
阿拉斯加釣鮭記
Os Pittaro
Rails Plugins 1 Plugin
Money Wash 12
芬蘭
Os Brockmeier
Safe netizens
Rails Plugins 2 Hoe
Barmherzigkeit 328066
Tsunami en Asia
Os Pennyleach
3 Deseos
Os Schlossnagle Theo
Os Accardi
Os Gottfrid
玻利維亞的驚險山路
Ad

Similar to Os Reindersfinal (20)

PPTX
Medical Image Processing Strategies for multi-core CPUs
PDF
Parallel Programming
PPTX
Blazing Fast Windows 8 Apps using Visual C++
PPT
Migration To Multi Core - Parallel Programming Models
PDF
Unmanaged Parallelization via P/Invoke
PDF
Options and trade offs for parallelism and concurrency in Modern C++
PPT
Parallel Programming: Beyond the Critical Section
PPT
Lec1 Intro
PPT
Lec1 Intro
PDF
Ndp Slides
PPT
CS4961-L9.ppt
PDF
ParallelProgrammingBasics_v2.pdf
PPTX
Complier design
PPT
Parallel Programming Primer 1
PDF
Task based Programming with OmpSs and its Application
PDF
Parallel Programming
PPT
Parallel Programming Primer
PPTX
Parallel Computing - openMP -- Lecture 5
PPTX
Java parallel programming made simple
PPT
Introduction to Cluster Computing and Map Reduce (from Google)
Medical Image Processing Strategies for multi-core CPUs
Parallel Programming
Blazing Fast Windows 8 Apps using Visual C++
Migration To Multi Core - Parallel Programming Models
Unmanaged Parallelization via P/Invoke
Options and trade offs for parallelism and concurrency in Modern C++
Parallel Programming: Beyond the Critical Section
Lec1 Intro
Lec1 Intro
Ndp Slides
CS4961-L9.ppt
ParallelProgrammingBasics_v2.pdf
Complier design
Parallel Programming Primer 1
Task based Programming with OmpSs and its Application
Parallel Programming
Parallel Programming Primer
Parallel Computing - openMP -- Lecture 5
Java parallel programming made simple
Introduction to Cluster Computing and Map Reduce (from Google)

More from oscon2007 (20)

PDF
J Ruby Whirlwind Tour
ODP
Solr Presentation5
PDF
Os Borger
PDF
Os Harkins
PDF
Os Fitzpatrick Sussman Wiifm
PDF
Os Bunce
PDF
Yuicss R7
PDF
Performance Whack A Mole
ODP
Os Fogel
PDF
Os Lanphier Brashears
PPT
Os Tucker
PDF
Os Fitzpatrick Sussman Swp
PDF
Os Furlong
PDF
Os Berlin Dispelling Myths
PDF
Os Kimsal
PDF
Os Pruett
PDF
Os Alrubaie
PDF
Os Keysholistic
ODP
Os Jonphillips
PDF
Os Urnerupdated
J Ruby Whirlwind Tour
Solr Presentation5
Os Borger
Os Harkins
Os Fitzpatrick Sussman Wiifm
Os Bunce
Yuicss R7
Performance Whack A Mole
Os Fogel
Os Lanphier Brashears
Os Tucker
Os Fitzpatrick Sussman Swp
Os Furlong
Os Berlin Dispelling Myths
Os Kimsal
Os Pruett
Os Alrubaie
Os Keysholistic
Os Jonphillips
Os Urnerupdated

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
MIND Revenue Release Quarter 2 2025 Press Release
Accuracy of neural networks in brain wave diagnosis of schizophrenia
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Getting Started with Data Integration: FME Form 101
Programs and apps: productivity, graphics, security and other tools
A comparative analysis of optical character recognition models for extracting...
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
NewMind AI Weekly Chronicles - August'25-Week II
MYSQL Presentation for SQL database connectivity
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation

Os Reindersfinal

  • 1. Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142
  • 2. Coding with TBB Contest threadingbuildingblocks.org Challenge : Integrate TBB in open source projects through 8/31/07 Judging: Best implementation and most benefit achieved from using TBB Grand Prize : A multi-core laptop and recognition at the upcoming Intel Developers Forum Details : threadingbuildingblocks.org
  • 3. Problem Gaining performance from multi-core requires parallel programming Even a simple “parallel for” is tricky for a non-expert to write well with threads. Two aspects to parallel programming Correctness : avoiding race conditions and deadlock Performance : efficient use of resources Hardware threads (keep busy with real work) Memory space (share between cores) Memory bandwidth (efficient cache line use)
  • 4. Three Approaches for Improvement New language: Cilk, NESL, Haskell, Erlang, Fortress, … Language extensions / pragmas: OpenMP Easier to get acceptance, but require special compiler Library: POOMA, Hood, … Easy to use Domain specific
  • 5. Family Tree 1988 2001 2006 1995 Languages *Other names and brands may be claimed as the property of others Cilk space efficient scheduler cache-oblivious algorithms Threaded-C continuation tasks task stealing OpenMP* fork/join tasks OpenMP taskqueue while & recursion Pragmas Chare Kernel small tasks JSR-166 (FJTask) containers Intel® Threading Building Blocks STL generic programming STAPL recursive ranges ECMA .NET* parallel iteration classes Libraries
  • 6. Enter Intel® Threading Building Blocks 2 years from conception to 1.0 beta Born from a meeting of Principal Engineers who wanted to: Ease parallel programming for variable numbers of cores Build an accessible technology to encourage adoption Take advantage of existing research to meld ease of conception with performance Version 1.0 launched August 2006 Version 1.1 launched April 2007 Version 2.0 Intel provides Threading Building Blocks as an Open Source project with more ports done, and more we can work on together. threadingbuildingblocks.org
  • 7. Example (just a peek) void ApplyFoo( size_t n , int x ) { for(size_t i=range_begin; i<range_end; ++i) Foo( i,x ); } SERIAL VERSION void ParallelApplyFoo(size_t n, int x) { parallel_for( blocked_range<size_t>(0,n,10), <>(const blocked_range<size_t>& range) { for(size_t i=range.begin(); i<range.end(); ++i) Foo(i,x); } ); } PARALLEL VERSION (the way I wish I could write it)
  • 8. Parallel Version (as it can be written) class ApplyFoo { public: int my_x; ApplyFoo( int x ) : my_x(x) {} void operator()(const blocked_range<size_t>& range) const { for(size_t i=range.begin(); i!=range.end(); ++i) Foo(i,my_x); } }; void ParallelApplyFoo(size_t n, int x) { parallel_for(blocked_range<size_t>(0,n,10), ApplyFoo(x)); }
  • 10. Generic Programming Best known: C++ Standard Template Library Write best possible algorithm - most general way parallel_for does not require specific type of iteration space, but only that it have signatures for recursive splitting Instantiate algorithm to specific situation Enables distribution of broadly-useful high-quality algorithms and data structures
  • 11. Key Features of Intel Threading Building Blocks Manage work as divisible tasks instead of threads Intel TBB maps such tasks onto physical threads Solving cache issues and load balancing for us Full support for nested parallelism Targets threading for robust performance portable, scalable perf. for computationally intense portions Interoperable with other threading packages Emphasizes scalable, data parallel programming
  • 12. Relaxed Sequential Semantics TBB emphasizes relaxed sequential semantics Parallelism as accelerator , not mandatory for correctness.
  • 13. Synchronization Primitives atomic, spin_mutex, spin_rw_mutex, queuing_mutex, queuing_rw_mutex, mutex Generic Parallel Algorithms parallel_for parallel_while parallel_reduce pipeline parallel_sort parallel_scan Concurrent Containers concurrent_hash_map concurrent_queue concurrent_vector Task scheduler Memory Allocation cache_aligned_allocator scalable_allocator
  • 14. Serial Example static void SerialUpdateVelocity() { for( int i=1; i<UniverseHeight-1; ++i ) for( int j=1; j<UniverseWidth-1; ++j ) V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; }
  • 15. Parallel Version blue = original code red = provided by TBB black = boilerplate for library struct UpdateVelocityBody { void operator()( const blocked_range <int>& range ) const { int end = range.end (); for( int i= range.begin (); i<end; ++i ) { for( int j=1; j<UniverseWidth-1; ++j ) { V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; } } } void ParallelUpdateVelocity() { parallel_for ( blocked_range<int> ( 1, UniverseHeight-1), UpdateVelocityBody(), auto_partitioner() ); } Task Parallel control structure Task subdivision handler
  • 16. Range is Generic Requirements for parallel_for Range Library provides blocked_range and blocked_range2d You can define your own ranges Partitioner calls splitting constructor to spread tasks over range Destructor R::~R() True if range is empty bool R::empty() const True if range can be partitioned bool R::is_divisible() const Split r into two subranges R::R (R& r, split ) Copy constructor R::R (const R&)
  • 18. Parallel pipeline Linear pipeline of stages specify maximum number of items that can be in flight handling arbitrary DAG can take thought but is doable Each stage can be serial or parallel Serial stage = one item at a time, in order. Parallel stage = multiple items at a time, out of order. Uses cache efficiently Each thread carries an item through as many stages as possible. Biases towards finishing old items before tackling new ones.
  • 19. Parallel pipeline Parallel stage scales because it can process items in parallel or out of order. Serial stage processes items one at a time in order. Another serial stage. Items wait for turn in serial stage Controls excessive parallelism by limiting total number of items flowing through pipeline. Uses sequence numbers recover order for serial stage. Tag incoming items with sequence numbers Throughput limited by throughput of slowest serial stage. 1 3 2 4 5 6 7 8 9 10 11 12
  • 20. Concurrent Containers, Mutual Exclusion Memory Allocator Task Scheduler
  • 21. Concurrent Containers Intel TBB provides concurrent containers STL containers are not safe under concurrent operations attempting multiple modifications concurrently could corrupt them Standard practice: wrap a lock around STL container accesses Limits accessors to operating one at a time TBB provides fine-grained locking and lockless operations where possible Worse single-thread performance, but better scalability. Can be used with TBB, OpenMP, or native threads.
  • 22. Concurrent interface requirements Some STL interfaces are not designed to support concurrency For example, suppose two threads each execute: Solution: Intel TBB concurrent_queue has pop_if_present() extern std::queue q; if(!q.empty()) { item=q.front(); q.pop(); } At this instant, another thread might pop last element.
  • 23. concurrent_vector <T> Dynamically growable array of T grow_by (n) grow_to_at_least (n) Never moves elements until cleared Can concurrently access and grow Method clear() is not thread-safe with respect to access/resizing // Append sequence [begin,end) to x in thread-safe way. template<typename T> void Append( concurrent_vector <T> &x, const T *begin, const T *end ) { std::copy(begin, end, x.begin() + x. grow_by (end-begin) ) } Example
  • 24. concurrent_queue <T> Preserves local element order If one thread pushes and another thread pops two values, they come out in the same order as they went in. Two kinds of pops blocking non-blocking Method size() returns signed integer If size() returns – n, it means n pops await corresponding pushes. Caution: queues are cache coolers
  • 25. concurrent_hash <Key,T,HashCompare> Associative table allows concurrent access for reads and updates bool insert ( accessor &result, const Key &key) to add or edit bool find ( accessor &result, const Key &key) to edit bool find ( const_accessor &result, const Key &key) to look up bool erase ( const Key &key) to remove Reader locks coexist – writer locks are exclusive
  • 26. struct MyHashCompare { static long hash ( const char* x ) { long h = 0; for( const char* s = x; *s; s++ ) h = (h*157)^*s; return h; } static bool equal ( const char* x, const char* y ) { return strcmp(x,y)==0; } }; typedef concurrent_hash_map <const char*,int,MyHashCompare> StringTable; StringTable MyTable; Example: map strings to integers void MyUpdateCount( const char* x ) { StringTable::accessor a; MyTable. insert ( a, x ); a->second += 1; } Multiple threads can insert and update entries concurrently. accessor object acts as smart pointer and writer lock.
  • 27. Synchronization Primitives Shared Data use mutual exclusion to avoid races TBB mutual exclusion regions are protected by scoped locks The range of the lock is determined by its lifetime (scope) Leaving lock scope calls the destructor, making it exception safe Minimizing lock lifetime avoids possible contention Several mutex behaviors are available Spin vs. queued Writer vs. reader/writer Scoped wrapper of native mutual exclusion function
  • 28. Example: spin_rw_mutex promotion spin_rw_mutex MyMutex; int foo () { spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock. upgrade_to_writer ()) { /* reacquire state */ } /* perform desired write */ return 0; /* Destructor of ‘lock’ releases ‘MyMutex’ */ } Exceptions occurring within the locked code range automatically release the lock ( lock passes out of scope), avoiding deadlock Any reader lock may be upgraded to writer lock; upgrade_to_writer() fails if the lock had to be released before it can be locked for writing
  • 29. Scalable Memory Allocator Problem Memory allocation can bottle-neck a concurrent environment Thread allocation from a global heap requires global locks. Solution Intel® Threading Building Blocks provides tested, tuned, scalable, per-thread memory allocation Scalable memory allocator interface can be used… As an allocator argument to STL template classes As a replacement for malloc/realloc/free calls (C programs) As a replacement for new and delete operators (C++ programs)
  • 30. Task Scheduler Underlying the generic task structure is a task scheduler Core to task scheduling is a thread pool whereby Intel TBB maximizes thread efficiency and manages its complexity Task scheduler is designed to address common performance issues of parallel programming with native threads TBB Approach Problem Task chunking and work-stealing help balance load Load imbalance Programmer specifies tasks, not threads. Program complexity “ Greedy” scheduling often wins “ Fair” scheduling One scheduler thread per hardware thread Oversubscription
  • 31. Task Scheduler Intel TBB task interest is managed in the task_scheduler_init object Thread pool construction also tied to the life of this object Nested construction is reference counted, low overhead Put Init object scope high to avoid pool reconstruction overhead Construction specifies thread pool size automatic , explicit or deferred Dynamic init object lifetime management offers thread pool size control #include “tbb/task_scheduler_init.h” using namespace tbb; int main() { task_scheduler_init init; … . return 0; }
  • 32. Tasking Development tools Intel TBB offers facilities to accelerate development Linking with libtbb_debug.so (or Win/Mac equivalents) adds checking TBB_DO_ASSERT macro extends checking into the header/inline code TBB_DO_THREADING_TOOLS adds hooks for Intel Thread Analysis tools The tick_count class offers convenient timing services. tick_count:: now () returns current timestamp tick_count::interval_t:: operator- (const tick_count &t1, const tick_count &t2) double tick_count::interval_t:: seconds () converts intervals to real time ITT_NOTIFY events can be useful with Intel Thread Analysis tools
  • 34. Open Source – quick ‘tour’ Source library organized around 4 directories src – C++ source for Intel TBB, TBBmalloc and the unit tests include – the standard include files build – catchall for platform-specific build information examples – TBB sample code Top level index.html offers help on building and porting Build prerequisites: C++ compiler for target environment GNU make Bourne or BASH-compatible shell Some architectures may require an assembler for low-level primitives
  • 35. Coding with TBB Contest Challenge : Integrate TBB in open source projects through 8/31/07 Judging: Best implementation and most benefit achieved from using TBB Grand Prize : A multi-core laptop and recognition at the upcoming Intel Developers Forum Details : threadingbuildingblocks.org
  • 36. Learn more… INTEL booth (today until 5pm!) many Intel engineers – come see us! threadingbuildingblocks.org