SlideShare a Scribd company logo
Learning and Development
       Presents




OPEN TALK SERIES
A series of illuminating talks and interactions that open our minds to new ideas
and concepts; that makes us look for newer or better ways of doing what we did;
or point us to exciting things we have never done before. A range of topics on
Technology, Business, Fun and Life.

Be part of the learning experience at Aditi.
Join the talks. Its free. Free as in freedom at work, not free-beer.
Speak at these events. Or bring an expert/friend to talk.
Mail LEAD with topic and availability.
Parallel Programming

    Sundararajan Subramanian
        Aditi Technologies



2
Introduction to Parallel Computing
• The challenge
  – Provide the abstractions , programming
    paradigms, and algorithms needed to
    effectively design, implement, and maintain
    applications that exploit the parallelism
    provided by the underlying hardware in order
    to solve modern problems.
Single-core CPU chip
                  the single core




                                    4
Multi-core architectures




     Core 1           Core 2   Core 3   Core 4




Multi-core CPU chip                              5
Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)


   c          c         c         c
   o          o         o         o
   r          r         r         r
   e          e         e         e

   1          2         3         4



                                               6
The cores run in parallel
    thread 1       thread 2       thread 3       thread 4




c              c              c              c
o              o              o              o
r              r              r              r
e              e              e              e

1              2              3              4




                                                            7
Within each core, threads are time-sliced
       (just like on a uniprocessor)
     several       several       several       several
     threads       threads       threads       threads




 c             c             c             c
 o             o             o             o
 r             r             r             r
 e             e             e             e

 1             2             3             4




                                                         8
Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
  instructions, split them into
  microinstructions, do aggressive branch
  prediction, etc.
• Instruction-level parallelism enabled rapid
  increases in processor speeds over the
  last 15 years

                                             9
Instruction level parallelism
• For(int i-0;i<1000;i++)
    { a[0]++; a[0]++; }


• For(int i-0;i<1000;i++)
    { a[0]++; a[1]++; }
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
  thread (Web server, database server)
• A computer game can do AI, graphics, and
  physics in three separate threads
• Single-core superscalar processors cannot
  fully exploit TLP
• Multi-core architectures are the next step in
  processor evolution: explicitly exploiting TLP
                                               11
A technique complementary to multi-core:
         Simultaneous multithreading

• Problem addressed:                                    L1 D-Cache D-TLB

  The processor pipeline                               Integer       Floating Point
  can get stalled:




                               L2 Cache and Control
  – Waiting for the result                                  Schedulers

    of a long floating point                                Uop queues
    (or integer) operation
                                                            Rename/Alloc
  – Waiting for data to
                                                      BTB      Trace Cache           uCode
    arrive from memory                                                               ROM
                                                               Decoder
 Other execution units         Bus

 wait unused                                                BTB and I-TLB
                                                                         Source: Intel

                                                                                         12
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
  SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
  on the same core

• Example: if one thread is waiting for a floating
  point operation to complete, another thread can
  use the integer units


                                                     13
Without SMT, only a single thread can
        run at any given time
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer            Floating Point

                                        Schedulers

                                     Uop queues

                                     Rename/Alloc

                             BTB     Trace Cache          uCode ROM

                                         Decoder
      Bus




                                     BTB and I-TLB

                                                 Thread 1: floating point
                                                                            14
Without SMT, only a single thread can
        run at any given time
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer              Floating Point

                                        Schedulers

                                      Uop queues

                                     Rename/Alloc

                             BTB      Trace Cache         uCode ROM

                                         Decoder
      Bus




                                     BTB and I-TLB

                               Thread 2:
                               integer operation                      15
SMT processor: both threads can run
          concurrently
                                  L1 D-Cache D-TLB

     L2 Cache and Control
                             Integer            Floating Point

                                       Schedulers

                                     Uop queues

                                    Rename/Alloc

                            BTB      Trace Cache         uCode ROM

                                        Decoder
     Bus




                                    BTB and I-TLB

                              Thread 2:         Thread 1: floating point
                              integer operation                            16
But: Can’t simultaneously use the
       same functional unit
                                 L1 D-Cache D-TLB

    L2 Cache and Control
                            Integer            Floating Point

                                      Schedulers

                                   Uop queues

                                   Rename/Alloc

                           BTB     Trace Cache        uCode ROM

                                       Decoder        This scenario is
                                                      impossible with SMT
    Bus




                                   BTB and I-TLB
                                                      on a single core
                             Thread 1 Thread 2        (assuming a single
                                 IMPOSSIBLE           integer unit)       17
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
  simultaneous thread as a separate
  “virtual processor”
• The chip has only a single copy
  of each resource
• Compare to multi-core:
  each core has its own copy of resources

                                              18
Multi-core:
                       threads can run on separate cores
                         L1 D-Cache D-TLB                                          L1 D-Cache D-TLB

                        Integer         Floating Point                            Integer         Floating Point
L2 Cache and Control




                                                          L2 Cache and Control
                              Schedulers                                                Schedulers

                              Uop queues                                                Uop queues

                             Rename/Alloc                                              Rename/Alloc

                       BTB      Trace Cache       uCode                          BTB       Trace Cache      uCode
                                                  ROM                                                       ROM
                                Decoder                                                   Decoder
                                                          Bus
Bus




                             BTB and I-TLB                                             BTB and I-TLB

                             Thread 1                                                  Thread 2                    19
Multi-core:
                       threads can run on separate cores
                         L1 D-Cache D-TLB                                        L1 D-Cache D-TLB

                        Integer       Floating Point                            Integer       Floating Point
L2 Cache and Control




                                                        L2 Cache and Control
                             Schedulers                                              Schedulers

                             Uop queues                                              Uop queues

                             Rename/Alloc                                            Rename/Alloc

                       BTB      Trace Cache     uCode                          BTB      Trace Cache        uCode
                                                ROM                                                        ROM
                                Decoder                                                 Decoder
                                                        Bus
Bus




                             BTB and I-TLB                                           BTB and I-TLB

                                     Thread 3                                                   Thread 4       20
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
  – Single-core, non-SMT: standard uniprocessor
  – Single-core, with SMT
  – Multi-core, non-SMT
  – Multi-core, with SMT: our fish machines
• The number of SMT threads:
  2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
                                                  21
SMT Dual-core: all four threads can run
            concurrently
                         L1 D-Cache D-TLB                                        L1 D-Cache D-TLB

                        Integer       Floating Point                            Integer         Floating Point
L2 Cache and Control




                                                        L2 Cache and Control
                             Schedulers                                               Schedulers

                             Uop queues                                               Uop queues

                             Rename/Alloc                                            Rename/Alloc

                       BTB      Trace Cache     uCode                          BTB       Trace Cache        uCode
                                                ROM                                                         ROM
                                Decoder                                                 Decoder
                                                        Bus
Bus




                             BTB and I-TLB                                           BTB and I-TLB

                         Thread 1 Thread 3                                           Thread 2    Thread 4        22
Parallel Programming
Designs with private L2 caches




                                                             CORE0
CORE1




                   CORE0




                                      CORE1
        L1 cache           L1 cache           L1 cache               L1 cache

        L2 cache           L2 cache           L2 cache          L2 cache

                                              L3 cache          L3 cache
              memory
                                                      memory
  Both L1 and L2 are private
                                              A design with L3 caches
  Examples: AMD Opteron,
  AMD Athlon, Intel Pentium D                 Example: Intel Itanium 2
Private vs shared caches?
• Advantages/disadvantages?




                               25
Private vs shared caches
• Advantages of private:
  – They are closer to core, so faster access
  – Reduces contention
• Advantages of shared:
  – Threads on different cores can share the
    same cache data
  – More cache space available if a single (or a
    few) high-performance thread runs on the
    system
                                                   26
Parallel Architectures
• Use multiple
  – Datapaths
  – Memory units
  – Processing units
Parallel Architectures
• SIMD
  – Single instruction stream, multiple data stream
                     Processing
                        Unit
                     Processing
                        Unit




                                               Interconnect
 Control
                     Processing
  Unit
                        Unit
                     Processing
                        Unit
                     Processing
                        Unit
Parallel Architectures
• MIMD
 – Multiple instruction stream, multiple data stream
          Processing/Control
                 Unit

          Processing/Control




                                          Interconnect
                 Unit

          Processing/Control
                 Unit

          Processing/Control
                 Unit
Parallelism in Visual Studio 2010
Integrated    Programming Models                                                       Programming Models
Tooling
                          PLINQ
   Parallel            Task Parallel                                                     Parallel Pattern      Agents
  Debugger                                                                                   Library           Library
                         Library
Toolwindows




                                                   Data Structures

                                                                     Data Structures
              Concurrency Runtime                                                        Concurrency Runtime

                       ThreadPool
  Profiler                                                                                       Task Scheduler
Concurrency           Task Scheduler
  Analysis
                    Resource Manager
                                                                                              Resource Manager

                                    Operating System

                                       Threads

               Key:         Tools        Native Library                                Managed Library
Multi threading Today
• Divide the total number of activites across n
  processors
• In case of 2 Procs, divide it by 2.
User Mode Scheduler
CLR Thread Pool

    Global
    Queue




               Worker    …    Worker
              Thread 1       Thread p


Program
 Thread
User Mode Scheduler For Tasks
     CLR Thread Pool: Work-Stealing

                           Local       …     Local
           Global          Queue             Queue
           Queue




                        Worker     …        Worker
                       Thread 1            Thread p
                                            Task 6
Task 1              Task Task 3
                         4
 Task 2Program            Task 5
       Thread
DEMO
Task-based Programming
       ThreadPool Summary
ThreadPool.QueueUserWorkItem(…);



System.Threading.Tasks
Starting                      Parent/Child
Task.Factory.StartNew(…);     var p = new Task(() => {
                                  var t = new Task(…);
                              });
Continue/Wait/Cancel
Task t = …                    Tasks with results
                              Task<int> f =
Task p = t.ContinueWith(…);     new Task<int>(() => C());
t.Wait(2000);                 …
t.Cancel();                   int result = f.Result;
Coordination Data Structures (1 of
                                      3)
                                      Block if full
Concurrent Collections                         P          C
•   BlockingCollection<T>                  P                  C
•   ConcurrentBag<T>                           P          C
•   ConcurrentDictionary<TKey,TValu
    e>                                                Block if empty
•   ConcurrentLinkedList<T>
•   ConcurrentQueue<T>
•   ConcurrentStack<T>
•   IProducerConsumerCollection<T>
•   Partitioner, Partitioner<T>,
    OrderablePartitioner<T>
Coordination Data Structures (2 of
                           3)
Synchronization Primitives
•   Barrier
•   CountdownEvent




                             Loop
•   ManualResetEventSlim                   Barrier   postPhaseAction

•   SemaphoreSlim
•   SpinLock
•   SpinWait




                              CountdownEvent.
Coordination Data Structures (3 of
                                           3)
Initialization Primitives
•   Lazy<T>, LazyVariable<T>, LazyInitializer   Cancellation    MyMethod( )
•   ThreadLocal<T>                                Source

                                                         Foo(…, CancellationToken ct)
Cancellation Primitives                                         Thread Boundary

•   CancellationToken
•   CancellationTokenSource                              Bar(…, CancellationToken ct)
•   ICancelableOperation

                                                       ManualResetEventSlim.Wait( ct )


                                                Cancellation
                                                   Token

More Related Content

PPTX
Parallel Programming
PPT
Lecture 6
PPT
Lecture 6
PPTX
Graphics processing uni computer archiecture
PDF
Intro to parallel computing
PPTX
PPT
Multicore Processors
PPT
Parallel processing
Parallel Programming
Lecture 6
Lecture 6
Graphics processing uni computer archiecture
Intro to parallel computing
Multicore Processors
Parallel processing

What's hot (19)

PPTX
Multithreading computer architecture
PDF
What is simultaneous multithreading
PPTX
network ram parallel computing
PDF
Lecture02 types
PPT
Introduction to parallel_computing
PPT
Hardware multithreading
PPT
Lecture 1
PPT
Computer architecture
PDF
Reduce course notes class xi
ODP
Concept of thread
PPTX
PARALLELISM IN MULTICORE PROCESSORS
PPTX
MEMORY & I/O SYSTEMS
PPT
Parallel computing
PPTX
Advanced computer architecture
PDF
Multithreaded processors ppt
PDF
Lecture 7 cuda execution model
PDF
Lect06
PDF
Aca2 07 new
PDF
Lecture 6.1
Multithreading computer architecture
What is simultaneous multithreading
network ram parallel computing
Lecture02 types
Introduction to parallel_computing
Hardware multithreading
Lecture 1
Computer architecture
Reduce course notes class xi
Concept of thread
PARALLELISM IN MULTICORE PROCESSORS
MEMORY & I/O SYSTEMS
Parallel computing
Advanced computer architecture
Multithreaded processors ppt
Lecture 7 cuda execution model
Lect06
Aca2 07 new
Lecture 6.1
Ad

Similar to Parallel Programming (20)

PPT
Multi core-architecture
PPTX
I3 multicore processor
PPTX
PPTX
Gpu archi
PDF
Multi-core Parallelization in Clojure - a Case Study
PPTX
Modern CPUs and Caches - A Starting Point for Programmers
PPT
Multi-core architectures
PPTX
Microprocessor.ppt
PDF
Javascript engine performance
PDF
Lect.10.arm soc.4 neon
PDF
GPU programming
PPTX
Computer Organization: Introduction to Microprocessor and Microcontroller
PDF
XT Best Practices
PDF
CSTalks-Polymorphic heterogeneous multicore systems-17Aug
PDF
Talk on Parallel Computing at IGWA
PDF
27 multicore
PDF
Parallel Programming
PDF
27 multicore
PPTX
PDF
Ov psim demo_slides_power_pc
Multi core-architecture
I3 multicore processor
Gpu archi
Multi-core Parallelization in Clojure - a Case Study
Modern CPUs and Caches - A Starting Point for Programmers
Multi-core architectures
Microprocessor.ppt
Javascript engine performance
Lect.10.arm soc.4 neon
GPU programming
Computer Organization: Introduction to Microprocessor and Microcontroller
XT Best Practices
CSTalks-Polymorphic heterogeneous multicore systems-17Aug
Talk on Parallel Computing at IGWA
27 multicore
Parallel Programming
27 multicore
Ov psim demo_slides_power_pc
Ad

More from HARMAN Services (20)

PPTX
3 Dimensions Of Transformation
PPTX
Testing Strategies to Deliver Consistent App Performance
PPTX
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
PPTX
Digital Transformation: Connected API Ecosystems
PDF
Webinar - Transforming Manufacturing with IoT
PDF
Microsoft Azure Explained - Hitesh D Kesharia
PDF
15 Big Data Billionaires
PDF
Digital Transformation in Travel
PDF
Digital Transformation in Retail
PDF
Digital Transformation in Media
PDF
Digital Transformation in Hospitality
PPTX
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
PDF
Top LinkedIn Influencers Every CIO Must Follow
PPTX
Ladbrokes and Aditi - Digital Transformation Case study
PDF
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
PPTX
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
PPTX
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
PPTX
24 Connected Car features to look out for before the release of Bond 24
PPTX
Webinar: How I Met Your Connected Customer
PPTX
5 Takeaways From The UX India Conference
3 Dimensions Of Transformation
Testing Strategies to Deliver Consistent App Performance
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
Digital Transformation: Connected API Ecosystems
Webinar - Transforming Manufacturing with IoT
Microsoft Azure Explained - Hitesh D Kesharia
15 Big Data Billionaires
Digital Transformation in Travel
Digital Transformation in Retail
Digital Transformation in Media
Digital Transformation in Hospitality
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Top LinkedIn Influencers Every CIO Must Follow
Ladbrokes and Aditi - Digital Transformation Case study
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
24 Connected Car features to look out for before the release of Bond 24
Webinar: How I Met Your Connected Customer
5 Takeaways From The UX India Conference

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Network Security Unit 5.pdf for BCA BBA.
sap open course for s4hana steps from ECC to s4
Review of recent advances in non-invasive hemoglobin estimation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
Spectroscopy.pptx food analysis technology
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cuic standard and advanced reporting.pdf
A Presentation on Artificial Intelligence
Digital-Transformation-Roadmap-for-Companies.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Parallel Programming

  • 1. Learning and Development Presents OPEN TALK SERIES A series of illuminating talks and interactions that open our minds to new ideas and concepts; that makes us look for newer or better ways of doing what we did; or point us to exciting things we have never done before. A range of topics on Technology, Business, Fun and Life. Be part of the learning experience at Aditi. Join the talks. Its free. Free as in freedom at work, not free-beer. Speak at these events. Or bring an expert/friend to talk. Mail LEAD with topic and availability.
  • 2. Parallel Programming Sundararajan Subramanian Aditi Technologies 2
  • 3. Introduction to Parallel Computing • The challenge – Provide the abstractions , programming paradigms, and algorithms needed to effectively design, implement, and maintain applications that exploit the parallelism provided by the underlying hardware in order to solve modern problems.
  • 4. Single-core CPU chip the single core 4
  • 5. Multi-core architectures Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip 5
  • 6. Multi-core CPU chip • The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 6
  • 7. The cores run in parallel thread 1 thread 2 thread 3 thread 4 c c c c o o o o r r r r e e e e 1 2 3 4 7
  • 8. Within each core, threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 8
  • 9. Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years 9
  • 10. Instruction level parallelism • For(int i-0;i<1000;i++) { a[0]++; a[0]++; } • For(int i-0;i<1000;i++) { a[0]++; a[1]++; }
  • 11. Thread-level parallelism (TLP) • This is parallelism on a more coarser scale • Server can serve each client in a separate thread (Web server, database server) • A computer game can do AI, graphics, and physics in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 11
  • 12. A technique complementary to multi-core: Simultaneous multithreading • Problem addressed: L1 D-Cache D-TLB The processor pipeline Integer Floating Point can get stalled: L2 Cache and Control – Waiting for the result Schedulers of a long floating point Uop queues (or integer) operation Rename/Alloc – Waiting for data to BTB Trace Cache uCode arrive from memory ROM Decoder Other execution units Bus wait unused BTB and I-TLB Source: Intel 12
  • 13. Simultaneous multithreading (SMT) • Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core • Weaving together multiple “threads” on the same core • Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units 13
  • 14. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point 14
  • 15. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation 15
  • 16. SMT processor: both threads can run concurrently L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: Thread 1: floating point integer operation 16
  • 17. But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario is impossible with SMT Bus BTB and I-TLB on a single core Thread 1 Thread 2 (assuming a single IMPOSSIBLE integer unit) 17
  • 18. SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources 18
  • 19. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 2 19
  • 20. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 3 Thread 4 20
  • 21. Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines • The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads” 21
  • 22. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4 22
  • 24. Designs with private L2 caches CORE0 CORE1 CORE0 CORE1 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D Example: Intel Itanium 2
  • 25. Private vs shared caches? • Advantages/disadvantages? 25
  • 26. Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention • Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 26
  • 27. Parallel Architectures • Use multiple – Datapaths – Memory units – Processing units
  • 28. Parallel Architectures • SIMD – Single instruction stream, multiple data stream Processing Unit Processing Unit Interconnect Control Processing Unit Unit Processing Unit Processing Unit
  • 29. Parallel Architectures • MIMD – Multiple instruction stream, multiple data stream Processing/Control Unit Processing/Control Interconnect Unit Processing/Control Unit Processing/Control Unit
  • 30. Parallelism in Visual Studio 2010 Integrated Programming Models Programming Models Tooling PLINQ Parallel Task Parallel Parallel Pattern Agents Debugger Library Library Library Toolwindows Data Structures Data Structures Concurrency Runtime Concurrency Runtime ThreadPool Profiler Task Scheduler Concurrency Task Scheduler Analysis Resource Manager Resource Manager Operating System Threads Key: Tools Native Library Managed Library
  • 31. Multi threading Today • Divide the total number of activites across n processors • In case of 2 Procs, divide it by 2.
  • 32. User Mode Scheduler CLR Thread Pool Global Queue Worker … Worker Thread 1 Thread p Program Thread
  • 33. User Mode Scheduler For Tasks CLR Thread Pool: Work-Stealing Local … Local Global Queue Queue Queue Worker … Worker Thread 1 Thread p Task 6 Task 1 Task Task 3 4 Task 2Program Task 5 Thread
  • 34. DEMO
  • 35. Task-based Programming ThreadPool Summary ThreadPool.QueueUserWorkItem(…); System.Threading.Tasks Starting Parent/Child Task.Factory.StartNew(…); var p = new Task(() => { var t = new Task(…); }); Continue/Wait/Cancel Task t = … Tasks with results Task<int> f = Task p = t.ContinueWith(…); new Task<int>(() => C()); t.Wait(2000); … t.Cancel(); int result = f.Result;
  • 36. Coordination Data Structures (1 of 3) Block if full Concurrent Collections P C • BlockingCollection<T> P C • ConcurrentBag<T> P C • ConcurrentDictionary<TKey,TValu e> Block if empty • ConcurrentLinkedList<T> • ConcurrentQueue<T> • ConcurrentStack<T> • IProducerConsumerCollection<T> • Partitioner, Partitioner<T>, OrderablePartitioner<T>
  • 37. Coordination Data Structures (2 of 3) Synchronization Primitives • Barrier • CountdownEvent Loop • ManualResetEventSlim Barrier postPhaseAction • SemaphoreSlim • SpinLock • SpinWait CountdownEvent.
  • 38. Coordination Data Structures (3 of 3) Initialization Primitives • Lazy<T>, LazyVariable<T>, LazyInitializer Cancellation MyMethod( ) • ThreadLocal<T> Source Foo(…, CancellationToken ct) Cancellation Primitives Thread Boundary • CancellationToken • CancellationTokenSource Bar(…, CancellationToken ct) • ICancelableOperation ManualResetEventSlim.Wait( ct ) Cancellation Token