SlideShare a Scribd company logo
Scheduler Performance in Many-
      Core Architecture

               Itai Avron
                MSc Thesis
  Technion - Electrical Engineering Dept.

                   May 2, 2012
Agenda
•   Introduction and Motivation
•   The Plural Architecture
•   Improved Scheduler
•   Analysis of Simulation Results
•   Conclusions and Future Work




                         May 2, 2012
Background
• CPU performance improvements
  – In the past : Increase of clock frequency
     • We reached the power wall
  – Today : Multi-cores
  – The future : Many-cores
     •   Homogeneous  Heterogeneous?
     •   What architecture?
     •   Memory model?
     •   Scheduler?
     •   …


                            May 2, 2012
Scheduling In Many-Core Architecture
• Software scheduling is slow
  – A lot of cores to schedule
  – Fine granularity tasks  many tasks to schedule at
    the same time
     • To enhance parallelism


• Dedicated Hardware required!


                           May 2, 2012
Scheduler Challenges
• Latency
  – Message delay
     • From core to scheduler (completed prev. task)
     • From scheduler to core (start new task)
  – Schedule time
     • to allocate tasks to cores


• Capacity
  – Number of instancestasks scheduled per cycle

                             May 2, 2012
Other Architectures
•   Graphic Processing Unit (GPU’s)
•   Tilera
•   Larrabee
•   XMT
•   Rigel
•   Data-Driven Multithreading Model
•   Task Superscalar

                       May 2, 2012
GPU – NVIDIA Fermi
• Composed of many
  processing elements
  (PEs)
• Scheduling is done in
  hardware
   – Schedule warps
   – Only one control flow
• SIMD


                             May 2, 2012
Tilera
• Composed of tiles
   – Each tile is independent
• Static Scheduling
   – Determined during
     compile time
• MIMD



 [Agarwal (MIT) 1997- ]

                                May 2, 2012
Larrabee (Intel)
• Array of processor cores
• Software controlled
  Scheduling
   – Lightweight distributed
     task-stealing scheduler
• MIMD




                               May 2, 2012
XMT
• Composed of TCU’s
  – Thread control unit
• Hardware Scheduling
  – Using Prefix-Sum
• PRAM Programming
  Model
• SPMD


 [Vishkin (UMD) 2005-]

                          May 2, 2012
Rigel
• Composed of tiles of
  clusters
   – Each cluster holds 8
     cores
• Software Scheduling
   – Allocation via task
     queues
   – Synchronization via
     Barriers
• SPMD
    [Patel (UIUC) 2008- ]

                             May 2, 2012
Data-Driven Multithreading Model
• A Threads
  Synchronization Unit
  (TSU)
   – Connects to existing
     cores
• Hardware Scheduling
   – Using Task Map
• Producer-Consumer
  Programming Model
 [Evripidou (U Cyprus) 1997- ]

                                 May 2, 2012
Task Superscalar
• An Out-of-Order Task
  Pipeline
   – Connects to existing cores
   – No Speculations
• Hardware Scheduling
   – Creation of new tasks is
     done in software
   – Management and
     Allocation is done in
     Hardware
• StarSs Programming
  Model
                                                [Etsion (BSC) 2009- ]


                                  May 2, 2012
Agenda
•   Introduction and Motivation
•   The Plural Architecture
•   Improved Scheduler
•   Analysis of Simulation Results
•   Conclusions and Future Work




                         May 2, 2012
The ‘Plural’ System Architecture

                               Scheduler




                  Cores



                           Memory Network

       Memory
        banks

[Bayer (Technion) 1988 ]

                              May 2, 2012
The System
• Many RISC cores
  – In-Order, Blocking LoadStore
  – No data cache
• Shared On-Chip memory banks
  – Interleaved address
  – Access takes 2 cycles
     • Core retries on collision
• Hardware synchronization and scheduling unit
  – Distributes tasks to cores according to a task map
  – Collects task completion messages from cores

                               May 2, 2012
Plural Task Map
                                                              Task
• Precedence Graph
                                             A

• Created by the                             1

                                                                        Dependency
  programmer                         C
                                    5000
                                                     B
                                                    1200

• Duplicable Tasks                           D
  – All instances are                       130

                        Condition
    concurrent
                                           cntr=4



                                             E              Task Name
                                             1           Number of instances




                         May 2, 2012
Plural Scheduling



•   Central Synchronization Unit (CSU)
     –   Manages allocation, scheduling, and synchronization of tasks
     –   Collects task-termination
     –   Programmed by the task map
     –   Allocates packs (sets) of parallel task-instances
•   Distribution Network (DN)
     –   Organized as a tree with the CSU as its root
     –   Mediates between the CSU and the processing cores
     –   Downstream flow -> decomposes allocated packs of task instances
     –   Upstream flow -> unifies task-termination events from the cores


                                               May 2, 2012
Scheduling Process
                         CSU
                      allocates
                      ready to
                      run tasks


    CSU                                       DN
  process                                 distributes
new eligible                               packs to
to run tasks                                 cores




                                  Cores sends
         DN unifies
                                  termination
        termination
                                  message on
         messages
                                  completion




                       May 2, 2012
Agenda
•   Introduction and Motivation
•   The Plural Architecture
•   Improved Scheduler
•   Analysis of Simulation Results
•   Conclusions and Future Work




                         May 2, 2012
Scheduler Improvements
• Enhancing scheduler capacity
• Reducing scheduling latency
• Adding task queues to each core
  – Sharing queues
• Adding task length indicator




                      May 2, 2012
Simulation Environment
• Matlab Simulator                                 [Friedman, Kh
                                                   oretz, Ginosar,
  – Based on Eyal and Dima’s simulator              PDP 2012]
• Benchmarks
  – 3 Demo programs
  – 3 Benchmarks
       • JPEG, Mandelbrot, Linear Solver
• 24 System configurations
  –   256 cores, 256 banks
  –   Scheduler capacity: 5, 10, infinite [instances]
  –   Latency (scheduler—cores): 0, 20 [cycles]
  –   Task queue depth: 0, 1, 2, 10 [instances]

                               May 2, 2012
Benchmark Task Maps
        Normal and                        Mandelbrot                            JPEG                               Linear Solver
                               Parallel
      Shared Variable

             A                   A           A                                   A                                      A
             1                   1            1                                 1                                       1
                                             540                                10                                     236
             23                  23


                                             B                                   B                                      B
             B                   B            1                                 1                                       1
            100                 2000         225                                10                                      40
            15                   25

                                             C                                                                          C
                                            4096                    C    E       G      J      I      K                 1
D            C           D       C           80                  1        1     300    200    100    100               214
600         500         2600    2500                            5715    12810   2418   1490   1952   1659
 20         35           26      35
                                             D                                                               D
                                            4096                    D    F       H                           1
                                             7                  300     300     300                         172                    F
             E                   E                              181     705     2927                                               1
            130                 2300                                                                                               58
            18                   18
                                                                                                             E
                                                                                                            100
                                                                                 L                          126
                                                                                 1
                                                                                460
           cn




                               cn




                                                                                                             G                     H
             tr




                                 tr
               =4




                                   =4




                                                                                                            7720                   100
                                                                                 M                          197                    78
                                                                                 1
                                                                                2548

             F                   F                                                                                      J
                                                                                                                        1
             1                   1                                                                                      47
             27                  19                                              N
                                                                                 1
                                                                                207




                                                                                                                     cn
                                                                                                                       tr
                                                                                                                          =5
                                                   Task Name
                                             Number of instances
                                             Length in time units

                                                                                                                        K
                                                                                                                        1
                                                                                                                        87

                                                   May 2, 2012
Agenda
•   Introduction and Motivation
•   The Plural Architecture
•   Improved Scheduler
•   Analysis of Simulation Results
•   Conclusions and Future Work




                         May 2, 2012
Analysis of Simulation Results
•   “Normal” Benchmark
•   “Parallel” Benchmark
•   “Shared Variable” Benchmark
•   JPEG Benchmark
•   Linear Solver Benchmark
•   Mandelbrot Benchmark

• Benchmarks Analysis

                        May 2, 2012
A
                                                              1
                                                              23



                                                              B
                                                              100




             “Normal” Benchmark
                                                              15



                                                      D       C
                                                      600     500
                                                       20     35



                                                               E

Activity Per core, Latency = 0 cycles                         130
                                                              18




                                                            cn
                                                              tr
                                                                 =4
                                                               F
                                                              1
                                                              27




                                        May 2, 2012
A
                                                         1
                                                         23



                                                         B
                                                         100




            “Normal” Benchmark
                                                         15



                                                 D       C
                                                 600     500
                                                  20     35



                                                          E

Unbalanced scheduling, Latency = 0 cycles                130
                                                         18




                                                       cn
                                                         tr
                                                            =4
                                                          F
                                                         1
                                                         27




                                   May 2, 2012
A
                                                           1
                                                           23



                                                           B
                                                           100




             “Normal” Benchmark
                                                           15



                                                   D       C
                                                   600     500
                                                    20     35



                                                            E

Activity Per core, Latency = 20 cycles                     130
                                                           18




                                                         cn
                                                           tr
                                                              =4
                                                            F
                                                           1
                                                           27




                                     May 2, 2012
Analysis of Simulation Results
•   “Normal” Benchmark
•   “Parallel” Benchmark
•   “Shared Variable” Benchmark
•   JPEG Benchmark
•   Linear Solver Benchmark
•   Mandelbrot Benchmark

• Benchmarks Analysis

                        May 2, 2012
A
                                                               1
                                                               23



                                                               B
                                                              2000




             “Parallel” Benchmark
                                                               25



                                                       D       C
                                                      2600    2500
                                                       26      35



                                                                E

Activity Per core, Latency = 0 cycles                         2300
                                                               18




                                                             cn
                                                               tr
                                                                  =4
                                                                F
                                                               1
                                                               19




                                        May 2, 2012
A
                                                             1
                                                             23



                                                             B
                                                            2000




              “Parallel” Benchmark
                                                             25



                                                     D       C
                                                    2600    2500
                                                     26      35



                                                              E

 Activity Per core, Latency = 20 cycles                     2300
                                                             18




                                                           cn
                                                             tr
                                                                =4
                                                              F
                                                             1
                                                             19




Queues help hide latency only if schedule capacity is
                 sufficiently high




                                      May 2, 2012
Analysis of Simulation Results
•   “Normal” Benchmark
•   “Parallel” Benchmark
•   “Shared Variable” Benchmark
•   JPEG Benchmark
•   Linear Solver Benchmark
•   Mandelbrot Benchmark

• Benchmarks Analysis

                        May 2, 2012
A
                                                               1
                                                               23



                                                               B
                                                               100




   “Shared Variable” Benchmark
                                                               15



                                                       D       C
                                                       600     500
                                                        20     35



                                                                E

Activity Per cycle, Latency = 0 cycles                         130
                                                               18




                                                             cn
                                                               tr
                                                                  =4
                                                                F
                                                               1
                                                               27




            Is this a problem of the scheduler?




                                         May 2, 2012
Analysis of Simulation Results
•   “Normal” Benchmark
•   “Parallel” Benchmark
•   “Shared Variable” Benchmark
•   JPEG Benchmark
•   Linear Solver Benchmark
•   Mandelbrot Benchmark

• Benchmarks Analysis
                      May 2, 2012
A
                                                                      1
                                                                      10



                                                                       B
                                                                      1




                       JPEG Benchmark
                                                                      10




                                                        C      E       G      J      I      K
                                                        1       1     300    200    100    100
                                                       5715   12810   2418   1490   1952   1659



Activity Per cycle, Latency = 0 cycles                  D
                                                       300
                                                               F
                                                               300
                                                                       H
                                                                      300
                                                       181     705    2927




                                                                       L
                                                                       1
                                                                      460



                                                                       M
                                                                       1
                                                                      2548



                                                                       N
                                                                       1
                                                                      207




                                         May 2, 2012
A
                                                                    1
                                                                    10



                                                                     B
                                                                    1




                      JPEG Benchmark
                                                                    10




                                                      C      E       G      J      I      K
                                                      1       1     300    200    100    100
                                                     5715   12810   2418   1490   1952   1659



Unbalanced scheduling, Latency = 0 cycles             D
                                                     300
                                                             F
                                                             300
                                                                     H
                                                                    300
                                                     181     705    2927




                                                                     L
                                                                     1
                                                                    460



                                                                     M
                                                                     1
                                                                    2548



                                                                     N
                                                                     1
                                                                    207




         Queues may degrade system performance




                                       May 2, 2012
Solutions to imbalance
1.   Queue sharing among multiple cores
2.   Scheduling awareness of long tasks       Simulated

3.   Using fine granularity tasks
4.   Task migration among queues
5.   Task map optimization
6.   Pipeline multiple instances of an algorithm


                          May 2, 2012
Solutions to imbalance
• Queue sharing among multiple cores
• Scheduling awareness of long tasks
• Using fine granularity tasks




                    May 2, 2012
A
                                                                      1
                                                                      10




                          JPEG Benchmark                              1
                                                                       B

                                                                      10




                           Shared Queues                C
                                                        1
                                                       5715
                                                               E
                                                                1
                                                              12810
                                                                       G
                                                                      300
                                                                      2418
                                                                              J
                                                                             200
                                                                             1490
                                                                                     I
                                                                                    100
                                                                                    1952
                                                                                            K
                                                                                           100
                                                                                           1659



Activity Per cycle, Latency = 0 cycles                  D
                                                       300
                                                               F
                                                               300
                                                                       H
                                                                      300
                                                       181     705    2927




                                                                       L
                                                                       1
                                                                      460



                                                                       M
                                                                       1
                                                                      2548



                                                                       N
                                                                       1
                                                                      207




                                         May 2, 2012
Solutions to imbalance
• Queue sharing among multiple cores
• Scheduling awareness of long tasks
• Using fine granularity tasks




                    May 2, 2012
JPEG Benchmark    [Green 2010]


      Execution-Time Aware Scheduler
Activity Per cycle, Latency = 0 cycles, Task E flagged as long




                            Flag task C as well




                                      May 2, 2012
JPEG Benchmark
      Execution-Time Aware Scheduler
Activity Per cycle, Latency = 0 cycles, Task E and C flagged as long




                         Need Profiling Tool




                                      May 2, 2012
Solutions to imbalance
• Queue sharing among multiple cores
• Scheduling awareness of long tasks
• Using fine granularity tasks




                    May 2, 2012
A
                                                          1
                                                          10




                         JPEG Benchmark                   1
                                                           B

                                                          10




                         Fine Granularity
                                             C     E1      G      J      I      K
                                             1      1     300    200    100    100
                                            5715   4270   2418   1490   1952   1659


                                                   E2      H
                                                    1     300

Activity Per cycle, Latency = 0 cycles             4270


                                                   E3
                                                          2927



                                                    1
                                                   4270


                                             D      F
                                            300    300
                                            181    705



                                                           L
                                                           1
                                                          460



                                                           M
                                                           1
                                                          2548



                                                           N
                                                           1
                                                          207




  Might be further improved by decomposing task E
                         May 2, 2012
      further and by also decomposing task C
Analysis of Simulation Results
•   “Normal” Benchmark
•   “Parallel” Benchmark
•   “Shared Variable” Benchmark
•   JPEG Benchmark
•   Linear Solver Benchmark
•   Mandelbrot Benchmark

• Benchmarks Analysis
                      May 2, 2012
A
                                                                 1
                                                                236



                                                                 B
                                                                 1




           Linear Solver Benchmark
                                                                40



                                                                 C
                                                                 1
                                                                214




Activity Per core, Latency = 20 cycles                  D
                                                        1
                                                       172             F
                                                                       1
                                                                       58
                                                        E
                                                       100
                                                       126



                                                        G              H
                                                       7720            100
                                                        197            78



                                                                 J
                                                                 1
                                                                47




                                                              cn
                                                                tr
                                                                  5=
                                                                 K
                                                                 1
                                                                87




                                         May 2, 2012
Analysis of Simulation Results
•   “Normal” Benchmark
•   “Parallel” Benchmark
•   “Shared Variable” Benchmark
•   JPEG Benchmark
•   Linear Solver Benchmark
•   Mandelbrot Benchmark

• Benchmarks Analysis
                      May 2, 2012
A
                                                         1
                                                        540



             Mandelbrot Benchmark                        B
                                                         1
                                                        225

Activity Per cycle, Latency = 20 cycles
                                                         C
                                                        4096
                                                         80



                                                         D
                                                        4096
                                                         7




                                          May 2, 2012
A
                                                                                   1
                                                                                  540



              Mandelbrot Benchmark                                                 B
                                                                                   1
                                                                                  225

 Activity Per cycle, Latency = 20 cycles, Zoom on task D execution for infinite
 capacity                                                                          C
                                                                                  4096
                                                                                   80



                                                                                   D
                                                                                  4096
                                                                                   7




Fine grained tasks requires deep queues and a powerful
   scheduler to assign instances fast enough to hide
                        latencies

                                         May 2, 2012
Analysis of Simulation Results
•   “Normal” Benchmark
•   “Parallel” Benchmark
•   “Shared Variable” Benchmark
•   JPEG Benchmark
•   Linear Solver Benchmark
•   Mandelbrot Benchmark

• Benchmarks Analysis
                      May 2, 2012
Total Run-Time




A 2 slot queue and a scheduler capacity of 10 is enough
                  to utilize 256 2012
                           May 2,
                                  cores
(STD of cores busy time, latency = 20)

                  Load Balancing




•   Queues may cause imbalance
•   Larger scheduler capacityMay 2, 2012 imbalance
                              decreases
Effective Allocation Latency




A 1 slot queue is sufficient to hide much of the latency
                          May 2, 2012
Agenda
•   Introduction and Motivation
•   The Plural Architecture
•   Improved Scheduler
•   Analysis of Simulation Results
•   Conclusions and Future Work




                         May 2, 2012
Conclusions
• Analysis of scheduler effect on many-core
  architecture
• A simulation and investigation tool
• Queues to hide latencies
  – Might cause imbalance
     • Task map optimization and tuning
     • Sharing queues



                           May 2, 2012
Future Research
• Scheduler distribution networks
• Implications of scheduler on power
• Other imbalance solutions
  – As described before
• Profiling for task map optimization and
  scheduling analysis



                          May 2, 2012
QUESTIONS?


             May 2, 2012

More Related Content

PDF
The unified data center for cloud david yen
PDF
4838281 operating-system-scheduling-on-multicore-architectures
PDF
22). smlevel energy eff-dynamictaskschedng
PDF
Chap2 - ADSP 21K Manual - Processor and Software Overview
PPTX
TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon ...
PDF
04032012 iw ibm_webcast_series_part_3_performance_how_to_finetune_your_odm_so...
PPTX
Hanborq Optimizations on Hadoop MapReduce
DOC
Ls deploy ad_prep
The unified data center for cloud david yen
4838281 operating-system-scheduling-on-multicore-architectures
22). smlevel energy eff-dynamictaskschedng
Chap2 - ADSP 21K Manual - Processor and Software Overview
TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon ...
04032012 iw ibm_webcast_series_part_3_performance_how_to_finetune_your_odm_so...
Hanborq Optimizations on Hadoop MapReduce
Ls deploy ad_prep

Viewers also liked (8)

PDF
Tr ns802 11
PDF
SIMULATIVE ANALYSIS OF CHANNEL AND QoS AWARE SCHEDULER TO ENHANCE THE CAPACIT...
PDF
LCU13: Discussion on ODP – Fastpath networking applications on manycore SoCs
PDF
LTE Schedulers – A Definitive Approach
PPTX
enodeb sw to dual core
PPT
Introduction to NS2 - Cont..
PPTX
Hsupa (enhanced uplink)
PDF
A Novel Parameterized QoS based Uplink and Downlink Scheduler for Bandwidth/D...
Tr ns802 11
SIMULATIVE ANALYSIS OF CHANNEL AND QoS AWARE SCHEDULER TO ENHANCE THE CAPACIT...
LCU13: Discussion on ODP – Fastpath networking applications on manycore SoCs
LTE Schedulers – A Definitive Approach
enodeb sw to dual core
Introduction to NS2 - Cont..
Hsupa (enhanced uplink)
A Novel Parameterized QoS based Uplink and Downlink Scheduler for Bandwidth/D...
Ad

Similar to Scheduler performance in manycore architecture (20)

PDF
Intro to parallel computing
PDF
CS6801-MULTI-CORE-ARCHITECTURE-AND-PROGRAMMING_watermark.pdf
PDF
Session 1 introduction concurrent programming
PPTX
Thinking in parallel ab tuladev
PDF
Calibration of Deployment Simulation Models - A Multi-Paradigm Modelling Appr...
PDF
A Survey on in-a-box parallel computing and its implications on system softwa...
PPT
EEDC Programming Models
PDF
Compute API –Past & Future
PDF
Multi-core Parallelization in Clojure - a Case Study
PPTX
Hardware-aware thread scheduling: the case of asymmetric multicore processors
PDF
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
PDF
A Framework and Methods for Dynamic Scheduling of a Directed Acyclic Graph on...
PDF
Problems in Task Scheduling in Multiprocessor System
PDF
PDF
Lect18
PPTX
Parallel architecture &programming
PPT
Current Trends in HPC
PPTX
Parallel architecture-programming
PDF
HYBRID HEURISTIC-BASED ARTIFICIAL IMMUNE SYSTEM FOR TASK SCHEDULING
PDF
Multi-tasking in PHP
Intro to parallel computing
CS6801-MULTI-CORE-ARCHITECTURE-AND-PROGRAMMING_watermark.pdf
Session 1 introduction concurrent programming
Thinking in parallel ab tuladev
Calibration of Deployment Simulation Models - A Multi-Paradigm Modelling Appr...
A Survey on in-a-box parallel computing and its implications on system softwa...
EEDC Programming Models
Compute API –Past & Future
Multi-core Parallelization in Clojure - a Case Study
Hardware-aware thread scheduling: the case of asymmetric multicore processors
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
A Framework and Methods for Dynamic Scheduling of a Directed Acyclic Graph on...
Problems in Task Scheduling in Multiprocessor System
Lect18
Parallel architecture &programming
Current Trends in HPC
Parallel architecture-programming
HYBRID HEURISTIC-BASED ARTIFICIAL IMMUNE SYSTEM FOR TASK SCHEDULING
Multi-tasking in PHP
Ad

More from chiportal (20)

PDF
Prof. Zhihua Wang, Tsinghua University, Beijing, China
PPTX
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
PPTX
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
PPTX
Prof. Uri Weiser,Technion
PDF
Ken Liao, Senior Associate VP, Faraday
PDF
Prof. Danny Raz, Director, Bell Labs Israel, Nokia
PDF
Marco Casale-Rossi, Product Mktg. Manager, Synopsys
PPTX
Dr.Efraim Aharoni, ESD Leader, TowerJazz
PPTX
Eddy Kvetny, System Engineering Group Leader, Intel
PPTX
Dr. John Bainbridge, Principal Application Architect, NetSpeed
PPTX
Xavier van Ruymbeke, App. Engineer, Arteris
PPTX
Asi Lifshitz, VP R&D, Vtool
PPTX
Zvika Rozenshein,General Manager, EngineeringIQ
PPTX
Lewis Chu,Marketing Director,GUC
PPTX
Kunal Varshney, VLSI Engineer, Open-Silicon
PDF
Gert Goossens,Sen. Director, ASIP Tools, Synopsys
PPSX
Tuvia Liran, Director of VLSI, Nano Retina
PPTX
Sagar Kadam, Lead Software Engineer, Open-Silicon
PPTX
Ronen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
PDF
Prof. Emanuel Cohen, Technion
Prof. Zhihua Wang, Tsinghua University, Beijing, China
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Uri Weiser,Technion
Ken Liao, Senior Associate VP, Faraday
Prof. Danny Raz, Director, Bell Labs Israel, Nokia
Marco Casale-Rossi, Product Mktg. Manager, Synopsys
Dr.Efraim Aharoni, ESD Leader, TowerJazz
Eddy Kvetny, System Engineering Group Leader, Intel
Dr. John Bainbridge, Principal Application Architect, NetSpeed
Xavier van Ruymbeke, App. Engineer, Arteris
Asi Lifshitz, VP R&D, Vtool
Zvika Rozenshein,General Manager, EngineeringIQ
Lewis Chu,Marketing Director,GUC
Kunal Varshney, VLSI Engineer, Open-Silicon
Gert Goossens,Sen. Director, ASIP Tools, Synopsys
Tuvia Liran, Director of VLSI, Nano Retina
Sagar Kadam, Lead Software Engineer, Open-Silicon
Ronen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
Prof. Emanuel Cohen, Technion

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Scheduler performance in manycore architecture

  • 1. Scheduler Performance in Many- Core Architecture Itai Avron MSc Thesis Technion - Electrical Engineering Dept. May 2, 2012
  • 2. Agenda • Introduction and Motivation • The Plural Architecture • Improved Scheduler • Analysis of Simulation Results • Conclusions and Future Work May 2, 2012
  • 3. Background • CPU performance improvements – In the past : Increase of clock frequency • We reached the power wall – Today : Multi-cores – The future : Many-cores • Homogeneous Heterogeneous? • What architecture? • Memory model? • Scheduler? • … May 2, 2012
  • 4. Scheduling In Many-Core Architecture • Software scheduling is slow – A lot of cores to schedule – Fine granularity tasks  many tasks to schedule at the same time • To enhance parallelism • Dedicated Hardware required! May 2, 2012
  • 5. Scheduler Challenges • Latency – Message delay • From core to scheduler (completed prev. task) • From scheduler to core (start new task) – Schedule time • to allocate tasks to cores • Capacity – Number of instancestasks scheduled per cycle May 2, 2012
  • 6. Other Architectures • Graphic Processing Unit (GPU’s) • Tilera • Larrabee • XMT • Rigel • Data-Driven Multithreading Model • Task Superscalar May 2, 2012
  • 7. GPU – NVIDIA Fermi • Composed of many processing elements (PEs) • Scheduling is done in hardware – Schedule warps – Only one control flow • SIMD May 2, 2012
  • 8. Tilera • Composed of tiles – Each tile is independent • Static Scheduling – Determined during compile time • MIMD [Agarwal (MIT) 1997- ] May 2, 2012
  • 9. Larrabee (Intel) • Array of processor cores • Software controlled Scheduling – Lightweight distributed task-stealing scheduler • MIMD May 2, 2012
  • 10. XMT • Composed of TCU’s – Thread control unit • Hardware Scheduling – Using Prefix-Sum • PRAM Programming Model • SPMD [Vishkin (UMD) 2005-] May 2, 2012
  • 11. Rigel • Composed of tiles of clusters – Each cluster holds 8 cores • Software Scheduling – Allocation via task queues – Synchronization via Barriers • SPMD [Patel (UIUC) 2008- ] May 2, 2012
  • 12. Data-Driven Multithreading Model • A Threads Synchronization Unit (TSU) – Connects to existing cores • Hardware Scheduling – Using Task Map • Producer-Consumer Programming Model [Evripidou (U Cyprus) 1997- ] May 2, 2012
  • 13. Task Superscalar • An Out-of-Order Task Pipeline – Connects to existing cores – No Speculations • Hardware Scheduling – Creation of new tasks is done in software – Management and Allocation is done in Hardware • StarSs Programming Model [Etsion (BSC) 2009- ] May 2, 2012
  • 14. Agenda • Introduction and Motivation • The Plural Architecture • Improved Scheduler • Analysis of Simulation Results • Conclusions and Future Work May 2, 2012
  • 15. The ‘Plural’ System Architecture Scheduler Cores Memory Network Memory banks [Bayer (Technion) 1988 ] May 2, 2012
  • 16. The System • Many RISC cores – In-Order, Blocking LoadStore – No data cache • Shared On-Chip memory banks – Interleaved address – Access takes 2 cycles • Core retries on collision • Hardware synchronization and scheduling unit – Distributes tasks to cores according to a task map – Collects task completion messages from cores May 2, 2012
  • 17. Plural Task Map Task • Precedence Graph A • Created by the 1 Dependency programmer C 5000 B 1200 • Duplicable Tasks D – All instances are 130 Condition concurrent cntr=4 E Task Name 1 Number of instances May 2, 2012
  • 18. Plural Scheduling • Central Synchronization Unit (CSU) – Manages allocation, scheduling, and synchronization of tasks – Collects task-termination – Programmed by the task map – Allocates packs (sets) of parallel task-instances • Distribution Network (DN) – Organized as a tree with the CSU as its root – Mediates between the CSU and the processing cores – Downstream flow -> decomposes allocated packs of task instances – Upstream flow -> unifies task-termination events from the cores May 2, 2012
  • 19. Scheduling Process CSU allocates ready to run tasks CSU DN process distributes new eligible packs to to run tasks cores Cores sends DN unifies termination termination message on messages completion May 2, 2012
  • 20. Agenda • Introduction and Motivation • The Plural Architecture • Improved Scheduler • Analysis of Simulation Results • Conclusions and Future Work May 2, 2012
  • 21. Scheduler Improvements • Enhancing scheduler capacity • Reducing scheduling latency • Adding task queues to each core – Sharing queues • Adding task length indicator May 2, 2012
  • 22. Simulation Environment • Matlab Simulator [Friedman, Kh oretz, Ginosar, – Based on Eyal and Dima’s simulator PDP 2012] • Benchmarks – 3 Demo programs – 3 Benchmarks • JPEG, Mandelbrot, Linear Solver • 24 System configurations – 256 cores, 256 banks – Scheduler capacity: 5, 10, infinite [instances] – Latency (scheduler—cores): 0, 20 [cycles] – Task queue depth: 0, 1, 2, 10 [instances] May 2, 2012
  • 23. Benchmark Task Maps Normal and Mandelbrot JPEG Linear Solver Parallel Shared Variable A A A A A 1 1 1 1 1 540 10 236 23 23 B B B B B 1 1 1 100 2000 225 10 40 15 25 C C 4096 C E G J I K 1 D C D C 80 1 1 300 200 100 100 214 600 500 2600 2500 5715 12810 2418 1490 1952 1659 20 35 26 35 D D 4096 D F H 1 7 300 300 300 172 F E E 181 705 2927 1 130 2300 58 18 18 E 100 L 126 1 460 cn cn G H tr tr =4 =4 7720 100 M 197 78 1 2548 F F J 1 1 1 47 27 19 N 1 207 cn tr =5 Task Name Number of instances Length in time units K 1 87 May 2, 2012
  • 24. Agenda • Introduction and Motivation • The Plural Architecture • Improved Scheduler • Analysis of Simulation Results • Conclusions and Future Work May 2, 2012
  • 25. Analysis of Simulation Results • “Normal” Benchmark • “Parallel” Benchmark • “Shared Variable” Benchmark • JPEG Benchmark • Linear Solver Benchmark • Mandelbrot Benchmark • Benchmarks Analysis May 2, 2012
  • 26. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 E Activity Per core, Latency = 0 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
  • 27. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 E Unbalanced scheduling, Latency = 0 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
  • 28. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 E Activity Per core, Latency = 20 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
  • 29. Analysis of Simulation Results • “Normal” Benchmark • “Parallel” Benchmark • “Shared Variable” Benchmark • JPEG Benchmark • Linear Solver Benchmark • Mandelbrot Benchmark • Benchmarks Analysis May 2, 2012
  • 30. A 1 23 B 2000 “Parallel” Benchmark 25 D C 2600 2500 26 35 E Activity Per core, Latency = 0 cycles 2300 18 cn tr =4 F 1 19 May 2, 2012
  • 31. A 1 23 B 2000 “Parallel” Benchmark 25 D C 2600 2500 26 35 E Activity Per core, Latency = 20 cycles 2300 18 cn tr =4 F 1 19 Queues help hide latency only if schedule capacity is sufficiently high May 2, 2012
  • 32. Analysis of Simulation Results • “Normal” Benchmark • “Parallel” Benchmark • “Shared Variable” Benchmark • JPEG Benchmark • Linear Solver Benchmark • Mandelbrot Benchmark • Benchmarks Analysis May 2, 2012
  • 33. A 1 23 B 100 “Shared Variable” Benchmark 15 D C 600 500 20 35 E Activity Per cycle, Latency = 0 cycles 130 18 cn tr =4 F 1 27 Is this a problem of the scheduler? May 2, 2012
  • 34. Analysis of Simulation Results • “Normal” Benchmark • “Parallel” Benchmark • “Shared Variable” Benchmark • JPEG Benchmark • Linear Solver Benchmark • Mandelbrot Benchmark • Benchmarks Analysis May 2, 2012
  • 35. A 1 10 B 1 JPEG Benchmark 10 C E G J I K 1 1 300 200 100 100 5715 12810 2418 1490 1952 1659 Activity Per cycle, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 May 2, 2012
  • 36. A 1 10 B 1 JPEG Benchmark 10 C E G J I K 1 1 300 200 100 100 5715 12810 2418 1490 1952 1659 Unbalanced scheduling, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 Queues may degrade system performance May 2, 2012
  • 37. Solutions to imbalance 1. Queue sharing among multiple cores 2. Scheduling awareness of long tasks Simulated 3. Using fine granularity tasks 4. Task migration among queues 5. Task map optimization 6. Pipeline multiple instances of an algorithm May 2, 2012
  • 38. Solutions to imbalance • Queue sharing among multiple cores • Scheduling awareness of long tasks • Using fine granularity tasks May 2, 2012
  • 39. A 1 10 JPEG Benchmark 1 B 10 Shared Queues C 1 5715 E 1 12810 G 300 2418 J 200 1490 I 100 1952 K 100 1659 Activity Per cycle, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 May 2, 2012
  • 40. Solutions to imbalance • Queue sharing among multiple cores • Scheduling awareness of long tasks • Using fine granularity tasks May 2, 2012
  • 41. JPEG Benchmark [Green 2010] Execution-Time Aware Scheduler Activity Per cycle, Latency = 0 cycles, Task E flagged as long Flag task C as well May 2, 2012
  • 42. JPEG Benchmark Execution-Time Aware Scheduler Activity Per cycle, Latency = 0 cycles, Task E and C flagged as long Need Profiling Tool May 2, 2012
  • 43. Solutions to imbalance • Queue sharing among multiple cores • Scheduling awareness of long tasks • Using fine granularity tasks May 2, 2012
  • 44. A 1 10 JPEG Benchmark 1 B 10 Fine Granularity C E1 G J I K 1 1 300 200 100 100 5715 4270 2418 1490 1952 1659 E2 H 1 300 Activity Per cycle, Latency = 0 cycles 4270 E3 2927 1 4270 D F 300 300 181 705 L 1 460 M 1 2548 N 1 207 Might be further improved by decomposing task E May 2, 2012 further and by also decomposing task C
  • 45. Analysis of Simulation Results • “Normal” Benchmark • “Parallel” Benchmark • “Shared Variable” Benchmark • JPEG Benchmark • Linear Solver Benchmark • Mandelbrot Benchmark • Benchmarks Analysis May 2, 2012
  • 46. A 1 236 B 1 Linear Solver Benchmark 40 C 1 214 Activity Per core, Latency = 20 cycles D 1 172 F 1 58 E 100 126 G H 7720 100 197 78 J 1 47 cn tr 5= K 1 87 May 2, 2012
  • 47. Analysis of Simulation Results • “Normal” Benchmark • “Parallel” Benchmark • “Shared Variable” Benchmark • JPEG Benchmark • Linear Solver Benchmark • Mandelbrot Benchmark • Benchmarks Analysis May 2, 2012
  • 48. A 1 540 Mandelbrot Benchmark B 1 225 Activity Per cycle, Latency = 20 cycles C 4096 80 D 4096 7 May 2, 2012
  • 49. A 1 540 Mandelbrot Benchmark B 1 225 Activity Per cycle, Latency = 20 cycles, Zoom on task D execution for infinite capacity C 4096 80 D 4096 7 Fine grained tasks requires deep queues and a powerful scheduler to assign instances fast enough to hide latencies May 2, 2012
  • 50. Analysis of Simulation Results • “Normal” Benchmark • “Parallel” Benchmark • “Shared Variable” Benchmark • JPEG Benchmark • Linear Solver Benchmark • Mandelbrot Benchmark • Benchmarks Analysis May 2, 2012
  • 51. Total Run-Time A 2 slot queue and a scheduler capacity of 10 is enough to utilize 256 2012 May 2, cores
  • 52. (STD of cores busy time, latency = 20) Load Balancing • Queues may cause imbalance • Larger scheduler capacityMay 2, 2012 imbalance decreases
  • 53. Effective Allocation Latency A 1 slot queue is sufficient to hide much of the latency May 2, 2012
  • 54. Agenda • Introduction and Motivation • The Plural Architecture • Improved Scheduler • Analysis of Simulation Results • Conclusions and Future Work May 2, 2012
  • 55. Conclusions • Analysis of scheduler effect on many-core architecture • A simulation and investigation tool • Queues to hide latencies – Might cause imbalance • Task map optimization and tuning • Sharing queues May 2, 2012
  • 56. Future Research • Scheduler distribution networks • Implications of scheduler on power • Other imbalance solutions – As described before • Profiling for task map optimization and scheduling analysis May 2, 2012
  • 57. QUESTIONS? May 2, 2012

Editor's Notes

  • #27: Unbalanced work distribution in low capacityCapacity reduces run-timeUnbalanced scheduling in deep queues
  • #29: Latency is added to the task’s run timeQueues hide latencySynchronization points cannot be compensated by queues
  • #31: Low capacity cannot utilize all the cores
  • #32: Queues in low capacity generates imbalance (only low cores receives instance to the queue and hides latency)
  • #34: The low capacity scheduler spreads the access time to the shared bank
  • #37: An instance of task G is stuck behind task E
  • #38: requiring more complex hardware possibly requiring a more complex scheduler.possibly requiring more complex hardware and enhanced communication bandwidth, and incurring higher power and latency
  • #39: Queues are shared among 2 cores
  • #40: Notice that sharing will not always solve this problem
  • #41: Scheduler do not schedule new tasks to a queue to which he scheduled a “long” task
  • #44: Break long tasks to many fine grained tasks. In this case, we brake task E to 3 parts
  • #47: Very parallelLong tasks, so only one slot queue and low capacity is sufficient
  • #49: Task D is very short (7 cycles), so a large capacity scheduler is neededThe infinite capacity causes collisions in memory (after the first collision the accesses are spread in time)In the no queue configuration we can see all tasks finish together
  • #50: Might be solved by unifying several instances together (but it degrades parallelism)