Scheduler performance in manycore architecture

Scheduler Performance in Many-
Core Architecture

Itai Avron
MSc Thesis
Technion - Electrical Engineering Dept.

May 2, 2012

Agenda
• Introduction and Motivation
• The Plural Architecture
• Improved Scheduler
• Analysis of Simulation Results
• Conclusions and Future Work

May 2, 2012

Background
• CPU performance improvements
– In the past : Increase of clock frequency
• We reached the power wall
– Today : Multi-cores
– The future : Many-cores
• Homogeneous Heterogeneous?
• What architecture?
• Memory model?
• Scheduler?
• …

May 2, 2012

Scheduling In Many-Core Architecture
• Software scheduling is slow
– A lot of cores to schedule
– Fine granularity tasks  many tasks to schedule at
the same time
• To enhance parallelism

• Dedicated Hardware required!

May 2, 2012

Scheduler Challenges
• Latency
– Message delay
• From core to scheduler (completed prev. task)
• From scheduler to core (start new task)
– Schedule time
• to allocate tasks to cores

• Capacity
– Number of instancestasks scheduled per cycle

May 2, 2012

Other Architectures
• Graphic Processing Unit (GPU’s)
• Tilera
• Larrabee
• XMT
• Rigel
• Data-Driven Multithreading Model
• Task Superscalar

May 2, 2012

GPU – NVIDIA Fermi
• Composed of many
processing elements
(PEs)
• Scheduling is done in
hardware
– Schedule warps
– Only one control flow
• SIMD

May 2, 2012

Tilera
• Composed of tiles
– Each tile is independent
• Static Scheduling
– Determined during
compile time
• MIMD

[Agarwal (MIT) 1997- ]

May 2, 2012

Larrabee (Intel)
• Array of processor cores
• Software controlled
Scheduling
– Lightweight distributed
task-stealing scheduler
• MIMD

May 2, 2012

XMT
• Composed of TCU’s
– Thread control unit
• Hardware Scheduling
– Using Prefix-Sum
• PRAM Programming
Model
• SPMD

[Vishkin (UMD) 2005-]

May 2, 2012

Rigel
• Composed of tiles of
clusters
– Each cluster holds 8
cores
• Software Scheduling
– Allocation via task
queues
– Synchronization via
Barriers
• SPMD
[Patel (UIUC) 2008- ]

May 2, 2012

Data-Driven Multithreading Model
• A Threads
Synchronization Unit
(TSU)
– Connects to existing
cores
– Using Task Map
• Producer-Consumer
Programming Model
[Evripidou (U Cyprus) 1997- ]

May 2, 2012

Task Superscalar
• An Out-of-Order Task
Pipeline
– Connects to existing cores
– No Speculations
– Creation of new tasks is
done in software
– Management and
Allocation is done in
Hardware
• StarSs Programming
Model
[Etsion (BSC) 2009- ]

May 2, 2012

The ‘Plural’ System Architecture

Scheduler

Cores

Memory Network

Memory
banks

[Bayer (Technion) 1988 ]

May 2, 2012

The System
• Many RISC cores
– In-Order, Blocking LoadStore
– No data cache
• Shared On-Chip memory banks
– Interleaved address
– Access takes 2 cycles
• Core retries on collision
• Hardware synchronization and scheduling unit
– Distributes tasks to cores according to a task map
– Collects task completion messages from cores

May 2, 2012

Plural Task Map
Task
• Precedence Graph
A

• Created by the 1

Dependency
programmer C
5000
B
1200

• Duplicable Tasks D
– All instances are 130

Condition
concurrent
cntr=4

E Task Name
1 Number of instances

May 2, 2012

Plural Scheduling

• Central Synchronization Unit (CSU)
– Manages allocation, scheduling, and synchronization of tasks
– Collects task-termination
– Programmed by the task map
– Allocates packs (sets) of parallel task-instances
• Distribution Network (DN)
– Organized as a tree with the CSU as its root
– Mediates between the CSU and the processing cores
– Downstream flow -> decomposes allocated packs of task instances
– Upstream flow -> unifies task-termination events from the cores

May 2, 2012

Scheduling Process
CSU
allocates
ready to
run tasks

CSU DN
process distributes
new eligible packs to
to run tasks cores

Cores sends
DN unifies
termination
termination
message on
messages
completion

May 2, 2012

Scheduler Improvements
• Enhancing scheduler capacity
• Reducing scheduling latency
• Adding task queues to each core
– Sharing queues
• Adding task length indicator

May 2, 2012

Simulation Environment
• Matlab Simulator [Friedman, Kh
oretz, Ginosar,
– Based on Eyal and Dima’s simulator PDP 2012]
• Benchmarks
– 3 Demo programs
– 3 Benchmarks
• JPEG, Mandelbrot, Linear Solver
• 24 System configurations
– 256 cores, 256 banks
– Scheduler capacity: 5, 10, infinite [instances]
– Latency (scheduler—cores): 0, 20 [cycles]
– Task queue depth: 0, 1, 2, 10 [instances]

May 2, 2012

Benchmark Task Maps
Normal and Mandelbrot JPEG Linear Solver
Parallel
Shared Variable

A A A A A
1 1 1 1 1
540 10 236
23 23

B B B
B B 1 1 1
100 2000 225 10 40
15 25

C C
4096 C E G J I K 1
D C D C 80 1 1 300 200 100 100 214
600 500 2600 2500 5715 12810 2418 1490 1952 1659
20 35 26 35
D D
4096 D F H 1
7 300 300 300 172 F
E E 181 705 2927 1
130 2300 58
18 18
E
100
L 126
1
460
cn

cn

G H
tr

tr
=4

=4

7720 100
M 197 78
1
2548

F F J
1
1 1 47
27 19 N
1
207

cn
tr
=5
Task Name
Number of instances
Length in time units

K
1
87

May 2, 2012

Analysis of Simulation Results
• “Normal” Benchmark
• “Parallel” Benchmark
• “Shared Variable” Benchmark
• JPEG Benchmark
• Linear Solver Benchmark
• Mandelbrot Benchmark

• Benchmarks Analysis

May 2, 2012

A
1
23

B
100

“Normal” Benchmark
15

D C
600 500
20 35

E

Activity Per core, Latency = 0 cycles 130
18

cn
tr
=4
F
1
27

May 2, 2012

A
1
23

B
100

15

D C
600 500
20 35

E

Unbalanced scheduling, Latency = 0 cycles 130
18

cn
tr
=4
F
1
27

May 2, 2012

A
1
23

B
100

15

D C
600 500
20 35

E

18

cn
tr
=4
F
1
27

May 2, 2012

A
1
23

B
2000

“Parallel” Benchmark
25

D C
2600 2500
26 35

E

18

cn
tr
=4
F
1
19

May 2, 2012

A
1
23

B
2000

“Parallel” Benchmark
25

D C
2600 2500
26 35

E

18

cn
tr
=4
F
1
19

Queues help hide latency only if schedule capacity is
sufficiently high

May 2, 2012

A
1
23

B
100

“Shared Variable” Benchmark
15

D C
600 500
20 35

E

Activity Per cycle, Latency = 0 cycles 130
18

cn
tr
=4
F
1
27

Is this a problem of the scheduler?

May 2, 2012

Analysis of Simulation Results
• “Normal” Benchmark
• “Parallel” Benchmark
• “Shared Variable” Benchmark
• JPEG Benchmark
• Linear Solver Benchmark
• Mandelbrot Benchmark

• Benchmarks Analysis
May 2, 2012

A
1
10

B
1

JPEG Benchmark
10

C E G J I K
1 1 300 200 100 100
5715 12810 2418 1490 1952 1659

Activity Per cycle, Latency = 0 cycles D
300
F
300
H
300
181 705 2927

L
1
460

M
1
2548

N
1
207

May 2, 2012

A
1
10

B
1

JPEG Benchmark
10

C E G J I K
1 1 300 200 100 100
5715 12810 2418 1490 1952 1659

Unbalanced scheduling, Latency = 0 cycles D
300
F
300
H
300
181 705 2927

L
1
460

M
1
2548

N
1
207

Queues may degrade system performance

May 2, 2012

Solutions to imbalance
1. Queue sharing among multiple cores
2. Scheduling awareness of long tasks Simulated

3. Using fine granularity tasks
4. Task migration among queues
5. Task map optimization
6. Pipeline multiple instances of an algorithm

May 2, 2012

Solutions to imbalance
• Queue sharing among multiple cores
• Scheduling awareness of long tasks
• Using fine granularity tasks

May 2, 2012

A
1
10

JPEG Benchmark 1
B

10

Shared Queues C
1
5715
E
1
12810
G
300
2418
J
200
1490
I
100
1952
K
100
1659

Activity Per cycle, Latency = 0 cycles D
300
F
300
H
300
181 705 2927

L
1
460

M
1
2548

N
1
207

May 2, 2012

JPEG Benchmark [Green 2010]

Execution-Time Aware Scheduler
Activity Per cycle, Latency = 0 cycles, Task E flagged as long

Flag task C as well

May 2, 2012

JPEG Benchmark
Execution-Time Aware Scheduler
Activity Per cycle, Latency = 0 cycles, Task E and C flagged as long

Need Profiling Tool

May 2, 2012

A
1
10

JPEG Benchmark 1
B

10

Fine Granularity
C E1 G J I K
1 1 300 200 100 100
5715 4270 2418 1490 1952 1659

E2 H
1 300

Activity Per cycle, Latency = 0 cycles 4270

E3
2927

1
4270

D F
300 300
181 705

L
1
460

M
1
2548

N
1
207

Might be further improved by decomposing task E
May 2, 2012
further and by also decomposing task C

A
1
236

B
1

Linear Solver Benchmark
40

C
1
214

Activity Per core, Latency = 20 cycles D
1
172 F
1
58
E
100
126

G H
7720 100
197 78

J
1
47

cn
tr
5=
K
1
87

May 2, 2012

A
1
540

Mandelbrot Benchmark B
1
225

Activity Per cycle, Latency = 20 cycles
C
4096
80

D
4096
7

May 2, 2012

A
1
540

Mandelbrot Benchmark B
1
225

Activity Per cycle, Latency = 20 cycles, Zoom on task D execution for infinite
capacity C
4096
80

D
4096
7

Fine grained tasks requires deep queues and a powerful
scheduler to assign instances fast enough to hide
latencies

May 2, 2012

Total Run-Time

A 2 slot queue and a scheduler capacity of 10 is enough
to utilize 256 2012
May 2,
cores

(STD of cores busy time, latency = 20)

Load Balancing

• Queues may cause imbalance
• Larger scheduler capacityMay 2, 2012 imbalance
decreases

Effective Allocation Latency

A 1 slot queue is sufficient to hide much of the latency
May 2, 2012

Conclusions
• Analysis of scheduler effect on many-core
architecture
• A simulation and investigation tool
• Queues to hide latencies
– Might cause imbalance
• Task map optimization and tuning
• Sharing queues

May 2, 2012

Future Research
• Scheduler distribution networks
• Implications of scheduler on power
• Other imbalance solutions
– As described before
• Profiling for task map optimization and
scheduling analysis

May 2, 2012

QUESTIONS?

May 2, 2012

Scheduler performance in manycore architecture

More Related Content

Viewers also liked (8)

Similar to Scheduler performance in manycore architecture (20)

More from chiportal (20)

Recently uploaded (20)

Scheduler performance in manycore architecture

Editor's Notes