[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

– Data-independent tasks

– Tasks with statically-known data dependences

– SIMD divergence

– Lacking fine-grained synchronization

– Lacking writeable, coherent caches

32-‐bit
Key-‐value
Sor7ng
Keys-‐only
Sor7ng

DEVICE (106
keys
/
sec)

(106
pairs/
sec)

NVIDIA
GTX
280 449

(3.8x
speedup*) 534

(2.9x
speedup*)

* Satish et al.,"Designing efficient sorting algorithms
for manycore GPUs," in IPDPS '09

32-‐bit
Key-‐value
Sor7ng
Keys-‐only
Sor7ng

DEVICE (106
keys
/
sec)

(106
pairs/
sec)

NVIDIA
GTX
480 775 1005

NVIDIA
GTX
280 449 534

NVIDIA
8800
GT 129 171

32-‐bit
Key-‐value
Sor7ng
Keys-‐only
Sor7ng

DEVICE (106
keys
/
sec)

(106
pairs/
sec)

NVIDIA
GTX
480 775 1005

NVIDIA
GTX
280 449 534

NVIDIA
8800
GT 129 171

Intel

Knight's
Ferry
MIC
32-‐core* 560

Intel

Core
i7
quad-‐core
* 240

Intel

Core-‐2
quad-‐core* 138

*Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC
Architectures,“ Intel Tech Report 2010.

Input

Thread
Thread
Thread
Thread

Output

– Each output is dependent upon a finite subset of the input
• Threads are decomposed by output element
• The output (and at least one input) index is a static function of thread-id

Input

?

Output

– Each output element has dependences upon any / all input elements
– E.g., sorting, reduction, compaction, duplicate removal, histogram generation,
map-reduce, etc.

– Threads are decomposed by output
element
Thread
Thread
Thread
Thread

– Repeatedly iterate over recycled
input streams
– Output stream size is statically
known before each pass Thread
Thread
Thread
Thread

+ + + +

– O(n) global work from passes of pairwise-neighbor-reduction

– Static dependences, uniform output

allocation

– Repeated pairwise swapping
• Bubble sort is O(n2)
– Repeatedly check each vertex or edge
• Bitonic sort is O(nlog2n)
• Breadth-first search becomes O(V2)
– Need partitioning: dynamic, cooperative • O(V+E) is work-optimal

– Need queue: dynamic, cooperative
allocation



– Variable output per thread
– Need dynamic, cooperative allocation

Input

Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread

?

Output

• Where do I put something in a list?  Where do I enqueue something?
– Duplicate removal – Search space exploration

– Sorting – Graph traversal

– Histogram compilation – General work queues

• For 30,000 producers and consumers?

– Locks serialize everything

Input
2
1
0
3
2
– O(n) work

– For allocation: use scan results as
Preﬁx
Sum
0
2
3
3
6
a scattering vector

– Popularized by Blelloch et al. in the
‘90s

– Merrill et al. Parallel Scan for
Stream Architectures. Technical
Report CS2009-14, University of
Virginia. 2009

Thread
Thread
Thread
Thread
Thread

Input

(
&
allocaOon

requirement)
2
1
0
3
2
– O(n) work

Result
of

a scattering vector
preﬁx
scan
(sum)

0
2
3
3
6

‘90s

Virginia. 2009

Input

(
&
allocaOon

requirement)
2
1
0
3
2
– O(n) work

Result
of

a scattering vector
preﬁx
scan
(sum)

0
2
3
3
6

Thread
Thread
Thread
Thread
Thread

‘90s

Output

0
1
2
3
4
5
6
7

Virginia. 2009

Key sequence 1110 0011 1010 0111 1100 1000 0101 0001

0s 1s

Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Key sequence 1110 0011 1010 0111 1100 1000 0101 0001
0 1 2 3 4 5 6 7

0s 1s

Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Scanned allocations
0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3
(relocation offsets)
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0s 1s

Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Scanned allocations
0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3
(bin relocation offsets)
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Adjusted allocations
0 1 1 2 2 3 4 4 4 4 5 5 6 6 6 7
(global relocation offsets)
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0 4 1 5 2 3 6 7

Key sequence 1110 0011 1010 0111 1100 1000 0101 0001

Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001
0 1 2 3 4 5 6 7

Determine
allocaCon
size

Global
Device
Memory

Host
Program
CUDPP
scan

CUDPP
Scan

CUDPP
scan

Distribute
output

Host
GPU

Un-fused

Determine
allocaCon
size

Determine
allocaCon

Global
Device
Memory

Global
Device
Memory

Scan

Host
Program

Host
Program

CUDPP
scan

CUDPP
Scan
Scan

CUDPP
scan

Scan

Distribute
output

Distribute
output

Host
GPU
Host
GPU

Un-fused Fused

Determine
allocaCon
1. Heavy SMT (over-threading) yields

Global
Device
Memory

Scan
usable “bubbles” of free
Host
Program

computation
Scan
2. Propagate live data between steps
in fast registers / smem

Scan
3. Use scan (or variant) as a “runtime”
Distribute
output
for everything

Host
GPU

Fused

Device
Memory
Bandwidth
Compute
Throughput
Memory
wall
Memory
wall

(109
bytes/s)
(109
thread-‐cycles/s)
(bytes/cycle)
(instrs/word)

GTX
480
169.0
672.0
0.251
15.9

GTX
285
159.0
354.2
0.449
8.9

GTX
280
141.7
311.0
0.456
8.8

Tesla
C1060
102.0
312.0
0.327
12.2

9800
GTX+
70.4
235.0
0.300
13.4

8800
GT
57.6
168.0
0.343
11.7

9800
GT
57.6
168.0
0.343
11.7

8800
GTX
86.4
172.8
0.500
8.0

Quadro
FX
5600
76.8
152.3
0.504
7.9

25

GTX285
r+w
memory
wall

Thread-‐InstrucOons
/
32-‐bit
scan
element

(17.8
instrucOons
per

20
input
word)

15

10

Insert
work
here

5

0

0
16
32
48
64
80
96
112

Problem
Size
(millions)

25

/
32-‐bit
scan
element

GTX285
r+w
memory

20
wall
(17.8)

15

10
Insert
work
here

5

Data
Movement

Skeleton

0

0
16
32
48
64
80
96
112

Problem
Size
(millions)

25

/
32-‐bit
scan
element

GTX285
r+w
memory

20
wall
(17.8)

15

Insert
work
here

10
Our
Scan
Kernel

5

Data
Movement

Skeleton

0

0
16
32
48
64
80
96
112

Problem
Size
(millions)

25

/
32-‐bit
scan
element

GTX285
r+w

20
memory
wall

(17.8)

15

– Increase granularity /
Insert
work
here

redundant computation
• ghost cells 10
Our
Scan
Kernel

• radix bits

– Orthogonal kernel fusion
5

Data
Movement

Skeleton

0

0
16
32
48
64
80
96
112

Problem
Size
(millions)

25

/
32-‐bit
scan
element

CUDPP
Scan
Kernel

20

15

10
Our
Scan
Kernel

5

0

0
20
40
60
80
100
120

Problem
Size
(millions)

35

30
GTX285
Radix

/
32-‐bit
scan
element

– Partially-coalesced writes Scader
Kernel
Wall

– 2x write overhead 25

GTX285
Scan

20
Insert
work
here
Kernel
Wall

15

– 4 total concurrent scan
operations (radix 16) 10

Our
Scan
Kernel

5

0

0
16
32
48
64
80
96
112

Problem
Size
(millions)

50

45

480
Radix

40
Scader
Kernel

Thread-‐instructoins
/
32-‐bit
word

Wall

35

30

– Need kernels with tunable
25

local (or redundant) work 285
Radix

• ghost cells 20
Scader
Kernel

Wall

• radix bits
15

10

5

0

0
10
20
30
40
50
60
70
80
90

Problem
Size
(millions)

– Virtual processors abstract a diversity of hardware configurations

– Leads to a host of inefficiencies

– E.g., only several hundred CTAs

…

Grid A

threadblock

grid-size = (N / tilesize) CTAs

…

Grid B

threadblock

grid-size = 150 CTAs (or other small constant)

…

threadblock

– Thread-dependent predicates

– Setup and initialization code (notably for
smem)

– Offset calculations (notably for smem)

– Common values are hoisted and kept live

…

threadblock

– Thread-dependent predicates

– Setup and initialization code (notably for
smem)

– Offset calculations (notably for smem)

– Common values are hoisted and kept live
– Spills are really bad

log tilesize (N) -level tree

Two-level tree
load, store)

– O( N / tilesize) gmem accesses – GPU is least efficient here: get it over with
as quick as possible
– 2-4 instructions per access (offset calcs,

20

Thread-‐instrucOons
/
Element

16

12

Compute
Load

8

285
Scan
Kernel
Wall

4

0

0
1000
2000
3000
4000
5000
6000
7000
8000
9000

Grid
Size
(#
of
threadblocks)

C = number of CTAs
N = problem size
– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA
T = tile size

B = tiles per CTA

– conditional evaluation

– singleton loads

C = number of CTAs
N = problem size
– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA
T = tile size

B = tiles per CTA

– 16.1M % (1024 * 150) = 136.4 extra tiles

C = number of CTAs
N = problem size
– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs)
T = tile size

B = tiles per CTA

– 109 + 1 = 110 tiles per CTA (136 CTAs)

– 16.1M % (1024 * 150) = 0.4 extra tiles

– If you breathe on your code, run it through the VP
• Kernel runtimes
• Instruction counts

– Indispensible for tuning
• Host-side timing requires too many iterations
• Only 1-2 cudaprof iterations for consistent counter-based perf data

– Write tools to parse the output
• “Dummy” kernels useful for demarcation

1100

1000
GTX
480

900
C2050
(no
ECC)

SorOng
Rate
(106
keys/sec)

800
GTX
285

700
C2050
(ECC)

600
GTX
280

500
C1060

400
9800
GTX+

300

200

100

0

0
16
32
48
64
80
96
112
128
144
160
176
192
208
224
240
256
272

Problem
size
(millions)

800

GTX
480

700
C2050
(no
ECC)

GTX
285

SorOng
Rate
(millions
of
pairs/sec)

600
GTX
280

C2050
(ECC)

C1060

500

9800
GTX+

400

300

200

100

0

0
16
32
48
64
80
96
112
128
144
160
176
192
208
224
240

Problem
size
(millions)

180

160

Kernel
Bandwidth
(GiBytes
/
sec)

140

120

100

80

60

40
merrill_tree
Reduce

20
merrill_rts
Scan

0

0
20
40
60
80
100
120

Problem
Size
(millions)

180

160

Kernel
Bandwidth
(Bytes
x109
/
sec)

140

120

100

80

60

40
merrill_linear
Reduce

20
merrill_linear
Scan

0

0
20
40
60
80
100
120

Problem
Size
(millions)

– Implement device “memcpy” for tile-processing
• Optimize for “full tiles”

– Specialize for different SM versions, input types, etc.

– Use templated code to
generate various
instances

– Run with cudaprof env
vars to collect data

160
128-‐Thread
CTA
(64B
ld)

150

Bandwidth
(GiBytes/sec)

140

One-‐way

130

120
Single

110
Single
(no-‐overlap)

Double

100

Double
(no-‐overlap)

90

Quad

80
Quad
(no-‐overlap)

0
20
40
60
80
100
120

Words
Copied
(millions)

us

cudaMemcpy()

128-‐Thread
CTA
(128B
ld/st)

140

Bandwidth
(GiBytes/sec)

130

Two-‐way
120

Single

110

Single
(no-‐overlap)

100

Double

90
Double
(no-‐overlap)

Quad

80
Quad
(no-‐overlap)

70
Intrinsic
Copy

0
20
40
60
80
100
120

Words
Copied
(millions)

m0 m1 m2 m3 m4 m5 m6 m7
m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11

t0 x0 x1 x2 x3 x4 x5 x6 x7 t0
i i i i x0 x1 x2 x3 x4 x5 x6 x7

⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7

⊕0
⊕1
⊕2
⊕3

t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7)
t1
i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7)

⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7

⊕0
⊕1
t2
t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i

⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7

t3
=0
⊕0

t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3)

=0 ⊕0
=1 ⊕1

t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5)

=0 ⊕0
=1 ⊕1
=2 ⊕2
=3 ⊕3

t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)

– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

barrier

Tree-‐based:
barrier

barrier

t0
t1
t2

t3

t0
t1
t2

t3

t0
t1
t2

t3

t0
t1
t2

t3

t0
t1
t2
…
t
T/4
-‐1
tT/4

tT/4
+

1

tT/4
+

2

…
t

T/2
-‐
1 tT/2

tT/2
+

1

tT/2
+

2

…
t
3T/4
-‐1
t3T/4

t3T/4+1
t3T/4+2
…
t

T
-‐
1

vs.
raking-‐based:
barrier
t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2

t3

t0
t1
t2

t3

t0
t1
t2

t3

t0
t1
t2

t3

DMA

t0
t1
t2
…
t

T/4
tT/4

tT/4
+

1

tT/4
+

2

…
t

T/2
-‐
tT/2

tT/2
+

1

tT/2
+

2

…
t
3T/4

-‐1

t3T/4

t3T/
4+1

t3T/
4+2

…
t

T
-‐
1

– Barriers make O(n) code O(n log n)
-‐1 1

t0
t1
t2
t3

t0
t1
t2
t3

– The rest are “DMA engine” threads t0
t1
t2
t3

Worker

– Use threadblocks to cover pipeline
t0
t1
t2

t3

latencies, e.g., for Fermi: t0
t1
t2

t3

t0
t1
t2

t3
• 2 worker warps per CTA
t0
t1
t2

t3

• 6-7 CTAs

– Different SMs (varied local storage: registers/smem)
– Different input types (e.g., sorting chars vs. ulongs)

– # of steps for each algorithm phase is configuration-driven

– Template expansion + Constant-propagation + Static loop unrolling +
Preprocessor Macros
– Compiler produces a target assembly that is well-tuned for the specifically
targeted hardware and problem

Subsequence
of
Keys

211
122
302
232
300
021
022
013
021
123
330
023
130
220
020
301
112
221
023
322
003
012
022
130
010
121
323
020
101
212
220
333

Digit
Flag
Vectors

Digit

Decoding

0s
0
0
0
0
1
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
0

1s
1
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0

2s
0
1
1
1
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
1
0
1
1
0
0
0
0
0
0
1
0
0

3s
0
0
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
1
0
0
0
0
1

Serial

(regs)

…
…
…
…

ReducOon

t0
t1
t2
…
tT/4
-‐1
tT/4

tT/4
+
1
tT/4
+
2
…
tT/2
-‐
1
tT/2

tT/2
+
1
tT/2
+
2
…
t3T/4
-‐1
t3T/4

t3T/4+1
t3T/4+2
…
tT
-‐
1

t0
t1
t2
…
tT/4
-‐1
tT/4

tT/4
+
1
tT/4
+
2
…
tT/2
-‐
1
tT/2

tT/2
+
1
tT/2
+
2
…
t3T/4
-‐1
t3T/4

t3T/4+1
t3T/4+2
…
tT
-‐
1

t0
t1
t2
…
tT/4
-‐1
tT/4

tT/4
+
1
tT/4
+
2
…
tT/2
-‐
1
tT/2

tT/2
+
1
tT/2
+
2
…
t3T/4
-‐1
t3T/4

t3T/4+1
t3T/4+2
…
tT
-‐
1

t0
t1
t2
tT/4
-‐1
tT/4

tT/4
+
1
tT/4
+
2
tT/2
-‐
1
tT/2

tT/2
+
1
tT/2
+
2
t3T/4
-‐1
t3T/4

t3T/4+1
t3T/4+2
tT
-‐
1

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

Serial

(shmem)

ReducOon

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

Scan

(shmem)

t0
t1
t2
t3

SIMD
Kogge-‐Stone

t0
t1
t2
t3

9

t0
t1
t2
t3

0s
total

t0
t1
t2
t3

7

t0
t1
t2
t3

1s
total

9

2s
total

t0
t1
t2
t3

7

t0
t1
t2
t3

3s
total

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

(shmem)

Serial
Scan

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
tT/4
-‐2
tT/4
-‐1
tT/4

tT/4
+
1
tT/2
-‐
2
tT/2
-‐
1
tT/2

tT/2
+
1
t3T/4
-‐2
t3T/4
-‐1
t3T/4

t3T/4+1
tT
-‐
2
tT
-‐
1

…
…
…
…

t0
t1
tT/4
-‐2
tT/4
-‐1
tT/4

tT/4
+
1
tT/2
-‐
2
tT/2
-‐
1
tT/2

tT/2
+
1
t3T/4
-‐2
t3T/4
-‐1
t3T/4

t3T/4+1
tT
-‐
2
tT
-‐
1

…
…
…
…

t0
t1
tT/4
-‐2
tT/4
-‐1
tT/4

tT/4
+
1
tT/2
-‐
2
tT/2
-‐
1
tT/2

tT/2
+
1
t3T/4
-‐2
t3T/4
-‐1
t3T/4

t3T/4+1
tT
-‐
2
tT
-‐
1

…
…
…
…

t0
t1
tT/4
-‐2
tT/4
-‐1
tT/4

tT/4
+
1
tT/2
-‐
2
tT/2
-‐
1
tT/2

tT/2
+
1
t3T/4
-‐2
t3T/4
-‐1
t3T/4

t3T/4+1
tT
-‐
2
tT
-‐
1

…
…
…
…

(regs)

Serial
Scan

0s
0
0
0
0
1
1
1
1
1
1
2
2
3
4
5
5
5
5
5
5
5
5
5
6
7
7
7
8
8
8
9
9

1s
1
1
1
1
1
2
2
2
3
3
3
3
3
3
3
4
4
5
5
5
5
5
5
5
5
6
6
6
7
7
7
7

2s
0
1
2
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
6
6
7
8
8
8
8
8
8
8
9
9
9

3s
0
0
0
0
0
0
0
1
1
2
2
3
3
3
3
3
3
3
4
4
5
5
5
5
5
5
6
6
6
6
6
7

0s
total
1s
total
2s
total
3s
total

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

t0
t1
t2
t3

Local
Exchange
Oﬀsets

10
17
18
19
1
11
20
26
12
27
2
28
3
4
5
13
21
14
29
22
30
23
24
6
7
15
31
8
16
25
9
32

Scader

…
…
…
…

(SIMD
Kogge-‐Stone
Scan,
smem
exchange)

t0
t1
tT/4
-‐2
tT/4
-‐1
tT/4

tT/4
+
1
tT/2
-‐
2
tT/2
-‐
1
tT/2

tT/2
+
1
t3T/4
-‐2
t3T/4
-‐1
t3T/4

t3T/4+1
tT
-‐
2
tT
-‐
1

t0
t1
tT/4
-‐2
tT/4
-‐1
tT/4

tT/4
+
1
tT/2
-‐
2
tT/2
-‐
1
tT/2

tT/2
+
1
t3T/4
-‐2
t3T/4
-‐1
t3T/4

t3T/4+1
tT
-‐
2
tT
-‐
1

…
…
…
…

Exchanged
Keys

300
330
130
220
020
130
010
020
220
211
021
021
301
221
121
101
122
302
232
022
122
112
322
022
212
013
123
023
023
003
323
333

9
18

0s
carry-‐in
0s
carry-‐out

25
32

1s
carry-‐in
1s
carry-‐out

33
42

2s
carry-‐in
2s
carry-‐out

49
56

3s
carry-‐in
3s
carry-‐out

Global
Scatter
Oﬀsets

10
11
12
13
14
15
16
17
18
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
50
51
52
53
54
55
56

• Resource-allocation as runtime

1. Kernel memory-wall analysis (kernel fusion)

2. Algorithm serialization
3. Tune for data-movement

4. Warp-synchronous programming

5. Flexible granularity via meta-programming

– Back40Computing (a Google Code Project)
• http://code.google.com/p/back40computing/

– Default sorting method for Thrust
• http://code.google.com/p/thrust/

– A single host-side procedure call launches a kernel that performs orthogonal
program steps

MyUberKernel<<<grid_size, num_threads>>>(d_device_storage);

– No existing public repositories of kernel “subroutines” for scavenging

GATHER(key) GATHER (value)

Extract radix digit

Encode flag bit – Callbacks, iterators, visitors, functors, etc.
(into flag vectors)

LOCAL MULTI-SCAN
(flag vectors)
– ReduceKernel<<<grid_size, num_threads>>>
(CountingIterator(100));
Decode local rank
(from flag vectors)

EXCHANGE (key) EXCHANGE (value)

Extract radix digit (again)

Update global radix digit partition offsets – E.g., fused kernel left can’t be composed using
a callback-based pattern
SCATTER (key) SCATTER (value)

Fused radix sorting kernel
• Digit extraction
• Local prefix scan
• Scatter accordingly

– Compiled libraries suffer from code bloat
• CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types.
• Specializing for device configurations makes it even worse

– The alternative is to ship source for #include’ing
• Have to be willing to share source

– Need a way to fit meta-programming in at the JIT / bytecode level to help avoid
expansion / mismatch-by-omission

– Can leverage fundamentally different algorithms for different phases
• How to teach the compiler do to this?

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

More Related Content

Viewers also liked (11)

Similar to [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia) (20)

More from npinto (12)

Recently uploaded (20)

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)