SlideShare a Scribd company logo
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
–  Data-independent tasks

–  Tasks with statically-known data dependences



–  SIMD divergence

–  Lacking fine-grained synchronization

–  Lacking writeable, coherent caches
–  Data-independent tasks

–  Tasks with statically-known data dependences



–  SIMD divergence

–  Lacking fine-grained synchronization

–  Lacking writeable, coherent caches
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
32-­‐bit	
  Key-­‐value	
  Sor7ng	
                 Keys-­‐only	
  Sor7ng	
  
                  DEVICE           (106	
  keys	
  /	
  sec)	
  	
               (106	
  pairs/	
  sec)	
  




NVIDIA	
  GTX	
  280         449	
  	
  	
  (3.8x	
  speedup*)               534	
  	
  	
  (2.9x	
  speedup*)




                                                     * Satish et al.,"Designing efficient sorting algorithms
                                                       for manycore GPUs," in IPDPS '09
32-­‐bit	
  Key-­‐value	
  Sor7ng	
         Keys-­‐only	
  Sor7ng	
  
                  DEVICE           (106	
  keys	
  /	
  sec)	
  	
       (106	
  pairs/	
  sec)	
  


NVIDIA	
  GTX	
  480                         775                                1005

NVIDIA	
  GTX	
  280                         449                                 534

NVIDIA	
  8800	
  GT                         129                                 171
32-­‐bit	
  Key-­‐value	
  Sor7ng	
            Keys-­‐only	
  Sor7ng	
  
                         DEVICE                               (106	
  keys	
  /	
  sec)	
  	
          (106	
  pairs/	
  sec)	
  


NVIDIA	
  GTX	
  480                                                      775                                 1005

NVIDIA	
  GTX	
  280                                                      449                                  534

NVIDIA	
  8800	
  GT                                                      129                                   171


Intel	
  	
  Knight's	
  Ferry	
  MIC	
  32-­‐core*                                                            560

Intel	
  	
  Core	
  i7	
  quad-­‐core	
  *                                                                    240

Intel	
  	
  Core-­‐2	
  quad-­‐core*                                                                          138

                                                                          *Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC
                                                                           Architectures,“ Intel Tech Report 2010.
 
Input	
  




                                 Thread	
     Thread	
     Thread	
     Thread	
  


                  Output	
  



–  Each output is dependent upon a finite subset of the input
    •  Threads are decomposed by output element
    •  The output (and at least one input) index is a static function of thread-id
Input	
  



                                        ?	
  

   Output	
  




–  Each output element has dependences upon any / all input elements
–  E.g., sorting, reduction, compaction, duplicate removal, histogram generation,
   map-reduce, etc.
–  Threads are decomposed by output
   element
                                               Thread	
          Thread	
          Thread	
          Thread	
  
–  Repeatedly iterate over recycled
   input streams
–  Output stream size is statically
   known before each pass             Thread	
          Thread	
          Thread	
          Thread	
  
+         +      +   +




–  O(n) global work from passes of pairwise-neighbor-reduction

–  Static dependences, uniform output
allocation

–  Repeated pairwise swapping
     • Bubble sort is O(n2)
                                             –  Repeatedly check each vertex or edge
     • Bitonic sort is O(nlog2n)
                                                  • Breadth-first search becomes O(V2)
–  Need partitioning: dynamic, cooperative        • O(V+E) is work-optimal

                                             –  Need queue: dynamic, cooperative
                                                allocation
allocation

–  Repeated pairwise swapping
     • Bubble sort is O(n2)
                                             –  Repeatedly check each vertex or edge
     • Bitonic sort is O(nlog2n)
                                                  • Breadth-first search becomes O(V2)
–  Need partitioning: dynamic, cooperative        • O(V+E) is work-optimal

                                             –  Need queue: dynamic, cooperative
                                                allocation
	
  
–  Variable output per thread
–  Need dynamic, cooperative allocation
Input	
  
                      Thread	
     Thread	
     Thread	
     Thread	
     Thread	
     Thread	
        Thread	
     Thread	
     Thread	
     Thread	
     Thread	
     Thread	
  




                                                                                               ?	
  

         Output	
  



•  Where do I put something in a list?                                                     Where do I enqueue something?
     –  Duplicate removal                                                                               –  Search space exploration

     –  Sorting                                                                                         –  Graph traversal

     –  Histogram compilation                                                                           –  General work queues
• For 30,000 producers and consumers?




–  Locks serialize everything
Input	
            2	
     1	
     0	
     3	
     2	
     –    O(n) work

                                                           –    For allocation: use scan results as
Prefix	
  Sum	
     0	
     2	
     3	
     3	
     6	
          a scattering vector

                                                           –    Popularized by Blelloch et al. in the
                                                                ‘90s




                                                           –    Merrill et al. Parallel Scan for
                                                                Stream Architectures. Technical
                                                                Report CS2009-14, University of
                                                                Virginia. 2009
Thread	
     Thread	
     Thread	
     Thread	
     Thread	
  


Input	
  	
  (	
  &	
  allocaOon	
  	
  
requirement)	
                               2	
          1	
          0	
          3	
          2	
        –    O(n) work

                                                                                                            –    For allocation: use scan results as
Result	
  of	
  	
                                                                                               a scattering vector
prefix	
  scan	
  (sum)	
  
                                              0	
          2	
          3	
         3	
          6	
  
                                                                                                            –    Popularized by Blelloch et al. in the
                                                                                                                 ‘90s




                                                                                                            –    Merrill et al. Parallel Scan for
                                                                                                                 Stream Architectures. Technical
                                                                                                                 Report CS2009-14, University of
                                                                                                                 Virginia. 2009
Input	
  	
  (	
  &	
  allocaOon	
  	
  
             requirement)	
                                       2	
          1	
          0	
          3	
          2	
        –    O(n) work

                                                                                                                                 –    For allocation: use scan results as
             Result	
  of	
  	
                                                                                                       a scattering vector
             prefix	
  scan	
  (sum)	
  
                                                                   0	
          2	
          3	
         3	
          6	
  
                                                                Thread	
     Thread	
     Thread	
     Thread	
     Thread	
  
                                                                                                                                 –    Popularized by Blelloch et al. in the
                                                                                                                                      ‘90s



Output	
  
                           0	
            1	
           2	
         3	
          4	
          5	
          6	
          7	
  
                                                                                                                                 –    Merrill et al. Parallel Scan for
                                                                                                                                      Stream Architectures. Technical
                                                                                                                                      Report CS2009-14, University of
                                                                                                                                      Virginia. 2009
Key sequence    1110   0011        1010   0111   1100   1000        0101   0001



                                    0s                               1s



Output key sequence   1110   1010        1100   1000   0011   0111        0101   0001
Key sequence      1110            0011              1010         0111     1100        1000             0101        0001
                               0            1                 2            3           4           5                6           7




                                                    0s                                                     1s




Allocation requirements    1       0   1        0        1        1   0        0   0       1   0       1        0       0   1       1
                           0       1   2        3        4        5   6        7   0       1   2       3        4       5   6       7

   Scanned allocations
                           0       1   1        2        2        3   4        4   0       0   1       1        2       2   2       3
    (relocation offsets)
                           0       1   2        3        4        5   6        7   0       1   2       3        4       5   6       7
0s                                                                1s




 Allocation requirements      1       0   1        0        1        1   0           0       0       1       0       1        0       0   1       1
                              0       1   2        3        4        5   6           7       0       1       2       3        4       5   6       7

     Scanned allocations
                              0       1   1        2        2        3   4           4       0       0       1       1        2       2   2       3
   (bin relocation offsets)
                              0       1   2        3        4        5   6           7       0       1       2       3        4       5   6       7
     Adjusted allocations
                              0       1   1        2        2        3   4           4       4       4       5       5        6       6   6       7
(global relocation offsets)
                              0       1   2        3        4        5   6           7       0       1       2       3        4       5   6       7




                              0           4                 1                5           2               3                    6           7

           Key sequence       1110            0011              1010         0111        1100            1000                 0101        0001




    Output key sequence       1110            1010              1100         1000        0011            0111                 0101        0001
                                  0            1                 2               3               4               5                6           7
 
Determine	
  allocaCon	
  size	
  




                                                              Global	
  Device	
  Memory	
  
Host	
  Program	
                CUDPP	
  scan	
  


                                 CUDPP	
  Scan	
  


                                 CUDPP	
  scan	
  


                              Distribute	
  output	
  

        Host	
                           GPU	
  


                      Un-fused
Determine	
  allocaCon	
  size	
  
                                                                                                                     Determine	
  allocaCon	
  




                                                             Global	
  Device	
  Memory	
  




                                                                                                                                                  Global	
  Device	
  Memory	
  
                                                                                                                          Scan	
  
Host	
  Program	
  




                                                                                              Host	
  Program	
  
                                CUDPP	
  scan	
  

                                CUDPP	
  Scan	
                                                                              Scan	
  

                                CUDPP	
  scan	
  
                                                                                                                            Scan	
  
                                                                                                                       Distribute	
  output	
  
                             Distribute	
  output	
  
  Host	
                                GPU	
                                                 Host	
                            GPU	
  

                      Un-fused                                                                                      Fused
Determine	
  allocaCon	
                                      1.  Heavy SMT (over-threading) yields




                                                    Global	
  Device	
  Memory	
  
                             Scan	
                                                      usable “bubbles” of free
Host	
  Program	
  




                                                                                         computation
                               Scan	
                                                2.  Propagate live data between steps
                                                                                         in fast registers / smem

                              Scan	
                                                 3.  Use scan (or variant) as a “runtime”
                         Distribute	
  output	
                                          for everything

Host	
                            GPU	
  

                      Fused
Determine	
  allocaCon	
                                      1.  Heavy SMT (over-threading) yields




                                                    Global	
  Device	
  Memory	
  
                             Scan	
                                                      usable “bubbles” of free
Host	
  Program	
  




                                                                                         computation
                               Scan	
                                                2.  Propagate live data between steps
                                                                                         in fast registers / smem

                              Scan	
                                                 3.  Use scan (or variant) as a “runtime”
                         Distribute	
  output	
                                          for everything

Host	
                            GPU	
  

                      Fused
Device	
              Memory	
  Bandwidth	
       Compute	
  Throughput	
            Memory	
  wall	
      Memory	
  wall	
  
           	
                   (109	
  bytes/s)	
       (109	
  thread-­‐cycles/s)	
       (bytes/cycle)	
      (instrs/word)	
  

     GTX	
  480	
                    169.0	
                       672.0	
                      0.251	
               15.9	
  

     GTX	
  285	
                    159.0	
                       354.2	
                      0.449	
                8.9	
  

     GTX	
  280	
                    141.7	
                       311.0	
                      0.456	
                8.8	
  

   Tesla	
  C1060	
                  102.0	
                       312.0	
                      0.327	
               12.2	
  

   9800	
  GTX+	
                     70.4	
                       235.0	
                      0.300	
               13.4	
  

     8800	
  GT	
                     57.6	
                       168.0	
                      0.343	
               11.7	
  

     9800	
  GT	
                     57.6	
                       168.0	
                      0.343	
               11.7	
  

    8800	
  GTX	
                     86.4	
                       172.8	
                      0.500	
                8.0	
  

Quadro	
  FX	
  5600	
                76.8	
                       152.3	
                      0.504	
                7.9	
  
Device	
              Memory	
  Bandwidth	
       Compute	
  Throughput	
            Memory	
  wall	
      Memory	
  wall	
  
           	
                   (109	
  bytes/s)	
       (109	
  thread-­‐cycles/s)	
       (bytes/cycle)	
      (instrs/word)	
  

     GTX	
  480	
                    169.0	
                       672.0	
                      0.251	
               15.9	
  

     GTX	
  285	
                    159.0	
                       354.2	
                      0.449	
                8.9	
  

     GTX	
  280	
                    141.7	
                       311.0	
                      0.456	
                8.8	
  

   Tesla	
  C1060	
                  102.0	
                       312.0	
                      0.327	
               12.2	
  

   9800	
  GTX+	
                     70.4	
                       235.0	
                      0.300	
               13.4	
  

     8800	
  GT	
                     57.6	
                       168.0	
                      0.343	
               11.7	
  

     9800	
  GT	
                     57.6	
                       168.0	
                      0.343	
               11.7	
  

    8800	
  GTX	
                     86.4	
                       172.8	
                      0.500	
                8.0	
  

Quadro	
  FX	
  5600	
                76.8	
                       152.3	
                      0.504	
                7.9	
  
25	
  
                                                                                                                                                GTX285	
  r+w	
  memory	
  wall	
  	
  
Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  




                                                                                                                                                  (17.8	
  instrucOons	
  per	
  	
  
                                                               20	
                                                                                    input	
  word)	
  



                                                               15	
  



                                                               10	
  
                                                                                                   Insert	
  work	
  here	
  

                                                                 5	
  



                                                                 0	
  
                                                                         0	
     16	
     32	
         48	
                64	
        80	
                    96	
                   112	
  
                                                                                                   Problem	
  Size	
  (millions)	
  
25	
  
Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  




                                                                                                                                                   GTX285	
  r+w	
  memory	
  
                                                               20	
                                                                                    wall	
  (17.8)	
  


                                                               15	
  



                                                               10	
                                Insert	
  work	
  here	
  


                                                                 5	
  
                                                                                                                                                Data	
  Movement	
  
                                                                                                                                                   Skeleton	
  
                                                                 0	
  
                                                                         0	
     16	
     32	
         48	
                64	
        80	
                96	
            112	
  
                                                                                                   Problem	
  Size	
  (millions)	
  
25	
  
Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  




                                                                                                                                                        GTX285	
  r+w	
  memory	
  
                                                               20	
                                                                                         wall	
  (17.8)	
  


                                                               15	
  
                                                                                                   Insert	
  work	
  here	
  

                                                               10	
                                                                             Our	
  Scan	
  Kernel	
  



                                                                 5	
  
                                                                                                                                                Data	
  Movement	
  
                                                                                                                                                   Skeleton	
  
                                                                 0	
  
                                                                         0	
     16	
     32	
        48	
                  64	
       80	
                    96	
            112	
  
                                                                                                   Problem	
  Size	
  (millions)	
  
25	
  




                                Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  
                                                                                                                                                                                      GTX285	
  r+w	
  
                                                                                               20	
                                                                                   memory	
  wall	
  
                                                                                                                                                                                        (17.8)	
  


                                                                                               15	
  
–    Increase granularity /
                                                                                                                                   Insert	
  work	
  here	
  
     redundant computation
      • ghost cells                                                                            10	
                                                                             Our	
  Scan	
  Kernel	
  
      • radix bits


–    Orthogonal kernel fusion
                                                                                                 5	
  
                                                                                                                                                                                Data	
  Movement	
  
                                                                                                                                                                                   Skeleton	
  
                                                                                                 0	
  
                                                                                                         0	
     16	
     32	
            48	
           64	
          80	
              96	
           112	
  
                                                                                                                                   Problem	
  Size	
  (millions)	
  
25	
  
Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  




                                                                                               CUDPP	
  Scan	
  Kernel	
  
                                                               20	
  



                                                               15	
  



                                                               10	
                                Our	
  Scan	
  Kernel	
  



                                                                 5	
  



                                                                 0	
  
                                                                         0	
     20	
     40	
                          60	
                80	
     100	
     120	
  
                                                                                                        Problem	
  Size	
  (millions)	
  
35	
  


                                                                                                 30	
                                                                                 GTX285	
  Radix	
  




                                  Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  
–    Partially-coalesced writes                                                                                                                                                     Scader	
  Kernel	
  Wall	
  

–    2x write overhead                                                                           25	
  

                                                                                                                                                                                         GTX285	
  Scan	
  
                                                                                                 20	
                                Insert	
  work	
  here	
                             Kernel	
  Wall	
  


                                                                                                 15	
  
–    4 total concurrent scan
     operations (radix 16)                                                                       10	
  
                                                                                                                                                                                    Our	
  Scan	
  Kernel	
  

                                                                                                   5	
  


                                                                                                   0	
  
                                                                                                           0	
     16	
     32	
             48	
            64	
          80	
              96	
               112	
  
                                                                                                                                       Problem	
  Size	
  (millions)	
  
50	
  

                                                                                      45	
  
                                                                                                                                                                          480	
  Radix	
  
                                                                                      40	
                                                                              Scader	
  Kernel	
  




                                 Thread-­‐instructoins	
  /	
  32-­‐bit	
  word	
  
                                                                                                                                                                            Wall	
  
                                                                                      35	
  

                                                                                      30	
  
–    Need kernels with tunable
                                                                                      25	
  
     local (or redundant) work                                                                                                                                           285	
  Radix	
  
      • ghost cells                                                                   20	
                                                                             Scader	
  Kernel	
  
                                                                                                                                                                           Wall	
  
      • radix bits
                                                                                      15	
  

                                                                                      10	
  

                                                                                        5	
  

                                                                                        0	
  
                                                                                                0	
     10	
     20	
     30	
        40	
        50	
        60	
           70	
        80	
     90	
  
                                                                                                                              Problem	
  Size	
  (millions)	
  
 
–  Virtual processors abstract a diversity of hardware configurations



–  Leads to a host of inefficiencies




–  E.g., only several hundred CTAs
–  Virtual processors abstract a diversity of hardware configurations



–  Leads to a host of inefficiencies




–  E.g., only several hundred CTAs
…	
  
Grid A




                                                             threadblock	
  

         grid-size = (N / tilesize) CTAs



                                                          …	
  
Grid B




                                                                       threadblock	
  



         grid-size = 150 CTAs (or other small constant)
…	
  




                                                                threadblock	
  




–  Thread-dependent predicates

–  Setup and initialization code (notably for
   smem)

–  Offset calculations (notably for smem)




                                                –  Common values are hoisted and kept live
…	
  




                                                                threadblock	
  




–  Thread-dependent predicates

–  Setup and initialization code (notably for
   smem)

–  Offset calculations (notably for smem)




                                                –  Common values are hoisted and kept live
…	
  




                                                                  threadblock	
  




–  Thread-dependent predicates

–  Setup and initialization code (notably for
   smem)

–  Offset calculations (notably for smem)




                                                –  Common values are hoisted and kept live
                                                –  Spills are really bad
log tilesize (N) -level tree

                                                                Two-level tree
                                                       load, store)




–    O( N / tilesize) gmem accesses               –    GPU is least efficient here: get it over with
                                                       as quick as possible
–    2-4 instructions per access (offset calcs,
log tilesize (N) -level tree

                                                                Two-level tree
                                                       load, store)




–    O( N / tilesize) gmem accesses               –    GPU is least efficient here: get it over with
                                                       as quick as possible
–    2-4 instructions per access (offset calcs,
20	
  
Thread-­‐instrucOons	
  /	
  Element	
  




                                           16	
  


                                           12	
  

                                                                                                                                                 Compute	
  Load	
  
                                             8	
  

                                                                                                                                                 285	
  Scan	
  Kernel	
  Wall	
  
                                             4	
  


                                             0	
  
                                                     0	
     1000	
     2000	
     3000	
          4000	
            5000	
           6000	
        7000	
           8000	
          9000	
  
                                                                                          Grid	
  Size	
  (#	
  of	
  threadblocks)	
  
C = number of CTAs
                                                        N = problem size
–    16.1M / 150 CTAs / 1024 =   109.91 tiles per CTA
                                                        T = tile size

                                                        B = tiles per CTA

–  conditional evaluation

–  singleton loads
C = number of CTAs
                                                        N = problem size
–    16.1M / 150 CTAs / 1024 =   109.91 tiles per CTA
                                                        T = tile size

                                                        B = tiles per CTA

–  conditional evaluation

–  singleton loads
C = number of CTAs
                                                          N = problem size
–    floor(16.1M / (1024 * 150) )   = 109 tiles per CTA
                                                          T = tile size

                                                          B = tiles per CTA

–    16.1M % (1024 * 150)           = 136.4 extra tiles
C = number of CTAs
                                                                     N = problem size
–    floor(16.1M / (1024 * 150) )   = 109 tiles per CTA (14 CTAs)
                                                                     T = tile size

                                                                     B = tiles per CTA

–    109 + 1                        = 110 tiles per CTA (136 CTAs)




–    16.1M % (1024 * 150)           = 0.4 extra tiles
 
–  If you breathe on your code, run it through the VP
    •  Kernel runtimes
    •  Instruction counts




–  Indispensible for tuning
    •  Host-side timing requires too many iterations
    •  Only 1-2 cudaprof iterations for consistent counter-based perf data

–  Write tools to parse the output
    •  “Dummy” kernels useful for demarcation
1100	
  
                                          1000	
                                                                                                                                      GTX	
  480	
  

                                           900	
                                                                                                                                      C2050	
  (no	
  ECC)	
  
SorOng	
  Rate	
  (106	
  keys/sec)	
  




                                           800	
                                                                                                                                      GTX	
  285	
  
                                           700	
                                                                                                                                      C2050	
  (ECC)	
  
                                           600	
                                                                                                                                      GTX	
  280	
  
                                           500	
                                                                                                                                      C1060	
  
                                           400	
                                                                                                                                      9800	
  GTX+	
  
                                           300	
  
                                           200	
  
                                           100	
  
                                               0	
  
                                                       0	
     16	
     32	
     48	
     64	
     80	
     96	
   112	
   128	
   144	
   160	
   176	
   192	
   208	
   224	
   240	
   256	
   272	
  
                                                                                                                Problem	
  size	
  (millions)	
  
800	
  
                                                                                                                                                                                                      GTX	
  480	
  
                                                      700	
                                                                                                                                           C2050	
  (no	
  ECC)	
  
                                                                                                                                                                                                      GTX	
  285	
  
SorOng	
  Rate	
  (millions	
  of	
  pairs/sec)	
  




                                                      600	
                                                                                                                                           GTX	
  280	
  
                                                                                                                                                                                                      C2050	
  (ECC)	
  
                                                                                                                                                                                                      C1060	
  
                                                      500	
  
                                                                                                                                                                                                      9800	
  GTX+	
  

                                                      400	
  

                                                      300	
  

                                                      200	
  

                                                      100	
  

                                                          0	
  
                                                                  0	
     16	
     32	
     48	
     64	
     80	
     96	
       112	
       128	
       144	
     160	
     176	
     192	
     208	
       224	
       240	
  
                                                                                                                           Problem	
  size	
  (millions)	
  
180	
  

                                                   160	
  
Kernel	
  Bandwidth	
  (GiBytes	
  /	
  sec)	
  




                                                   140	
  

                                                   120	
  

                                                   100	
  

                                                     80	
  

                                                     60	
  

                                                     40	
                                                                             merrill_tree	
  Reduce	
  

                                                     20	
                                                                             merrill_rts	
  Scan	
  

                                                       0	
  
                                                               0	
     20	
     40	
                   60	
                  80	
            100	
                 120	
  
                                                                                         Problem	
  Size	
  (millions)	
  
180	
  

                                                         160	
  
Kernel	
  Bandwidth	
  (Bytes	
  x109	
  /	
  sec)	
  




                                                         140	
  

                                                         120	
  

                                                         100	
  

                                                           80	
  

                                                           60	
  

                                                           40	
                                                                             merrill_linear	
  Reduce	
  

                                                           20	
                                                                             merrill_linear	
  Scan	
  

                                                             0	
  
                                                                     0	
     20	
     40	
                   60	
                  80	
             100	
                  120	
  
                                                                                               Problem	
  Size	
  (millions)	
  
–  Implement device “memcpy” for tile-processing
    •  Optimize for “full tiles”

–  Specialize for different SM versions, input types, etc.
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
–  Use templated code to
   generate various
   instances


–  Run with cudaprof env
   vars to collect data
160	
                                        128-­‐Thread	
  CTA	
  (64B	
  ld)	
  
                                 150	
  
Bandwidth	
  (GiBytes/sec)	
  




                                 140	
  
                                                                                                                                                                                                                         One-­‐way	
  
                                 130	
  
                                 120	
                                                                                                                     Single	
  
                                 110	
                                                                                                                     Single	
  (no-­‐overlap)	
  
                                                                                                                                                           Double	
  
                                 100	
  
                                                                                                                                                           Double	
  (no-­‐overlap)	
  
                                   90	
  
                                                                                                                                                           Quad	
  
                                   80	
                                                                                                                    Quad	
  (no-­‐overlap)	
  
                                            0	
             20	
     40	
                       60	
                                       80	
      100	
                    120	
  
                                                                                    Words	
  Copied	
  (millions)	
  
                                                                                                                                                                                                                us	
  
                                                                                                                                                                                                                                        cudaMemcpy()	
  
                                                                                                                                                      128-­‐Thread	
  CTA	
  (128B	
  ld/st)	
  
                                                                                                                        140	
  
                                                                                       Bandwidth	
  (GiBytes/sec)	
  




                                                                                                                        130	
  

                                                    Two-­‐way	
                                                         120	
  
                                                                                                                                                                                                                                Single	
  
                                                                                                                        110	
  
                                                                                                                                                                                                                                Single	
  (no-­‐overlap)	
  
                                                                                                                        100	
  
                                                                                                                                                                                                                                Double	
  
                                                                                                                          90	
                                                                                                  Double	
  (no-­‐overlap)	
  
                                                                                                                                                                                                                                Quad	
  
                                                                                                                          80	
                                                                                                  Quad	
  (no-­‐overlap)	
  
                                                                                                                          70	
                                                                                                  Intrinsic	
  Copy	
  
                                                                                                                                   0	
      20	
      40	
                     60	
                    80	
               100	
                     120	
  
                                                                                                                                                                   Words	
  Copied	
  (millions)	
  
 
m0              m1           m2               m3           m4              m5           m6              m7
                                                                                                                                    m0   m1   m2   m3    m4            m5           m6           m7           m8         m9       m10       m11

t0   x0            x1             x2             x3             x4            x5             x6            x7              t0	
     i    i    i    i     x0            x1           x2           x3           x4         x5        x6        x7


                                                                                                                                                         ⊕0	
       ⊕1	
          ⊕2	
        ⊕3	
          ⊕4	
     ⊕5	
       ⊕6	
      ⊕7	
  
                 ⊕0	
                         ⊕1	
                          ⊕2	
                         ⊕3	
  
t1   x0        ⊕(x0..x1)          x2        ⊕(x2..x3)           x4        ⊕(x4..x5)          x6        ⊕(x6..x7)
                                                                                                                           t1	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7)


                                                                                                                                                        ⊕0	
         ⊕1	
         ⊕2	
         ⊕3	
         ⊕4	
      ⊕5	
       ⊕6	
      ⊕7	
  


                                              ⊕0	
                                                       ⊕1	
              t2	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)
t2   x0        ⊕(x0..x1)          x2        ⊕(x0..x3)           x4        ⊕(x4..x5)          x6        ⊕(x4..x7)       i


                                                                                                                                                         ⊕0	
            ⊕1	
        ⊕2	
          ⊕3	
     ⊕4	
       ⊕5	
      ⊕6	
      ⊕7	
  


                                                                                                                           t3	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7)
                                               =0	
                                                      ⊕0	
  
t3   x0        ⊕(x0..x1)          x2               i            x4        ⊕(x4..x5)          x6        ⊕(x0..x3)




                  =0                          ⊕0	
                           =1                          ⊕1	
  
t4   x0              i            x2        ⊕(x0..x1)           x4        ⊕(x0..x3)          x6        ⊕(x0..x5)




     =0          ⊕0	
             =1          ⊕1	
              =2          ⊕2	
             =3          ⊕3	
  
t5   i             x0          ⊕(x0..x1)    ⊕(x0..x2)        ⊕(x0..x3)    ⊕(x0..x4)       ⊕(x0..x5)    ⊕(x0..x6)




               –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
               –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
m0              m1           m2               m3           m4              m5           m6              m7
                                                                                                                                    m0   m1   m2   m3    m4            m5           m6           m7           m8         m9       m10       m11

t0   x0            x1             x2             x3             x4            x5             x6            x7              t0	
     i    i    i    i     x0            x1           x2           x3           x4         x5        x6        x7


                                                                                                                                                         ⊕0	
       ⊕1	
          ⊕2	
        ⊕3	
          ⊕4	
     ⊕5	
       ⊕6	
      ⊕7	
  
                 ⊕0	
                         ⊕1	
                          ⊕2	
                         ⊕3	
  
t1   x0        ⊕(x0..x1)          x2        ⊕(x2..x3)           x4        ⊕(x4..x5)          x6        ⊕(x6..x7)
                                                                                                                           t1	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7)


                                                                                                                                                        ⊕0	
         ⊕1	
         ⊕2	
         ⊕3	
         ⊕4	
      ⊕5	
       ⊕6	
      ⊕7	
  


                                              ⊕0	
                                                       ⊕1	
              t2	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)
t2   x0        ⊕(x0..x1)          x2        ⊕(x0..x3)           x4        ⊕(x4..x5)          x6        ⊕(x4..x7)       i


                                                                                                                                                         ⊕0	
            ⊕1	
        ⊕2	
          ⊕3	
     ⊕4	
       ⊕5	
      ⊕6	
      ⊕7	
  


                                                                                                                           t3	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7)
                                               =0	
                                                      ⊕0	
  
t3   x0        ⊕(x0..x1)          x2               i            x4        ⊕(x4..x5)          x6        ⊕(x0..x3)




                  =0                          ⊕0	
                           =1                          ⊕1	
  
t4   x0              i            x2        ⊕(x0..x1)           x4        ⊕(x0..x3)          x6        ⊕(x0..x5)




     =0          ⊕0	
             =1          ⊕1	
              =2          ⊕2	
             =3          ⊕3	
  
t5   i             x0          ⊕(x0..x1)    ⊕(x0..x2)        ⊕(x0..x3)    ⊕(x0..x4)       ⊕(x0..x5)    ⊕(x0..x6)




               –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
               –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
barrier	
  
Tree-­‐based:	
              barrier	
  
                             barrier	
  
                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3




                                           t0	
        t1	
     t2	
     …	
   t
                                                                               T/4	
  -­‐1	
     tT/4	
  	
     tT/4	
  +	
  
                                                                                                                     1	
  
                                                                                                                                tT/4	
  +	
  
                                                                                                                                    2	
  
                                                                                                                                                …	
   t                 	
  
                                                                                                                                                          T/2	
  -­‐	
  1       tT/2	
  	
        tT/2	
  +	
  
                                                                                                                                                                                                    1	
  
                                                                                                                                                                                                                       tT/2	
  +	
  
                                                                                                                                                                                                                         2	
  
                                                                                                                                                                                                                                         …	
   t
                                                                                                                                                                                                                                               3T/4	
  -­‐1	
          t3T/4	
  	
     t3T/4+1	
     t3T/4+2	
     …	
   t           	
  
                                                                                                                                                                                                                                                                                                                         T	
  -­‐	
  1




vs.	
  raking-­‐based:	
     barrier	
     t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                           t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                           t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3
barrier	
  
Tree-­‐based:	
              barrier	
  
                             barrier	
  
                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3




                                           t0	
        t1	
     t2	
     …	
   t
                                                                               T/4	
  -­‐1	
     tT/4	
  	
     tT/4	
  +	
  
                                                                                                                     1	
  
                                                                                                                                tT/4	
  +	
  
                                                                                                                                    2	
  
                                                                                                                                                …	
   t                 	
  
                                                                                                                                                          T/2	
  -­‐	
  1       tT/2	
  	
        tT/2	
  +	
  
                                                                                                                                                                                                    1	
  
                                                                                                                                                                                                                       tT/2	
  +	
  
                                                                                                                                                                                                                         2	
  
                                                                                                                                                                                                                                         …	
   t
                                                                                                                                                                                                                                               3T/4	
  -­‐1	
          t3T/4	
  	
     t3T/4+1	
     t3T/4+2	
     …	
   t           	
  
                                                                                                                                                                                                                                                                                                                         T	
  -­‐	
  1




vs.	
  raking-­‐based:	
     barrier	
     t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                           t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                           t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3
DMA	
  
                                                         t0	
        t1	
     t2	
     …	
   t 	
  
                                                                                                T/4	
     tT/4	
  	
     tT/4	
  +	
  
                                                                                                                             1	
  
                                                                                                                                         tT/4	
  +	
  
                                                                                                                                             2	
  
                                                                                                                                                         …	
   t 	
  
                                                                                                                                                                  T/2	
  -­‐	
     tT/2	
  	
       tT/2	
  +	
  
                                                                                                                                                                                                      1	
  
                                                                                                                                                                                                                    tT/2	
  +	
  
                                                                                                                                                                                                                      2	
  
                                                                                                                                                                                                                                     …	
   t
                                                                                                                                                                                                                                           3T/4	
  
                                                                                                                                                                                                                                           -­‐1	
  
                                                                                                                                                                                                                                                           t3T/4	
  	
     t3T/
                                                                                                                                                                                                                                                                           4+1	
  
                                                                                                                                                                                                                                                                                     t3T/
                                                                                                                                                                                                                                                                                     4+2	
  
                                                                                                                                                                                                                                                                                               …	
   t           	
  
                                                                                                                                                                                                                                                                                                     T	
  -­‐	
  1


–  Barriers make O(n) code O(n log n)
                                                                                                -­‐1                                                               1




                                                         t0   	
                                           t1     	
                                                                t2     	
                                                                 t3   	
  
                                                         t0   	
                                           t1     	
                                                                t2     	
                                                                 t3   	
  
–  The rest are “DMA engine” threads                     t0   	
                                           t1     	
                                                                t2     	
                                                                 t3   	
  




                                        Worker	
  	
  
–  Use threadblocks to cover pipeline
                                                                                                                                                               t0   	
                        t1  	
                       t2       	
                 	
  
                                                                                                                                                                                                                                                      t3



   latencies, e.g., for Fermi:                                                                                                                                 t0   	
                        t1  	
                       t2       	
                 	
  
                                                                                                                                                                                                                                                      t3


                                                                                                                                                               t0   	
                        t1  	
                       t2       	
                 	
  
                                                                                                                                                                                                                                                      t3
    •  2 worker warps per CTA
                                                                                                                                                               t0   	
                        t1  	
                       t2       	
                 	
  
                                                                                                                                                                                                                                                      t3

    •  6-7 CTAs
 
–  Different SMs (varied local storage: registers/smem)
–  Different input types (e.g., sorting chars vs. ulongs)



–  # of steps for each algorithm phase is configuration-driven

–  Template expansion + Constant-propagation + Static loop unrolling +
   Preprocessor Macros
–  Compiler produces a target assembly that is well-tuned for the specifically
   targeted hardware and problem
Subsequence	
  of	
  Keys	
  




211	
     122	
                                   302	
                             232	
     300	
     021	
                           022	
                               013	
                                                                                  021	
                                                        123	
                                                          330	
                                                           023	
                                                          130	
                                                        220	
                                                   020	
                                                     301	
                                                                               112	
                                                               221	
                                                                023	
                                                           322	
                                                         003	
                                                     012	
                                                     022	
                                                           130	
                                                                             010	
                                                 121	
                                                      323	
                                                         020	
                                                        101	
                                                                          212	
                                                                        220	
                                                               333	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Digit	
  Flag	
  Vectors	
  




                           Digit	
  
                         Decoding	
  
                                                                                                                  0s	
                                            0	
                                 0	
                                         0	
                                       0	
                                         1	
                                         0	
                                          0	
                                         0	
                                                                                0	
                                       0	
                                      1	
                                      0	
                                       1	
                                      1	
                                          1	
                                            0	
                                                                                           0	
                                       0	
                                         0	
                                       0	
                                     0	
                                        0	
                                          0	
                                       1	
                                                                           1	
                                      0	
                                      0	
                                      1	
                                       0	
                                         0	
                                                 1	
                                                        0	
  



                                                                                                                           1s	
                                             1	
                                         0	
                                        0	
                                        0	
                                         0	
                                         1	
                                         0	
                                          0	
                                                                                 1	
                                     0	
                                      0	
                                     0	
                                        0	
                                       0	
                                          0	
                                                1	
                                                                                      0	
                                       1	
                                         0	
                                      0	
                                      0	
                                         0	
                                         0	
                                        0	
                                                                        0	
                                       1	
                                       0	
                                      0	
                                        1	
                                          0	
                                                    0	
                                                        0	
  



                                                                                                                                     2s	
                                                   0	
                                           1	
                                      1	
                                         1	
                                          0	
                                         0	
                                        1	
                                           0	
                                                                              0	
                                      0	
                                      0	
                                      0	
                                       0	
                                         0	
                                           0	
                                                 0	
                                                                                  1	
                                       0	
                                          0	
                                     1	
                                        0	
                                          1	
                                      1	
                                         0	
                                                                         0	
                                     0	
                                       0	
                                       0	
                                         0	
                                              1	
                                                    0	
                                                       0	
  



                                                                                                                                              3s	
                                                            0	
                                         0	
                                       0	
                                          0	
                                        0	
                                          0	
                                         0	
                                           1	
                                                                            0	
                                      1	
                                      0	
                                       1	
                                      0	
                                          0	
                                              0	
                                            0	
                                                                                    0	
                                          0	
                                      1	
                                     0	
                                        1	
                                          0	
                                       0	
                                       0	
                                                                          0	
                                      0	
                                      1	
                                       0	
                                         0	
                                                  0	
                                                       0	
                                                      1	
  




                      Serial	
  	
  
                      (regs)	
  
                                                                                                                                                                                                                                                                                                                                                                                                    …	
                                                                                                                                                                                                                                                                                                                                                                                        …	
                                                                                                                                                                                                                                                                                                                                                                                                          …	
                                                                                                                                                                                                                                                                                                                                                                                   …	
  




                    ReducOon	
  
                                                                                                                                                                   t0	
                                                                           t1	
                                                                                 t2	
                                                                           …	
                                 tT/4	
  -­‐1	
                                                                                            tT/4	
  	
                                                                tT/4	
  +	
  1	
                                                                tT/4	
  +	
  2	
                                                                   …	
                                  tT/2	
  -­‐	
  1	
                                                                                                   tT/2	
  	
                                                                      tT/2	
  +	
  1	
                                                                tT/2	
  +	
  2	
                                                                    …	
                                t3T/4	
  -­‐1	
                                                                                 t3T/4	
  	
                                                                 t3T/4+1	
                                                                         t3T/4+2	
                                                                              …	
                                                  tT	
  -­‐	
  1	
  
                                                                                                                                                                              t0	
                                                                                 t1	
                                                                                  t2	
                                                                       …	
                                    tT/4	
  -­‐1	
                                                                                          tT/4	
  	
                                                                 tT/4	
  +	
  1	
                                                                 tT/4	
  +	
  2	
                                                                 …	
                                       tT/2	
  -­‐	
  1	
                                                                                               tT/2	
  	
                                                                       tT/2	
  +	
  1	
                                                               tT/2	
  +	
  2	
                                                                 …	
                                   t3T/4	
  -­‐1	
                                                                                 t3T/4	
  	
                                                                  t3T/4+1	
                                                                        t3T/4+2	
                                                                                 …	
                                                     tT	
  -­‐	
  1	
  

                                                                                                                                                                                             t0	
                                                                                  t1	
                                                                                   t2	
                                                                       …	
                                     tT/4	
  -­‐1	
                                                                                       tT/4	
  	
                                                                  tT/4	
  +	
  1	
                                                                 tT/4	
  +	
  2	
                                                                    …	
                                        tT/2	
  -­‐	
  1	
                                                                                           tT/2	
  	
                                                                      tT/2	
  +	
  1	
                                                                 tT/2	
  +	
  2	
                                                                 …	
                                   t3T/4	
  -­‐1	
                                                                            t3T/4	
  	
                                                                      t3T/4+1	
                                                                         t3T/4+2	
                                                                                       …	
                                                   tT	
  -­‐	
  1	
  
                                                                                                                                                                                                                t0	
                                                                                 t1	
                                                                               t2	
                                                                                                                    tT/4	
  -­‐1	
                                                                                     tT/4	
  	
                                                                 tT/4	
  +	
  1	
                                                                 tT/4	
  +	
  2	
                                                                                                                  tT/2	
  -­‐	
  1	
                                                                                         tT/2	
  	
                                                                     tT/2	
  +	
  1	
                                                                 tT/2	
  +	
  2	
                                                                                                         t3T/4	
  -­‐1	
                                                                              t3T/4	
  	
                                                                 t3T/4+1	
                                                                           t3T/4+2	
                                                                                                                                                  tT	
  -­‐	
  1	
  




                                                                                                                                                                   t0	
                                                                                                                                                                                                                                                                                                                                                                                               t1	
                                                                                                                                                                                                                                                                                                                                                                                                      t2	
                                                                                                                                                                                                                                                                                                                                                                                   t3	
  

                                                                                                                                                                              t0	
                                                                                                                                                                                                                                                                                                                                                                                                    t1	
                                                                                                                                                                                                                                                                                                                                                                                                   t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  
                                                                                                                                                                                             t0	
                                                                                                                                                                                                                                                                                                                                                                                                   t1	
                                                                                                                                                                                                                                                                                                                                                                                                     t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  

                                                                                                                                                                                                                t0	
                                                                                                                                                                                                                                                                                                                                                                                                 t1	
                                                                                                                                                                                                                                                                                                                                                                                                     t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  




                                                                                                                                                                   t0	
                                                                                                                                                                                                                                                                                                                                                                                               t1	
                                                                                                                                                                                                                                                                                                                                                                                                      t2	
                                                                                                                                                                                                                                                                                                                                                                                   t3	
  




                      Serial	
  	
  
                    (shmem)	
  
                    ReducOon	
  
                                                                                                                                                                              t0	
                                                                                                                                                                                                                                                                                                                                                                                                    t1	
                                                                                                                                                                                                                                                                                                                                                                                                   t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  

                                                                                                                                                                                             t0	
                                                                                                                                                                                                                                                                                                                                                                                                   t1	
                                                                                                                                                                                                                                                                                                                                                                                                     t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  

                                                                                                                                                                                                                t0	
                                                                                                                                                                                                                                                                                                                                                                                                 t1	
                                                                                                                                                                                                                                                                                                                                                                                                     t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  




                                                                                                                                                                   t0	
                                                                                                                                                                                                                                                                                                                                                                                               t1	
                                                                                                                                                                                                                                                                                                                                                                                                      t2	
                                                                                                                                                                                                                                                                                                                                                                                   t3	
  
                                                                                                                                                                              t0	
                                                                                                                                                                                                                                                                                                                                                                                                    t1	
                                                                                                                                                                                                                                                                                                                                                                                                   t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  

                                                                                                                                                                                             t0	
                                                                                                                                                                                                                                                                                                                                                                                                   t1	
                                                                                                                                                                                                                                                                                                                                                                                                     t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  
                                                                                                                                                                                                                t0	
                                                                                                                                                                                                                                                                                                                                                                                                 t1	
                                                                                                                                                                                                                                                                                                                                                                                                     t2	
                                                                                                                                                                                                                                                                                                                                                                                    t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          t0	
                                                                                                                                                               t1	
                                                                                                                                                 t2	
                                                                                                                                                    t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            t0	
                                                                                                                                                             t1	
                                                                                                                                                  t2	
                                                                                                                                                  t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  t0	
                                                                                                                                                         t1	
                                                                                                                                                 t2	
                                                                                                                                                t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   t0	
                                                                                                                                                        t1	
                                                                                                                                                   t2	
                                                                                                                                             t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          t0	
                                                                                                                                                               t1	
                                                                                                                                                 t2	
                                                                                                                                                    t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            t0	
                                                                                                                                                             t1	
                                                                                                                                                  t2	
                                                                                                                                                  t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  t0	
                                                                                                                                                         t1	
                                                                                                                                                 t2	
                                                                                                                                                t3	
  
                             Scan	
  	
  
                        (shmem)	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   t0	
                                                                                                                                                        t1	
                                                                                                                                                   t2	
                                                                                                                                             t3	
  
                    SIMD	
  Kogge-­‐Stone	
  	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          t0	
                                                                                                                                                               t1	
                                                                                                                                                 t2	
                                                                                                                                                    t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                9	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            t0	
                                                                                                                                                             t1	
                                                                                                                                                  t2	
                                                                                                                                                  t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0s	
  total	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  t0	
                                                                                                                                                         t1	
                                                                                                                                                 t2	
                                                                                                                                                t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  7	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   t0	
                                                                                                                                                        t1	
                                                                                                                                                   t2	
                                                                                                                                             t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     1s	
  total	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  9	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         2s	
  total	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          t0	
                                                                                                                                                               t1	
                                                                                                                                                 t2	
                                                                                                                                                    t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     7	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            t0	
                                                                                                                                                             t1	
                                                                                                                                                  t2	
                                                                                                                                                  t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               3s	
  total	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  t0	
                                                                                                                                                         t1	
                                                                                                                                                 t2	
                                                                                                                                                t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   t0	
                                                                                                                                                        t1	
                                                                                                                                                   t2	
                                                                                                                                             t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 t0	
                                                                                                                                                                                                                                                                                                                                                                                             t1	
                                                                                                                                                                                                                                                                                                                                                                                                  t2	
                                                                                                                                                                                                                                                                                                                                                                                                          t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 t0	
                                                                                                                                                                                                                                                                                                                                                                                              t1	
                                                                                                                                                                                                                                                                                                                                                                                                 t2	
                                                                                                                                                                                                                                                                                                                                                                                                               t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    t0	
                                                                                                                                                                                                                                                                                                                                                                                                t1	
                                                                                                                                                                                                                                                                                                                                                                                              t2	
                                                                                                                                                                                                                                                                                                                                                                                                                   t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    t0	
                                                                                                                                                                                                                                                                                                                                                                                                t1	
                                                                                                                                                                                                                                                                                                                                                                                             t2	
                                                                                                                                                                                                                                                                                                                                                                                                                         t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 t0	
                                                                                                                                                                                                                                                                                                                                                                                             t1	
                                                                                                                                                                                                                                                                                                                                                                                                  t2	
                                                                                                                                                                                                                                                                                                                                                                                                          t3	
  
                          (shmem)	
  
                         Serial	
  Scan	
  	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 t0	
                                                                                                                                                                                                                                                                                                                                                                                              t1	
                                                                                                                                                                                                                                                                                                                                                                                                 t2	
                                                                                                                                                                                                                                                                                                                                                                                                               t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    t0	
                                                                                                                                                                                                                                                                                                                                                                                                t1	
                                                                                                                                                                                                                                                                                                                                                                                              t2	
                                                                                                                                                                                                                                                                                                                                                                                                                   t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    t0	
                                                                                                                                                                                                                                                                                                                                                                                                t1	
                                                                                                                                                                                                                                                                                                                                                                                             t2	
                                                                                                                                                                                                                                                                                                                                                                                                                         t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 t0	
                                                                                                                                                                                                                                                                                                                                                                                             t1	
                                                                                                                                                                                                                                                                                                                                                                                                  t2	
                                                                                                                                                                                                                                                                                                                                                                                                          t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 t0	
                                                                                                                                                                                                                                                                                                                                                                                              t1	
                                                                                                                                                                                                                                                                                                                                                                                                 t2	
                                                                                                                                                                                                                                                                                                                                                                                                               t3	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    t0	
                                                                                                                                                                                                                                                                                                                                                                                                t1	
                                                                                                                                                                                                                                                                                                                                                                                              t2	
                                                                                                                                                                                                                                                                                                                                                                                                                   t3	
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    t0	
                                                                                                                                                                                                                                                                                                                                                                                                t1	
                                                                                                                                                                                                                                                                                                                                                                                             t2	
                                                                                                                                                                                                                                                                                                                                                                                                                         t3	
  




                                                                                                                                                                              t0	
                                                                                 t1	
                                                                                                             tT/4	
  -­‐2	
                                                                         tT/4	
  -­‐1	
                                                                                          tT/4	
  	
                                                                 tT/4	
  +	
  1	
                                                                                                 tT/2	
  -­‐	
  2	
                                                                         tT/2	
  -­‐	
  1	
                                                                                               tT/2	
  	
                                                                       tT/2	
  +	
  1	
                                                                                                t3T/4	
  -­‐2	
                                                                       t3T/4	
  -­‐1	
                                                                                 t3T/4	
  	
                                                                  t3T/4+1	
                                                                                                             tT	
  -­‐	
  2	
                                                                                             tT	
  -­‐	
  1	
  
                                                                                                                                                                                                                                                                                                                        …	
                                                                                                                                                                                                                                                                                                                                                                                           …	
                                                                                                                                                                                                                                                                                                                                                                                                          …	
                                                                                                                                                                                                                                                                                                                                                                                 …	
  
                                                                                                                                                                                             t0	
                                                                                  t1	
                                                                                                               tT/4	
  -­‐2	
                                                                         tT/4	
  -­‐1	
                                                                                       tT/4	
  	
                                                                   tT/4	
  +	
  1	
                                                                                                 tT/2	
  -­‐	
  2	
                                                                            tT/2	
  -­‐	
  1	
                                                                                           tT/2	
  	
                                                                      tT/2	
  +	
  1	
                                                                                                  t3T/4	
  -­‐2	
                                                                       t3T/4	
  -­‐1	
                                                                                t3T/4	
  	
                                                                  t3T/4+1	
                                                                                                                  tT	
  -­‐	
  2	
                                                                                             tT	
  -­‐	
  1	
  
                                                                                                                                                                                                                                                                                                                                         …	
                                                                                                                                                                                                                                                                                                                                                                                          …	
                                                                                                                                                                                                                                                                                                                                                                                                        …	
                                                                                                                                                                                                                                                                                                                                                                                  …	
  
                                                                                                                                                                                                                t0	
                                                                                 t1	
                                                                                                                tT/4	
  -­‐2	
                                                                         tT/4	
  -­‐1	
                                                                                     tT/4	
  	
                                                                 tT/4	
  +	
  1	
                                                                                                     tT/2	
  -­‐	
  2	
                                                                            tT/2	
  -­‐	
  1	
                                                                                         tT/2	
  	
                                                                     tT/2	
  +	
  1	
                                                                                                      t3T/4	
  -­‐2	
                                                                     t3T/4	
  -­‐1	
                                                                              t3T/4	
  	
                                                                 t3T/4+1	
                                                                                                                        tT	
  -­‐	
  2	
                                                                                              tT	
  -­‐	
  1	
  
                                                                                                                                                                                                                                                                                                                                                          …	
                                                                                                                                                                                                                                                                                                                                                                                          …	
                                                                                                                                                                                                                                                                                                                                                                                                        …	
                                                                                                                                                                                                                                                                                                                                                                                   …	
  
                                                                                                                                                                                                                                 t0	
                                                                                 t1	
                                                                                                               tT/4	
  -­‐2	
                                                                       tT/4	
  -­‐1	
                                                                                       tT/4	
  	
                                                                tT/4	
  +	
  1	
                                                                                                       tT/2	
  -­‐	
  2	
                                                                            tT/2	
  -­‐	
  1	
                                                                                        tT/2	
  	
                                                                     tT/2	
  +	
  1	
                                                                                                      t3T/4	
  -­‐2	
                                                                   t3T/4	
  -­‐1	
                                                                                t3T/4	
  	
                                                                 t3T/4+1	
                                                                                                                             tT	
  -­‐	
  2	
                                                                                              tT	
  -­‐	
  1	
  
                                                                                                                                                                                                                                                                                                                                                                            …	
                                                                                                                                                                                                                                                                                                                                                                                        …	
                                                                                                                                                                                                                                                                                                                                                                                                          …	
                                                                                                                                                                                                                                                                                                                                                                                   …	
  
                           (regs)	
  
                         Serial	
  Scan	
  	
  




                                                                                                                           0s	
                                             0	
                                         0	
                                        0	
                                        0	
                                         1	
                                         1	
                                         1	
                                          1	
                                                                                 1	
                                     1	
                                      2	
                                     2	
                                        3	
                                       4	
                                          5	
                                                5	
                                                                                      5	
                                       5	
                                         5	
                                      5	
                                      5	
                                         5	
                                         5	
                                        6	
                                                                        7	
                                       7	
                                       7	
                                      8	
                                        8	
                                          8	
                                                    9	
                                                        9	
  



                                                                                                                                     1s	
                                                   1	
                                           1	
                                      1	
                                         1	
                                          1	
                                         2	
                                        2	
                                           2	
                                                                              3	
                                      3	
                                      3	
                                      3	
                                       3	
                                         3	
                                            3	
                                                4	
                                                                                  4	
                                       5	
                                          5	
                                     5	
                                        5	
                                          5	
                                      5	
                                         5	
                                                                         5	
                                     6	
                                       6	
                                       6	
                                         7	
                                              7	
                                                    7	
                                                       7	
  



                                                                                                                                              2s	
                                                            0	
                                          1	
                                      2	
                                          3	
                                        3	
                                          3	
                                         4	
                                           4	
                                                                            4	
                                      4	
                                      4	
                                       4	
                                      4	
                                          4	
                                              4	
                                            4	
                                                                                    5	
                                          5	
                                      5	
                                     6	
                                        6	
                                          7	
                                       8	
                                       8	
                                                                          8	
                                      8	
                                      8	
                                       8	
                                         8	
                                                 9	
                                                        9	
                                                      9	
  



                                                                                                                                                       3s	
                                                                     0	
                                        0	
                                        0	
                                         0	
                                         0	
                                         0	
                                          0	
                                       1	
                                                                               1	
                                      2	
                                     2	
                                        3	
                                       3	
                                          3	
                                                3	
                                             3	
                                                                                3	
                                         3	
                                      4	
                                       4	
                                         5	
                                        5	
                                        5	
                                    5	
                                                                           5	
                                       5	
                                      6	
                                        6	
                                            6	
                                                   6	
                                                       6	
                                                     7	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            0s	
  total	
                                                                                                                                                  1s	
  total	
                                                                                                                                                2s	
  total	
                                                                                                                                         3s	
  total	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         t0	
                                                                                                                                                            t1	
                                                                                                                                                          t2	
                                                                                                                                              t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         t0	
                                                                                                                                                            t1	
                                                                                                                                                          t2	
                                                                                                                                              t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         t0	
                                                                                                                                                            t1	
                                                                                                                                                          t2	
                                                                                                                                              t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         t0	
                                                                                                                                                            t1	
                                                                                                                                                          t2	
                                                                                                                                              t3	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Local	
  Exchange	
  Offsets	
  




                                                                                                                                                                                                              10	
                                        17	
                                      18	
                                        19	
                                        1	
                                          11	
                                        20	
                                      26	
                                                                               12	
                                     27	
                                     2	
                                    28	
                                        3	
                                           4	
                                              5	
                                           13	
                                                                                   21	
                                        14	
                                     29	
                                     22	
                                       30	
                                         23	
                                      24	
                                       6	
                                                                         7	
                                      15	
                                     31	
                                      8	
                                         16	
                                                25	
                                                        9	
                                                 32	
  
                                              Scader	
  




                                                                                                                                                                                                                                                                                                                                                                                                      …	
                                                                                                                                                                                                                                                                                                                                                                                        …	
                                                                                                                                                                                                                                                                                                                                                                                                          …	
                                                                                                                                                                                                                                                                                                                                                                                     …	
  
                         (SIMD	
  Kogge-­‐Stone	
  Scan,	
  smem	
  exchange)	
  




                                                                                                                                                                                                                                                     t0	
                                                                                 t1	
                                                                                                                tT/4	
  -­‐2	
                                                                       tT/4	
  -­‐1	
                                                                                    tT/4	
  	
                                                                   tT/4	
  +	
  1	
                                                                                                         tT/2	
  -­‐	
  2	
                                                                            tT/2	
  -­‐	
  1	
                                                                                     tT/2	
  	
                                                                   tT/2	
  +	
  1	
                                                                                                       t3T/4	
  -­‐2	
                                                                t3T/4	
  -­‐1	
                                                                                 t3T/4	
  	
                                                                    t3T/4+1	
                                                                                                                                 tT	
  -­‐	
  2	
                                                                                             tT	
  -­‐	
  1	
  




                                                                                                                                                                                                                                                   t0	
                                                                                 t1	
                                                                                                                tT/4	
  -­‐2	
                                                                       tT/4	
  -­‐1	
                                                                                    tT/4	
  	
                                                                  tT/4	
  +	
  1	
                                                                                                       tT/2	
  -­‐	
  2	
                                                                            tT/2	
  -­‐	
  1	
                                                                                        tT/2	
  	
                                                                   tT/2	
  +	
  1	
                                                                                                       t3T/4	
  -­‐2	
                                                                 t3T/4	
  -­‐1	
                                                                                 t3T/4	
  	
                                                                   t3T/4+1	
                                                                                                                                 tT	
  -­‐	
  2	
                                                                                             tT	
  -­‐	
  1	
  
                                                                                                                                                                                                                                                                                                                                                                                             …	
                                                                                                                                                                                                                                                                                                                                                                                        …	
                                                                                                                                                                                                                                                                                                                                                                                                          …	
                                                                                                                                                                                                                                                                                                                                                                                   …	
  




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Exchanged	
  Keys	
  




                                                                                                                           300	
                        330	
                                130	
                                                220	
                                               020	
                                                130	
                                                010	
                                               020	
                                                                        220	
                                            211	
                                            021	
                                            021	
                                              301	
                                              221	
                                                   121	
                                                      101	
                                                                        122	
                                              302	
                                               232	
                                           022	
                                                 122	
                                              112	
                                               322	
                                        022	
                                                            212	
                                             013	
                                              123	
                                                    023	
                                                             023	
                                                              003	
                                                            323	
               333	
  




                                                                                                                                                                                    9	
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     18	
  

                                                                                                                                        0s	
  carry-­‐in	
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 0s	
  carry-­‐out	
  
                                                                                                                                                                                                25	
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              32	
  

                                                                                                                                                  1s	
  carry-­‐in	
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1s	
  carry-­‐out	
  
                                                                                                                                                                                                                      33	
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             42	
  

                                                                                                                                                          2s	
  carry-­‐in	
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           2s	
  carry-­‐out	
  
                                                                                                                                                                                                                                    49	
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      56	
  

                                                                                                                                                                      3s	
  carry-­‐in	
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      3s	
  carry-­‐out	
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Global	
  Scatter	
  Offsets	
  




                                                                                                                            10	
                         11	
                                  12	
                                                 13	
                                               14	
                                                  15	
                                                 16	
                                                 17	
                                                                       18	
                                             26	
                                             27	
                                              28	
                                              29	
                                               30	
                                                      31	
                                                        32	
                                                                      33	
                                               34	
                                                35	
                                            36	
                                                  37	
                                              38	
                                                39	
                                          40	
                                                              41	
                                             50	
                                                   51	
                                                  52	
                                                              53	
                                                                  54	
                                                         55	
               56	
  
•  Resource-allocation as runtime



1.  Kernel memory-wall analysis (kernel fusion)

2.  Algorithm serialization
3.  Tune for data-movement

4.  Warp-synchronous programming

5.  Flexible granularity via meta-programming
–  Back40Computing (a Google Code Project)
    •  http://code.google.com/p/back40computing/

–  Default sorting method for Thrust
    •  http://code.google.com/p/thrust/
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
–  A single host-side procedure call launches a kernel that performs orthogonal
   program steps


MyUberKernel<<<grid_size, num_threads>>>(d_device_storage);




–  No existing public repositories of kernel “subroutines” for scavenging
GATHER(key)                                                                    GATHER (value)



                     Extract radix digit


                                                 Encode flag bit                                   –    Callbacks, iterators, visitors, functors, etc.
                                               (into flag vectors)


                                           LOCAL MULTI-SCAN
                                              (flag vectors)
                                                                                                   – ReduceKernel<<<grid_size, num_threads>>>
                                                                                                       (CountingIterator(100));
                    Decode local rank
                    (from flag vectors)



EXCHANGE (key)                                                                  EXCHANGE (value)



                 Extract radix digit (again)



                                  Update global radix digit partition offsets                      –    E.g., fused kernel left can’t be composed using
                                                                                                        a callback-based pattern
SCATTER (key)                                                                   SCATTER (value)




                   Fused radix sorting kernel
                                     • Digit extraction
                                     • Local prefix scan
                                     • Scatter accordingly
–  Compiled libraries suffer from code bloat
    • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types.
    • Specializing for device configurations makes it even worse

–  The alternative is to ship source for #include’ing
    • Have to be willing to share source

–  Need a way to fit meta-programming in at the JIT / bytecode level to help avoid
   expansion / mismatch-by-omission



–  Can leverage fundamentally different algorithms for different phases
    • How to teach the compiler do to this?

More Related Content

PDF
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
PDF
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
PPTX
Study of transient stability for parallel connected inverters in microgrid sy...
PDF
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
PDF
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
PDF
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
Study of transient stability for parallel connected inverters in microgrid sy...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...

Viewers also liked (11)

PDF
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
PDF
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
PDF
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
PDF
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
PDF
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
PDF
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
PDF
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
PDF
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
PDF
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
PDF
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
Ad

Similar to [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia) (20)

PDF
Cache aware hybrid sorter
PPT
Data flow super computing valentina balas
PDF
Gpu Join Presentation
PDF
XT Best Practices
PDF
Boyang gao gpu k-means_gmm_final_v1
PDF
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
PDF
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
PDF
VAST-Tree, EDBT'12
PDF
A Methodology for the Emulation of Boolean Logic that Paved the Way for the S...
PDF
Cots moves to multicore: AMD
PDF
AMD technologies for HPC
PPTX
iMinds The Conference: Jan Lemeire
PPTX
Critical Issues at Exascale for Algorithm and Software Design
PDF
libHPC: Software sustainability and reuse through metadata preservation
PDF
My Ph.D. Research
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
PDF
GPU programming
PPTX
Introduction To Parallel Computing
PDF
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
PDF
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Cache aware hybrid sorter
Data flow super computing valentina balas
Gpu Join Presentation
XT Best Practices
Boyang gao gpu k-means_gmm_final_v1
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
VAST-Tree, EDBT'12
A Methodology for the Emulation of Boolean Logic that Paved the Way for the S...
Cots moves to multicore: AMD
AMD technologies for HPC
iMinds The Conference: Jan Lemeire
Critical Issues at Exascale for Algorithm and Software Design
libHPC: Software sustainability and reuse through metadata preservation
My Ph.D. Research
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
GPU programming
Introduction To Parallel Computing
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Ad

More from npinto (12)

PDF
"AI" for Blockchain Security (Case Study: Cosmos)
PDF
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
PDF
[Harvard CS264] 05 - Advanced-level CUDA Programming
PDF
[Harvard CS264] 04 - Intermediate-level CUDA Programming
PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
PDF
[Harvard CS264] 01 - Introduction
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
PDF
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
PDF
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
"AI" for Blockchain Security (Case Study: Cosmos)
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 01 - Introduction
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...

Recently uploaded (20)

PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
RMMM.pdf make it easy to upload and study
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Classroom Observation Tools for Teachers
PDF
Basic Mud Logging Guide for educational purpose
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Pre independence Education in Inndia.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Cell Types and Its function , kingdom of life
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Insiders guide to clinical Medicine.pdf
O7-L3 Supply Chain Operations - ICLT Program
RMMM.pdf make it easy to upload and study
Abdominal Access Techniques with Prof. Dr. R K Mishra
Classroom Observation Tools for Teachers
Basic Mud Logging Guide for educational purpose
102 student loan defaulters named and shamed – Is someone you know on the list?
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Final Presentation General Medicine 03-08-2024.pptx
Pre independence Education in Inndia.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
01-Introduction-to-Information-Management.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
VCE English Exam - Section C Student Revision Booklet
STATICS OF THE RIGID BODIES Hibbelers.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Cell Types and Its function , kingdom of life

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

  • 2. –  Data-independent tasks –  Tasks with statically-known data dependences –  SIMD divergence –  Lacking fine-grained synchronization –  Lacking writeable, coherent caches
  • 3. –  Data-independent tasks –  Tasks with statically-known data dependences –  SIMD divergence –  Lacking fine-grained synchronization –  Lacking writeable, coherent caches
  • 5. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  280 449      (3.8x  speedup*) 534      (2.9x  speedup*) * Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09
  • 6. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  480 775 1005 NVIDIA  GTX  280 449 534 NVIDIA  8800  GT 129 171
  • 7. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  480 775 1005 NVIDIA  GTX  280 449 534 NVIDIA  8800  GT 129 171 Intel    Knight's  Ferry  MIC  32-­‐core* 560 Intel    Core  i7  quad-­‐core  * 240 Intel    Core-­‐2  quad-­‐core* 138 *Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.
  • 8.  
  • 9. Input   Thread   Thread   Thread   Thread   Output   –  Each output is dependent upon a finite subset of the input •  Threads are decomposed by output element •  The output (and at least one input) index is a static function of thread-id
  • 10. Input   ?   Output   –  Each output element has dependences upon any / all input elements –  E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.
  • 11. –  Threads are decomposed by output element Thread   Thread   Thread   Thread   –  Repeatedly iterate over recycled input streams –  Output stream size is statically known before each pass Thread   Thread   Thread   Thread  
  • 12. + + + + –  O(n) global work from passes of pairwise-neighbor-reduction –  Static dependences, uniform output
  • 13. allocation –  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2) –  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  • 14. allocation –  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2) –  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  • 15.    –  Variable output per thread –  Need dynamic, cooperative allocation
  • 16. Input   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   ?   Output   •  Where do I put something in a list?   Where do I enqueue something? –  Duplicate removal –  Search space exploration –  Sorting –  Graph traversal –  Histogram compilation –  General work queues
  • 17. • For 30,000 producers and consumers? –  Locks serialize everything
  • 18. Input   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Prefix  Sum   0   2   3   3   6   a scattering vector –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 19. Thread   Thread   Thread   Thread   Thread   Input    (  &  allocaOon     requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Result  of     a scattering vector prefix  scan  (sum)   0   2   3   3   6   –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 20. Input    (  &  allocaOon     requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Result  of     a scattering vector prefix  scan  (sum)   0   2   3   3   6   Thread   Thread   Thread   Thread   Thread   –  Popularized by Blelloch et al. in the ‘90s Output   0   1   2   3   4   5   6   7   –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 21. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0s 1s Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001
  • 22. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0 1 2 3 4 5 6 7 0s 1s Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
  • 23. 0s 1s Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (bin relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Adjusted allocations 0 1 1 2 2 3 4 4 4 4 5 5 6 6 6 7 (global relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 4 1 5 2 3 6 7 Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 1 2 3 4 5 6 7
  • 24.  
  • 25. Determine  allocaCon  size   Global  Device  Memory   Host  Program   CUDPP  scan   CUDPP  Scan   CUDPP  scan   Distribute  output   Host   GPU   Un-fused
  • 26. Determine  allocaCon  size   Determine  allocaCon   Global  Device  Memory   Global  Device  Memory   Scan   Host  Program   Host  Program   CUDPP  scan   CUDPP  Scan   Scan   CUDPP  scan   Scan   Distribute  output   Distribute  output   Host   GPU   Host   GPU   Un-fused Fused
  • 27. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of free Host  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everything Host   GPU   Fused
  • 28. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of free Host  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everything Host   GPU   Fused
  • 29. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0   Quadro  FX  5600   76.8   152.3   0.504   7.9  
  • 30. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0   Quadro  FX  5600   76.8   152.3   0.504   7.9  
  • 31. 25   GTX285  r+w  memory  wall     Thread-­‐InstrucOons  /  32-­‐bit  scan  element   (17.8  instrucOons  per     20   input  word)   15   10   Insert  work  here   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 32. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   10   Insert  work  here   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 33. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   Insert  work  here   10   Our  Scan  Kernel   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 34. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w   20   memory  wall   (17.8)   15   –  Increase granularity / Insert  work  here   redundant computation • ghost cells 10   Our  Scan  Kernel   • radix bits –  Orthogonal kernel fusion 5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 35. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   CUDPP  Scan  Kernel   20   15   10   Our  Scan  Kernel   5   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 36. 35   30   GTX285  Radix   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   –  Partially-coalesced writes Scader  Kernel  Wall   –  2x write overhead 25   GTX285  Scan   20   Insert  work  here   Kernel  Wall   15   –  4 total concurrent scan operations (radix 16) 10   Our  Scan  Kernel   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 37. 50   45   480  Radix   40   Scader  Kernel   Thread-­‐instructoins  /  32-­‐bit  word   Wall   35   30   –  Need kernels with tunable 25   local (or redundant) work 285  Radix   • ghost cells 20   Scader  Kernel   Wall   • radix bits 15   10   5   0   0   10   20   30   40   50   60   70   80   90   Problem  Size  (millions)  
  • 38.  
  • 39. –  Virtual processors abstract a diversity of hardware configurations –  Leads to a host of inefficiencies –  E.g., only several hundred CTAs
  • 40. –  Virtual processors abstract a diversity of hardware configurations –  Leads to a host of inefficiencies –  E.g., only several hundred CTAs
  • 41. …   Grid A threadblock   grid-size = (N / tilesize) CTAs …   Grid B threadblock   grid-size = 150 CTAs (or other small constant)
  • 42. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  • 43. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  • 44. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live –  Spills are really bad
  • 45. log tilesize (N) -level tree Two-level tree load, store) –  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible –  2-4 instructions per access (offset calcs,
  • 46. log tilesize (N) -level tree Two-level tree load, store) –  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible –  2-4 instructions per access (offset calcs,
  • 47. 20   Thread-­‐instrucOons  /  Element   16   12   Compute  Load   8   285  Scan  Kernel  Wall   4   0   0   1000   2000   3000   4000   5000   6000   7000   8000   9000   Grid  Size  (#  of  threadblocks)  
  • 48. C = number of CTAs N = problem size –  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA –  conditional evaluation –  singleton loads
  • 49. C = number of CTAs N = problem size –  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA –  conditional evaluation –  singleton loads
  • 50. C = number of CTAs N = problem size –  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA T = tile size B = tiles per CTA –  16.1M % (1024 * 150) = 136.4 extra tiles
  • 51. C = number of CTAs N = problem size –  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs) T = tile size B = tiles per CTA –  109 + 1 = 110 tiles per CTA (136 CTAs) –  16.1M % (1024 * 150) = 0.4 extra tiles
  • 52.  
  • 53. –  If you breathe on your code, run it through the VP •  Kernel runtimes •  Instruction counts –  Indispensible for tuning •  Host-side timing requires too many iterations •  Only 1-2 cudaprof iterations for consistent counter-based perf data –  Write tools to parse the output •  “Dummy” kernels useful for demarcation
  • 54. 1100   1000   GTX  480   900   C2050  (no  ECC)   SorOng  Rate  (106  keys/sec)   800   GTX  285   700   C2050  (ECC)   600   GTX  280   500   C1060   400   9800  GTX+   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   256   272   Problem  size  (millions)  
  • 55. 800   GTX  480   700   C2050  (no  ECC)   GTX  285   SorOng  Rate  (millions  of  pairs/sec)   600   GTX  280   C2050  (ECC)   C1060   500   9800  GTX+   400   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   Problem  size  (millions)  
  • 56. 180   160   Kernel  Bandwidth  (GiBytes  /  sec)   140   120   100   80   60   40   merrill_tree  Reduce   20   merrill_rts  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 57. 180   160   Kernel  Bandwidth  (Bytes  x109  /  sec)   140   120   100   80   60   40   merrill_linear  Reduce   20   merrill_linear  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 58. –  Implement device “memcpy” for tile-processing •  Optimize for “full tiles” –  Specialize for different SM versions, input types, etc.
  • 60. –  Use templated code to generate various instances –  Run with cudaprof env vars to collect data
  • 61. 160   128-­‐Thread  CTA  (64B  ld)   150   Bandwidth  (GiBytes/sec)   140   One-­‐way   130   120   Single   110   Single  (no-­‐overlap)   Double   100   Double  (no-­‐overlap)   90   Quad   80   Quad  (no-­‐overlap)   0   20   40   60   80   100   120   Words  Copied  (millions)   us   cudaMemcpy()   128-­‐Thread  CTA  (128B  ld/st)   140   Bandwidth  (GiBytes/sec)   130   Two-­‐way   120   Single   110   Single  (no-­‐overlap)   100   Double   90   Double  (no-­‐overlap)   Quad   80   Quad  (no-­‐overlap)   70   Intrinsic  Copy   0   20   40   60   80   100   120   Words  Copied  (millions)  
  • 62.  
  • 63. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3   t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7) t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0   t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1   t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3   t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  • 64. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3   t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7) t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0   t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1   t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3   t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  • 65. barrier   Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1 vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  • 66. barrier   Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1 vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  • 67. DMA   t0   t1   t2   …   t   T/4   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐   tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4   -­‐1   t3T/4     t3T/ 4+1   t3T/ 4+2   …   t   T  -­‐  1 –  Barriers make O(n) code O(n log n) -­‐1 1 t0   t1   t2   t3   t0   t1   t2   t3   –  The rest are “DMA engine” threads t0   t1   t2   t3   Worker     –  Use threadblocks to cover pipeline t0   t1   t2     t3 latencies, e.g., for Fermi: t0   t1   t2     t3 t0   t1   t2     t3 •  2 worker warps per CTA t0   t1   t2     t3 •  6-7 CTAs
  • 68.  
  • 69. –  Different SMs (varied local storage: registers/smem) –  Different input types (e.g., sorting chars vs. ulongs) –  # of steps for each algorithm phase is configuration-driven –  Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros –  Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem
  • 70. Subsequence  of  Keys   211   122   302   232   300   021   022   013   021   123   330   023   130   220   020   301   112   221   023   322   003   012   022   130   010   121   323   020   101   212   220   333   Digit  Flag  Vectors   Digit   Decoding   0s   0   0   0   0   1   0   0   0   0   0   1   0   1   1   1   0   0   0   0   0   0   0   0   1   1   0   0   1   0   0   1   0   1s   1   0   0   0   0   1   0   0   1   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   1   0   0   1   0   0   0   2s   0   1   1   1   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0   1   0   1   1   0   0   0   0   0   0   1   0   0   3s   0   0   0   0   0   0   0   1   0   1   0   1   0   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   0   0   0   1   Serial     (regs)   …   …   …   …   ReducOon   t0   t1   t2   …   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   …   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   …   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   tT  -­‐  1   t0   t1   t2   …   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   …   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   …   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   tT  -­‐  1   t0   t1   t2   …   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   …   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   …   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   tT  -­‐  1   t0   t1   t2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   tT  -­‐  1   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   Serial     (shmem)   ReducOon   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   Scan     (shmem)   t0   t1   t2   t3   SIMD  Kogge-­‐Stone     t0   t1   t2   t3   9   t0   t1   t2   t3   0s  total   t0   t1   t2   t3   7   t0   t1   t2   t3   1s  total   9   2s  total   t0   t1   t2   t3   7   t0   t1   t2   t3   3s  total   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   (shmem)   Serial  Scan     t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   (regs)   Serial  Scan     0s   0   0   0   0   1   1   1   1   1   1   2   2   3   4   5   5   5   5   5   5   5   5   5   6   7   7   7   8   8   8   9   9   1s   1   1   1   1   1   2   2   2   3   3   3   3   3   3   3   4   4   5   5   5   5   5   5   5   5   6   6   6   7   7   7   7   2s   0   1   2   3   3   3   4   4   4   4   4   4   4   4   4   4   5   5   5   6   6   7   8   8   8   8   8   8   8   9   9   9   3s   0   0   0   0   0   0   0   1   1   2   2   3   3   3   3   3   3   3   4   4   5   5   5   5   5   5   6   6   6   6   6   7   0s  total   1s  total   2s  total   3s  total   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   Local  Exchange  Offsets   10   17   18   19   1   11   20   26   12   27   2   28   3   4   5   13   21   14   29   22   30   23   24   6   7   15   31   8   16   25   9   32   Scader   …   …   …   …   (SIMD  Kogge-­‐Stone  Scan,  smem  exchange)   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   Exchanged  Keys   300   330   130   220   020   130   010   020   220   211   021   021   301   221   121   101   122   302   232   022   122   112   322   022   212   013   123   023   023   003   323   333   9   18   0s  carry-­‐in   0s  carry-­‐out   25   32   1s  carry-­‐in   1s  carry-­‐out   33   42   2s  carry-­‐in   2s  carry-­‐out   49   56   3s  carry-­‐in   3s  carry-­‐out   Global  Scatter  Offsets   10   11   12   13   14   15   16   17   18   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   50   51   52   53   54   55   56  
  • 71. •  Resource-allocation as runtime 1.  Kernel memory-wall analysis (kernel fusion) 2.  Algorithm serialization 3.  Tune for data-movement 4.  Warp-synchronous programming 5.  Flexible granularity via meta-programming
  • 72. –  Back40Computing (a Google Code Project) •  http://code.google.com/p/back40computing/ –  Default sorting method for Thrust •  http://code.google.com/p/thrust/
  • 74. –  A single host-side procedure call launches a kernel that performs orthogonal program steps MyUberKernel<<<grid_size, num_threads>>>(d_device_storage); –  No existing public repositories of kernel “subroutines” for scavenging
  • 75. GATHER(key) GATHER (value) Extract radix digit Encode flag bit –  Callbacks, iterators, visitors, functors, etc. (into flag vectors) LOCAL MULTI-SCAN (flag vectors) – ReduceKernel<<<grid_size, num_threads>>> (CountingIterator(100)); Decode local rank (from flag vectors) EXCHANGE (key) EXCHANGE (value) Extract radix digit (again) Update global radix digit partition offsets –  E.g., fused kernel left can’t be composed using a callback-based pattern SCATTER (key) SCATTER (value) Fused radix sorting kernel • Digit extraction • Local prefix scan • Scatter accordingly
  • 76. –  Compiled libraries suffer from code bloat • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types. • Specializing for device configurations makes it even worse –  The alternative is to ship source for #include’ing • Have to be willing to share source –  Need a way to fit meta-programming in at the JIT / bytecode level to help avoid expansion / mismatch-by-omission –  Can leverage fundamentally different algorithms for different phases • How to teach the compiler do to this?