SlideShare a Scribd company logo
High-Performance Computing Needs
Machine Learning... And Vice Versa
(was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”)




                                                                                             dit ion
                                                                                         e
Nicolas Pinto
NIPS “Big Learning” | December 16th, 2011




                                                                      The Rowland Institute a
                                                                      HARVARD UNIVERSITY
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
Motivation...
The Problem:
Visual Object Recognition
Why?
Why?
it seems easy, right?
44 years ago...
The Problem:
Visual Object Recognition
The Problem:
Visual Object Recognition
The Problem:
Visual Object Recognition

                fast
The Problem:
Visual Object Recognition

                fast
                accurate
The Problem:
Visual Object Recognition

                fast
                accurate
                effortless
The Problem:
Visual Object Recognition

                fast
                accurate
                effortless
                critical to survival
The Problem:
Visual Object Recognition

                fast
                accurate
                effortless
                critical to survival

                tolerant to
                variations!
hard?
hard?


// the world is 3D but the retina is 2D
hard?


// the world is 3D but the retina is 2D
// the curse of dimensionality
hard?


// the world is 3D but the retina is 2D
// the curse of dimensionality

// considerable   image variation
~50% of   is for vision!
you learned it...
      ve
     y ha
   ma
Background
The Approach
Reverse and Forward Engineering the Brain
The Approach
Reverse and Forward Engineering the Brain




     REVERSE                 FORWARD
       Study                       Build
    Natural System            Artificial System
The Approach
Reverse and Forward Engineering the Brain




     REVERSE                 FORWARD
       Study                       Build
    Natural System            Artificial System
Reverse Engineering         Images by DiCarlo JJ & Cox DD
                                        Animation by Li N



The Ventral Visual Stream
Reverse Engineering         Images by DiCarlo JJ & Cox DD
                                        Animation by Li N



The Ventral Visual Stream
Reverse Engineering
The Ventral Visual Stream



                                         taflo ps ?!
                            in =2 0 pe
                      bra
The Approach
Reverse and Forward Engineering the Brain




     REVERSE                 FORWARD
       Study                       Build
    Natural System            Artificial System
Forward Engineering
The Ventral Visual Stream



                                        a rnin g ???
                           a bo ut le
                       all
“Temp. Adv.”
                                                                    “Auto-reset”
                                                                       ...
                                 number of lters




L2
                           thresh/sat            norm strength

                                                            Learning
                                   normalization
                                   neighborhood                      Rate
         kernel                                                      Trace
         size                                                        “Temp. Adv.”
                                                                     “Auto-reset”
                                                                        ...
                                         n. of lters




L1
                  thresh/sat            norm strength            Learning
                                                                      Rate
                                           normalization
                                                                      Trace
                                           neighborhood
                                                                      “Temp. Adv.”
                                                                      “Auto-reset”
kernel                                                                   ...
How are things done normally?
How are things done normally?

  Usual Formula:
How are things done normally?

  Usual Formula:

  1) One grad student
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
  5) One Ph.D.
How do you call this ?




  “This is graduate student descent”
  - David McAllester
How do you call this ?




  “This is graduate student descent”
  - David McAllester
What’s better than this?




“Conjugate graduate student descent?”
- Nicolas Poilvert
Doing things a little bit differently
Doing things a little bit differently


  1) One grad student
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
  5) Hundreds of Thousands One PhD ?
“   If you want to have good ideas
         you must have many ideas.               ”
    “  Most of them will be wrong,
      and what you have to learn is
        which ones to throw away.                ”
                    Linus Pauling
                   (double Nobel Prize Winner)
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
High-throughput
       Screening
Read-out


L3
                  thresh/sat            norm strength

                                            normalization               Learning


                                                                                                  large family of
                                            neighborhood                         Rate
                                                                                 Trace
                                                                                 “Temp. Adv.”
                                                                                 “Auto-reset”

                                               number of lters
                                                                                    ...
                                                                                                  brain-inspired models

L2
                                        thresh/sat            norm strength




                                  clusive!
                                                                         Learning
                                                 normalization
                                                 neighborhood                     Rate




                               in
                                                                                  Trace

                                                                                                    52 parameters
     ery
         kernel
         size                                                                     “Temp. Adv.”



    v
                                                                                  “Auto-reset”
                                                                                     ...
                                                      n. of lters




                                                                                                    more than        10 25
L1
                               thresh/sat            norm strength            Learning



                                                                                                    possible unique
                                                                                   Rate
                                                        normalization
                                                                                   Trace
                                                        neighborhood
                                                                                   “Temp. Adv.”
                                                                                   “Auto-reset”
kernel                                                                                ...



                                                                                                    combinations!
size

                                                                 number of lters




 input
    kernel
    size
                                                                                                         Pinto, Doukhan, DiCarlo, Cox PLoS 2009
The curse of speed
The curse of speed


  thousands of big models
The curse of speed


  thousands of big models

  large amounts of unsupervised
  learning experience
The curse of speed
...and the blessing of massively parallel computing

  No off-the-shelf solution? DIY!
  Engineering (Hardware/SysAdmin/Software)   Science
The curse of speed
...and the blessing of massively parallel computing

  No off-the-shelf solution? DIY!
  Engineering (Hardware/SysAdmin/Software)   Science


  Leverage non-scientific high-tech
  markets and their $billions of R&D...
  Gaming: Graphics Cards (GPUs), PlayStation 3
  Web 2.0: Cloud Computing (Amazon, Google)
r ow n!
 u ild you
B
The blessing of GPUs
  Computational power         DIY GPU pr0n (since 2006)   Sony Playstation 3s (since 2007)




                                                                              GPUs
                        Peak GFLOP/s




                                                                              CPUs
speed
                 (in billion floating point operations per second)


    Q9450 (Matlab/C) [2008]    0.3


        Q9450 (C/SSE) [2008]   9.0


7900GTX (OpenGL/Cg) [2006]           68.2


     PS3/Cell (C/ASM) [2007]            111.4


  8800GTX (CUDA1.x) [2007]                      192.7


   GTX280 (CUDA2.x) [2008]                              339.3


   GTX480 (CUDA3.x) [2010]                                                                      974.3
    (Fermi)
                                                                Pinto, Doukhan, DiCarlo, Cox PLoS 2009
                                                                     Pinto, Cox GPU Comp. Gems 2011
speed
                 (in billion floating point operations per second)


    Q9450 (Matlab/C) [2008]    0.3


        Q9450 (C/SSE) [2008]   9.0


7900GTX (OpenGL/Cg) [2006]           68.2


     PS3/Cell (C/ASM) [2007]            111.4


  8800GTX (CUDA1.x) [2007]                      192.7


   GTX280 (CUDA2.x) [2008]                                  339.3


                                                                             cha n ging...
                                                                         e
   GTX480 (CUDA3.x) [2010]
                                              pe        edu p is g a m                                           974.3
    (Fermi)
                                     >1 000X s
                                                                                 Pinto, Doukhan, DiCarlo, Cox PLoS 2009
                                                                                      Pinto, Cox GPU Comp. Gems 2011
High-throughput Screening
        Skimming off the best models

                          stupid
              chance     baseline
        250


        200
                                         N=2500
        150
Count




        100


        50


         0
               50        60         70   80   90   100


                       Performance (%)                   Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
        Skimming off the best models

                          stupid
              chance     baseline
        250


        200
                                         N=2500
        150
Count




        100


        50


         0
               50        60         70   80   90   100


                       Performance (%)                   Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
        Skimming off the best models

                          stupid
              chance     baseline
        250


        200
                                         N=2500
        150
Count




        100


        50


         0
               50        60         70   80   90   100


                       Performance (%)                   Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on other tasks

                                                           ~90%
                                   vs.


                   “HMAX 2.1”
                     (~80%)




      V1-like                            5   4    3       2      1
      (baseline)
                   state-of-the-art      high-throughput models
                   (from literature)
                                             Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on other tasks

                                                           ~90%
                                   vs.


                   “HMAX 2.1”
                     (~80%)




      V1-like                            5   4    3       2      1
      (baseline)
                   state-of-the-art      high-throughput models
                   (from literature)
                                             Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on other tasks

                                                           ~90%
                                   vs.


                   “HMAX 2.1”
                     (~80%)




      V1-like                            5   4    3       2      1
      (baseline)
                   state-of-the-art      high-throughput models
                   (from literature)
                                             Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on other tasks

                                                           ~90%
                                   vs.


                   “HMAX 2.1”
                     (~80%)




      V1-like                            5   4    3       2      1
      (baseline)
                   state-of-the-art      high-throughput models
                   (from literature)
                                             Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on faces


                                               vs.




                                               HMAX 2.1
                                 PHOG
                            GB



                                        PHOW
                   SIFT




                                                                                            blend
                                                          5   4        3      2       1
      V1-like                                             high-throughput models
      (baseline)          state-of-the-art
                          (from literature)                       Pinto, Doukhan, DiCarlo, Cox PLoS 2009
Human vs. Machine
  8-way object categorization

                                           99.1


                               64


                  31.3
chance (12.5%)

                 baseline   best model   best human
What does it all mean?
what have we learned ?




                    briefly...
What does it all mean?
what have we learned ?
Grayscale Input
                        Normalize                                        Linear SVM
                                                                       simple classifier


                                          L1         L2           L3

                  Filter
                                    Threshold &
                        Φ1                        Pool    Normalize
                                     Saturate
                        Φ2
                  ...
                        Φk




➡   dimensionality: more filters is better
What does it all mean?
what have we learned ?
Grayscale Input
                        Normalize                                        Linear SVM
                                                                       simple classifier


                                          L1         L2           L3

                  Filter
                                    Threshold &
                        Φ1                        Pool    Normalize
                                     Saturate
                        Φ2
                  ...
                        Φk




➡   learning is difficult
What does it all mean?
what have we learned ?
Grayscale Input
                        Normalize                                        Linear SVM
                                                                       simple classifier


                                          L1         L2           L3

                  Filter
                                    Threshold &
                        Φ1                        Pool    Normalize
                                     Saturate
                        Φ2
                  ...
                        Φk




➡   non-linearities are important
What does it all mean?
what have we learned ?
Grayscale Input
                        Normalize                                        Linear SVM
                                                                       simple classifier


                                          L1         L2           L3

                  Filter
                                    Threshold &
                        Φ1                        Pool    Normalize
                                     Saturate
                        Φ2
                  ...
                        Φk



➡   normalization is very important
    missed in previous modeling efforts
    now confirmed by LeCun et al., Poggio et al., Ng et al.
What are these models
      not good for?
ob jects
 low  level
              s
   ckgr ound
 ba
   fa ces
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
one more thing
Real-world apps?
testing the generality and scalability of the approach
Facebook
Really Real World Problem

                                  enormous scale
                                     billion of photos
                                     3TB+ uploaded
                                     every day
                                     dense, collaborative
                                     face labels




collab. with Zak Stone & Todd Zickler @ Harvard
Relevance to Social Networking




                         slide courtesy of David Cox
Relevance to Social Networking
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
High-throughput
       Screening
High-Throughput Screening
 Labeled Faces in the Wild (LFW) View 1
  > 30,000 large-scale models (1to3 layers) screened in only 3 days


                   HT L3s (3 layers)                  top 5 models
                                                      LFW view 1 performance




                                   Lea rning!
                         vised
              o Un super
             N




Pinto, Cox (FG 2011)                             Pinto, Stone, Zickler, Cox (CVPR 2011)
Generalization
 Performance on LFW View 2 (hold out)

                       Face Verification Performance (% correct)

                                                               88.1
                                                  86.8
                                   85.3



                   79.4           Wolf et al.
                                 ACCV 2009      Kumar et al.   Ours
                  V1-like        face.com        ICCV 2009     (HT)


Pinto, Cox (FG 2011)
“Facebook100”
typical social network size?




collab. with Zak Stone & Todd Zickler @ Harvard
                                    Pinto, Stone, Zickler, Cox (CVPR 2011)
Auto-tagging
a network of 100 Facebook friends



                             > 86%
                             accurate
                             (w/ 90 training examples)



collab. with Zak Stone & Todd Zickler @ Harvard
                                     Pinto, Stone, Zickler, Cox (CVPR 2011)
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
vs face.com
comparison with a heavily-specialized commercial system
                                                                     L3
                                                           (hardware-accelerated
                                                           brute-force random model)
Performance (% correct)




                                                            face.com
                                                            V1-likearound)
                                                         (best technology
                                                            (one layer)




                          training example(s) / friend   Pinto, Stone, Zickler, Cox (CVPR 2011)
Conclusion?
Hardware Matters !


       Yann LeCun’s Mac




              picture courtesy of Koray Kavukcuoglu
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
Two conflicting requirements

   The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run


   Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Two conflicting requirements

   The brain is a massively parallel computer
                   FA  ST slow to run
➡ Big models are paralyzingly


   Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Two conflicting requirements

   The brain is a massively parallel computer
                   FA  ST slow to run
➡ Big models are paralyzingly


   Neural data only provides weak constraints
                    LEXI BLE
                F
➡ Lots of parameters – hard to explore
Two conflicting requirements

   The brain is a massively parallel computer
                   FA  ST slow to run
➡ Big models are paralyzingly


   Neural data only provides weak constraints
                    LEXI BLE
                F
➡ Lots of parameters – hard to explore




  How to optimize?
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
What’s the bottleneck?
lutio ns!
                     k Co nvo
       i lter ba n
3D F
Our answer?
Meta-programming

       !
Meta-programming

What?
Meta-programming !


 Leave the grunt-programming to the
 computer (i.e. auto-tuning like ATLAS or FFTW)
 •   Dynamically compile specialized versions
     of the same kernel for different conditions
 •   Empirical run-time tuning
 •   For free: smooth syntactic ugliness: unroll
     loops, index un-indexable registers, etc.
Meta-programming !

“Instrument” your solutions:
•   Block size
•   Work size
•   Loop unrolling
•   Pre-fetching
•   Spilling
•   etc.
                     ... and let the computer generate
                     find the optimal code
How?
Always use the right tool !
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)
                                                                plating
                                                             Tem
extern "C" {

#for j in xrange($FILTER_H)

  __global__ void convolve_beta_j${j}(float4 *input, float4 *output)
  {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
    __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

    // -- input/output offsets
    const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    float4 input_v4;

    // -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
    if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
#end if
      {
	   input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
	   shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
	   shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
	   shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
Compilation?
  (with Python-based solutions)
PyCUDA/PyOpenCL (by Andreas Klockner)




  Klöckner, Pinto, Lee, Catanzaro, Ivanov, Fasih (ParCo 2011)
Basic GPU Meta-programming System




                                                      A Case Study
                           GPU  Meta-Programming:
                                                 red Machine Vision
                           in Biologically-Inspi
                                                s]
                           [GPU Computing Gem

                           Pinto N, Cox DD
conv_kernel_4x4x4.cu
conv_kernel_template.cu                                          #include <stdio.h>

                                                                 texture<float4, 1, cudaReadModeElementType> tex_float4;
                                                                 __constant__ float constant[4][4][4];

                                                                 #define IMUL(a, b) __mul24(a, b)
 texture<float4, 1, cudaReadModeElementType> tex_float4;         extern "C" {
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];                                                         __global__ void convolve_beta_j0(float4 *input, float4 *output)
                                                                       {
 #define IMUL(a, b) __mul24(a, b)
 extern "C" {                                                           __shared__ float shared_in[131][4+1];

                                                                        // -- input/output offsets
 #for j in xrange($FILTER_H)                                            const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
                                                                        const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
   __global__ void convolve_beta_j${j}(float4 *input, float4            float4 input_v4;
 *output)
                                                                        // -- load input to shared memory
   {
                                                                          {
                                                                 	

                input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);
 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1                       	

                shared_in[threadIdx.x+128*0][0] = input_v4.x;
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];            	

                shared_in[threadIdx.x+128*0][1] = input_v4.y;
                                                                 	

                shared_in[threadIdx.x+128*0][2] = input_v4.z;
                                                                 	

                shared_in[threadIdx.x+128*0][3] = input_v4.w;
     // -- input/output offsets
                                                                          }
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +                      if((threadIdx.x+128*1)<131)
 blockIdx.x*blockDim.x + threadIdx.x;                                     {
     const uint out_idx = blockIdx.y*OUTPUT_W +                  	

                input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);
 blockIdx.x*blockDim.x + threadIdx.x;                            	

                shared_in[threadIdx.x+128*1][0] = input_v4.x;
                                                                 	

                shared_in[threadIdx.x+128*1][1] = input_v4.y;
     float4 input_v4;
                                                                 	

                shared_in[threadIdx.x+128*1][2] = input_v4.z;
                                                                 	

                shared_in[threadIdx.x+128*1][3] = input_v4.w;
      // -- load input to shared memory                                   }
 #for i in xrange($LOAD_ITERATIONS)                                     __syncthreads();
 #if $i==($LOAD_ITERATIONS-1)
                                                                        // -- compute dot products
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
                                                                        float v, w;
 #end if
        {                                                               float sum0 = 0;
 	         input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*           float sum1 = 0;
 $i);                                                                   float sum2 = 0;
                                                                        float sum3 = 0;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;          v = shared_in[threadIdx.x+0][0];
 	         shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;          w = constant[0][0][0];
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;          sum0 += v*w;
        }                                                               w = constant[0][0][1];
                                                                        sum1 += v*w;
 #end for
                                                                        w = constant[0][0][2];
                                                                        sum2 += v*w;
                                                                        w = constant[0][0][3];
                                                                        sum3 += v*w;
                                                                        v = shared_in[threadIdx.x+1][0];
                                                                        w = constant[0][1][0];
                                                                        sum0 += v*w;
                                                                        w = constant[0][1][1];
                                                                        sum1 += v*w;
                                                                        w = constant[0][1][2];
                                                                        sum2 += v*w;
                                                                        w = constant[0][1][3];
                                                                        sum3 += v*w;
                                                                        v = shared_in[threadIdx.x+2][0];
                                                                        w = constant[0][2][0];
                                                                        sum0 += v*w;
                                                                        w = constant[0][2][1];
                                                                        sum1 += v*w;
conv_kernel_template.cu
 texture<float4, 1, cudaReadModeElementType> tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];

 #define IMUL(a, b) __mul24(a, b)
                                                                 conv_kernel_4x4x4.cu
 extern "C" {

 #for j in xrange($FILTER_H)

   __global__ void convolve_beta_j${j}(float4 *input, float4



                                                                             20 kB
 *output)
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

     // -- input/output offsets
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;

      // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
 #end if

 	
 $i);
        {
           input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*    conv_kernel_8x8x4.cu
 	         shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
        }



                                                                             64 kB
 #end for
Benefits?
Smooth syntactic ugliness
Smooth syntactic ugliness
  Manipulations that are not easily
  accessible in CUDA C code:
  • loop unrolling (possibly fine-controlled)
Smooth syntactic ugliness
                            Manipulations that are not easily
                            accessible in CUDA C code:
                            • fine-controlled loop unrolling / jamming
..)

  v = shared_in[threadIdx.x+0][0];
  w = constant[0][0][0];
  sum0 += v*w;
  w = constant[0][0][1];
  sum1 += v*w;
  w = constant[0][0][2];
  sum2 += v*w;
  w = constant[0][0][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+1][0];
  w = constant[0][1][0];
  sum0 += v*w;
  w = constant[0][1][1];
  sum1 += v*w;
  w = constant[0][1][2];
  sum2 += v*w;
  w = constant[0][1][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+2][0];
  w = constant[0][2][0];
  sum0 += v*w;
  w = constant[0][2][1];
  sum1 += v*w;
  w = constant[0][2][2];
  sum2 += v*w;
  w = constant[0][2][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+3][0];
  w = constant[0][3][0];
  sum0 += v*w;
  w = constant[0][3][1];
  sum1 += v*w;
  w = constant[0][3][2];
  sum2 += v*w;
  w = constant[0][3][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+0][1];
  w = constant[1][0][0];
  sum0 += v*w;
  w = constant[1][0][1];
  sum1 += v*w;
  w = constant[1][0][2];
  sum2 += v*w;
  w = constant[1][0][3];
  sum3 += v*w;
How about #pragma unroll ?
   (why don’t you trust the compiler?)
o t alo ne....
    we are n
              s for S ignal
    Using GPU
             elatio n                        pil   ers
         Corr                     ust com
                            ’t tr
                                                                                                                                                                 itchell
                                                                                                                                                  Daniel A. M




                        Don                     gmen
                                          The Murch


                                        ode fr
                                               a
                                                     ts
                                                    ison Widefi
                                                                 eld Array



                                                                                                                                             c
                                                                                                                                      tical”
                                                                                                                              e “iden
                                                                                                                       re thes                 + g *h;
                                                                                                                   ompa                                                                                  LOPS
                                                                                                        •        C
                                                                                                                  *c +
                                                                                                                                           e*f
                                                                                                                                                                                                  770 GF
                                                                                                              + d
                                                                                                          b*c                       grating 8-s
                                                                                                                                                econd snap
                                                                                                                                                            shots over


                                                                                                     a +=
                                                                                                                               inte                           peeling,
                                                                                                                   roduced by                     lanking and

                                                                                                                     b*c;
                                                                                                   -2 526 field p                    d after RFI b
                                                                                      f the J2107                      e of the fiel
                                                                           an image o                    ht is an imag
                                                                                                                                                                                                        S
                                                                                                                                                                                                    FLOP
                                                             n the left is                  . On the rig

                                                                                                               a += d*c;
                                               Figure 3:
                                                           O                            ing
                                                                             hout blank
                                                                interval wit

                                                                                                                                                                                               20 G
                                                   entire time                    eeled imag
                                                                                              e.                                                                  noise
                                               the                          e unp                                                                    e above the
                                                               ntours of th                                                            f magnitud
                                                                                                                                                                                             10
                                                along with
                                                            co                                                                    rs o                             This
                                                                                                                      at are orde                       ious data.

                                                                                                                a += e*f;
                                                                                                           els th                          dub
                                                                                               ivers at lev                    ply discard              n here
                                                                                   to the rece                      m will sim              tector show
                                                   k
                                                                                                                ste

                                      ichael hClar
                                                                             ct in
                                                              fl ect or refra                      real-time sy                   n-based de
                                                occasion, re                         s the MWA                       mple media
                                                                       integration                       hich the si
                                    M           floor. D
                                                 wit
                                                 wil
                                                          uring deep
                                                     l require a
                                                                  series of d
                                                                              ata-quality
                                                                           art.
                                                                                            tests, of w
                                                                                                                a += g*h;
                                                              n integral p
                                                 will form a   eenhill
                                                     Lincoln Gr
                               Paul La   Plante and
                                                   Reference
                                                             s
                                                                                                  t Boolard
                                                                                                                 a +=
                                                                                                             y, EDGES
                                                                                                                           Memo, 058
                                                                                                                                         , 2010.
                                                                                                                                                  R.J. Cappal
                                                                                                                                                              lo, M.F. M
                                                                                                                                                                           orales, and
                                                                                            ics a                                           ale,                             d Topics
                                                                               RFI Statist                                    , C.J. Lonsd                      l of Selecte
                                                      [1] A.E   .E. Rogers,                                     , R.J. Sault                     IE EE Journa
                                                                                                  R.B. Wayth                        eld Array,
                                                                                   . Greenhill,                      hison Widefi                      ].
                                                                      itchell, L.J                    of the Murc                        07.1912                                 E, 97
                                                       [2] D.A. M                Time Calib
                                                                                              ration
                                                                                                                  , [astro-
                                                                                                                                ph/08                               s of the IEE
                                                            S.M. O    rd, Real-                       7 17, 2008                                      , Proceeding
                                                                                        2 (5), 707–                                     n Overview
                          1
              nuary 201
sday, 27 Ja                                                                rocessing,                                     rray: Desig
                                                            in Signal P                                 on Widefield A
                                                                                          he Murchis                        8].                                            , Graphics
                                                                           ale, et al., T                      903.182                                        R.G. Edgar
                                                        [3]  C.J. Lonsd                    [ast   ro-ph/0                                    H. Pfister, and                   Series,
                                                                             506, 2009,                                      ell, K. Dale,                     Conference
                                                              (8), 1497–1                                    , D.A. Mitch                       d Array, ASP
                                                                                               R.B. Wayth                        on Wide-fiel
                                                                                  Greenhill,                      the Murchis


     IICS‘2011                                           [4] S.M    . Ord, L.J.             ata Pro  cessing in                                                                 cal
                                                                              Units for D                                                                           Mathemati
                                                               Processing                                                                1 radio pola
                                                                                                                                                        rimetry. I.
                                                                              009.                                              aa d
                                                                                                                          nderstryn20 ing
                                                                                                                                       1
                                                                411, 127, 2                                 .J. Sault, U Janu                 6.
                                                                                      . Breg  man, and R ursday,.,2117, 137–147, 199
                                                                                                                        7
                                                                                                                                                                           alar
                                                                        amaker, J.D                       Th pl. Ser
                                                                                                          up                                                alogue of sc
                                                           [5 ] J.P. H                       st rophys. S                                 ll-co herency an                rophys. Su
                                                                                                                                                                                      ppl.
                                                                               s, Astron. A                                  . IV. The fu                   Astron. Ast
                                                                 foundation                                    polarimetry                     ric fidelity,
                                                                                                     g radio               ge and pola
                                                                                                                                         rimet
                                                                                       derstandin
Smooth syntactic ugliness
  Manipulations that are not easily
  accessible in CUDA C code:
  • variable-length argument lists
Smooth syntactic ugliness

  Manipulations that were not easily
  accessible in CUDA C code:
  • index un-indexable resources (e.g. regs)
Explore design decision
  space more freely
Basic GPU Meta-programming System




                                                      A Case Study
                           GPU  Meta-Programming:
                                                 red Machine Vision
                           in Biologically-Inspi
                                                s]
                           [GPU Computing Gem

                           Pinto N, Cox DD
... too many
                      optimizations?

                      ba nk c
                                onflict
                                             s




            on
                                       ing



        isi
                              ale   sc


      ec
                           co




                                        ca
    pr




                                             ch
    d                part             ling
                            itionnrol




                                                 in
ixe
      cla                     p u ca mpin




                                                 g
            m             loo              g
m


                pi
                     ng
                                adca sting
                          bro
                                                  ms
        zero-cop                             trea
e ?
              ec id
       ’t d
c an



                        keep them all !
Exploring design decision space more freely

  Meta-programming:


  • enables efficient learning of the GPU
    hardware/software


  • allows full exploitation of the GPU
    architecture
version A
conv_kernel_beta_template.cu
                                                                                             ...
                                                                        mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
                                                                        mov.b32 $r1, c0[$ofs2+0x0008]
 texture<float4, 1, cudaReadModeElementType> tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]                      mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4
 [$N_FILTERS];
                                                                        mov.b32 $r1, c0[$ofs2+0x000c]
 #define IMUL(a, b) __mul24(a, b)
 extern "C" {                                                           mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4
 #for j in xrange($FILTER_H)                                            mov.b32 $r1, c0[$ofs2+0x0010]
   __global__ void convolve_beta_j${j}(float4 *input, float4
 *output)
                                                                        mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
                                                                                             ...
     // -- input/output offsets
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;

      // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)


                                                                                                   version B
 #if $i==($LOAD_ITERATIONS-1)
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
 #end if
        {
 	         input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
 $i);
 	
 	
 	
           shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
           shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
           shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
                                                                                           ...
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;   mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1
        }
 #end for                                                        mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

                                                                                           ...
                                                                             aster... Why ?
                                                                        2x f
Results
speed
                 (in billion floating point operations per second)


    Q9450 (Matlab/C) [2008]    0.3


        Q9450 (C/SSE) [2008]   9.0


7900GTX (OpenGL/Cg) [2006]           68.2


     PS3/Cell (C/ASM) [2007]            111.4


  8800GTX (CUDA1.x) [2007]                      192.7


   GTX280 (CUDA2.x) [2008]                                  339.3


                                                                             cha n ging...
                                                                         e
   GTX480 (CUDA3.x) [2010]
                                              pe        edu p is g a m                                           974.3
    (Fermi)
                                     >1 000X s
                                                                                 Pinto, Doukhan, DiCarlo, Cox PLoS 2009
                                                                                      Pinto, Cox GPU Comp. Gems 2011
-10.4 1024x1024x8      16x5x5x8     726.412 ± 0.398    744.973 ± 0.571
    Analysis
      2048x2048x4       4x8x8x4     474.681 ± 0.160    887.974 ± 1.017



    ➡ Different hardware ?
  Table 33.2 Performance of Auto-Tuned Implementations on Two
  Hardware Platforms, Including Performance Tuned on One Platform and
  Run on the Other
                             Optimized for:
  Run on:               9400M             GTX480        Tuning Speedup

  9400M                 0.32s              2.52s               675%
  GTX480                0.016s             0.011s              52%



formance gains are observed for the auto-tuned meta-kernels as compared to
t, which was hand-picked to allow correct execution of all input ranges
 ng up against hardware limitations.
APTER 33 GPU Metaprogramming: A Case Study
   Analysis


    ➡ Different input configurations
  Table 33.3 Performance of Auto-Tuned Implementations on Two Input
  Configurations, Including Performance Tuned for One Configuration
  and Run with the Other
                              Optimized for:
  Run on:               Config1             Config2        Tuning Speedup

  config1                 11.1ms             15.7ms              41%
  config2                  fails             10.8ms         not comparable




, in Table 33.3 we show the effect of tuning on one input configuration an
in, significant speedups are obtained using kernels tailored to a specific inp
Summary
Summary

 Meta-programming:
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter highly-optimized code
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter highly-optimized code
 • is easy and flexible with the right tools
   (e.g. Python, PyCUDA/CL, Cheetah, decuda)
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter highly-optimized code
 • is easy and flexible with the right tools
   (e.g. Python, PyCUDA/CL, Cheetah, decuda)


 ➡ helps get drastic speed-ups !
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter highly-optimized code
 • is easy and flexible with the right tools
   (e.g. Python, PyCUDA/CL, Cheetah, decuda)


 ➡ helps get drastic speed-ups !
 ➡ facilitates “auto-tuning” !
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
Intelligent
         and fast




Auto-Tuning
   with Machine Learning




                    with James Bergstra and David Cox
Intelligent
         and fast




Auto-Tuning
   with Machine Learning
Auto-tuning: two approaches
Auto-tuning: two approaches


• Analytical model-based optimization:
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
 - cons: hard to build, domain expertise required, auto-
    tuned code far from peak
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
 - cons: hard to build, domain expertise required, auto-
    tuned code far from peak



• Empirical optimization:
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
 - cons: hard to build, domain expertise required, auto-
    tuned code far from peak



• Empirical optimization:
 - pros: auto-tuned code close to peak (dominant in
    specialized libraries e.g. ATLAS, FFTW), easier to build
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
 - cons: hard to build, domain expertise required, auto-
    tuned code far from peak



• Empirical optimization:
 - pros: auto-tuned code close to peak (dominant in
    specialized libraries e.g. ATLAS, FFTW), easier to build
 - cons: very slow “inference” (for new inputs, etc.)
Empirical Auto-Tuning

The goal is to empirically optimize execution
time given both


• the environment
 - hardware (GPU, CPU, Memory, Mobo, etc.)
 - software (SDK, Compiler suite, etc.)


• the data (input dimensions, repetitions, etc.)
Empirical Auto-Tuning with Meta-programming




                                                       A Case Study
                            GPU  Meta-Programming:
                                                  red Machine Vision
                            in Biologically-Inspi
                                                 s]
                            [GPU Computing Gem

                            Pinto N, Cox DD
Intelligent
         and fast




Auto-Tuning
   with Machine Learning
Auto-tuning: best of both approaches ?
Auto-tuning: best of both approaches ?


• Empirically-learned model-based
  optimization:
Auto-tuning: best of both approaches ?


• Empirically-learned model-based
  optimization:


 - pros: auto-tuned code close to peak*, easier to build (?),
    fast “inference” (for new inputs, hardware, etc.)
Auto-tuning: best of both approaches ?


• Empirically-learned model-based
  optimization:


 - pros: auto-tuned code close to peak*, easier to build (?),
    fast “inference” (for new inputs, hardware, etc.)


 - cons: unexplored !
Auto-tuning: best of both approaches ?


• Empirically-learned model-based
  optimization:


 - pros: auto-tuned code close to peak*, easier to build (?),
    fast “inference” (for new inputs, hardware, etc.)


 - cons: unexplored !


* could be dominant in specialized libraries
(e.g. machine learning!)
Fast Machine Learning-based
 Runtime Auto-Tuning


ML-based
First Last                           First Last                           First Last
                  Affiliation line 1                    Affiliation line 1                    Affiliation line 1


      Fast Machine Learning-based
                  Affiliation line 2                    Affiliation line 2                    Affiliation line 2
               anon@mail.com                        anon@mail.com                        anon@mail.com


ABSTRACT

      Runtime Auto-Tuning
                                                                  designs, the field lacks consensus on exactly how the differ-
The rapidly evolving landscape of multicore architectures         ent subsystems (memory, communication and computation)
makes the construction of efficient libraries a daunting task.      should be efficiently integrated, modeled and programmed.
A family of methods known collectively as “auto-tuning” has       These systems have exhibited varying degrees of memory
emerged to address this challenge. Two major approaches to        hierarchy and multi-threading complexity and, as a conse-
auto-tuning are empirical and model-based: empirical auto-        quence, they have been increasingly relying on flexible but
tuning is a generic but slow approach that works by mea-          low-level software-controlled cache management and paral-
suring runtimes of candidate implementations, model-based         lelism [Asanovic et al., 2006] in order to better control and
auto-tuning predicts those runtimes using simplified abstrac-      understand the various trade-offs among performance, reli-
tions designed by hand. We show that machine learning             ability, energy efficiency, production costs, etc. This evo-
methods for non-linear regression can be used to estimate         lution has profoundly altered the landscape of application
timing models from data, capturing the best of both ap-           development: programmers are now facing a wide diversity


     Machine Learning for Predictive Auto-Tuning with Boosted
proaches. A statistically-derived model offers the speed of        of low-level architectural issues that must be carefully bal-
a model-based approach, with the generality and simplicity        anced if one is to write code that is both high-performance
of empirical auto-tuning. We validate our approach using          and portable.
the filterbank correlation kernel described in Pinto and Cox

                        Regression Trees
[2012], where we find that 0.1 seconds of hill climbing on         1.1      Motivation
the regression model (“predictive auto-tuning”) can achieve          In this rapidly evolving landscape, the construction of gen-
an average of 95% of the speed-up brought by minutes of           eral development tools and libraries that fully utilize system
empirical auto-tuning. Our approach is not specific to filter-      resources remains a daunting task. Even within special-
bank correlation, nor even to GPU kernel auto-tuning, and         ized architectures from the same vendor, such as NVIDIA’s
can be applied to almost any templated-code optimization          Graphics Processing Units (GPUs) and the Compute Unified
problem, spanning a wide variety of problem types, kernel         Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA,
types, and platforms.                                             2011], many developers default to massive amounts of man-

1.   INTRODUCTION                        First Last                                                             First Last
                                                                  ual labor to optimize CUDA code to specific input domains.
                                                                  In addition, hand-tuning rarely generalizes well to new hard-                                First Last
                                      Affiliation line 1                                                     Affiliation line 1                                Affiliation line 1
                                                                  ware generations or different input domains, and it can also
  Due to power consumption and heat dissipation concerns,         be error-prone or far from optimal. One of the reason is that

                                      Affiliation line 2                                                     Affiliation line 2                                Affiliation line 2
scientific applications have shifted from computing platforms      kernels can produce staggeringly large optimization spaces
where performance had been primarily driven by rises in the       [Datta et al., 2008]. The problem is further compounded

                                anon@mail.com
clock frequency of a single “heavy-weight” processor (with
complex out-of-order control and cache structures) to a plat-
form with ever increasing numbers of “light-weight” cores.
                                                                                                       anon@mail.com
                                                                  by the fact that these spaces can be highly discontinuous
                                                                  [Ryoo et al., 2008], difficult to explore, and quasi-optimal                               anon@mail.com
                                                                  solutions lie at the edge of “performance cliffs” induced by
Interestingly, this shift is now not only relevant to compu-      hard device-specific constraints (e.g. register file size or low-
tational sciences but to the development of all computer sys-     latency cache size).


                                                                                                                                                   James Bergstra
tems: from ubiquitous consumer-facing devices (e.g. phones)
to high-end computer farms for web-scale applications (e.g.
                                                                  1.2      Auto-Tuning
     ABSTRACT
social networks).
  Although the future lies in low-power multi-core hardware         One strategy for addressing these challenges is to use one
                                                                  of a variety of automatic methods known collectively as
                                                                                                                                    designs, the field lacks consensus on exactly how the differ-
    The rapidly evolving landscape of multicore architectures
Permission to makethe or hard copies of all or part ofof work for
                                                                  “auto-tuning.” Two major auto-tuning approaches have emer-
                                                                  ged in the extensive literature covering the subject (see sur-
                                                                  veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc
    makes digital construction this efficient libraries a daunting task.
                                                                  et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos,
                                                                                                                                                   Nicolas Pinto
                                                                                                                                    ent subsystems (memory, communication and computation)
                                                                                                                                    should be efficiently integrated, modeled and programmed.

                                                                                                                                                   David Cox
personal or classroom use is granted without fee provided that copies are
                                                                  2008, Li et al., 2009, Park et al., 2011]): analytical model-     These systems have exhibited varying degrees of memory
not A familyfor profit or commercial advantage and that copies
    made or distributed of methods known collectively as “auto-tuning” has
                                                                  driven optimization and empirical optimization [Yotov et al.,
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or address this challenge. Two major approaches to
    emerged to to redistribute to lists, requires prior specific   2003].                                                            hierarchy and multi-threading complexity and, as a conse-
                                                                    The model-driven optimization approach uses analytical
permission and/or a fee.                                                                                                            quence, they have been increasingly relying on flexible but
    auto-tuning are empirical and model-based: empirical auto-
                                                                                                                                                   [submitted]
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.                  abstractions to model the hardware architectures, in order

    tuning is a generic but slow approach that works by mea-                                                                        low-level software-controlled cache management and paral-
    suring runtimes of candidate implementations, model-based                                                                       lelism [Asanovic et al., 2006] in order to better control and
    auto-tuning predicts those runtimes using simplified abstrac-                                                                    understand the various trade-offs among performance, reli-
    tions designed by hand. We show that machine learning                                                                           ability, energy efficiency, production costs, etc. This evo-
    methods for non-linear regression can be used to estimate                                                                       lution has profoundly altered the landscape of application
    timing models from data, capturing the best of both ap-                                                                         development: programmers are now facing a wide diversity
    proaches. A statistically-derived model offers the speed of                                                                      of low-level architectural issues that must be carefully bal-
                                                                                                                                    anced if one is to write code that is both high-performance
lutio ns!
                     k Co nvo
       i lter ba n
3D F
NVIDIA GTX 580 (Fermi)
0                   P    ie w
                      rev(b)                                                     2x faster          equality
                                                              1200




                          GFLOP/s of predictive auto-tuning
                                                              1000
Auto-­tuned mean

                                                               800
                                                                                                          2x slower

        ML-based:
Reference mean
                                                               600

         < 0.1sec
                                                               400



                                                               200



                                                                 0
                   200
                                                                     0    200   400   600   800   1000   1200   1400
d problem
                                                                         GFLOP/s of empirical auto-tuning
 r training
                                                                           old way: minutes!
NVIDIA GTX 580 (Fermi)
0                   P    ie w
                      rev(b)                                                        2x faster        equality
                                                              1200




                          GFLOP/s of predictive auto-tuning
                                                                                                            LOP /s !
                                                                                          RAF
                                                              1000


                                                                                      1 TE
Auto-­tuned mean

                                                               800              >   1.
                                                                                                           2x slower

        ML-based:
Reference mean
                                                               600

         < 0.1sec
                                                               400



                                                               200



                                                                 0
                   200
                                                                     0    200   400    600   800   1000   1200   1400
d problem
                                                                         GFLOP/s of empirical auto-tuning
 r training
                                                                           old way: minutes!
What else ?
What else could we do for HPC ?
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
  their complex interactions
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
  their complex interactions
• Help design better architectures ?
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
  their complex interactions
• Help design better architectures ?
• $$$
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
  their complex interactions
• Help design better architectures ?
• $$$
• etc.
It would be a
                                                       win-win-win situation!




(The Office Season 2, Episode 27: Conflict Resolution)
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
en ts
                   e dg em
    nowl
Ack
                                     DiCarlo Lab @ MIT

                            arlo
                     im DiC
                    J




          id Cox
    Dav
en ts
        e dg em
    nowl
Ack
CO ME

More Related Content

PPTX
Aok – areas of knowing mathematics
PDF
Joplin tornado hospital AAR
PDF
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
PDF
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
PDF
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
PDF
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
PDF
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
PDF
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
Aok – areas of knowing mathematics
Joplin tornado hospital AAR
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...

Viewers also liked (11)

PDF
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
PDF
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
PDF
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
PDF
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
PDF
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
PDF
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
PDF
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
PDF
Machine learning the next revolution or just another hype
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PPTX
Reinforcement Learning : A Beginners Tutorial
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
Machine learning the next revolution or just another hype
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Reinforcement Learning : A Beginners Tutorial
Ad

Similar to High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning) (20)

PDF
08.10.12 Artificial Intelligence and Cognition - Natural Cognition
PDF
PDF
633-600 Machine Learning
PPTX
Energy Efficient Data Storage Systems
PPTX
Deep neural networks
PDF
Marked Point Process For Neurite Tracing
PDF
Machine Learning
PPTX
neural-networks (1)
PPT
P02 sparse coding cvpr2012 deep learning methods for vision
KEY
Mechanisms of bottom-up and top-down processing in visual perception
PDF
Machine Learning
PDF
Multi-core programming talk for weekly biostat seminar
 
PDF
Artificial neural networks
PDF
Self Organinising neural networks
PDF
CCIA'2008: Can Evolution Strategies Improve Learning Guidance in XCS? Design ...
PDF
Optimal and Heuristic Techniques for Fault Detection
PDF
Machine Learning for objective QoE assessment: Science, Myths and a look to t...
PDF
Bio-inspired Active Vision System
PDF
An introduction to Machine Learning
PDF
Machine Learning - What, Where and How
08.10.12 Artificial Intelligence and Cognition - Natural Cognition
633-600 Machine Learning
Energy Efficient Data Storage Systems
Deep neural networks
Marked Point Process For Neurite Tracing
Machine Learning
neural-networks (1)
P02 sparse coding cvpr2012 deep learning methods for vision
Mechanisms of bottom-up and top-down processing in visual perception
Machine Learning
Multi-core programming talk for weekly biostat seminar
 
Artificial neural networks
Self Organinising neural networks
CCIA'2008: Can Evolution Strategies Improve Learning Guidance in XCS? Design ...
Optimal and Heuristic Techniques for Fault Detection
Machine Learning for objective QoE assessment: Science, Myths and a look to t...
Bio-inspired Active Vision System
An introduction to Machine Learning
Machine Learning - What, Where and How
Ad

More from npinto (16)

PDF
"AI" for Blockchain Security (Case Study: Cosmos)
PDF
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
PDF
[Harvard CS264] 05 - Advanced-level CUDA Programming
PDF
[Harvard CS264] 04 - Intermediate-level CUDA Programming
PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
PDF
[Harvard CS264] 01 - Introduction
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
PDF
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
PDF
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
"AI" for Blockchain Security (Case Study: Cosmos)
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 01 - Introduction
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...

Recently uploaded (20)

PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Complications of Minimal Access Surgery at WLH
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Pharma ospi slides which help in ospi learning
PDF
RMMM.pdf make it easy to upload and study
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Institutional Correction lecture only . . .
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Lesson notes of climatology university.
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
Microbial diseases, their pathogenesis and prophylaxis
Complications of Minimal Access Surgery at WLH
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Pharma ospi slides which help in ospi learning
RMMM.pdf make it easy to upload and study
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
O5-L3 Freight Transport Ops (International) V1.pdf
Pre independence Education in Inndia.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Institutional Correction lecture only . . .
PPH.pptx obstetrics and gynecology in nursing
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Lesson notes of climatology university.
Renaissance Architecture: A Journey from Faith to Humanism
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
O7-L3 Supply Chain Operations - ICLT Program

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

  • 1. High-Performance Computing Needs Machine Learning... And Vice Versa (was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”) dit ion e Nicolas Pinto NIPS “Big Learning” | December 16th, 2011 The Rowland Institute a HARVARD UNIVERSITY
  • 2. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 3. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 11. The Problem: Visual Object Recognition fast
  • 12. The Problem: Visual Object Recognition fast accurate
  • 13. The Problem: Visual Object Recognition fast accurate effortless
  • 14. The Problem: Visual Object Recognition fast accurate effortless critical to survival
  • 15. The Problem: Visual Object Recognition fast accurate effortless critical to survival tolerant to variations!
  • 16. hard?
  • 17. hard? // the world is 3D but the retina is 2D
  • 18. hard? // the world is 3D but the retina is 2D // the curse of dimensionality
  • 19. hard? // the world is 3D but the retina is 2D // the curse of dimensionality // considerable image variation
  • 20. ~50% of is for vision!
  • 21. you learned it... ve y ha ma
  • 23. The Approach Reverse and Forward Engineering the Brain
  • 24. The Approach Reverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 25. The Approach Reverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 26. Reverse Engineering Images by DiCarlo JJ & Cox DD Animation by Li N The Ventral Visual Stream
  • 27. Reverse Engineering Images by DiCarlo JJ & Cox DD Animation by Li N The Ventral Visual Stream
  • 28. Reverse Engineering The Ventral Visual Stream taflo ps ?! in =2 0 pe bra
  • 29. The Approach Reverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 30. Forward Engineering The Ventral Visual Stream a rnin g ??? a bo ut le all
  • 31. “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of lters L1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” kernel ...
  • 32. How are things done normally?
  • 33. How are things done normally? Usual Formula:
  • 34. How are things done normally? Usual Formula: 1) One grad student
  • 35. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime)
  • 36. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets
  • 37. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  • 38. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) One Ph.D.
  • 39. How do you call this ? “This is graduate student descent” - David McAllester
  • 40. How do you call this ? “This is graduate student descent” - David McAllester
  • 41. What’s better than this? “Conjugate graduate student descent?” - Nicolas Poilvert
  • 42. Doing things a little bit differently
  • 43. Doing things a little bit differently 1) One grad student
  • 44. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models
  • 45. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets
  • 46. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets
  • 47. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  • 48. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) Hundreds of Thousands One PhD ?
  • 49. If you want to have good ideas you must have many ideas. ” “ Most of them will be wrong, and what you have to learn is which ones to throw away. ” Linus Pauling (double Nobel Prize Winner)
  • 52. High-throughput Screening
  • 53. Read-out L3 thresh/sat norm strength normalization Learning large family of neighborhood Rate Trace “Temp. Adv.” “Auto-reset” number of lters ... brain-inspired models L2 thresh/sat norm strength clusive! Learning normalization neighborhood Rate in Trace 52 parameters ery kernel size “Temp. Adv.” v “Auto-reset” ... n. of lters more than 10 25 L1 thresh/sat norm strength Learning possible unique Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” kernel ... combinations! size number of lters input kernel size Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 54. The curse of speed
  • 55. The curse of speed thousands of big models
  • 56. The curse of speed thousands of big models large amounts of unsupervised learning experience
  • 57. The curse of speed ...and the blessing of massively parallel computing No off-the-shelf solution? DIY! Engineering (Hardware/SysAdmin/Software) Science
  • 58. The curse of speed ...and the blessing of massively parallel computing No off-the-shelf solution? DIY! Engineering (Hardware/SysAdmin/Software) Science Leverage non-scientific high-tech markets and their $billions of R&D... Gaming: Graphics Cards (GPUs), PlayStation 3 Web 2.0: Cloud Computing (Amazon, Google)
  • 59. r ow n! u ild you B
  • 60. The blessing of GPUs Computational power DIY GPU pr0n (since 2006) Sony Playstation 3s (since 2007) GPUs Peak GFLOP/s CPUs
  • 61. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 GTX480 (CUDA3.x) [2010] 974.3 (Fermi) Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  • 62. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 cha n ging... e GTX480 (CUDA3.x) [2010] pe edu p is g a m 974.3 (Fermi) >1 000X s Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  • 63. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150 Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 64. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150 Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 65. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150 Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 66. High-throughput Screening Validate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 67. High-throughput Screening Validate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 68. High-throughput Screening Validate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 69. High-throughput Screening Validate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 70. High-throughput Screening Validate on faces vs. HMAX 2.1 PHOG GB PHOW SIFT blend 5 4 3 2 1 V1-like high-throughput models (baseline) state-of-the-art (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 71. Human vs. Machine 8-way object categorization 99.1 64 31.3 chance (12.5%) baseline best model best human
  • 72. What does it all mean? what have we learned ? briefly...
  • 73. What does it all mean? what have we learned ? Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk ➡ dimensionality: more filters is better
  • 74. What does it all mean? what have we learned ? Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk ➡ learning is difficult
  • 75. What does it all mean? what have we learned ? Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk ➡ non-linearities are important
  • 76. What does it all mean? what have we learned ? Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk ➡ normalization is very important missed in previous modeling efforts now confirmed by LeCun et al., Poggio et al., Ng et al.
  • 77. What are these models not good for? ob jects low level s ckgr ound ba fa ces
  • 78. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 80. Real-world apps? testing the generality and scalability of the approach
  • 81. Facebook Really Real World Problem enormous scale billion of photos 3TB+ uploaded every day dense, collaborative face labels collab. with Zak Stone & Todd Zickler @ Harvard
  • 82. Relevance to Social Networking slide courtesy of David Cox
  • 83. Relevance to Social Networking
  • 85. High-throughput Screening
  • 86. High-Throughput Screening Labeled Faces in the Wild (LFW) View 1 > 30,000 large-scale models (1to3 layers) screened in only 3 days HT L3s (3 layers) top 5 models LFW view 1 performance Lea rning! vised o Un super N Pinto, Cox (FG 2011) Pinto, Stone, Zickler, Cox (CVPR 2011)
  • 87. Generalization Performance on LFW View 2 (hold out) Face Verification Performance (% correct) 88.1 86.8 85.3 79.4 Wolf et al. ACCV 2009 Kumar et al. Ours V1-like face.com ICCV 2009 (HT) Pinto, Cox (FG 2011)
  • 88. “Facebook100” typical social network size? collab. with Zak Stone & Todd Zickler @ Harvard Pinto, Stone, Zickler, Cox (CVPR 2011)
  • 89. Auto-tagging a network of 100 Facebook friends > 86% accurate (w/ 90 training examples) collab. with Zak Stone & Todd Zickler @ Harvard Pinto, Stone, Zickler, Cox (CVPR 2011)
  • 91. vs face.com comparison with a heavily-specialized commercial system L3 (hardware-accelerated brute-force random model) Performance (% correct) face.com V1-likearound) (best technology (one layer) training example(s) / friend Pinto, Stone, Zickler, Cox (CVPR 2011)
  • 93. Hardware Matters ! Yann LeCun’s Mac picture courtesy of Koray Kavukcuoglu
  • 94. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 95. Two conflicting requirements The brain is a massively parallel computer ➡ Big models are paralyzingly slow to run Neural data only provides weak constraints ➡ Lots of parameters – hard to explore
  • 96. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run ➡ Big models are paralyzingly Neural data only provides weak constraints ➡ Lots of parameters – hard to explore
  • 97. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run ➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F ➡ Lots of parameters – hard to explore
  • 98. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run ➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F ➡ Lots of parameters – hard to explore How to optimize?
  • 101. lutio ns! k Co nvo i lter ba n 3D F
  • 105. Meta-programming ! Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW) • Dynamically compile specialized versions of the same kernel for different conditions • Empirical run-time tuning • For free: smooth syntactic ugliness: unroll loops, index un-indexable registers, etc.
  • 106. Meta-programming ! “Instrument” your solutions: • Block size • Work size • Loop unrolling • Pre-fetching • Spilling • etc. ... and let the computer generate find the optimal code
  • 107. How?
  • 108. Always use the right tool !
  • 110. texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) plating Tem extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  • 111. Compilation? (with Python-based solutions)
  • 112. PyCUDA/PyOpenCL (by Andreas Klockner) Klöckner, Pinto, Lee, Catanzaro, Ivanov, Fasih (ParCo 2011)
  • 113. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 114. conv_kernel_4x4x4.cu conv_kernel_template.cu #include <stdio.h> texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) texture<float4, 1, cudaReadModeElementType> tex_float4; extern "C" { __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) extern "C" { __shared__ float shared_in[131][4+1]; // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; __global__ void convolve_beta_j${j}(float4 *input, float4 float4 input_v4; *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131) blockIdx.x*blockDim.x + threadIdx.x; { const uint out_idx = blockIdx.y*OUTPUT_W + input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } #for i in xrange($LOAD_ITERATIONS) __syncthreads(); #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum1 = 0; $i); float sum2 = 0; float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; sum0 += v*w; } w = constant[0][0][1]; sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  • 115. conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) conv_kernel_4x4x4.cu extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if $i); { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* conv_kernel_8x8x4.cu shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  • 118. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • loop unrolling (possibly fine-controlled)
  • 119. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • fine-controlled loop unrolling / jamming ..) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w;
  • 120. How about #pragma unroll ? (why don’t you trust the compiler?)
  • 121. o t alo ne.... we are n s for S ignal Using GPU elatio n pil ers Corr ust com ’t tr itchell Daniel A. M Don gmen The Murch ode fr a ts ison Widefi eld Array c tical” e “iden re thes + g *h; ompa LOPS • C *c + e*f 770 GF + d b*c grating 8-s econd snap shots over a += inte peeling, roduced by lanking and b*c; -2 526 field p d after RFI b f the J2107 e of the fiel an image o ht is an imag S FLOP n the left is . On the rig a += d*c; Figure 3: O ing hout blank interval wit 20 G entire time eeled imag e. noise the e unp e above the ntours of th f magnitud 10 along with co rs o This at are orde ious data. a += e*f; els th dub ivers at lev ply discard n here to the rece m will sim tector show k ste ichael hClar ct in fl ect or refra real-time sy n-based de occasion, re s the MWA mple media integration hich the si M floor. D wit wil uring deep l require a series of d ata-quality art. tests, of w a += g*h; n integral p will form a eenhill Lincoln Gr Paul La Plante and Reference s t Boolard a += y, EDGES Memo, 058 , 2010. R.J. Cappal lo, M.F. M orales, and ics a ale, d Topics RFI Statist , C.J. Lonsd l of Selecte [1] A.E .E. Rogers, , R.J. Sault IE EE Journa R.B. Wayth eld Array, . Greenhill, hison Widefi ]. itchell, L.J of the Murc 07.1912 E, 97 [2] D.A. M Time Calib ration , [astro- ph/08 s of the IEE S.M. O rd, Real- 7 17, 2008 , Proceeding 2 (5), 707– n Overview 1 nuary 201 sday, 27 Ja rocessing, rray: Desig in Signal P on Widefield A he Murchis 8]. , Graphics ale, et al., T 903.182 R.G. Edgar [3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series, 506, 2009, ell, K. Dale, Conference (8), 1497–1 , D.A. Mitch d Array, ASP R.B. Wayth on Wide-fiel Greenhill, the Murchis IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal Units for D Mathemati Processing 1 radio pola rimetry. I. 009. aa d nderstryn20 ing 1 411, 127, 2 .J. Sault, U Janu 6. . Breg man, and R ursday,.,2117, 137–147, 199 7 alar amaker, J.D Th pl. Ser up alogue of sc [5 ] J.P. H st rophys. S ll-co herency an rophys. Su ppl. s, Astron. A . IV. The fu Astron. Ast foundation polarimetry ric fidelity, g radio ge and pola rimet derstandin
  • 122. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • variable-length argument lists
  • 123. Smooth syntactic ugliness Manipulations that were not easily accessible in CUDA C code: • index un-indexable resources (e.g. regs)
  • 124. Explore design decision space more freely
  • 125. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 126. ... too many optimizations? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ling itionnrol in ixe cla p u ca mpin g m loo g m pi ng adca sting bro ms zero-cop trea
  • 127. e ? ec id ’t d c an keep them all !
  • 128. Exploring design decision space more freely Meta-programming: • enables efficient learning of the GPU hardware/software • allows full exploitation of the GPU architecture
  • 129. version A conv_kernel_beta_template.cu ... mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 [$N_FILTERS]; mov.b32 $r1, c0[$ofs2+0x000c] #define IMUL(a, b) __mul24(a, b) extern "C" { mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010] __global__ void convolve_beta_j${j}(float4 *input, float4 *output) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; ... // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) version B #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ... shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 } #end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ... aster... Why ? 2x f
  • 131. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 cha n ging... e GTX480 (CUDA3.x) [2010] pe edu p is g a m 974.3 (Fermi) >1 000X s Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  • 132. -10.4 1024x1024x8 16x5x5x8 726.412 ± 0.398 744.973 ± 0.571 Analysis 2048x2048x4 4x8x8x4 474.681 ± 0.160 887.974 ± 1.017 ➡ Different hardware ? Table 33.2 Performance of Auto-Tuned Implementations on Two Hardware Platforms, Including Performance Tuned on One Platform and Run on the Other Optimized for: Run on: 9400M GTX480 Tuning Speedup 9400M 0.32s 2.52s 675% GTX480 0.016s 0.011s 52% formance gains are observed for the auto-tuned meta-kernels as compared to t, which was hand-picked to allow correct execution of all input ranges ng up against hardware limitations.
  • 133. APTER 33 GPU Metaprogramming: A Case Study Analysis ➡ Different input configurations Table 33.3 Performance of Auto-Tuned Implementations on Two Input Configurations, Including Performance Tuned for One Configuration and Run with the Other Optimized for: Run on: Config1 Config2 Tuning Speedup config1 11.1ms 15.7ms 41% config2 fails 10.8ms not comparable , in Table 33.3 we show the effect of tuning on one input configuration an in, significant speedups are obtained using kernels tailored to a specific inp
  • 136. Summary Meta-programming: • can assist exploration and manual optimization
  • 137. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code
  • 138. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda)
  • 139. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda) ➡ helps get drastic speed-ups !
  • 140. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda) ➡ helps get drastic speed-ups ! ➡ facilitates “auto-tuning” !
  • 141. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 142. Intelligent and fast Auto-Tuning with Machine Learning with James Bergstra and David Cox
  • 143. Intelligent and fast Auto-Tuning with Machine Learning
  • 145. Auto-tuning: two approaches • Analytical model-based optimization:
  • 146. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference”
  • 147. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak
  • 148. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak • Empirical optimization:
  • 149. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak • Empirical optimization: - pros: auto-tuned code close to peak (dominant in specialized libraries e.g. ATLAS, FFTW), easier to build
  • 150. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak • Empirical optimization: - pros: auto-tuned code close to peak (dominant in specialized libraries e.g. ATLAS, FFTW), easier to build - cons: very slow “inference” (for new inputs, etc.)
  • 151. Empirical Auto-Tuning The goal is to empirically optimize execution time given both • the environment - hardware (GPU, CPU, Memory, Mobo, etc.) - software (SDK, Compiler suite, etc.) • the data (input dimensions, repetitions, etc.)
  • 152. Empirical Auto-Tuning with Meta-programming A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 153. Intelligent and fast Auto-Tuning with Machine Learning
  • 154. Auto-tuning: best of both approaches ?
  • 155. Auto-tuning: best of both approaches ? • Empirically-learned model-based optimization:
  • 156. Auto-tuning: best of both approaches ? • Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.)
  • 157. Auto-tuning: best of both approaches ? • Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.) - cons: unexplored !
  • 158. Auto-tuning: best of both approaches ? • Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.) - cons: unexplored ! * could be dominant in specialized libraries (e.g. machine learning!)
  • 159. Fast Machine Learning-based Runtime Auto-Tuning ML-based
  • 160. First Last First Last First Last Affiliation line 1 Affiliation line 1 Affiliation line 1 Fast Machine Learning-based Affiliation line 2 Affiliation line 2 Affiliation line 2 anon@mail.com anon@mail.com anon@mail.com ABSTRACT Runtime Auto-Tuning designs, the field lacks consensus on exactly how the differ- The rapidly evolving landscape of multicore architectures ent subsystems (memory, communication and computation) makes the construction of efficient libraries a daunting task. should be efficiently integrated, modeled and programmed. A family of methods known collectively as “auto-tuning” has These systems have exhibited varying degrees of memory emerged to address this challenge. Two major approaches to hierarchy and multi-threading complexity and, as a conse- auto-tuning are empirical and model-based: empirical auto- quence, they have been increasingly relying on flexible but tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral- suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli- tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo- methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity Machine Learning for Predictive Auto-Tuning with Boosted proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal- a model-based approach, with the generality and simplicity anced if one is to write code that is both high-performance of empirical auto-tuning. We validate our approach using and portable. the filterbank correlation kernel described in Pinto and Cox Regression Trees [2012], where we find that 0.1 seconds of hill climbing on 1.1 Motivation the regression model (“predictive auto-tuning”) can achieve In this rapidly evolving landscape, the construction of gen- an average of 95% of the speed-up brought by minutes of eral development tools and libraries that fully utilize system empirical auto-tuning. Our approach is not specific to filter- resources remains a daunting task. Even within special- bank correlation, nor even to GPU kernel auto-tuning, and ized architectures from the same vendor, such as NVIDIA’s can be applied to almost any templated-code optimization Graphics Processing Units (GPUs) and the Compute Unified problem, spanning a wide variety of problem types, kernel Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA, types, and platforms. 2011], many developers default to massive amounts of man- 1. INTRODUCTION First Last First Last ual labor to optimize CUDA code to specific input domains. In addition, hand-tuning rarely generalizes well to new hard- First Last Affiliation line 1 Affiliation line 1 Affiliation line 1 ware generations or different input domains, and it can also Due to power consumption and heat dissipation concerns, be error-prone or far from optimal. One of the reason is that Affiliation line 2 Affiliation line 2 Affiliation line 2 scientific applications have shifted from computing platforms kernels can produce staggeringly large optimization spaces where performance had been primarily driven by rises in the [Datta et al., 2008]. The problem is further compounded anon@mail.com clock frequency of a single “heavy-weight” processor (with complex out-of-order control and cache structures) to a plat- form with ever increasing numbers of “light-weight” cores. anon@mail.com by the fact that these spaces can be highly discontinuous [Ryoo et al., 2008], difficult to explore, and quasi-optimal anon@mail.com solutions lie at the edge of “performance cliffs” induced by Interestingly, this shift is now not only relevant to compu- hard device-specific constraints (e.g. register file size or low- tational sciences but to the development of all computer sys- latency cache size). James Bergstra tems: from ubiquitous consumer-facing devices (e.g. phones) to high-end computer farms for web-scale applications (e.g. 1.2 Auto-Tuning ABSTRACT social networks). Although the future lies in low-power multi-core hardware One strategy for addressing these challenges is to use one of a variety of automatic methods known collectively as designs, the field lacks consensus on exactly how the differ- The rapidly evolving landscape of multicore architectures Permission to makethe or hard copies of all or part ofof work for “auto-tuning.” Two major auto-tuning approaches have emer- ged in the extensive literature covering the subject (see sur- veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc makes digital construction this efficient libraries a daunting task. et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos, Nicolas Pinto ent subsystems (memory, communication and computation) should be efficiently integrated, modeled and programmed. David Cox personal or classroom use is granted without fee provided that copies are 2008, Li et al., 2009, Park et al., 2011]): analytical model- These systems have exhibited varying degrees of memory not A familyfor profit or commercial advantage and that copies made or distributed of methods known collectively as “auto-tuning” has driven optimization and empirical optimization [Yotov et al., bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or address this challenge. Two major approaches to emerged to to redistribute to lists, requires prior specific 2003]. hierarchy and multi-threading complexity and, as a conse- The model-driven optimization approach uses analytical permission and/or a fee. quence, they have been increasingly relying on flexible but auto-tuning are empirical and model-based: empirical auto- [submitted] Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. abstractions to model the hardware architectures, in order tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral- suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli- tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo- methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal- anced if one is to write code that is both high-performance
  • 161. lutio ns! k Co nvo i lter ba n 3D F
  • 162. NVIDIA GTX 580 (Fermi) 0 P ie w rev(b) 2x faster equality 1200 GFLOP/s of predictive auto-tuning 1000 Auto-­tuned mean 800 2x slower ML-based: Reference mean 600 < 0.1sec 400 200 0 200 0 200 400 600 800 1000 1200 1400 d problem GFLOP/s of empirical auto-tuning r training old way: minutes!
  • 163. NVIDIA GTX 580 (Fermi) 0 P ie w rev(b) 2x faster equality 1200 GFLOP/s of predictive auto-tuning LOP /s ! RAF 1000 1 TE Auto-­tuned mean 800 > 1. 2x slower ML-based: Reference mean 600 < 0.1sec 400 200 0 200 0 200 400 600 800 1000 1200 1400 d problem GFLOP/s of empirical auto-tuning r training old way: minutes!
  • 165. What else could we do for HPC ?
  • 166. What else could we do for HPC ? • Minimize failures (exascale supercomputers)
  • 167. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors
  • 168. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors • Help better understand hardware features and their complex interactions
  • 169. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors • Help better understand hardware features and their complex interactions • Help design better architectures ?
  • 170. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors • Help better understand hardware features and their complex interactions • Help design better architectures ? • $$$
  • 171. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors • Help better understand hardware features and their complex interactions • Help design better architectures ? • $$$ • etc.
  • 172. It would be a win-win-win situation! (The Office Season 2, Episode 27: Conflict Resolution)
  • 173. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 174. en ts e dg em nowl Ack DiCarlo Lab @ MIT arlo im DiC J id Cox Dav
  • 175. en ts e dg em nowl Ack
  • 176. CO ME