High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-Performance Computing Needs
Machine Learning... And Vice Versa
(was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”)

dit ion
e
Nicolas Pinto
NIPS “Big Learning” | December 16th, 2011

The Rowland Institute a
HARVARD UNIVERSITY

Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC

The Problem:
Visual Object Recognition

The Problem:

fast

The Problem:

fast
accurate

The Problem:

fast
accurate
effortless

The Problem:

fast
accurate
effortless
critical to survival

The Problem:

fast
accurate
effortless
critical to survival

tolerant to
variations!

hard?

// the world is 3D but the retina is 2D

hard?

// the curse of dimensionality

hard?

// the curse of dimensionality

// considerable image variation

you learned it...
ve
y ha
ma

The Approach
Reverse and Forward Engineering the Brain

The Approach
Reverse and Forward Engineering the Brain

REVERSE FORWARD
Study Build
Natural System Artiﬁcial System

Reverse Engineering Images by DiCarlo JJ & Cox DD
Animation by Li N

The Ventral Visual Stream

Reverse Engineering

taﬂo ps ?!
in =2 0 pe
bra

Forward Engineering

a rnin g ???
a bo ut le
all

“Temp. Adv.”
“Auto-reset”
...
number of lters

L2
thresh/sat norm strength

Learning
normalization
neighborhood Rate
kernel Trace
size “Temp. Adv.”
“Auto-reset”
...
n. of lters

L1
thresh/sat norm strength Learning
Rate
normalization
Trace
neighborhood
“Temp. Adv.”
“Auto-reset”
kernel ...

How are things done normally?

Usual Formula:


Usual Formula:

1) One grad student


Usual Formula:

1) One grad student
2) One Model (size limited by runtime)


Usual Formula:

1) One grad student
3) Performance numbers on a few
standard test sets


Usual Formula:

1) One grad student
standard test sets
4) yay. we. rock.


Usual Formula:

1) One grad student
standard test sets
4) yay. we. rock.
5) One Ph.D.

How do you call this ?

“This is graduate student descent”
- David McAllester

What’s better than this?

“Conjugate graduate student descent?”
- Nicolas Poilvert

Doing things a little bit differently


1) One grad student


1) One grad student
2) One Hundreds of Thousands of
BIG Models


1) One grad student
BIG Models
standard test sets


1) One grad student
BIG Models
standard test sets
4) yay. we. rock.


1) One grad student
BIG Models
standard test sets
4) yay. we. rock.
5) Hundreds of Thousands One PhD ?

“ If you want to have good ideas
you must have many ideas. ”
“ Most of them will be wrong,
and what you have to learn is
which ones to throw away. ”
Linus Pauling
(double Nobel Prize Winner)

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-throughput
Screening

Read-out

L3

normalization Learning

large family of
neighborhood Rate
Trace
“Temp. Adv.”
“Auto-reset”

number of lters
...
brain-inspired models

L2

clusive!
Learning
normalization
neighborhood Rate

in
Trace

52 parameters
ery
kernel
size “Temp. Adv.”

v
“Auto-reset”
...
n. of lters

more than 10 25
L1
thresh/sat norm strength Learning

possible unique
Rate
normalization
Trace
neighborhood
“Temp. Adv.”
“Auto-reset”
kernel ...

combinations!
size

number of lters

input
kernel
size
Pinto, Doukhan, DiCarlo, Cox PLoS 2009

The curse of speed

thousands of big models

The curse of speed

thousands of big models

large amounts of unsupervised
learning experience

The curse of speed
...and the blessing of massively parallel computing

No off-the-shelf solution? DIY!
Engineering (Hardware/SysAdmin/Software) Science

The curse of speed
...and the blessing of massively parallel computing

No off-the-shelf solution? DIY!
Engineering (Hardware/SysAdmin/Software) Science

Leverage non-scientiﬁc high-tech
markets and their $billions of R&D...
Gaming: Graphics Cards (GPUs), PlayStation 3
Web 2.0: Cloud Computing (Amazon, Google)

The blessing of GPUs
Computational power DIY GPU pr0n (since 2006) Sony Playstation 3s (since 2007)

GPUs
Peak GFLOP/s

CPUs

speed
(in billion ﬂoating point operations per second)

Q9450 (Matlab/C) [2008] 0.3

Q9450 (C/SSE) [2008] 9.0

7900GTX (OpenGL/Cg) [2006] 68.2

PS3/Cell (C/ASM) [2007] 111.4

8800GTX (CUDA1.x) [2007] 192.7

GTX280 (CUDA2.x) [2008] 339.3

GTX480 (CUDA3.x) [2010] 974.3
(Fermi)
Pinto, Cox GPU Comp. Gems 2011

speed
(in billion ﬂoating point operations per second)

Q9450 (Matlab/C) [2008] 0.3

Q9450 (C/SSE) [2008] 9.0

7900GTX (OpenGL/Cg) [2006] 68.2

PS3/Cell (C/ASM) [2007] 111.4

8800GTX (CUDA1.x) [2007] 192.7

GTX280 (CUDA2.x) [2008] 339.3

cha n ging...
e
GTX480 (CUDA3.x) [2010]
pe edu p is g a m 974.3
(Fermi)
>1 000X s
Pinto, Cox GPU Comp. Gems 2011

High-throughput Screening
Skimming off the best models

stupid
chance baseline
250

200
N=2500
150
Count

100

50

0
50 60 70 80 90 100

Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009

Validate on other tasks

~90%
vs.

“HMAX 2.1”
(~80%)

V1-like 5 4 3 2 1
(baseline)
state-of-the-art high-throughput models
(from literature)

Validate on faces

vs.

HMAX 2.1
PHOG
GB

PHOW
SIFT

blend
5 4 3 2 1
V1-like high-throughput models
(baseline) state-of-the-art
(from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009

Human vs. Machine
8-way object categorization

99.1

64

31.3
chance (12.5%)

baseline best model best human

What does it all mean?
what have we learned ?

brieﬂy...

Grayscale Input
Normalize Linear SVM
simple classiﬁer

L1 L2 L3

Filter
Threshold &
Φ1 Pool Normalize
Saturate
Φ2
...
Φk

➡ dimensionality: more ﬁlters is better

Grayscale Input
simple classiﬁer

L1 L2 L3

Filter
Threshold &
Φ1 Pool Normalize
Saturate
Φ2
...
Φk

➡ learning is difﬁcult

Grayscale Input
simple classiﬁer

L1 L2 L3

Filter
Threshold &
Φ1 Pool Normalize
Saturate
Φ2
...
Φk

➡ non-linearities are important

Grayscale Input
simple classiﬁer

L1 L2 L3

Filter
Threshold &
Φ1 Pool Normalize
Saturate
Φ2
...
Φk

➡ normalization is very important
missed in previous modeling efforts
now conﬁrmed by LeCun et al., Poggio et al., Ng et al.

What are these models
not good for?
ob jects
low level
s
ckgr ound
ba
fa ces

Real-world apps?
testing the generality and scalability of the approach

Facebook
Really Real World Problem

enormous scale
billion of photos
3TB+ uploaded
every day
dense, collaborative
face labels

collab. with Zak Stone & Todd Zickler @ Harvard

Relevance to Social Networking

slide courtesy of David Cox

Relevance to Social Networking

High-Throughput Screening
Labeled Faces in the Wild (LFW) View 1
> 30,000 large-scale models (1to3 layers) screened in only 3 days

HT L3s (3 layers) top 5 models
LFW view 1 performance

Lea rning!
vised
o Un super
N

Pinto, Cox (FG 2011) Pinto, Stone, Zickler, Cox (CVPR 2011)

Generalization
Performance on LFW View 2 (hold out)

Face Veriﬁcation Performance (% correct)

88.1
86.8
85.3

79.4 Wolf et al.
ACCV 2009 Kumar et al. Ours
V1-like face.com ICCV 2009 (HT)

Pinto, Cox (FG 2011)

“Facebook100”
typical social network size?

Pinto, Stone, Zickler, Cox (CVPR 2011)

Auto-tagging
a network of 100 Facebook friends

> 86%
accurate
(w/ 90 training examples)

Pinto, Stone, Zickler, Cox (CVPR 2011)

vs face.com
comparison with a heavily-specialized commercial system
L3
(hardware-accelerated
brute-force random model)
Performance (% correct)

face.com
V1-likearound)
(best technology
(one layer)

training example(s) / friend Pinto, Stone, Zickler, Cox (CVPR 2011)

Hardware Matters !

Yann LeCun’s Mac

picture courtesy of Koray Kavukcuoglu

Two conﬂicting requirements

The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints
➡ Lots of parameters – hard to explore


FA ST slow to run
➡ Big models are paralyzingly



FA ST slow to run

LEXI BLE
F


FA ST slow to run

LEXI BLE
F

How to optimize?

lutio ns!
k Co nvo
i lter ba n
3D F

Meta-programming !

Leave the grunt-programming to the
computer (i.e. auto-tuning like ATLAS or FFTW)
• Dynamically compile specialized versions
of the same kernel for different conditions
• Empirical run-time tuning
• For free: smooth syntactic ugliness: unroll
loops, index un-indexable registers, etc.

Meta-programming !

“Instrument” your solutions:
• Block size
• Work size
• Loop unrolling
• Pre-fetching
• Spilling
• etc.
... and let the computer generate
ﬁnd the optimal code

texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)
plating
Tem
extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output)
{

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
__shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets
const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
float4 input_v4;

// -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
#end if
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;

Compilation?
(with Python-based solutions)

PyCUDA/PyOpenCL (by Andreas Klockner)

Klöckner, Pinto, Lee, Catanzaro, Ivanov, Fasih (ParCo 2011)

Basic GPU Meta-programming System

A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem

Pinto N, Cox DD

conv_kernel_4x4x4.cu
conv_kernel_template.cu #include <stdio.h>

texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[4][4][4];

#define IMUL(a, b) __mul24(a, b)
texture<float4, 1, cudaReadModeElementType> tex_float4; extern "C" {
__constant__ float constant[$FILTER_D][$FILTER_W]
[$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output)
{
extern "C" { __shared__ float shared_in[131][4+1];

#for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
__global__ void convolve_beta_j${j}(float4 *input, float4 float4 input_v4;
*output)
{
{

input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);

shared_in[threadIdx.x+128*0][0] = input_v4.x;

shared_in[threadIdx.x+128*0][1] = input_v4.y;

shared_in[threadIdx.x+128*0][2] = input_v4.z;

shared_in[threadIdx.x+128*0][3] = input_v4.w;
}
const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131)
blockIdx.x*blockDim.x + threadIdx.x; {
const uint out_idx = blockIdx.y*OUTPUT_W +

input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);
blockIdx.x*blockDim.x + threadIdx.x;

shared_in[threadIdx.x+128*1][0] = input_v4.x;

shared_in[threadIdx.x+128*1][1] = input_v4.y;
float4 input_v4;

shared_in[threadIdx.x+128*1][2] = input_v4.z;

shared_in[threadIdx.x+128*1][3] = input_v4.w;
// -- load input to shared memory }
#for i in xrange($LOAD_ITERATIONS) __syncthreads();
// -- compute dot products
float v, w;
#end if
{ float sum0 = 0;
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum1 = 0;
$i); float sum2 = 0;
float sum3 = 0;
shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0];
shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0];
shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; sum0 += v*w;
} w = constant[0][0][1];
sum1 += v*w;
#end for
w = constant[0][0][2];
sum2 += v*w;
sum3 += v*w;
v = shared_in[threadIdx.x+1][0];
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;

conv_kernel_template.cu
__constant__ float constant[$FILTER_D][$FILTER_W]
[$N_FILTERS];

conv_kernel_4x4x4.cu
extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4

20 kB
*output)
{


const uint in_idx = (blockIdx.y+$j)*INPUT_W +
float4 input_v4;

#end if

$i);
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* conv_kernel_8x8x4.cu
shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
}

64 kB
#end for

Smooth syntactic ugliness
Manipulations that are not easily
accessible in CUDA C code:
• loop unrolling (possibly ﬁne-controlled)

• ﬁne-controlled loop unrolling / jamming
..)

sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;

How about #pragma unroll ?
(why don’t you trust the compiler?)

o t alo ne....
we are n
s for S ignal
Using GPU
elatio n pil ers
Corr ust com
’t tr
itchell
Daniel A. M

Don gmen
The Murch

ode fr
a
ts
ison Widefi
eld Array

c
tical”
e “iden
re thes + g *h;
ompa LOPS
• C
*c +
e*f
770 GF
+ d
b*c grating 8-s
econd snap
shots over

a +=
inte peeling,
roduced by lanking and

b*c;
-2 526 field p d after RFI b
f the J2107 e of the fiel
an image o ht is an imag
S
FLOP
n the left is . On the rig

a += d*c;
Figure 3:
O ing
hout blank
interval wit

20 G
entire time eeled imag
e. noise
the e unp e above the
ntours of th f magnitud
10
along with
co rs o This
at are orde ious data.

a += e*f;
els th dub
ivers at lev ply discard n here
to the rece m will sim tector show
k
ste

ichael hClar
ct in
fl ect or refra real-time sy n-based de
occasion, re s the MWA mple media
integration hich the si
M floor. D
wit
wil
uring deep
l require a
series of d
ata-quality
art.
tests, of w
a += g*h;
n integral p
will form a eenhill
Lincoln Gr
Paul La Plante and
Reference
s
t Boolard
a +=
y, EDGES
Memo, 058
, 2010.
R.J. Cappal
lo, M.F. M
orales, and
ics a ale, d Topics
RFI Statist , C.J. Lonsd l of Selecte
[1] A.E .E. Rogers, , R.J. Sault IE EE Journa
R.B. Wayth eld Array,
. Greenhill, hison Widefi ].
itchell, L.J of the Murc 07.1912 E, 97
[2] D.A. M Time Calib
ration
, [astro-
ph/08 s of the IEE
S.M. O rd, Real- 7 17, 2008 , Proceeding
2 (5), 707– n Overview
1
nuary 201
sday, 27 Ja rocessing, rray: Desig
in Signal P on Widefield A
he Murchis 8]. , Graphics
ale, et al., T 903.182 R.G. Edgar
[3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series,
506, 2009, ell, K. Dale, Conference
(8), 1497–1 , D.A. Mitch d Array, ASP
R.B. Wayth on Wide-fiel
Greenhill, the Murchis

IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal
Units for D Mathemati
Processing 1 radio pola
rimetry. I.
009. aa d
nderstryn20 ing
1
411, 127, 2 .J. Sault, U Janu 6.
. Breg man, and R ursday,.,2117, 137–147, 199
7
alar
amaker, J.D Th pl. Ser
up alogue of sc
[5 ] J.P. H st rophys. S ll-co herency an rophys. Su
ppl.
s, Astron. A . IV. The fu Astron. Ast
foundation polarimetry ric fidelity,
g radio ge and pola
rimet
derstandin

• variable-length argument lists


Manipulations that were not easily
• index un-indexable resources (e.g. regs)

Explore design decision
space more freely

... too many
optimizations?

ba nk c
onﬂict
s

on
ing

isi
ale sc

ec
co

ca
pr

ch
d part ling
itionnrol

in
ixe
cla p u ca mpin

g
m loo g
m

pi
ng
adca sting
bro
ms
zero-cop trea

e ?
ec id
’t d
c an

keep them all !

Exploring design decision space more freely

Meta-programming:

• enables efficient learning of the GPU
hardware/software

• allows full exploitation of the GPU
architecture

version A
conv_kernel_beta_template.cu
...
mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
mov.b32 $r1, c0[$ofs2+0x0008]
__constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4
[$N_FILTERS];
mov.b32 $r1, c0[$ofs2+0x000c]
extern "C" { mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4
#for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010]
__global__ void convolve_beta_j${j}(float4 *input, float4
*output)
mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
{

...
const uint in_idx = (blockIdx.y+$j)*INPUT_W +
float4 input_v4;


version B
#end if
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
$i);

...
shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1
}
#end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1
mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1
mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1
mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

...
aster... Why ?
2x f

-10.4 1024x1024x8 16x5x5x8 726.412 ± 0.398 744.973 ± 0.571
Analysis
2048x2048x4 4x8x8x4 474.681 ± 0.160 887.974 ± 1.017

➡ Different hardware ?
Table 33.2 Performance of Auto-Tuned Implementations on Two
Hardware Platforms, Including Performance Tuned on One Platform and
Run on the Other
Optimized for:
Run on: 9400M GTX480 Tuning Speedup

9400M 0.32s 2.52s 675%
GTX480 0.016s 0.011s 52%

formance gains are observed for the auto-tuned meta-kernels as compared to
t, which was hand-picked to allow correct execution of all input ranges
ng up against hardware limitations.

APTER 33 GPU Metaprogramming: A Case Study
Analysis

➡ Different input configurations
Table 33.3 Performance of Auto-Tuned Implementations on Two Input
Configurations, Including Performance Tuned for One Configuration
and Run with the Other
Optimized for:
Run on: Config1 Config2 Tuning Speedup

config1 11.1ms 15.7ms 41%
config2 fails 10.8ms not comparable

, in Table 33.3 we show the effect of tuning on one input configuration an
in, significant speedups are obtained using kernels tailored to a specific inp

Summary

Meta-programming:

• can assist exploration and manual
optimization

Summary

Meta-programming:

optimization
• can de-clutter highly-optimized code

Summary

Meta-programming:

optimization
• is easy and ﬂexible with the right tools
(e.g. Python, PyCUDA/CL, Cheetah, decuda)

Summary

Meta-programming:

optimization

➡ helps get drastic speed-ups !

Summary

Meta-programming:

optimization

➡ helps get drastic speed-ups !
➡ facilitates “auto-tuning” !

Intelligent
and fast

Auto-Tuning
with Machine Learning

with James Bergstra and David Cox

Intelligent
and fast

Auto-Tuning
with Machine Learning

Auto-tuning: two approaches

• Analytical model-based optimization:


- pros: very generic (dominant in compilers), fast
“inference”


“inference”
- cons: hard to build, domain expertise required, auto-
tuned code far from peak


“inference”

• Empirical optimization:


“inference”

- pros: auto-tuned code close to peak (dominant in
specialized libraries e.g. ATLAS, FFTW), easier to build


“inference”

- pros: auto-tuned code close to peak (dominant in
specialized libraries e.g. ATLAS, FFTW), easier to build
- cons: very slow “inference” (for new inputs, etc.)

Empirical Auto-Tuning

The goal is to empirically optimize execution
time given both

• the environment
- hardware (GPU, CPU, Memory, Mobo, etc.)
- software (SDK, Compiler suite, etc.)

• the data (input dimensions, repetitions, etc.)

Empirical Auto-Tuning with Meta-programming

A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem

Pinto N, Cox DD

Auto-tuning: best of both approaches ?


• Empirically-learned model-based
optimization:


optimization:

- pros: auto-tuned code close to peak*, easier to build (?),
fast “inference” (for new inputs, hardware, etc.)


optimization:


- cons: unexplored !


optimization:


- cons: unexplored !

* could be dominant in specialized libraries
(e.g. machine learning!)

Fast Machine Learning-based
Runtime Auto-Tuning

ML-based

First Last First Last First Last
Affiliation line 1 Affiliation line 1 Affiliation line 1

Fast Machine Learning-based
anon@mail.com anon@mail.com anon@mail.com

ABSTRACT

Runtime Auto-Tuning
designs, the field lacks consensus on exactly how the differ-
The rapidly evolving landscape of multicore architectures ent subsystems (memory, communication and computation)
makes the construction of efficient libraries a daunting task. should be efficiently integrated, modeled and programmed.
A family of methods known collectively as “auto-tuning” has These systems have exhibited varying degrees of memory
emerged to address this challenge. Two major approaches to hierarchy and multi-threading complexity and, as a conse-
auto-tuning are empirical and model-based: empirical auto- quence, they have been increasingly relying on flexible but
tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral-
suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and
auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli-
tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo-
methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application
timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity

Machine Learning for Predictive Auto-Tuning with Boosted
proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal-
a model-based approach, with the generality and simplicity anced if one is to write code that is both high-performance
of empirical auto-tuning. We validate our approach using and portable.
the filterbank correlation kernel described in Pinto and Cox

Regression Trees
[2012], where we find that 0.1 seconds of hill climbing on 1.1 Motivation
the regression model (“predictive auto-tuning”) can achieve In this rapidly evolving landscape, the construction of gen-
an average of 95% of the speed-up brought by minutes of eral development tools and libraries that fully utilize system
empirical auto-tuning. Our approach is not specific to filter- resources remains a daunting task. Even within special-
bank correlation, nor even to GPU kernel auto-tuning, and ized architectures from the same vendor, such as NVIDIA’s
can be applied to almost any templated-code optimization Graphics Processing Units (GPUs) and the Compute Unified
problem, spanning a wide variety of problem types, kernel Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA,
types, and platforms. 2011], many developers default to massive amounts of man-

1. INTRODUCTION First Last First Last
ual labor to optimize CUDA code to specific input domains.
In addition, hand-tuning rarely generalizes well to new hard- First Last
ware generations or different input domains, and it can also
Due to power consumption and heat dissipation concerns, be error-prone or far from optimal. One of the reason is that

scientific applications have shifted from computing platforms kernels can produce staggeringly large optimization spaces
where performance had been primarily driven by rises in the [Datta et al., 2008]. The problem is further compounded

anon@mail.com
clock frequency of a single “heavy-weight” processor (with
complex out-of-order control and cache structures) to a plat-
form with ever increasing numbers of “light-weight” cores.
anon@mail.com
by the fact that these spaces can be highly discontinuous
[Ryoo et al., 2008], difficult to explore, and quasi-optimal anon@mail.com
solutions lie at the edge of “performance cliffs” induced by
Interestingly, this shift is now not only relevant to compu- hard device-specific constraints (e.g. register file size or low-
tational sciences but to the development of all computer sys- latency cache size).

James Bergstra
tems: from ubiquitous consumer-facing devices (e.g. phones)
to high-end computer farms for web-scale applications (e.g.
1.2 Auto-Tuning
ABSTRACT
social networks).
Although the future lies in low-power multi-core hardware One strategy for addressing these challenges is to use one
of a variety of automatic methods known collectively as
designs, the field lacks consensus on exactly how the differ-
The rapidly evolving landscape of multicore architectures
Permission to makethe or hard copies of all or part ofof work for
“auto-tuning.” Two major auto-tuning approaches have emer-
ged in the extensive literature covering the subject (see sur-
veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc
makes digital construction this efficient libraries a daunting task.
et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos,
Nicolas Pinto
ent subsystems (memory, communication and computation)
should be efficiently integrated, modeled and programmed.

David Cox
personal or classroom use is granted without fee provided that copies are
2008, Li et al., 2009, Park et al., 2011]): analytical model- These systems have exhibited varying degrees of memory
not A familyfor profit or commercial advantage and that copies
made or distributed of methods known collectively as “auto-tuning” has
driven optimization and empirical optimization [Yotov et al.,
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or address this challenge. Two major approaches to
emerged to to redistribute to lists, requires prior specific 2003]. hierarchy and multi-threading complexity and, as a conse-
The model-driven optimization approach uses analytical
permission and/or a fee. quence, they have been increasingly relying on flexible but
auto-tuning are empirical and model-based: empirical auto-
[submitted]
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. abstractions to model the hardware architectures, in order

tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral-
suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and
auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli-
tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo-
methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application
timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity
proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal-
anced if one is to write code that is both high-performance

NVIDIA GTX 580 (Fermi)
0 P ie w
rev(b) 2x faster equality
1200

GFLOP/s of predictive auto-tuning
1000
Auto-tuned mean

800
2x slower

ML-based:
Reference mean
600

< 0.1sec
400

200

0
200
0 200 400 600 800 1000 1200 1400
d problem
GFLOP/s of empirical auto-tuning
r training
old way: minutes!

NVIDIA GTX 580 (Fermi)
0 P ie w
rev(b) 2x faster equality
1200

GFLOP/s of predictive auto-tuning
LOP /s !
RAF
1000

1 TE
Auto-tuned mean

800 > 1.
2x slower

ML-based:
Reference mean
600

< 0.1sec
400

200

0
200
0 200 400 600 800 1000 1200 1400
d problem
GFLOP/s of empirical auto-tuning
r training
old way: minutes!

What else could we do for HPC ?


• Minimize failures (exascale supercomputers)


• Minimize mixed-precision errors


• Help better understand hardware features and
their complex interactions


• Help design better architectures ?


• $$$


• $$$
• etc.

It would be a
win-win-win situation!

(The Ofﬁce Season 2, Episode 27: Conﬂict Resolution)

en ts
e dg em
nowl
Ack
DiCarlo Lab @ MIT

arlo
im DiC
J

id Cox
Dav

en ts
e dg em
nowl
Ack

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

More Related Content

Viewers also liked (11)

Similar to High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning) (20)

More from npinto (16)

Recently uploaded (20)

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)