SlideShare a Scribd company logo
Training and Inference for Deep Gaussian Processes
Keyon Vafa
April 26, 2016
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50
Motivation
An ideal model for prediction is
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
An ideal model for prediction is
accurate
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
An ideal model for prediction is
accurate
computationally efficient
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
This thesis focuses on one particular class of prediction models, deep
Gaussian processes for regression. They are a new model, having been
introduced by Damianou and Lawrence in 2013.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50
Motivation
This thesis focuses on one particular class of prediction models, deep
Gaussian processes for regression. They are a new model, having been
introduced by Damianou and Lawrence in 2013.
Exact inference is intractable. In this thesis, we introduce a new method to
learn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50
Motivation
The DGPS algorithm
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Gaussian Processes
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 5 / 50
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of values
f (x1), . . . , f (xN) has a multivariate normal distribution.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of values
f (x1), . . . , f (xN) has a multivariate normal distribution.
The inputs {xn}N
n=1 can be vectors from any arbitrary sized domain.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of values
f (x1), . . . , f (xN) has a multivariate normal distribution.
The inputs {xn}N
n=1 can be vectors from any arbitrary sized domain.
Specified by a mean function m(x) and a covariance function k(x, x )
where
m(x) = E[f (x)]
k(x, x ) = Cov(f (x), f (x )).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness and
stationarity of functions drawn from a GP.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness and
stationarity of functions drawn from a GP.
The squared exponential covariance function has the following form:
k(x, x ) = σ2
f exp −
1
2
(x − x )T
M(x − x )
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness and
stationarity of functions drawn from a GP.
The squared exponential covariance function has the following form:
k(x, x ) = σ2
f exp −
1
2
(x − x )T
M(x − x )
When M is a diagonal matrix, the elements on the diagonal are
known as the length-scales, denoted by l−2
i . σ2
f is known as the signal
variance.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
Gaussian Processes
Sampling from a GP
x
f(x) Signal variance 1.0, Length-scale 0.2
x
f(x)
Signal variance 1.0, Length-scale 1.0
x
f(x)
Signal variance 1.0, Length-scale 5.0
x
f(x)
Signal variance 0.2, Length-scale 1.0
x
f(x) Signal variance 1.0, Length-scale 1.0
x
f(x)
Signal variance 5.0, Length-scale 1.0
Random samples from GP priors. The length-scale controls the
smoothness of our function, while the signal variance controls the
deviation from the mean.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 8 / 50
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and corresponding
outputs y ∈ RN, the function values from a GP evaluated at X. We
assume a mean function m(x) and a covariance function k(x, x ),
which rely on parameters θ.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and corresponding
outputs y ∈ RN, the function values from a GP evaluated at X. We
assume a mean function m(x) and a covariance function k(x, x ),
which rely on parameters θ.
We would like to learn the optimal θ, and estimate the function
values y∗ for a set of new inputs X∗.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and corresponding
outputs y ∈ RN, the function values from a GP evaluated at X. We
assume a mean function m(x) and a covariance function k(x, x ),
which rely on parameters θ.
We would like to learn the optimal θ, and estimate the function
values y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X, θ) = N(0, KXX ).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and corresponding
outputs y ∈ RN, the function values from a GP evaluated at X. We
assume a mean function m(x) and a covariance function k(x, x ),
which rely on parameters θ.
We would like to learn the optimal θ, and estimate the function
values y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X, θ) = N(0, KXX ).
We can then use the multivariate normal conditional distribution to
evaluate the predictive distribution:
P(y∗|X∗, X, y, θ) ∼ N(KX∗XK−1
XXy, KX∗X∗ − KX∗XK−1
XXKXX∗ ).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
Gaussian Processes
GPs for Regression
Note this is all true because we are assuming the outputs correspond to a
Gaussian process. We therefore make the following assumption:
y
y∗
∼ N
0
0
,
KXX KXX∗
KX∗X KX∗X∗
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50
Gaussian Processes
GPs for Regression
Note this is all true because we are assuming the outputs correspond to a
Gaussian process. We therefore make the following assumption:
y
y∗
∼ N
0
0
,
KXX KXX∗
KX∗X KX∗X∗
.
Computing P(y|X) and P(y∗|X∗, X, y) only requires matrix algebra on the
above assumption.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50
Gaussian Processes
Example of a GP for Regression
xf(x)
Gaussian Process Regression
x
Outputs Data
Figure: On the left, data from a sigmoidal curve with noise. On the right, samples
from a GP trained on the data (represented by ‘x’), using a squared exponential
covariance function.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 11 / 50
Deep Gaussian Processes
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 12 / 50
Deep Gaussian Processes
Definition of a Deep Gaussian Process
Formally, we define a deep Gaussian Process as the composition of GPs:
f(1:L)
(x) = f(L)
(f(L−1)
(. . . f(2)
(f(1)
(x)) . . . ))
where f
(l)
d ∼ GP 0, k
(l)
d (x, x ) for f
(l)
d ∈ f(l)
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 13 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of units
at layer l
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of units
at layer l
For an L layer deep GP, we have
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of units
at layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of units
at layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L − 1 hidden layers {hl
n}L−1
l=1
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of units
at layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L − 1 hidden layers {hl
n}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of units
at layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L − 1 hidden layers {hl
n}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their own
kernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Example: Two-Layer Deep GP
ynhnxn
f g
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50
Deep Gaussian Processes
Example: Two-Layer Deep GP
ynhnxn
f g
We have a one dimensional input, xn, a one dimensional hidden unit, hn,
and a one dimensional output, yn. This two-layer network consists of two
GPs, f and g where
hn = f (xn), where f ∼ GP(0, k(1)
(x, x ))
and
yn = g(hn), where g ∼ GP(0, k(2)
(h, h )).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50
Deep Gaussian Processes
Example: More Complicated Model
y
h
(2)
1
h
(2)
2
h
(2)
3
h
(2)
4
h
(1)
1
h
(1)
2
h
(1)
3
h
(1)
4
x1
x2
x3
Graphical representation of a more complicated deep GP architecture.
Every edge corresponds to a GP between units, as the outputs of each
layer are the inputs of the following layer. Our input data is 3-dimensional,
while the two hidden layers in this model each have 4 hidden units.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 16 / 50
Deep Gaussian Processes
Sampling From a Deep GP
6 4 2 0 2 4 6
x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
g(f(x))
Full Deep GP
6 4 2 0 2 4 6
x
2
1
0
1
2
3
4
f(x)
Layer 1: Length-scale 0.5
2 1 0 1 2 3 4
f(x)
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
g(f(x))
Layer 2: Length-scale 1.0
Samples from deep GPs. As opposed to single-layer GPs, a deep GP can
model non-stationary functions (functions whose shape changes along the
input space) without the use of a non-stationary kernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 17 / 50
Deep Gaussian Processes
Comparison with Neural Networks
Similarities: deep architectures, completely connected, single-layer
GPs correspond to two-layer neural networks with random weights
and infinitely many hidden units.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50
Deep Gaussian Processes
Comparison with Neural Networks
Similarities: deep architectures, completely connected, single-layer
GPs correspond to two-layer neural networks with random weights
and infinitely many hidden units.
Differences: deep GP is nonparametric, no activation functions, must
specify kernels, training is intractable
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50
Implementation
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 19 / 50
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation
(FITC) circumvents the O(N3) training time for a single-layer GP by
introducing pseudo data, points that are not in the data set but can
be chosen to approximate the function (Snelson and Ghahramani,
2005).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation
(FITC) circumvents the O(N3) training time for a single-layer GP by
introducing pseudo data, points that are not in the data set but can
be chosen to approximate the function (Snelson and Ghahramani,
2005).
We introduce M pseudo inputs ¯X = {¯xm}M
m=1 and the corresponding
pseudo outputs ¯y = {¯ym}M
m=1, which correspond to the function
values at the pseudo inputs.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation
(FITC) circumvents the O(N3) training time for a single-layer GP by
introducing pseudo data, points that are not in the data set but can
be chosen to approximate the function (Snelson and Ghahramani,
2005).
We introduce M pseudo inputs ¯X = {¯xm}M
m=1 and the corresponding
pseudo outputs ¯y = {¯ym}M
m=1, which correspond to the function
values at the pseudo inputs.
Key assumption: conditioned on the pseudo data, the output values
are independent.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
Implementation
FITC Approximation for Single-Layer GP
We assume a prior
P(¯y|¯X) = N (0, K¯X¯X) .
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50
Implementation
FITC Approximation for Single-Layer GP
We assume a prior
P(¯y|¯X) = N (0, K¯X¯X) .
Training takes time O(NM2), and testing requires O(M2).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50
Implementation
FITC Example
x
f(x)
5 Pseudo Parameters
x
f(x)
10 Pseudo Parameters
Figure: The predictive mean of a GP trained on sigmoidal data using the FITC
approximation. On the left, we use 5 pseudo data points, while on the right, we
use 10.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 22 / 50
Implementation
Learning Deep GPs is Intractable
ynhnxn
f
θ(1)
g
θ(2)
Example: two-layer model, with inputs X, outputs y, and hidden layer H
(which is N × 1). Ideally, a Bayesian treatment would allow us to integrate
out the hidden function values to evaluate
P(y|X, θ) = P(y|H, θ(2)
)P H|X, θ(1)
dH
= N (0, KHH) N (0, KXX) dH.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50
Implementation
Learning Deep GPs is Intractable
ynhnxn
f
θ(1)
g
θ(2)
Example: two-layer model, with inputs X, outputs y, and hidden layer H
(which is N × 1). Ideally, a Bayesian treatment would allow us to integrate
out the hidden function values to evaluate
P(y|X, θ) = P(y|H, θ(2)
)P H|X, θ(1)
dH
= N (0, KHH) N (0, KXX) dH.
Evaluating the integrals of Gaussians with respect to kernel functions is
intractable.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
We sample predictive means and covariances to approximate the
marginal likelihood, relying on automatic differentiation techniques to
evaluate the gradients and optimize our objective.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
We sample predictive means and covariances to approximate the
marginal likelihood, relying on automatic differentiation techniques to
evaluate the gradients and optimize our objective.
We replace every GP with the FITC GP, so the time complexity for L
layers and H hidden units per layer is O(N2MLH) as opposed to
O(N3LH).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
Implementation
Related Work
Damianou and Lawrence (2013) also use the FITC approximation at
every layer, but they perform inference with approximate variational
marginalization. Subsequent methods (Hensman and Lawrence, 2014;
Dai et al., 2015; Bui et al., 2016) also use variational approximations.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50
Implementation
Related Work
Damianou and Lawrence (2013) also use the FITC approximation at
every layer, but they perform inference with approximate variational
marginalization. Subsequent methods (Hensman and Lawrence, 2014;
Dai et al., 2015; Bui et al., 2016) also use variational approximations.
These methods are able to integrate out the pseudo outputs at each
layer, but they rely on integral approximations that restrict the kernel.
Meanwhile, the DGPS uses Monte Carlo sampling, which is easier to
implement, more intuitive to understand, and can extent easily to
most kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance for
every unit in the first hidden layer. We then sample values from each
predictive distribution
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance for
every unit in the first hidden layer. We then sample values from each
predictive distribution
For every hidden layer thereafter, we take the samples from the
previous layer, calculate the predictive mean and covariance, and
repeat sampling until the final layer.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance for
every unit in the first hidden layer. We then sample values from each
predictive distribution
For every hidden layer thereafter, we take the samples from the
previous layer, calculate the predictive mean and covariance, and
repeat sampling until the final layer.
We use K different samples {(˜µk, ˜Σk)}K
k=1 to approximate the
marginal likelihood:
P(y|X) ≈
K
k=1
P(y|˜µk, ˜Σk) =
K
k=1
N(˜µk, ˜Σk)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model with
a FITC GP
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model with
a FITC GP
For each GP, corresponding to hidden unit d in layer l, we introduce
pseudo inputs ¯X
(l)
d and corresponding pseudo outputs ¯y
(l)
d .
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model with
a FITC GP
For each GP, corresponding to hidden unit d in layer l, we introduce
pseudo inputs ¯X
(l)
d and corresponding pseudo outputs ¯y
(l)
d .
With the addition of the pseudo data, we are required to learn the
following set of parameters:
Θ = ¯X
(l)
d ,¯y
(l)
d , θ
(l)
d
D(l)
d=1
L
l=1
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
Our goal is to learn
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
Our goal is to learn
{(¯X
(l)
,¯y(l))}2
l=1, the pseudo data for each layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
Our goal is to learn
{(¯X
(l)
,¯y(l))}2
l=1, the pseudo data for each layer
θ(1)
and θ(2)
, the kernel parameters for f and g
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
To sample values H from the hidden layer, we use the FITC approximation
and assume
P H|X, ¯X
(1)
,¯y(1)
= N µ(1)
, Σ(1)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
To sample values H from the hidden layer, we use the FITC approximation
and assume
P H|X, ¯X
(1)
,¯y(1)
= N µ(1)
, Σ(1)
where
µ(1)
= KX¯X
(1) K−1
¯X
(1) ¯X
(1)¯y(1)
Σ(1)
= diag KXX − KX¯X
(1) K−1
¯X
(1) ¯X
(1) K¯X
(1)
X
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
To sample values H from the hidden layer, we use the FITC approximation
and assume
P H|X, ¯X
(1)
,¯y(1)
= N µ(1)
, Σ(1)
where
µ(1)
= KX¯X
(1) K−1
¯X
(1) ¯X
(1)¯y(1)
Σ(1)
= diag KXX − KX¯X
(1) K−1
¯X
(1) ¯X
(1) K¯X
(1)
X
.
We obtain K samples, {˜Hk}K
k=1 from the above distribution.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
For each sample ˜Hk, we can approximate
P y|˜Hk, ¯X
(2)
,¯y(2)
≈ N ˜µ(2)
, ˜Σ
(2)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
For each sample ˜Hk, we can approximate
P y|˜Hk, ¯X
(2)
,¯y(2)
≈ N ˜µ(2)
, ˜Σ
(2)
where
˜µ(2)
= K˜Hk
¯X
(2) K−1
¯X
(2) ¯X
(2)¯y(2)
˜Σ
(2)
= diag K˜Hk
˜Hk
− K˜Hk
¯X
(2) K−1
¯X
(2) ¯X
(2) K¯X
(2) ˜Hk
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
Thus, we can approximate the marginal likelihood with our samples:
P(y|X, Θ) ≈
1
K
K
k=1
P y|˜Hk, ¯X
(2)
,¯y(2)
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
Thus, we can approximate the marginal likelihood with our samples:
P(y|X, Θ) ≈
1
K
K
k=1
P y|˜Hk, ¯X
(2)
,¯y(2)
.
Incorporating the prior over the pseudo outputs into our objective, we
have:
L(y|X, Θ) = log P(y|X, Θ) +
L
l=1
D(l)
d=1
log P ¯y
(l)
d
¯X
(l)
d .
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50
Experiments and Analysis
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 32 / 50
Experiments and Analysis
Step Function
We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + i ,
where i ∼ N(0, .01).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50
Experiments and Analysis
Step Function
We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + i ,
where i ∼ N(0, .01).
The non-stationarity of a step function is appealing from a deep GP
perspective.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50
Experiments and Analysis
Step Function
x
f(x)
Samples from a Single-Layer GP
Figure: Functions sampled from a single-layer GP. Evidently, the predictive draws
do not fully capture the shape of the step function.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 34 / 50
Experiments and Analysis
Step Function
Figure: Predictive draws from a single-layer GP and a two-layer deep GP.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 35 / 50
Experiments and Analysis
Step Function
Figure: Predictive draws from a three-layer deep GP.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 36 / 50
Experiments and Analysis
Step Function
x
f(x)
Random Initialization
x
Hiddenvalues
Hidden values
f(x)
x
f(x)
Smart Initialization
xHiddenvalues
Hidden values
f(x)
Impact of Parameter Initializations on Predictive Draws
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 37 / 50
Experiments and Analysis
Step Function
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
MeanSquaredError
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3
3
2
1
0
1
2
LogLikelihoodperData
50 Data Points
1 2 3
3
2
1
0
1
2 100 Data Points
1 2 3
3
2
1
0
1
2 200 Data Points
Figure: Experimental results measuring the test log-likelihood per data and test
mean squared error of the noisy step function. We vary the number of layers used
in the model, along with the number of data points used in the original step
function (which is divided 80/20 into train/test). We run 10 trials at each
combination.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 38 / 50
Experiments and Analysis
Step Function
Occasionally, models with deeper architectures outperform those that are
more shallow, yet they also possess the widest distributions and trials with
the worst results.
2 1 0 1 2
Train Log-Likelihood per Data
2
1
0
1
2
TestLog-LikelihoodperData
Layers
1
2
3
0.0 0.2 0.4 0.6 0.8 1.0
Train MSE
0.0
0.2
0.4
0.6
0.8
1.0
TestMSE
Layers
1
2
3
Figure: Test set log-likelihoods per data and mean squared errors plotted against
their training set counterparts for the step function experiment. Overfitting does
not appear to be a problem.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 39 / 50
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
If we can successfully optimize our objective, deeper architectures are
better suited at learning the noisy step function than shallower ones.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
If we can successfully optimize our objective, deeper architectures are
better suited at learning the noisy step function than shallower ones.
However, it becomes more difficult to train and successfully optimize
as the number of layers grows and the number of parameters
increases.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
Experiments and Analysis
Step Function
x
f(x)
Random Seed 66
x
Hiddenvalues1
Hidden values 1
Hiddenvalues2
Hidden values 2
f(x)
x
f(x)
Random Seed 0
x
Hiddenvalues1
Hidden values 1
Hiddenvalues2
Hidden values 2
f(x)
Figure: Predictive draws from two identical three-layer models, albeit with
different random parameter initializations.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 41 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing the
dimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability to
learn a non-stationary function.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability to
learn a non-stationary function.
We divide the input space into three regions: X1 ∈ [−4, −3],
X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability to
learn a non-stationary function.
We divide the input space into three regions: X1 ∈ [−4, −3],
X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Sample from a GP with length-scale l = .25 for regions X1 and X3,
using l = 2 for region X2.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
Experiments and Analysis
Toy Non-Stationary Data
x
Outputs
Data
x
f(x)
2-Layer Deep GP
x
Hiddenvalues
Hidden values
f(x)
x
f(x)
1-Layer GP
Predictive Draws for Toy Non-Stationary Data
Figure: Predictive draws from the single-layer and 2-layer models for toy
non-stationary data with squared exponential kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 44 / 50
Experiments and Analysis
Toy Non-Stationary Data
x
f(x)
Non-Stationary Data: 3-Layer Deep GP
Figure: The optimization for a 3-layer model can get stuck in a local optimum,
and although the predictive draws are non-stationary, our predictions are poor at
the tails.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 45 / 50
Experiments and Analysis
Motorcycle Data
94 points, where the inputs are time in milliseconds since impact of a
motorcycle accident and outputs are corresponding helmet
accelerations.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50
Experiments and Analysis
Motorcycle Data
94 points, where the inputs are time in milliseconds since impact of a
motorcycle accident and outputs are corresponding helmet
accelerations.
Dataset is somewhat non-stationary, as the accelerations are constant
early on but after a certain time become more varying.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50
Experiments and Analysis
Motorcycle Data
Time
Acceleration
Data
Time
Acceleration
2-Layer Deep GP
Time
Hiddenvalues
Hidden values
Acceleration
Time
Acceleration
1-Layer GP
Predictive Draws for Motorcycle Data
Figure: Predictive draws from the single-layer and 2-layer models trained on
motorcycle data with squared exponential kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 47 / 50
Conclusion
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 48 / 50
Conclusion
Future Directions
Natural extensions include
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in local
optima
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in local
optima
Introducing variational parameters so we do not have to learn pseudo
outputs
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in local
optima
Introducing variational parameters so we do not have to learn pseudo
outputs
Extending model to classification
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in local
optima
Introducing variational parameters so we do not have to learn pseudo
outputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate the
model likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Acknowledgments
A huge thank-you to Sasha Rush, Finale Doshi-Velez, David Duvenaud,
and Miguel Hern´andez-Lobato. This thesis would not be possible without
your help and support!
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 50 / 50

More Related Content

PPTX
Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)
PDF
파이썬으로 익히는 딥러닝 기본 (18년)
PDF
(DL hacks輪読) Deep Kernel Learning
PDF
Chapter13.2.3
PPTX
JSX 速さの秘密 - 高速なJavaScriptを書く方法
PDF
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
PPTX
冗長変換とその画像復元応用
PDF
連続変量を含む相互情報量の推定
Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)
파이썬으로 익히는 딥러닝 기본 (18년)
(DL hacks輪読) Deep Kernel Learning
Chapter13.2.3
JSX 速さの秘密 - 高速なJavaScriptを書く方法
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
冗長変換とその画像復元応用
連続変量を含む相互情報量の推定

What's hot (20)

PDF
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
PDF
[DL輪読会]Deep Neural Networks as Gaussian Processes
PPTX
PRMLrevenge_3.3
PDF
3.3節 変分近似法(前半)
PDF
Prml 4.1.1
PDF
강화학습 알고리즘의 흐름도 Part 2
PDF
Ορισμένο ολοκλήρωμα με 918 ασκήσεις
PDF
Graphic Notes on Linear Algebra and Data Science
PDF
παράγωγοι β' (2013)
PPTX
R6パッケージの紹介―機能と実装
PDF
PDF
機械学習を使った時系列売上予測
PDF
Wasserstein GAN 수학 이해하기 I
PDF
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...
PDF
【輪読】Bayesian Optimization of Combinatorial Structures
PPTX
RとPythonを比較する
PDF
確率的主成分分析
PPTX
クラシックな機械学習の入門  5. サポートベクターマシン
PPTX
色々な確率分布とその応用
PDF
POLINOMIOS CICLOTÓMICOS EN CUERPOS K[X] Y RAICES PRIMITIVAS MÓDULO N
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
[DL輪読会]Deep Neural Networks as Gaussian Processes
PRMLrevenge_3.3
3.3節 変分近似法(前半)
Prml 4.1.1
강화학습 알고리즘의 흐름도 Part 2
Ορισμένο ολοκλήρωμα με 918 ασκήσεις
Graphic Notes on Linear Algebra and Data Science
παράγωγοι β' (2013)
R6パッケージの紹介―機能と実装
機械学習を使った時系列売上予測
Wasserstein GAN 수학 이해하기 I
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...
【輪読】Bayesian Optimization of Combinatorial Structures
RとPythonを比較する
確率的主成分分析
クラシックな機械学習の入門  5. サポートベクターマシン
色々な確率分布とその応用
POLINOMIOS CICLOTÓMICOS EN CUERPOS K[X] Y RAICES PRIMITIVAS MÓDULO N
Ad

Viewers also liked (20)

PDF
PCAの最終形態GPLVMの解説
PPT
Pasolli_TH1_T09_2.ppt
PDF
Bird’s-eye view of Gaussian harmonic analysis
PPTX
The Role Of Translators In MT: EU 2010
PDF
Flexible and efficient Gaussian process models for machine ...
PDF
1 factor vs.2 factor gaussian model for zero coupon bond pricing final
PDF
YSC 2013
PDF
03 the gaussian kernel
PPTX
Kernal methods part2
PDF
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
PDF
Differentiating the translation process: A corpus analysis of editorial influ...
PDF
Inventory
PPTX
Gaussian model (kabani & sumeet)
PPTX
linear equation and gaussian elimination
PDF
Gaussian Processes: Applications in Machine Learning
PDF
The Gaussian Process Latent Variable Model (GPLVM)
PPTX
Image encryption and decryption
PPTX
Noise filtering
KEY
Google I/O 2011, Android Accelerated Rendering
PDF
Social Network Analysis & an Introduction to Tools
PCAの最終形態GPLVMの解説
Pasolli_TH1_T09_2.ppt
Bird’s-eye view of Gaussian harmonic analysis
The Role Of Translators In MT: EU 2010
Flexible and efficient Gaussian process models for machine ...
1 factor vs.2 factor gaussian model for zero coupon bond pricing final
YSC 2013
03 the gaussian kernel
Kernal methods part2
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
Differentiating the translation process: A corpus analysis of editorial influ...
Inventory
Gaussian model (kabani & sumeet)
linear equation and gaussian elimination
Gaussian Processes: Applications in Machine Learning
The Gaussian Process Latent Variable Model (GPLVM)
Image encryption and decryption
Noise filtering
Google I/O 2011, Android Accelerated Rendering
Social Network Analysis & an Introduction to Tools
Ad

Similar to Training and Inference for Deep Gaussian Processes (20)

PPTX
Gaussian processing
PDF
proposal_pura
PDF
A nonlinear approximation of the Bayesian Update formula
PPTX
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
PDF
PDF
MLHEP 2015: Introductory Lecture #1
PDF
Statement of stochastic programming problems
PDF
A brief introduction to Gaussian process
PDF
Expectation propagation
PDF
Hands-on Tutorial of Machine Learning in Python
PDF
Elementary Landscape Decomposition of the Hamiltonian Path Optimization Problem
PDF
QMC: Transition Workshop - Approximating Multivariate Functions When Function...
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Stochastic Approximation and Simulated Annealing
PDF
Knowledge extraction from support vector machines
PDF
Iclr2016 vaeまとめ
PDF
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
PDF
Delayed acceptance for Metropolis-Hastings algorithms
PDF
QTML2021 UAP Quantum Feature Map
Gaussian processing
proposal_pura
A nonlinear approximation of the Bayesian Update formula
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
MLHEP 2015: Introductory Lecture #1
Statement of stochastic programming problems
A brief introduction to Gaussian process
Expectation propagation
Hands-on Tutorial of Machine Learning in Python
Elementary Landscape Decomposition of the Hamiltonian Path Optimization Problem
QMC: Transition Workshop - Approximating Multivariate Functions When Function...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Stochastic Approximation and Simulated Annealing
Knowledge extraction from support vector machines
Iclr2016 vaeまとめ
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Delayed acceptance for Metropolis-Hastings algorithms
QTML2021 UAP Quantum Feature Map

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to Knowledge Engineering Part 1
PDF
annual-report-2024-2025 original latest.
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Qualitative Qantitative and Mixed Methods.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Supervised vs unsupervised machine learning algorithms
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Business Acumen Training GuidePresentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Knowledge Engineering Part 1
annual-report-2024-2025 original latest.
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to machine learning and Linear Models
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

Training and Inference for Deep Gaussian Processes

  • 1. Training and Inference for Deep Gaussian Processes Keyon Vafa April 26, 2016 Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50
  • 2. Motivation An ideal model for prediction is Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
  • 3. Motivation An ideal model for prediction is accurate Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
  • 4. Motivation An ideal model for prediction is accurate computationally efficient Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
  • 5. Motivation An ideal model for prediction is accurate computationally efficient easy to tune without overfitting Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
  • 6. Motivation An ideal model for prediction is accurate computationally efficient easy to tune without overfitting able to provide certainty estimates Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
  • 7. Motivation This thesis focuses on one particular class of prediction models, deep Gaussian processes for regression. They are a new model, having been introduced by Damianou and Lawrence in 2013. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50
  • 8. Motivation This thesis focuses on one particular class of prediction models, deep Gaussian processes for regression. They are a new model, having been introduced by Damianou and Lawrence in 2013. Exact inference is intractable. In this thesis, we introduce a new method to learn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50
  • 9. Motivation The DGPS algorithm Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
  • 10. Motivation The DGPS algorithm is more straightforward than existing methods Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
  • 11. Motivation The DGPS algorithm is more straightforward than existing methods can more easily adapt to using arbitrary kernels Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
  • 12. Motivation The DGPS algorithm is more straightforward than existing methods can more easily adapt to using arbitrary kernels relies on Monte Carlo sampling to circumvent the intractability hurdle Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
  • 13. Motivation The DGPS algorithm is more straightforward than existing methods can more easily adapt to using arbitrary kernels relies on Monte Carlo sampling to circumvent the intractability hurdle uses pseudo data to ease the computational burden Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
  • 14. Gaussian Processes Table of Contents 1 Gaussian Processes 2 Deep Gaussian Processes 3 Implementation 4 Experiments and Analysis 5 Conclusion Keyon Vafa Training and Inference for Deep GPs April 26, 2016 5 / 50
  • 15. Gaussian Processes Definition of a Gaussian Process A function f is a Gaussian process (GP) if any finite set of values f (x1), . . . , f (xN) has a multivariate normal distribution. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
  • 16. Gaussian Processes Definition of a Gaussian Process A function f is a Gaussian process (GP) if any finite set of values f (x1), . . . , f (xN) has a multivariate normal distribution. The inputs {xn}N n=1 can be vectors from any arbitrary sized domain. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
  • 17. Gaussian Processes Definition of a Gaussian Process A function f is a Gaussian process (GP) if any finite set of values f (x1), . . . , f (xN) has a multivariate normal distribution. The inputs {xn}N n=1 can be vectors from any arbitrary sized domain. Specified by a mean function m(x) and a covariance function k(x, x ) where m(x) = E[f (x)] k(x, x ) = Cov(f (x), f (x )). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
  • 18. Gaussian Processes Covariance Function The covariance function (or kernel) determines the smoothness and stationarity of functions drawn from a GP. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
  • 19. Gaussian Processes Covariance Function The covariance function (or kernel) determines the smoothness and stationarity of functions drawn from a GP. The squared exponential covariance function has the following form: k(x, x ) = σ2 f exp − 1 2 (x − x )T M(x − x ) Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
  • 20. Gaussian Processes Covariance Function The covariance function (or kernel) determines the smoothness and stationarity of functions drawn from a GP. The squared exponential covariance function has the following form: k(x, x ) = σ2 f exp − 1 2 (x − x )T M(x − x ) When M is a diagonal matrix, the elements on the diagonal are known as the length-scales, denoted by l−2 i . σ2 f is known as the signal variance. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
  • 21. Gaussian Processes Sampling from a GP x f(x) Signal variance 1.0, Length-scale 0.2 x f(x) Signal variance 1.0, Length-scale 1.0 x f(x) Signal variance 1.0, Length-scale 5.0 x f(x) Signal variance 0.2, Length-scale 1.0 x f(x) Signal variance 1.0, Length-scale 1.0 x f(x) Signal variance 5.0, Length-scale 1.0 Random samples from GP priors. The length-scale controls the smoothness of our function, while the signal variance controls the deviation from the mean. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 8 / 50
  • 22. Gaussian Processes GPs for Regression Setup: We are given a set of inputs X ∈ RN×D and corresponding outputs y ∈ RN, the function values from a GP evaluated at X. We assume a mean function m(x) and a covariance function k(x, x ), which rely on parameters θ. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
  • 23. Gaussian Processes GPs for Regression Setup: We are given a set of inputs X ∈ RN×D and corresponding outputs y ∈ RN, the function values from a GP evaluated at X. We assume a mean function m(x) and a covariance function k(x, x ), which rely on parameters θ. We would like to learn the optimal θ, and estimate the function values y∗ for a set of new inputs X∗. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
  • 24. Gaussian Processes GPs for Regression Setup: We are given a set of inputs X ∈ RN×D and corresponding outputs y ∈ RN, the function values from a GP evaluated at X. We assume a mean function m(x) and a covariance function k(x, x ), which rely on parameters θ. We would like to learn the optimal θ, and estimate the function values y∗ for a set of new inputs X∗. To learn θ, we optimize the marginal likelihood: P(y|X, θ) = N(0, KXX ). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
  • 25. Gaussian Processes GPs for Regression Setup: We are given a set of inputs X ∈ RN×D and corresponding outputs y ∈ RN, the function values from a GP evaluated at X. We assume a mean function m(x) and a covariance function k(x, x ), which rely on parameters θ. We would like to learn the optimal θ, and estimate the function values y∗ for a set of new inputs X∗. To learn θ, we optimize the marginal likelihood: P(y|X, θ) = N(0, KXX ). We can then use the multivariate normal conditional distribution to evaluate the predictive distribution: P(y∗|X∗, X, y, θ) ∼ N(KX∗XK−1 XXy, KX∗X∗ − KX∗XK−1 XXKXX∗ ). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
  • 26. Gaussian Processes GPs for Regression Note this is all true because we are assuming the outputs correspond to a Gaussian process. We therefore make the following assumption: y y∗ ∼ N 0 0 , KXX KXX∗ KX∗X KX∗X∗ . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50
  • 27. Gaussian Processes GPs for Regression Note this is all true because we are assuming the outputs correspond to a Gaussian process. We therefore make the following assumption: y y∗ ∼ N 0 0 , KXX KXX∗ KX∗X KX∗X∗ . Computing P(y|X) and P(y∗|X∗, X, y) only requires matrix algebra on the above assumption. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50
  • 28. Gaussian Processes Example of a GP for Regression xf(x) Gaussian Process Regression x Outputs Data Figure: On the left, data from a sigmoidal curve with noise. On the right, samples from a GP trained on the data (represented by ‘x’), using a squared exponential covariance function. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 11 / 50
  • 29. Deep Gaussian Processes Table of Contents 1 Gaussian Processes 2 Deep Gaussian Processes 3 Implementation 4 Experiments and Analysis 5 Conclusion Keyon Vafa Training and Inference for Deep GPs April 26, 2016 12 / 50
  • 30. Deep Gaussian Processes Definition of a Deep Gaussian Process Formally, we define a deep Gaussian Process as the composition of GPs: f(1:L) (x) = f(L) (f(L−1) (. . . f(2) (f(1) (x)) . . . )) where f (l) d ∼ GP 0, k (l) d (x, x ) for f (l) d ∈ f(l) . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 13 / 50
  • 31. Deep Gaussian Processes Deep GP Notation Each layer l consists of D(l) GPs, where D(l) is the number of units at layer l Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
  • 32. Deep Gaussian Processes Deep GP Notation Each layer l consists of D(l) GPs, where D(l) is the number of units at layer l For an L layer deep GP, we have Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
  • 33. Deep Gaussian Processes Deep GP Notation Each layer l consists of D(l) GPs, where D(l) is the number of units at layer l For an L layer deep GP, we have one input layer xn ∈ RD(0) Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
  • 34. Deep Gaussian Processes Deep GP Notation Each layer l consists of D(l) GPs, where D(l) is the number of units at layer l For an L layer deep GP, we have one input layer xn ∈ RD(0) L − 1 hidden layers {hl n}L−1 l=1 Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
  • 35. Deep Gaussian Processes Deep GP Notation Each layer l consists of D(l) GPs, where D(l) is the number of units at layer l For an L layer deep GP, we have one input layer xn ∈ RD(0) L − 1 hidden layers {hl n}L−1 l=1 one output layer yn, which we assume to be 1-dimensional. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
  • 36. Deep Gaussian Processes Deep GP Notation Each layer l consists of D(l) GPs, where D(l) is the number of units at layer l For an L layer deep GP, we have one input layer xn ∈ RD(0) L − 1 hidden layers {hl n}L−1 l=1 one output layer yn, which we assume to be 1-dimensional. All layers are completely connected by GPs, each with their own kernel. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
  • 37. Deep Gaussian Processes Example: Two-Layer Deep GP ynhnxn f g Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50
  • 38. Deep Gaussian Processes Example: Two-Layer Deep GP ynhnxn f g We have a one dimensional input, xn, a one dimensional hidden unit, hn, and a one dimensional output, yn. This two-layer network consists of two GPs, f and g where hn = f (xn), where f ∼ GP(0, k(1) (x, x )) and yn = g(hn), where g ∼ GP(0, k(2) (h, h )). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50
  • 39. Deep Gaussian Processes Example: More Complicated Model y h (2) 1 h (2) 2 h (2) 3 h (2) 4 h (1) 1 h (1) 2 h (1) 3 h (1) 4 x1 x2 x3 Graphical representation of a more complicated deep GP architecture. Every edge corresponds to a GP between units, as the outputs of each layer are the inputs of the following layer. Our input data is 3-dimensional, while the two hidden layers in this model each have 4 hidden units. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 16 / 50
  • 40. Deep Gaussian Processes Sampling From a Deep GP 6 4 2 0 2 4 6 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 g(f(x)) Full Deep GP 6 4 2 0 2 4 6 x 2 1 0 1 2 3 4 f(x) Layer 1: Length-scale 0.5 2 1 0 1 2 3 4 f(x) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 g(f(x)) Layer 2: Length-scale 1.0 Samples from deep GPs. As opposed to single-layer GPs, a deep GP can model non-stationary functions (functions whose shape changes along the input space) without the use of a non-stationary kernel. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 17 / 50
  • 41. Deep Gaussian Processes Comparison with Neural Networks Similarities: deep architectures, completely connected, single-layer GPs correspond to two-layer neural networks with random weights and infinitely many hidden units. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50
  • 42. Deep Gaussian Processes Comparison with Neural Networks Similarities: deep architectures, completely connected, single-layer GPs correspond to two-layer neural networks with random weights and infinitely many hidden units. Differences: deep GP is nonparametric, no activation functions, must specify kernels, training is intractable Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50
  • 43. Implementation Table of Contents 1 Gaussian Processes 2 Deep Gaussian Processes 3 Implementation 4 Experiments and Analysis 5 Conclusion Keyon Vafa Training and Inference for Deep GPs April 26, 2016 19 / 50
  • 44. Implementation FITC Approximation for Single-Layer GP The Fully Independent Training Conditional Approximation (FITC) circumvents the O(N3) training time for a single-layer GP by introducing pseudo data, points that are not in the data set but can be chosen to approximate the function (Snelson and Ghahramani, 2005). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
  • 45. Implementation FITC Approximation for Single-Layer GP The Fully Independent Training Conditional Approximation (FITC) circumvents the O(N3) training time for a single-layer GP by introducing pseudo data, points that are not in the data set but can be chosen to approximate the function (Snelson and Ghahramani, 2005). We introduce M pseudo inputs ¯X = {¯xm}M m=1 and the corresponding pseudo outputs ¯y = {¯ym}M m=1, which correspond to the function values at the pseudo inputs. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
  • 46. Implementation FITC Approximation for Single-Layer GP The Fully Independent Training Conditional Approximation (FITC) circumvents the O(N3) training time for a single-layer GP by introducing pseudo data, points that are not in the data set but can be chosen to approximate the function (Snelson and Ghahramani, 2005). We introduce M pseudo inputs ¯X = {¯xm}M m=1 and the corresponding pseudo outputs ¯y = {¯ym}M m=1, which correspond to the function values at the pseudo inputs. Key assumption: conditioned on the pseudo data, the output values are independent. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
  • 47. Implementation FITC Approximation for Single-Layer GP We assume a prior P(¯y|¯X) = N (0, K¯X¯X) . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50
  • 48. Implementation FITC Approximation for Single-Layer GP We assume a prior P(¯y|¯X) = N (0, K¯X¯X) . Training takes time O(NM2), and testing requires O(M2). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50
  • 49. Implementation FITC Example x f(x) 5 Pseudo Parameters x f(x) 10 Pseudo Parameters Figure: The predictive mean of a GP trained on sigmoidal data using the FITC approximation. On the left, we use 5 pseudo data points, while on the right, we use 10. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 22 / 50
  • 50. Implementation Learning Deep GPs is Intractable ynhnxn f θ(1) g θ(2) Example: two-layer model, with inputs X, outputs y, and hidden layer H (which is N × 1). Ideally, a Bayesian treatment would allow us to integrate out the hidden function values to evaluate P(y|X, θ) = P(y|H, θ(2) )P H|X, θ(1) dH = N (0, KHH) N (0, KXX) dH. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50
  • 51. Implementation Learning Deep GPs is Intractable ynhnxn f θ(1) g θ(2) Example: two-layer model, with inputs X, outputs y, and hidden layer H (which is N × 1). Ideally, a Bayesian treatment would allow us to integrate out the hidden function values to evaluate P(y|X, θ) = P(y|H, θ(2) )P H|X, θ(1) dH = N (0, KHH) N (0, KXX) dH. Evaluating the integrals of Gaussians with respect to kernel functions is intractable. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50
  • 52. Implementation DPGS Algorithm Overview The Deep Gaussian Process Sampling algorithm relies on two central ideas: Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
  • 53. Implementation DPGS Algorithm Overview The Deep Gaussian Process Sampling algorithm relies on two central ideas: We sample predictive means and covariances to approximate the marginal likelihood, relying on automatic differentiation techniques to evaluate the gradients and optimize our objective. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
  • 54. Implementation DPGS Algorithm Overview The Deep Gaussian Process Sampling algorithm relies on two central ideas: We sample predictive means and covariances to approximate the marginal likelihood, relying on automatic differentiation techniques to evaluate the gradients and optimize our objective. We replace every GP with the FITC GP, so the time complexity for L layers and H hidden units per layer is O(N2MLH) as opposed to O(N3LH). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
  • 55. Implementation Related Work Damianou and Lawrence (2013) also use the FITC approximation at every layer, but they perform inference with approximate variational marginalization. Subsequent methods (Hensman and Lawrence, 2014; Dai et al., 2015; Bui et al., 2016) also use variational approximations. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50
  • 56. Implementation Related Work Damianou and Lawrence (2013) also use the FITC approximation at every layer, but they perform inference with approximate variational marginalization. Subsequent methods (Hensman and Lawrence, 2014; Dai et al., 2015; Bui et al., 2016) also use variational approximations. These methods are able to integrate out the pseudo outputs at each layer, but they rely on integral approximations that restrict the kernel. Meanwhile, the DGPS uses Monte Carlo sampling, which is easier to implement, more intuitive to understand, and can extent easily to most kernels. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50
  • 57. Implementation Sampling Hidden Values For inputs X, we calculate the predictive mean and covariance for every unit in the first hidden layer. We then sample values from each predictive distribution Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
  • 58. Implementation Sampling Hidden Values For inputs X, we calculate the predictive mean and covariance for every unit in the first hidden layer. We then sample values from each predictive distribution For every hidden layer thereafter, we take the samples from the previous layer, calculate the predictive mean and covariance, and repeat sampling until the final layer. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
  • 59. Implementation Sampling Hidden Values For inputs X, we calculate the predictive mean and covariance for every unit in the first hidden layer. We then sample values from each predictive distribution For every hidden layer thereafter, we take the samples from the previous layer, calculate the predictive mean and covariance, and repeat sampling until the final layer. We use K different samples {(˜µk, ˜Σk)}K k=1 to approximate the marginal likelihood: P(y|X) ≈ K k=1 P(y|˜µk, ˜Σk) = K k=1 N(˜µk, ˜Σk) Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
  • 60. Implementation FITC for Deep GPs To make fitting more scalable, we replace every GP in the model with a FITC GP Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
  • 61. Implementation FITC for Deep GPs To make fitting more scalable, we replace every GP in the model with a FITC GP For each GP, corresponding to hidden unit d in layer l, we introduce pseudo inputs ¯X (l) d and corresponding pseudo outputs ¯y (l) d . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
  • 62. Implementation FITC for Deep GPs To make fitting more scalable, we replace every GP in the model with a FITC GP For each GP, corresponding to hidden unit d in layer l, we introduce pseudo inputs ¯X (l) d and corresponding pseudo outputs ¯y (l) d . With the addition of the pseudo data, we are required to learn the following set of parameters: Θ = ¯X (l) d ,¯y (l) d , θ (l) d D(l) d=1 L l=1 . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
  • 63. Implementation Example: DGPS Algorithm on 2 Layers ynHnXn f ¯X (1) ,¯y(1), θ(1) g ¯X (2) ,¯y(2), θ(2) Our goal is to learn Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
  • 64. Implementation Example: DGPS Algorithm on 2 Layers ynHnXn f ¯X (1) ,¯y(1), θ(1) g ¯X (2) ,¯y(2), θ(2) Our goal is to learn {(¯X (l) ,¯y(l))}2 l=1, the pseudo data for each layer Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
  • 65. Implementation Example: DGPS Algorithm on 2 Layers ynHnXn f ¯X (1) ,¯y(1), θ(1) g ¯X (2) ,¯y(2), θ(2) Our goal is to learn {(¯X (l) ,¯y(l))}2 l=1, the pseudo data for each layer θ(1) and θ(2) , the kernel parameters for f and g Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
  • 66. Implementation Example: DGPS Algorithm on 2 Layers ynHnXn f ¯X (1) ,¯y(1), θ(1) g ¯X (2) ,¯y(2), θ(2) To sample values H from the hidden layer, we use the FITC approximation and assume P H|X, ¯X (1) ,¯y(1) = N µ(1) , Σ(1) Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
  • 67. Implementation Example: DGPS Algorithm on 2 Layers ynHnXn f ¯X (1) ,¯y(1), θ(1) g ¯X (2) ,¯y(2), θ(2) To sample values H from the hidden layer, we use the FITC approximation and assume P H|X, ¯X (1) ,¯y(1) = N µ(1) , Σ(1) where µ(1) = KX¯X (1) K−1 ¯X (1) ¯X (1)¯y(1) Σ(1) = diag KXX − KX¯X (1) K−1 ¯X (1) ¯X (1) K¯X (1) X . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
  • 68. Implementation Example: DGPS Algorithm on 2 Layers ynHnXn f ¯X (1) ,¯y(1), θ(1) g ¯X (2) ,¯y(2), θ(2) To sample values H from the hidden layer, we use the FITC approximation and assume P H|X, ¯X (1) ,¯y(1) = N µ(1) , Σ(1) where µ(1) = KX¯X (1) K−1 ¯X (1) ¯X (1)¯y(1) Σ(1) = diag KXX − KX¯X (1) K−1 ¯X (1) ¯X (1) K¯X (1) X . We obtain K samples, {˜Hk}K k=1 from the above distribution. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
  • 69. Implementation Example: DGPS Algorithm on 2 Layers ynHnXn f ¯X (1) ,¯y(1), θ(1) g ¯X (2) ,¯y(2), θ(2) For each sample ˜Hk, we can approximate P y|˜Hk, ¯X (2) ,¯y(2) ≈ N ˜µ(2) , ˜Σ (2) Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50
  • 70. Implementation Example: DGPS Algorithm on 2 Layers ynHnXn f ¯X (1) ,¯y(1), θ(1) g ¯X (2) ,¯y(2), θ(2) For each sample ˜Hk, we can approximate P y|˜Hk, ¯X (2) ,¯y(2) ≈ N ˜µ(2) , ˜Σ (2) where ˜µ(2) = K˜Hk ¯X (2) K−1 ¯X (2) ¯X (2)¯y(2) ˜Σ (2) = diag K˜Hk ˜Hk − K˜Hk ¯X (2) K−1 ¯X (2) ¯X (2) K¯X (2) ˜Hk . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50
  • 71. Implementation Example: DGPS Algorithm on 2 Layers Thus, we can approximate the marginal likelihood with our samples: P(y|X, Θ) ≈ 1 K K k=1 P y|˜Hk, ¯X (2) ,¯y(2) . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50
  • 72. Implementation Example: DGPS Algorithm on 2 Layers Thus, we can approximate the marginal likelihood with our samples: P(y|X, Θ) ≈ 1 K K k=1 P y|˜Hk, ¯X (2) ,¯y(2) . Incorporating the prior over the pseudo outputs into our objective, we have: L(y|X, Θ) = log P(y|X, Θ) + L l=1 D(l) d=1 log P ¯y (l) d ¯X (l) d . Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50
  • 73. Experiments and Analysis Table of Contents 1 Gaussian Processes 2 Deep Gaussian Processes 3 Implementation 4 Experiments and Analysis 5 Conclusion Keyon Vafa Training and Inference for Deep GPs April 26, 2016 32 / 50
  • 74. Experiments and Analysis Step Function We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + i , where i ∼ N(0, .01). Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50
  • 75. Experiments and Analysis Step Function We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + i , where i ∼ N(0, .01). The non-stationarity of a step function is appealing from a deep GP perspective. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50
  • 76. Experiments and Analysis Step Function x f(x) Samples from a Single-Layer GP Figure: Functions sampled from a single-layer GP. Evidently, the predictive draws do not fully capture the shape of the step function. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 34 / 50
  • 77. Experiments and Analysis Step Function Figure: Predictive draws from a single-layer GP and a two-layer deep GP. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 35 / 50
  • 78. Experiments and Analysis Step Function Figure: Predictive draws from a three-layer deep GP. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 36 / 50
  • 79. Experiments and Analysis Step Function x f(x) Random Initialization x Hiddenvalues Hidden values f(x) x f(x) Smart Initialization xHiddenvalues Hidden values f(x) Impact of Parameter Initializations on Predictive Draws Keyon Vafa Training and Inference for Deep GPs April 26, 2016 37 / 50
  • 80. Experiments and Analysis Step Function 1 2 3 Number of Layers 0.0 0.2 0.4 0.6 0.8 1.0 MeanSquaredError 1 2 3 Number of Layers 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 Number of Layers 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 3 2 1 0 1 2 LogLikelihoodperData 50 Data Points 1 2 3 3 2 1 0 1 2 100 Data Points 1 2 3 3 2 1 0 1 2 200 Data Points Figure: Experimental results measuring the test log-likelihood per data and test mean squared error of the noisy step function. We vary the number of layers used in the model, along with the number of data points used in the original step function (which is divided 80/20 into train/test). We run 10 trials at each combination. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 38 / 50
  • 81. Experiments and Analysis Step Function Occasionally, models with deeper architectures outperform those that are more shallow, yet they also possess the widest distributions and trials with the worst results. 2 1 0 1 2 Train Log-Likelihood per Data 2 1 0 1 2 TestLog-LikelihoodperData Layers 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 Train MSE 0.0 0.2 0.4 0.6 0.8 1.0 TestMSE Layers 1 2 3 Figure: Test set log-likelihoods per data and mean squared errors plotted against their training set counterparts for the step function experiment. Overfitting does not appear to be a problem. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 39 / 50
  • 82. Experiments and Analysis Step Function Overfitting does not appear to be a problem. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
  • 83. Experiments and Analysis Step Function Overfitting does not appear to be a problem. If we can successfully optimize our objective, deeper architectures are better suited at learning the noisy step function than shallower ones. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
  • 84. Experiments and Analysis Step Function Overfitting does not appear to be a problem. If we can successfully optimize our objective, deeper architectures are better suited at learning the noisy step function than shallower ones. However, it becomes more difficult to train and successfully optimize as the number of layers grows and the number of parameters increases. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
  • 85. Experiments and Analysis Step Function x f(x) Random Seed 66 x Hiddenvalues1 Hidden values 1 Hiddenvalues2 Hidden values 2 f(x) x f(x) Random Seed 0 x Hiddenvalues1 Hidden values 1 Hiddenvalues2 Hidden values 2 f(x) Figure: Predictive draws from two identical three-layer models, albeit with different random parameter initializations. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 41 / 50
  • 86. Experiments and Analysis Step Function Ways to combat optimization challenges: Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
  • 87. Experiments and Analysis Step Function Ways to combat optimization challenges: Using random restarts Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
  • 88. Experiments and Analysis Step Function Ways to combat optimization challenges: Using random restarts Decreasing the number of model parameters Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
  • 89. Experiments and Analysis Step Function Ways to combat optimization challenges: Using random restarts Decreasing the number of model parameters Trying different optimization methods Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
  • 90. Experiments and Analysis Step Function Ways to combat optimization challenges: Using random restarts Decreasing the number of model parameters Trying different optimization methods Experimenting with more diverse architectures, i.e. increasing the dimension of the hidden layer Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
  • 91. Experiments and Analysis Toy Non-Stationary Data We create toy non-stationary data to evaluate a deep GP’s ability to learn a non-stationary function. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
  • 92. Experiments and Analysis Toy Non-Stationary Data We create toy non-stationary data to evaluate a deep GP’s ability to learn a non-stationary function. We divide the input space into three regions: X1 ∈ [−4, −3], X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
  • 93. Experiments and Analysis Toy Non-Stationary Data We create toy non-stationary data to evaluate a deep GP’s ability to learn a non-stationary function. We divide the input space into three regions: X1 ∈ [−4, −3], X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points. Sample from a GP with length-scale l = .25 for regions X1 and X3, using l = 2 for region X2. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
  • 94. Experiments and Analysis Toy Non-Stationary Data x Outputs Data x f(x) 2-Layer Deep GP x Hiddenvalues Hidden values f(x) x f(x) 1-Layer GP Predictive Draws for Toy Non-Stationary Data Figure: Predictive draws from the single-layer and 2-layer models for toy non-stationary data with squared exponential kernels. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 44 / 50
  • 95. Experiments and Analysis Toy Non-Stationary Data x f(x) Non-Stationary Data: 3-Layer Deep GP Figure: The optimization for a 3-layer model can get stuck in a local optimum, and although the predictive draws are non-stationary, our predictions are poor at the tails. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 45 / 50
  • 96. Experiments and Analysis Motorcycle Data 94 points, where the inputs are time in milliseconds since impact of a motorcycle accident and outputs are corresponding helmet accelerations. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50
  • 97. Experiments and Analysis Motorcycle Data 94 points, where the inputs are time in milliseconds since impact of a motorcycle accident and outputs are corresponding helmet accelerations. Dataset is somewhat non-stationary, as the accelerations are constant early on but after a certain time become more varying. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50
  • 98. Experiments and Analysis Motorcycle Data Time Acceleration Data Time Acceleration 2-Layer Deep GP Time Hiddenvalues Hidden values Acceleration Time Acceleration 1-Layer GP Predictive Draws for Motorcycle Data Figure: Predictive draws from the single-layer and 2-layer models trained on motorcycle data with squared exponential kernels. Keyon Vafa Training and Inference for Deep GPs April 26, 2016 47 / 50
  • 99. Conclusion Table of Contents 1 Gaussian Processes 2 Deep Gaussian Processes 3 Implementation 4 Experiments and Analysis 5 Conclusion Keyon Vafa Training and Inference for Deep GPs April 26, 2016 48 / 50
  • 100. Conclusion Future Directions Natural extensions include Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
  • 101. Conclusion Future Directions Natural extensions include Trying different optimization methods to avoid getting stuck in local optima Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
  • 102. Conclusion Future Directions Natural extensions include Trying different optimization methods to avoid getting stuck in local optima Introducing variational parameters so we do not have to learn pseudo outputs Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
  • 103. Conclusion Future Directions Natural extensions include Trying different optimization methods to avoid getting stuck in local optima Introducing variational parameters so we do not have to learn pseudo outputs Extending model to classification Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
  • 104. Conclusion Future Directions Natural extensions include Trying different optimization methods to avoid getting stuck in local optima Introducing variational parameters so we do not have to learn pseudo outputs Extending model to classification Exploring properties of more complex architectures, and evaluate the model likelihood to choose optimal configuration Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
  • 105. Conclusion Acknowledgments A huge thank-you to Sasha Rush, Finale Doshi-Velez, David Duvenaud, and Miguel Hern´andez-Lobato. This thesis would not be possible without your help and support! Keyon Vafa Training and Inference for Deep GPs April 26, 2016 50 / 50