Reducing the dimensionality of data with neural networks

Reducing the Dimensionality of Data
with Neural Networks
@St_Hakky
Geoffrey E. Hinton; R. R. Salakhutdinov (2006-07-28). “Reducing the
Dimensionality of Data with Neural Networks”. Science 313 (5786)

Dimensionality Reduction
• Dimensionality Reduction facipitates…
• Classification
• Visualization
• Communication
• Storage of high-dimensional data

Principal Components Analysis
• PCA(Principal Components Analysis)
• A simple and widely used method
• Finds the directions of greatest variance in the data set
• Represents each data point by its coordinates along each
of these directions

“Encoder” and “Decoder” Network
• This paper describe a nonlinear generalization of
PCA(This is autoencoder)
• use an adaptive, multilayer “encoder” network to
transform the high-dimensional data into a low-
dimensional code
• a similar “decoder” network to recover the data from
the code

AutoEncoder
Code
Input Output
Encoder Decoder

AutoEncoder
Input data
Reconstructing data
Hidden layer
Input layer
Outputlayer
Dimensionality
Reduction

How to train the AutoEncoder
・ Starting with random
weights in the two networks
Input data
Reconstructing data
Hidden layer
Input layer
Outputlayer
Dimensionality
Reduction
・ They are trained by
minimizing the discrepancy
between the original data
and its reconstruction.
・ Gradients are obtained by
the chain rule to back-
propagate error from the
decoder network to encoder
network.

It is difficult to optimize multilayer
autoencoder
• It is difficult to optimize the weights in nonlinear
autoencoders that have multiple hidden layers(2-4).
• With large initial weights:
• autoencoders typically find poor local minima
• With small initial weights:
• the gradients in the early layers are tiny, making it infeasible to
train autoencoders with many hidden layers
• If the initial weights are close to a good solution,
gradient decent works well. However finding such
initial weights is very difficult.

Pretraining
• This paper introduce this “pretraining” procedure
for binary data, generalize it to real-valued data,
and show that it works well for a variety of data
sets.

Restricted Boltzmann Machine(RBM)
Visible units
Hidden units
The input data correspond
to “visible” units of the RBM
and the feature detectors
correspond to “hidden” units.
A joint configuration (𝑣, ℎ) of
the visible and hidden units
has an energy given by (1).
𝑣𝑖
ℎ𝑗
𝑏𝑖, 𝑏𝑗: 𝑏𝑖𝑎𝑠
𝑤𝑖𝑗
The network assigns a
probability to every possible
data via this energy function.

Pretraining consits of learning a stack
of RBMs
・ The first layer of feature
detectors then become the visible
units for learning the next RBM.
・ This layer-by-layer learning can
be repeated as many times as
desired.

Experiment(2-A)
The six units in the code layer were linear
and all the other units were logistic.
The network was trained on 20,000
images and tested on 10,000 new images.
The autoencoder discovered how to
convert each 784-pixel image into six
real numbers that allow almost perfect
reconstruction.
Data
The function of layer
Encoder
Decoder
28 * 28
28 * 28
400
400
200
200
100
100
50
50
25
25
6
6
Used AutoEncoder’s Network
Observed Results

Experiment(2-A)
(1) Random samples of curves from the
test data set
(2) Reconstructions produced by the six-
dimensional deep autoencoder
(3) Reconstructions by logistic PCA using
six components
(4) Reconstructions by logistic PCA
The average squared error per image for
the last four rows is 1.44, 7.64, 2.45, 5.90.
(5) Standard PCA using 18 components.
(1)
(3)
(5)
(4)
(2)

Experiment(2-B)
The 30 units in the code layer were linear
The network was trained on 60,000
images and tested on 10,000 new images.
Data
Encoder
Decoder
1000
1000
784
784
500
250
250
30
30
500

Experiment(2-B)：MNIST
The average squared errors for the last
three rows are 3.00, 8.01, and 13.87.
(1)
(3)
(2)
(4)
(1) A random test image from each class
(2) Reconstructions by the 30-dimensional
autoencoder
(3) Reconstructions by 30- dimensional
logistic PCA
(4) Reconstructions by standard PCA

Experiment(2-B)
A two-dimensional autoencoder produced a better visualization of the data than
did the first two principal components.
(A) The two-dimensional codes for 500
digits of each class produced by taking
the first two principal components of
all 60,000 training images.
(B) The two-dimensional codes
found by a 784- 1000-500-250-2
autoencoder.

Experiment(2-C)
The 30 units in the code layer were linear
Olivetti face data set
Data
Encoder
Decoder
2000
2000
625
625
1000
500
500
30
30
1000
Observed Results
The autoencoder clearly outperformed PCA

Experiment(2-C)
(1) Random samples from the test data set
(1)
(3)
(2)
(2) Reconstructions by the 30-dimensional autoencoder
(3) Reconstructions by 30-dimensional PCA.
The average squared errors are 126 and 135.

Conclusion
• It has been obvious since the 1980s that
backpropagation through deep autoencoders would
be very effective for nonlinear dimensionality
reduction in the situation of…
• Computers were fast enough
• Data sets were big enough
• The initial weights were close enough to a good solution.

Conclusion
• Autoencoders give mappings in both directions
between the data and code spaces.
• They can be applied to very large data sets.
• The reason is that both the pretraining and the fine-
tuning scale linearly in time and space with the
number of training cases.

Reducing the dimensionality of data with neural networks

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Reducing the dimensionality of data with neural networks (20)

Recently uploaded (20)

Reducing the dimensionality of data with neural networks