DNN and RBM

Masayuki Tanaka
Aug. 17, 2015
Back-Propagation Algorithm for
Deep Neural Networks and
Contradictive Diverse Learning for
Restricted Boltzmann Machine

Outline
1. Examples of Deep Learning
2. RBM to Deep NN
3. Deep Neural Network (Deep NN)
– Back-Propagation (Supervised Learning)
4. Restricted Boltzmann Machine (RBM)
– Mathematics, Probabilistic Model and Inference Model
– Pre-training by Contradictive Diverse Learning
(Unsupervised Learning)
5. Inference Model with Distribution
1
http://bit.ly/dnnicpr2014

Deep learning
2
– MNIST (handwritten digits benchmark)
MNIST
 Top performance in character recognition

Deep learning
3
CIFAR10
– CIFAR (image classification benchmark)
 Top performance in image classification

Deep learning
4
Convolution
Pooling
Softmax
OtherGoogLeNet, ILSVRC2014
Image Large Scale Visual Recognition Challenge
(ILSVRC)
 Top performance in visual recoginition

Deep learning
5
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/uns
upervised_icml2012.pdf
Automatic learning with youtube videos,
neuron for human’s face
neuron for cat
10,000,000:
training samples
Three days learning with
1,000 computers
5
 “Cat neuron”

Deep??
6
Input
layer
Output
layer
(Shallow) NN
Input
layer
Output
layer
Deep NN
NN: Neural network

Pros and Cons of Deep NN
7
Input
layer
Output
layer
Deep NN Until a few years ago…
1. Tend to be overfitting
2. Learning information does not reach to
the lower layer
・Pre-training with RBM
・Big data
Image net
More than 1,5 M: Labeled images
http://www.image-net.org/
Labeled Faces in the Wild
More than 10,000: Face images
http://vis-www.cs.umass.edu/lfw/
High-performance network

Outline
1. Examples of Deep NNs
2. RBM to Deep NN
8

Single Layer Neural Network
9
Input layer
𝑣1 𝑣2 𝑣3
ℎ
Output
𝑤1 𝑤2 𝑤3
ℎ = 𝜎
𝑖
𝑤𝑖 𝑣𝑖 + 𝑏
𝜎 𝑥 =
1
1 + 𝑒−𝑥
Sigmoid function
0.0
0.2
0.4
0.6
0.8
1.0
-6.0
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
 Single node output
 Multiple nodes output
(Single Layer NN)
𝒉Output
layer
Input
Layer
𝒗
ℎ𝑗 = 𝜎
𝑖
𝑤𝑖𝑗 𝑣𝑖 + 𝑏𝑗
𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Vector representation of
Single layer NN
It is equivalent to the
inference model of the RBM

Sigmoid function
Weighted sum and Activation functions
10
Input layer
𝑣1 𝑣2 𝑣3
ℎ Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝑓
ℎ = 𝑓 𝑛
𝑛 =
𝑖
Input layer
𝑣1 𝑣2 𝑣3
ℎ
Output layer
𝑤1 𝑤2 𝑤3
ℎ = 𝑓
𝑖
𝑓 𝑥 = 𝜎 𝑥 =
1
1 + 𝑒−𝑥
0.0
0.2
0.4
0.6
0.8
1.0
-6.0
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
-6.0
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Rectified linear unit
𝑓 𝑥 = ReLU 𝑥 =
0 (𝑥 < 0)
𝑥 (𝑥 ≥ 0)

Single layer NN to Deep NN
11
The deep NN is build up by stacking single layer NNs.
1st NN
2nd NN
k-th NN
Output data
Input NN
The output of the single layer NN
will be the input of the next single
layer NN.
The output data of the deep NN
is inferred by iterating the
process.

Parameters estimation for deep NN
12
Parameters are estimated by gradient
descent algorithm which minimizes
the difference between the output
data and teach data.
x
y
x0
x1
x2
1st NN
2nd NN
k-th NN
Teach data
Input NN

Parameters estimation for deep NN
13
Parameters are estimated by gradient
descent algorithm which minimizes
the difference between the output
data and teach data.
1st NN
2nd NN
k-th NN
Teach data
Input NN
Back-propagation：
The gradients can be calculated as
propagating the information backward.

Why the pre-training is necessary?
14
1st NN
2nd NN
k-th NN
Teach data
Input NN
The back-propagation calculates the
gradient from the output layer to the input
layer.
The information of the back-propagation
can not reach the deep layers.
Deep layers（1st layer, 2nd layer, …) are
better to be learned by the unsupervised
learning.
Pre-training with the RBMs.

Pre-training with RBMs
15
1st NN
2nd NN
k-th NN
Input data
Single layer NN
RBM
Data
The inference of the single layer NN is
mathematically equivalent to the inference of
the RBM.
The RBM parameters are estimated by maximum
likelihood algorithm with given training data.

Pre-training and fine-tuning
16 Training data
Output data
Pre-training for
1st layer RBM
Training data
Output data
Pre-training for
2nd layer RBM
Input data
Teach data Back
propagation
copy
copy
copy
Fine-tuning of deep NN

Feature vector extraction
17 Training data
Output data
Pre-training for
1st layer RBM
Training data
Output data
Pre-training for
2nd layer RBM
Input data
Feature
copy
copy
copy

Outline
2. RBM to Deep NN
18

Back-Propagation Algorithm
19
Input data
Teach data
Back
propagation
Output data
𝒉 = 𝜎 𝑾 𝑇
𝒗 + 𝒃Vector representation of
the single layer NN
The goal of learning：
Weights W and bias b of the each layer are
estimated, so that the differences between
the output data and the teach data are
minimized.
𝐼 =
1
2
𝑘
ℎ 𝑘
(𝐿)
− 𝑡 𝑘
2Objective
function
Efficient calculation of the gradient is important.
𝜕𝐼
𝜕𝑾(ℓ)
Back-propagation algorithm is an efficient
algorithm to calculate the gradients.

Back-Propagation:
Gradient of the sigmoid function
20
𝜎 𝑥 =
1
1 + 𝑒−𝑥
 Sigmoid function  Gradient of the
sigmoid function
𝜕𝜎
𝜕𝑥
= 1 − 𝜎 𝑥 𝜎(𝑥)
 Derivation of the gradient of the sigmoid function
𝜕𝜎
𝜕𝑥
=
𝜕
𝜕𝑥
1
1 + 𝑒−𝑥
= −
1
1 + 𝑒−𝑥 2
× −𝑒−𝑥 =
𝑒−𝑥
1 + 𝑒−𝑥 2
=
𝑒−𝑥
1 + 𝑒−𝑥
×
1
1 + 𝑒−𝑥
= 1 −
1
1 + 𝑒−𝑥
1
1 + 𝑒−𝑥
= (1 − 𝜎 𝑥 )𝜎 𝑥

Back-Propagation: Simplification
21
 Single layer NN
𝒉
Output
layer
Input
layer
𝒗
ℎ𝑗 = 𝜎
𝑖
𝑤𝑖𝑗 𝑣𝑖 + 𝑏𝑗
𝑾′ =
𝑾
𝒃 𝑇 𝒗′ =
𝒗
1
= 𝜎 𝑾′ 𝑇 𝒗′
Here and after,
Let’s consider only weight W
𝒉 = 𝜎 𝑾 𝑇 𝒗
the single layer NN

Two-layer NN
22
ℎ 𝑘
(2)
= 𝜎(𝑛 𝑘
2
)
𝑛 𝑘
2
=
𝑗
𝑤𝑗𝑘
(2)
ℎ𝑗
(1)
ℎ𝑗
(1)
= 𝜎(𝑛𝑗
1
)
𝑛𝑗
1
=
𝑖
𝑤𝑖𝑗
(1)
𝑣𝑖
𝑣1 𝑣2 𝑣𝑖
𝑛1
(1)
𝜎 𝜎 𝜎
𝑤𝑖𝑗
(1)
𝑛2
(1)
𝑛𝑗
(1)
ℎ1
(1)
ℎ2
(1)
ℎ𝑗
(1)
𝑛1
(2)
𝜎 𝜎 𝜎
𝑛2
(2)
𝑛 𝑘
(2)
ℎ1
(2)
ℎ2
(2)
ℎ 𝑘
(2)
1
st
layer
2
nd
layer
𝑤𝑗𝑘
(2)
𝑡1 𝑡2 𝑡 𝑘
Teach
Input
layer𝑣1 𝑣2 𝑣3
ℎ Output
layer
𝑤1 𝑤2 𝑤3 ℎ = 𝜎
𝑖
𝑤𝑖 𝑣𝑖
Input layer
𝑣1 𝑣2 𝑣3
ℎ Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝜎
ℎ = 𝜎 𝑛
𝑛 =
𝑖
𝑤𝑖 𝑣𝑖
Separate weighted sum and
activation function

Back-Propagation of two-layer NN
23
ℎ 𝑘
(2)
= 𝜎(𝑛 𝑘
2
)
𝑛 𝑘
2
=
𝑗
𝑤𝑗𝑘
(2)
ℎ𝑗
(1)
ℎ𝑗
(1)
= 𝜎(𝑛𝑗
1
)
𝑛𝑗
1
=
𝑖
𝑤𝑖𝑗
(1)
𝑣𝑖
𝑣1 𝑣2 𝑣𝑖
𝑛1
(1)
𝜎 𝜎 𝜎
𝑤𝑖𝑗
(1)
𝑛2
(1)
𝑛𝑗
(1)
ℎ1
(1)
ℎ2
(1)
ℎ𝑗
(1)
𝑛1
(2)
𝜎 𝜎 𝜎
𝑛2
(2)
𝑛 𝑘
(2)
ℎ1
(2)
ℎ2
(2)
ℎ 𝑘
(2)
1
st
layer
2
nd
layer
𝑤𝑗𝑘
(2)
𝑡1 𝑡2 𝑡 𝑘
Teach
𝐼 =
1
2
𝑘
ℎ 𝑘
(2)
− 𝑡 𝑘
2
Objective
function
𝜕𝐼
𝜕𝑤𝑗𝑘
(2)
=
𝜕𝐼
𝜕𝑛 𝑘
(2)
𝜕𝑛 𝑘
(2)
𝜕𝑤𝑗𝑘
(2)
= 𝛿 𝑘
(2)
ℎ𝑗
(1)
𝛿 𝑘
(2)
=
𝜕𝐼
𝜕𝑛 𝑘
(2)
=
𝜕𝐼
𝜕ℎ 𝑘
(2)
𝜕ℎ 𝑘
(2)
𝜕𝑛 𝑘
(2)
= (ℎ 𝑘
(2)
− 𝑡 𝑘) ℎ 𝑘
2
(1 − ℎ 𝑘
2
)
𝜕𝐼
𝜕𝑤𝑖𝑗
(1)
=
𝜕𝐼
𝜕𝑛𝑗
(1)
𝜕𝑛𝑗
(1)
𝜕𝑤𝑖𝑗
(1)
= 𝛿𝑗
(1)
𝑣𝑖
𝛿𝑗
(1)
=
𝜕𝐼
𝜕𝑛𝑗
(1)
=
𝑘
𝜕𝐼
𝜕𝑛 𝑘
(2)
𝜕𝑛 𝑘
(2)
𝜕ℎ𝑗
(1)
𝜕ℎ𝑗
(1)
𝜕𝑛𝑗
(1)
= 𝑘 𝛿 𝑘
(2)
𝑤𝑗𝑘
(2)
ℎ𝑗
1
(1 − ℎ𝑗
1
)
Back-propagation

Back-Propagation of arbitrary layer
24
𝒉(ℓ−1)
𝑾(ℓ)
𝒏(ℓ) 𝒏(ℓ)
= 𝑾 ℓ T
𝒉(ℓ−1)
𝒇(ℓ)
𝒉(ℓ) = 𝒇(ℓ) 𝒏(ℓ)𝒉(ℓ)
𝑾(ℓ+1)
𝜹(ℓ) = 𝑾(ℓ+1) 𝜹(ℓ+1) ⊗
𝜕𝑓(ℓ)
𝜕𝒏(ℓ)
𝜕𝐼
𝜕𝑾(ℓ) = 𝜹(ℓ) 𝒉 ℓ−1 T
⊗: elementwise product
𝜹(ℓ+1) = 𝑾(ℓ+2) 𝜹(ℓ+2) ⊗
𝜕𝑓(ℓ+1)
𝜕𝒏(ℓ+1)
𝜕𝐼
𝜕𝑾(ℓ+1) = 𝜹(ℓ+1)
𝒉 ℓ1 T
𝜹(ℓ−1)
= 𝑾(ℓ)
𝜹(ℓ)
⊗
𝜕𝑓(ℓ−1)
𝜕𝒏(ℓ−1)
𝜕𝐼
𝜕𝑾(ℓ−1) = 𝜹(ℓ−1) 𝒉 ℓ−2 T

Tip for gradient calculation debug
25
𝐼(𝜽) =
1
2
𝑘
ℎ 𝑘
𝐿
(𝒗; 𝜽) − 𝑡 𝑘
2
𝜕𝐼
𝜕𝜃𝑖
Objective
function
Gradient calculated by
the back-propagation
𝜕𝐼
𝜕𝜃𝑖
= lim
𝜀→0
𝐼 𝜽 + 𝜀𝟏𝑖 − 𝐼 𝜽
𝜀
𝟏𝑖:i-th element is １，
others are ０
∆𝑖 𝐼 =
𝐼 𝜽 + 𝜀𝟏𝑖 − 𝐼 𝜽
𝜀
Computational efficient
Difficult implementation
Computational inefficient
Easy implementation
Definition of gradient
Differential approximation
For small 𝜀, ∆𝑖 𝐼 ≑
𝜕𝐼
𝜕𝜃 𝑖

Stochastic Gradient descent algorithm
（Mini-batch learning）
26
{ 𝒗, 𝒕 1, 𝒗, 𝒕 2, ⋯ , 𝒗, 𝒕 𝑛 ⋯ , 𝒗, 𝒕 𝑁 }
Parameters 𝜽 are learned with samples
𝐼 𝑛(𝜽) is the objective function associated to 𝒗, 𝒕 𝑛
𝒗, 𝒕 7
𝒗, 𝒕 10
𝒗, 𝒕 1
𝒗, 𝒕 2
𝒗, 𝒕 3
𝒗, 𝒕 4
𝒗, 𝒕 11𝒗, 𝒕 5
𝒗, 𝒕 6
𝜽 ← 𝜽 − 𝜂
𝜕𝐼
𝜕𝜽𝒗, 𝒕 7
𝒗, 𝒕 9𝒗, 𝒕 2
𝒗, 𝒕 11
Sampling
𝒗, 𝒕 6
𝒗, 𝒕 10𝒗, 𝒕 3
𝒗, 𝒕 12
𝒗, 𝒕 9
𝜽 ← 𝜽 − 𝜂
𝜕𝐼
𝜕𝜽
Sampling𝒗, 𝒕 12
𝒗, 𝒕 8
To avoid the overfitting, the parameters
are updated with each mini-batch.
Whole training data

Practical update of parameters
27
G. Hinton,
A Practical Guide to Training
Restricted Boltzmann Machines 2010.
Size of mini-batch：10 - 100
Learning rate 𝜂：Empirically determined
Weight decay rate 𝜆：0.01 - 0.00001
Momentum rate 𝜈：0.9 (initially 0.5)
𝜃(𝑡+1) = 𝜃(𝑡) + ∆𝜃(𝑡)
∆𝜃(𝑡)
= −𝜂
𝜕𝐼
𝜕𝜃
− 𝜆𝜃(𝑡)
+ 𝜈∆𝜃(𝑡−1)
Update rule
Gradient Weight decay momentum
Weight decay is to avoid the
unnecessary diverse.
（especially for sigmoid function）
Momentum is to avoid unnecessary
oscillation of update amount.
The similar effect of the conjugate
gradient algorithm is expected.

Outline
2. RBM to Deep NN
28

Restricted Boltzmann Machines
29
 Boltzmann Machines
Boltmann machine is a probabilistic model represented by
undirected graph (nodes and edges).
Here, the binary state {0,1} is considered as the state of the nodes.
 Unrestricted and restricted Boltzmann machine
v:visible layer
h:hidden layer
v:visible layer
h:hidden layer
• (Unrestricted) Boltzmann machine • Restricted Boltzamann machine
(RBM)
Every node is connected each other.
There is no edge in the same layer.
It helps analysis.

RBM: Probabilistic and energy models
30
v:visible layer {0,1}
h:hidden layer {0,1}
Probabilistic model: 𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
exp −𝐸 𝒗, 𝒉; 𝜽
RBM parameter: θ = (W,b,c)
Weight: W
Bias: b,c
Energy model: 𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 −
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
= −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗
Partition function: 𝑍 𝜽 =
𝒗,𝒉∈{0,1}
exp(−𝐸(𝒗, 𝒉; 𝜽))

=
𝑖 exp(𝑐𝑖 𝑣𝑖) 𝑗 exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
𝑖 exp(𝑐𝑖 𝑣𝑖) 𝑗 ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
RBM: Conditional probability model
(Inference model)
31
𝑃 𝒉 𝒗; 𝜽 =
𝑃 𝒗, 𝒉; 𝜽
𝒉∈{0,1} 𝑃 𝒗, 𝒉; 𝜽
=
exp( 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑗 𝑏𝑗ℎ𝑗 + 𝑖 𝑐𝑖 𝑣𝑖)
𝒉∈{0,1} exp( 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑗 𝑏𝑗ℎ𝑗 + 𝑖 𝑐𝑖 𝑣𝑖)
Conditional probabilities of nodes
are independent.
𝑃 ℎ𝑗 𝒗; 𝜽 =
exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
=
𝑗
=
𝑗
𝑃(ℎ𝑗|𝒗; 𝜽)
𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
= −𝒗 𝑇
𝑾𝒉 − 𝒃 𝑇
𝒉 − 𝒄 𝑇
𝒗

32
𝑃 ℎ𝑗 𝒗; 𝜽 =
𝑃 ℎ𝑗 = 1 𝒗; 𝜽 =
exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 1 + 𝑏𝑗 × 1)
exp 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 0 + 𝑏𝑗 × 0 + exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 1 + 𝑏𝑗 × 1)
𝜎 𝑥 =
1
1 + 𝑒−𝑥
=
exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗)
exp 0 + exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗)
=
1
1 + exp(− 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗)
= 𝜎
𝑖
𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄 𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
single layer NN
RBM: Conditional probability model
(Inference model)

Gaussian-Bernoulli RBM
33
Bernoulli-Bernoulli RBM Gaussian-Bernoulli RBM
v:visible layer
Gaussian distribution
𝑁(𝑣; 𝜇, 𝑠2
)
𝑃 𝒗, 𝒉 =
1
𝑍(𝜽)
exp −𝐸 𝒗, 𝒉
Gaussian-Bernoulli RBM
𝐸 𝒗, 𝒉 =
1
2𝑠2
𝑖
𝑣𝑖 − 𝑐𝑖
2 −
1
𝑠
𝑖,𝑗
𝑣𝑖 𝑤𝑖𝑗ℎ𝑖 −
𝑗
𝑏𝑗ℎ𝑗
Probabilistic model:
Inference model
(Conditional probability)
𝑃 𝒉 = 𝟏 𝒗 = 𝜎
1
𝑠
𝑾 𝑇 𝒗 + 𝒃
𝑃 𝒗 𝒉 = 𝑵 𝒗; 𝜎𝑾𝒉 + 𝒄, 𝑠2
𝑰
Energy model:

Outline
2. RBM to Deep NN
34

RBM: Contradictive Diverse Learning
35
𝒗(0)
𝒉(0)
𝒗(1)
𝒉(1)
𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃
𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄
𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃
𝑾 ← 𝑾 − 𝜀 Δ𝑾
Δ𝑾 =
𝟏
𝑁
𝑛
𝒗 𝑛(0)
𝑇
𝒉 𝑛(0) − 𝒗 𝑛(1)
𝑇
𝒉 𝑛(1)
Iterative process of the CD learning
The CD learning can be considered the maximum
likelihood estimation with given training data.
Momentum and
weight decay
are also applied.

RBM: Outline of the CD learning
36
パラメータθ：Weight W， Bias b,c
𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
• The maximum likelihood estimation with given training data {vn}.
• (Approximated) EM algorithm is applied to handle unobserved
hidden data.
• The Gibbs sampling is applied to evaluate the partition function.
• The Gibbs sampling is approximated by single sampling.
Outline of the Contrastive Divergence Learning

CD learning: Maximum likelihood
37
h:hidden layer {0,1} 𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
RBM is probabilistic model
The maximum likelihood
gives parameters for given
training data.
Visible data are given.
Hidden data are not given.
Integrate the hidden data.
Log likelihood for training data {vn}
𝐿 𝜽 =
𝑛
𝐿 𝑛 𝜽 =
𝑛
log 𝑃 𝒗 𝑛; 𝜽
=
𝑛
log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽)
The optimization is performed by
the EM algorithm.
𝐿 𝑛 𝜽 = log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽)

EM Algorithm
38 Reference: これなら分かる最適化数学，金谷健一
𝐿 𝑛 𝜽 = log 𝑃 𝒗 𝑛; 𝜽 = log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽)
Log likelihood for training data {vn}
The EM algorithm monotonically increases the log likelihood.
EM algorithm
1. Initialize parameter 𝜽 by 𝜽0 . Set 𝜏 = 0．
2. Evaluate following function（E-step）
3. Find 𝜽 𝜏 which maximizes 𝑄 𝜏 𝜽 （M-step）
4. Set 𝜏←𝜏 + 1, then step2. Iterate until it is converged.
𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽)
※In the CD learning, the M-step is approximated.

Evaluation function and derivatives
39
𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
𝑍 𝜽 =
𝒗,𝒉∈{0,1}
exp(−𝐸(𝒗, 𝒉; 𝜽))
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log
exp −𝐸 𝒗 𝑛, 𝒉; 𝜽
𝑍(𝜽)
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) −𝐸 𝒗 𝑛, 𝒉; 𝜽 − log 𝑍(𝜽)
Evaluation function
𝜕𝑄 𝜏 𝜽
𝜕𝜽
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) −
𝜕
𝜕𝜽
𝐸 𝒗 𝑛, 𝒉; 𝜽 −
𝜕
𝜕𝜽
log 𝑍(𝜽)
𝜕
𝜕𝜽
𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽) −
𝜕
𝜕𝜽
𝐸 𝒗, 𝒉; 𝜽
Data term Model term

Derivative of partition function
（Model term）
40
𝜕 log 𝑍(𝜽)
𝜕𝜽
=
1
𝑍(𝜽)
𝜕𝑍
𝜕𝜽
(𝜽) =
1
𝑍(𝜽)
𝜕
𝜕𝜽
𝑓 𝒙; 𝜽 𝑑𝒙
=
1
𝑍(𝜽)
𝜕𝑓 𝒙; 𝜽
𝜕𝜽
𝑑𝒙
=
1
𝑍(𝜽)
𝑓 𝒙; 𝜽
𝜕 log 𝑓 𝒙; 𝜽
𝜕𝜽
𝑑𝒙
𝑃 𝒙; 𝜽 =
1
𝑍 𝜽
𝑓(𝒙; 𝜽) 𝑍 𝜽 = 𝑓 𝒙; 𝜽 𝑑𝒙
Probability function Partition function
Arbitrary function：
𝑓(𝒙; 𝜽)
= 𝑃 𝒙; 𝜽
𝜕𝜽
𝑑𝒙
= 𝐸 𝒙∼𝑃 𝒙;𝜽
𝜕𝜽
Derivative of
log function
Derivative of
log function
Derivative operator
into integral
Definition of expectation

Evaluation function and derivatives
41
𝜕𝑄 𝜏 𝜽
𝜕𝜽
𝜕
𝜕𝜽
𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽) −
𝜕
𝜕𝜽
𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
Evaluation function
𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
= −𝒗 𝑇
𝑾𝒉 − 𝒃 𝑇
𝒉 − 𝒄 𝑇
𝒗
𝜕
𝜕𝑾
𝐸 𝒗, 𝒉; 𝜽 = −𝒗 𝑻
𝒉
𝜕𝑄 𝜏 𝜽 𝜏
𝜕𝑾
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇
𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻
𝒉
𝜕𝜽
𝜕
𝜕𝜽
𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) −
𝜕
𝜕𝜽
Data term （Feasible） Model term (Infeasible)
Approximation by
Gibbs sampling

Derivatives of evaluation function
(Data term)
42
𝒗(0)
𝒉(0)
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Inference of the RBM
Data term
𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇 𝒉 = 𝒗 𝑛
𝑇 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒉
= 𝒗 𝑛
𝑇 𝟎 ∙ 𝑃 𝒉 = 𝟎 𝒗; 𝜽 + 𝟏 ∙ 𝑃 𝒉 = 𝟏 𝒗; 𝜽
= 𝒗 𝑛
𝑇
𝜎 𝑾 𝑇 𝒗 𝑛 + 𝒃
𝜕𝑾
𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉
Data term（Feasible） Model term（Infeasible）
Approximation by
Gibbs sampling

Expectation Approximation by Monte-Calro
Method （Model term）
43
𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 ≑
1
𝑁
𝑖
𝒗(𝑖)
𝑇
𝒉(𝑖) 𝒗(𝑖)
𝑇
, 𝒉(𝑖) ∼ 𝑃(𝒗, 𝒉; 𝜽 𝜏)
Independent sampling
Monte-Calro method
How can we get the sampling? Gibbs sampling
𝜕𝑾
Approximation by
Gibbs sampling

Gibbs sampling
44
𝒗(0)
𝒉(0)
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Inference of RBM
𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄
𝒗(1)
𝒉(1)
𝒗(∞)
𝒉(∞)
Gibbs sampling
1. Initialize 𝒗.
2. Sampling 𝒉 with𝑃(𝒉|𝒗).
3. Sampling 𝒗 with 𝑃(𝒗|𝒉).
4. Iterate step 2 and step 3.
𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 ≑
1
𝑁
𝑖
𝒗(𝑖)
𝑇
𝒉(𝑖)
𝜕𝑾
Approximation by
Gibbs sampling

Approximated Evaluation of Model Term
45
𝒗(0)
𝒉(0)
𝒗(1)
𝒉(1)
Stop just one time!
𝒉(0) = 𝜎 𝑾 𝑇
𝒗(𝟎) + 𝒃
𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄
𝒉(1) = 𝜎 𝑾 𝑇
𝒗(𝟏) + 𝒃
𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇 𝒉 =𝒗(0)
𝑇
𝒉(0)
𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 =𝒗(1)
𝑇
𝒉(1) 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Inference of RBM
𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄
𝜕𝑾
Approximation by
Gibbs sampling

Probabilities or status?
46
𝒗(0)
𝒉(0)
𝒗(1)
𝒉(1)
𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃
𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄
𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃
G. Hinton,
A Practical Guide to Training
Restricted Boltzmann Machines 2010.
The inference of the RBM gives the probabilities.
In the Gibbs sampling, should we sampling the status with the probabilities ?
Or should we simply use the probabilities ?
Hinton recommends to use the probabilities.
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Inference of RBM
𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄

RBM: Contradictive Diverse Learning
47
𝒗(0)
𝒉(0)
𝒗(1)
𝒉(1)
𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃
𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄
𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃
𝑾 ← 𝑾 − 𝜀 Δ𝑾
Δ𝑾 =
𝟏
𝑁
𝑛
𝒗 𝑛(0)
𝑇
𝒉 𝑛(0) − 𝒗 𝑛(1)
𝑇
𝒉 𝑛(1)
Iterative process of the CD learning
The CD learning can be considered the maximum
likelihood estimation with given training data.
Momentum and
weight decay
are also applied.

Pre-training for the stacked RBMs
48 Training data
Output data
Pre-training for
1st layer RBM
Training data
Output data
Pre-training for
2nd layer RBM
Input data
copy
copy
copy
Pre-training for the RBMs

Outline
2. RBM to Deep NN
49

Drop-out
50
The drop-out is expected to be similar
to ensemble learning.
It is effective to avoid the overfitting.
Drop-out
The nodes are randomly dropped out for
each mini-batch. The output of the
dropped node is zero.
50% drop-out rate is recommended.
G. Hinton, N.Srivastava, A.Krizhevsky, I.Sutskever, and
R.Salakhutdinov, “Improving neural networks by preventing
co-adaptation of feature detectors.”, arXiv preprint
arXiv:1207.0580, 2012.
Input
layer
Output
layer

Ensemble learning and Drop-out
51
Input
layer
Output
layer
Ensemble learning Drop-out
Integration of multiple weak learner ＞ single learner
𝒉 𝒗 =
1
𝐾
𝒉1 𝒗 + 𝒉2 𝒗 + 𝒉3 𝒗 + ⋯ + 𝒉 𝐾 𝒗
The drop-out is
expected to be
similar effect to
ensemble learning

Fast drop-out learning
52 S.I. Wang and C.D. Manning, Fast dropout training, ICML 2013.
Input layer
𝑣1 𝑣2 𝑣3
ℎ Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝜎
ℎ = 𝜎 𝑛
𝑛 =
𝑖
Inference of standard NN
𝑣1 𝑣2 𝑣3
ℎ
𝑤1 𝑤2 𝑤3
𝑛
𝜎
𝜉1 𝜉2 𝜉3
Input layer
Output layer
ℎ = 𝐸 𝜎 𝜒
𝜒 =
𝑖
𝜉𝑖 𝑤𝑖 𝑣𝑖 + 𝑏
𝜉𝑖 ∈ 0,1 ,
𝑃 𝜉𝑖 = 0 = 0.5
𝑃 𝜉𝑖 = 1 = 0.5
𝜒, 𝜉𝑖:Stochastic
variables
Inference of drop-out NN
𝜒
𝑃(𝜒)

Fast drop-out learning
53 S.I. Wang and C.D. Manning, Fast dropout training, ICML 2013.
𝜒
𝑃(𝜒)
𝜒
𝑃(𝜒) Approximate by Gaussian
𝑁(𝜒; 𝜇, 𝑠2
)
mean 𝜇,
variance𝑠2
≑
−∞
∞
) 𝜎 𝜒 𝑑𝜒
≑ 𝜎
𝜇
1 + 𝜋𝑠2/8
Closed form approximation
Fast calculation without
sampling.
𝑣1 𝑣2 𝑣3
ℎ
𝑤1 𝑤2 𝑤3
𝑛
𝜎
𝜉1 𝜉2 𝜉3
Input layer
Output layer
𝜒 =
𝑖
𝜉𝑖 𝑤𝑖 𝑣𝑖 + 𝑏
𝜉𝑖 ∈ 0,1 ,
𝑃 𝜉𝑖 = 0 = 0.5
𝑃 𝜉𝑖 = 1 = 0.5
𝜒, 𝜉𝑖:Stochastic
variables
Inference of drop-out NN

Inference based on RBM model
54
M. Tanaka and M. Okutomi, A Novel Inference of a Restricted
Boltzmann Machine, ICPR2014.
Input layer
𝑣1 𝑣2 𝑣3
ℎ Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝜎
ℎ = 𝜎 𝑛
𝑛 =
𝑖
Inference of standard NN
ℎ1 ℎ2 ℎ3
𝐻 Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝜎 𝐻 = 𝐸 𝜎 𝜒
𝜒 =
𝑖
𝑤𝑖ℎ𝑖 + 𝑏
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇
𝒗 + 𝒃
𝜒, ℎ𝑖: stochastic variables
𝜒
𝑃(𝜒)
𝜒
𝑃(𝜒) ℎ = 𝐸 𝜎 𝜒
≑
−∞
∞
)𝜎 𝜒 𝑑𝜒
≑ 𝜎
𝜇
1 + 𝜋𝑠2/8
mean 𝜇,
variance𝑠2
Approximate by Gaussian
)

55
M. Tanaka and M. Okutomi, A Novel Inference of a Restricted
Boltzmann Machine, ICPR2014.
Improve the performance!

56

Outline
2. RBM to Deep NN
57

DNN and RBM

More Related Content

What's hot (20)

Similar to DNN and RBM (20)

More from Masayuki Tanaka (20)

Recently uploaded (20)

DNN and RBM