SlideShare a Scribd company logo
Masayuki Tanaka
Aug. 17, 2015
Back-Propagation Algorithm for
Deep Neural Networks and
Contradictive Diverse Learning for
Restricted Boltzmann Machine
Outline
1. Examples of Deep Learning
2. RBM to Deep NN
3. Deep Neural Network (Deep NN)
– Back-Propagation (Supervised Learning)
4. Restricted Boltzmann Machine (RBM)
– Mathematics, Probabilistic Model and Inference Model
– Pre-training by Contradictive Diverse Learning
(Unsupervised Learning)
5. Inference Model with Distribution
1
http://bit.ly/dnnicpr2014
Deep learning
2
– MNIST (handwritten digits benchmark)
MNIST
 Top performance in character recognition
Deep learning
3
CIFAR10
– CIFAR (image classification benchmark)
 Top performance in image classification
Deep learning
4
Convolution
Pooling
Softmax
OtherGoogLeNet, ILSVRC2014
Image Large Scale Visual Recognition Challenge
(ILSVRC)
 Top performance in visual recoginition
Deep learning
5
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/uns
upervised_icml2012.pdf
Automatic learning with youtube videos,
neuron for human’s face
neuron for cat
10,000,000:
training samples
Three days learning with
1,000 computers
5
 “Cat neuron”
Deep??
6
Input
layer
Output
layer
(Shallow) NN
Input
layer
Output
layer
Deep NN
NN: Neural network
Pros and Cons of Deep NN
7
Input
layer
Output
layer
Deep NN Until a few years ago…
1. Tend to be overfitting
2. Learning information does not reach to
the lower layer
・Pre-training with RBM
・Big data
Image net
More than 1,5 M: Labeled images
http://www.image-net.org/
Labeled Faces in the Wild
More than 10,000: Face images
http://vis-www.cs.umass.edu/lfw/
High-performance network
Outline
1. Examples of Deep NNs
2. RBM to Deep NN
3. Deep Neural Network (Deep NN)
– Back-Propagation (Supervised Learning)
4. Restricted Boltzmann Machine (RBM)
– Mathematics, Probabilistic Model and Inference Model
– Pre-training by Contradictive Diverse Learning
(Unsupervised Learning)
5. Inference Model with Distribution
8
http://bit.ly/dnnicpr2014
Single Layer Neural Network
9
Input layer
𝑣1 𝑣2 𝑣3
ℎ
Output
𝑤1 𝑤2 𝑤3
ℎ = 𝜎
𝑖
𝑤𝑖 𝑣𝑖 + 𝑏
𝜎 𝑥 =
1
1 + 𝑒−𝑥
Sigmoid function
0.0
0.2
0.4
0.6
0.8
1.0
-6.0
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
 Single node output
 Multiple nodes output
(Single Layer NN)
𝒉Output
layer
Input
Layer
𝒗
ℎ𝑗 = 𝜎
𝑖
𝑤𝑖𝑗 𝑣𝑖 + 𝑏𝑗
𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Vector representation of
Single layer NN
It is equivalent to the
inference model of the RBM
Sigmoid function
Weighted sum and Activation functions
10
Input layer
𝑣1 𝑣2 𝑣3
ℎ Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝑓
ℎ = 𝑓 𝑛
𝑛 =
𝑖
𝑤𝑖 𝑣𝑖 + 𝑏
Input layer
𝑣1 𝑣2 𝑣3
ℎ
Output layer
𝑤1 𝑤2 𝑤3
ℎ = 𝑓
𝑖
𝑤𝑖 𝑣𝑖 + 𝑏
𝑓 𝑥 = 𝜎 𝑥 =
1
1 + 𝑒−𝑥
0.0
0.2
0.4
0.6
0.8
1.0
-6.0
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
-6.0
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Rectified linear unit
𝑓 𝑥 = ReLU 𝑥 =
0 (𝑥 < 0)
𝑥 (𝑥 ≥ 0)
Single layer NN to Deep NN
11
The deep NN is build up by stacking single layer NNs.
1st NN
2nd NN
k-th NN
Output data
Input NN
The output of the single layer NN
will be the input of the next single
layer NN.
The output data of the deep NN
is inferred by iterating the
process.
Parameters estimation for deep NN
12
Parameters are estimated by gradient
descent algorithm which minimizes
the difference between the output
data and teach data.
x
y
x0
x1
x2
The deep NN is build up by stacking single layer NNs.
1st NN
2nd NN
k-th NN
Teach data
Input NN
Parameters estimation for deep NN
13
Parameters are estimated by gradient
descent algorithm which minimizes
the difference between the output
data and teach data.
The deep NN is build up by stacking single layer NNs.
1st NN
2nd NN
k-th NN
Teach data
Input NN
Back-propagation:
The gradients can be calculated as
propagating the information backward.
Why the pre-training is necessary?
14
1st NN
2nd NN
k-th NN
Teach data
Input NN
The back-propagation calculates the
gradient from the output layer to the input
layer.
The information of the back-propagation
can not reach the deep layers.
Deep layers(1st layer, 2nd layer, …) are
better to be learned by the unsupervised
learning.
Pre-training with the RBMs.
Pre-training with RBMs
15
1st NN
2nd NN
k-th NN
Input data
Single layer NN
RBM
Data
The inference of the single layer NN is
mathematically equivalent to the inference of
the RBM.
The RBM parameters are estimated by maximum
likelihood algorithm with given training data.
Pre-training and fine-tuning
16 Training data
Output data
Pre-training for
1st layer RBM
Training data
Output data
Pre-training for
2nd layer RBM
Input data
Teach data Back
propagation
copy
copy
copy
Fine-tuning of deep NN
Pre-training with RBMs
Feature vector extraction
17 Training data
Output data
Pre-training for
1st layer RBM
Training data
Output data
Pre-training for
2nd layer RBM
Input data
Feature
copy
copy
copy
Pre-training with RBMs
Outline
1. Examples of Deep NNs
2. RBM to Deep NN
3. Deep Neural Network (Deep NN)
– Back-Propagation (Supervised Learning)
4. Restricted Boltzmann Machine (RBM)
– Mathematics, Probabilistic Model and Inference Model
– Pre-training by Contradictive Diverse Learning
(Unsupervised Learning)
5. Inference Model with Distribution
18
http://bit.ly/dnnicpr2014
Back-Propagation Algorithm
19
Input data
Teach data
Back
propagation
Output data
𝒉 = 𝜎 𝑾 𝑇
𝒗 + 𝒃Vector representation of
the single layer NN
The goal of learning:
Weights W and bias b of the each layer are
estimated, so that the differences between
the output data and the teach data are
minimized.
𝐼 =
1
2
𝑘
ℎ 𝑘
(𝐿)
− 𝑡 𝑘
2Objective
function
Efficient calculation of the gradient is important.
𝜕𝐼
𝜕𝑾(ℓ)
Back-propagation algorithm is an efficient
algorithm to calculate the gradients.
Back-Propagation:
Gradient of the sigmoid function
20
𝜎 𝑥 =
1
1 + 𝑒−𝑥
 Sigmoid function  Gradient of the
sigmoid function
𝜕𝜎
𝜕𝑥
= 1 − 𝜎 𝑥 𝜎(𝑥)
 Derivation of the gradient of the sigmoid function
𝜕𝜎
𝜕𝑥
=
𝜕
𝜕𝑥
1
1 + 𝑒−𝑥
= −
1
1 + 𝑒−𝑥 2
× −𝑒−𝑥 =
𝑒−𝑥
1 + 𝑒−𝑥 2
=
𝑒−𝑥
1 + 𝑒−𝑥
×
1
1 + 𝑒−𝑥
= 1 −
1
1 + 𝑒−𝑥
1
1 + 𝑒−𝑥
= (1 − 𝜎 𝑥 )𝜎 𝑥
Back-Propagation: Simplification
21
 Single layer NN
𝒉
Output
layer
Input
layer
𝒗
ℎ𝑗 = 𝜎
𝑖
𝑤𝑖𝑗 𝑣𝑖 + 𝑏𝑗
𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
𝑾′ =
𝑾
𝒃 𝑇 𝒗′ =
𝒗
1
𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
= 𝜎 𝑾′ 𝑇 𝒗′
Here and after,
Let’s consider only weight W
𝒉 = 𝜎 𝑾 𝑇 𝒗
Vector representation of
the single layer NN
Two-layer NN
22
ℎ 𝑘
(2)
= 𝜎(𝑛 𝑘
2
)
𝑛 𝑘
2
=
𝑗
𝑤𝑗𝑘
(2)
ℎ𝑗
(1)
ℎ𝑗
(1)
= 𝜎(𝑛𝑗
1
)
𝑛𝑗
1
=
𝑖
𝑤𝑖𝑗
(1)
𝑣𝑖
𝑣1 𝑣2 𝑣𝑖
𝑛1
(1)
𝜎 𝜎 𝜎
𝑤𝑖𝑗
(1)
𝑛2
(1)
𝑛𝑗
(1)
ℎ1
(1)
ℎ2
(1)
ℎ𝑗
(1)
𝑛1
(2)
𝜎 𝜎 𝜎
𝑛2
(2)
𝑛 𝑘
(2)
ℎ1
(2)
ℎ2
(2)
ℎ 𝑘
(2)
1
st
layer
2
nd
layer
𝑤𝑗𝑘
(2)
𝑡1 𝑡2 𝑡 𝑘
Teach
Input
layer𝑣1 𝑣2 𝑣3
ℎ Output
layer
𝑤1 𝑤2 𝑤3 ℎ = 𝜎
𝑖
𝑤𝑖 𝑣𝑖
Input layer
𝑣1 𝑣2 𝑣3
ℎ Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝜎
ℎ = 𝜎 𝑛
𝑛 =
𝑖
𝑤𝑖 𝑣𝑖
Separate weighted sum and
activation function
Back-Propagation of two-layer NN
23
ℎ 𝑘
(2)
= 𝜎(𝑛 𝑘
2
)
𝑛 𝑘
2
=
𝑗
𝑤𝑗𝑘
(2)
ℎ𝑗
(1)
ℎ𝑗
(1)
= 𝜎(𝑛𝑗
1
)
𝑛𝑗
1
=
𝑖
𝑤𝑖𝑗
(1)
𝑣𝑖
𝑣1 𝑣2 𝑣𝑖
𝑛1
(1)
𝜎 𝜎 𝜎
𝑤𝑖𝑗
(1)
𝑛2
(1)
𝑛𝑗
(1)
ℎ1
(1)
ℎ2
(1)
ℎ𝑗
(1)
𝑛1
(2)
𝜎 𝜎 𝜎
𝑛2
(2)
𝑛 𝑘
(2)
ℎ1
(2)
ℎ2
(2)
ℎ 𝑘
(2)
1
st
layer
2
nd
layer
𝑤𝑗𝑘
(2)
𝑡1 𝑡2 𝑡 𝑘
Teach
𝐼 =
1
2
𝑘
ℎ 𝑘
(2)
− 𝑡 𝑘
2
Objective
function
𝜕𝐼
𝜕𝑤𝑗𝑘
(2)
=
𝜕𝐼
𝜕𝑛 𝑘
(2)
𝜕𝑛 𝑘
(2)
𝜕𝑤𝑗𝑘
(2)
= 𝛿 𝑘
(2)
ℎ𝑗
(1)
𝛿 𝑘
(2)
=
𝜕𝐼
𝜕𝑛 𝑘
(2)
=
𝜕𝐼
𝜕ℎ 𝑘
(2)
𝜕ℎ 𝑘
(2)
𝜕𝑛 𝑘
(2)
= (ℎ 𝑘
(2)
− 𝑡 𝑘) ℎ 𝑘
2
(1 − ℎ 𝑘
2
)
𝜕𝐼
𝜕𝑤𝑖𝑗
(1)
=
𝜕𝐼
𝜕𝑛𝑗
(1)
𝜕𝑛𝑗
(1)
𝜕𝑤𝑖𝑗
(1)
= 𝛿𝑗
(1)
𝑣𝑖
𝛿𝑗
(1)
=
𝜕𝐼
𝜕𝑛𝑗
(1)
=
𝑘
𝜕𝐼
𝜕𝑛 𝑘
(2)
𝜕𝑛 𝑘
(2)
𝜕ℎ𝑗
(1)
𝜕ℎ𝑗
(1)
𝜕𝑛𝑗
(1)
= 𝑘 𝛿 𝑘
(2)
𝑤𝑗𝑘
(2)
ℎ𝑗
1
(1 − ℎ𝑗
1
)
Back-propagation
Back-Propagation of arbitrary layer
24
𝒉(ℓ−1)
𝑾(ℓ)
𝒏(ℓ) 𝒏(ℓ)
= 𝑾 ℓ T
𝒉(ℓ−1)
𝒇(ℓ)
𝒉(ℓ) = 𝒇(ℓ) 𝒏(ℓ)𝒉(ℓ)
𝑾(ℓ+1)
𝜹(ℓ) = 𝑾(ℓ+1) 𝜹(ℓ+1) ⊗
𝜕𝑓(ℓ)
𝜕𝒏(ℓ)
𝜕𝐼
𝜕𝑾(ℓ) = 𝜹(ℓ) 𝒉 ℓ−1 T
⊗: elementwise product
𝜹(ℓ+1) = 𝑾(ℓ+2) 𝜹(ℓ+2) ⊗
𝜕𝑓(ℓ+1)
𝜕𝒏(ℓ+1)
𝜕𝐼
𝜕𝑾(ℓ+1) = 𝜹(ℓ+1)
𝒉 ℓ1 T
𝜹(ℓ−1)
= 𝑾(ℓ)
𝜹(ℓ)
⊗
𝜕𝑓(ℓ−1)
𝜕𝒏(ℓ−1)
𝜕𝐼
𝜕𝑾(ℓ−1) = 𝜹(ℓ−1) 𝒉 ℓ−2 T
Tip for gradient calculation debug
25
𝐼(𝜽) =
1
2
𝑘
ℎ 𝑘
𝐿
(𝒗; 𝜽) − 𝑡 𝑘
2
𝜕𝐼
𝜕𝜃𝑖
Objective
function
Gradient calculated by
the back-propagation
𝜕𝐼
𝜕𝜃𝑖
= lim
𝜀→0
𝐼 𝜽 + 𝜀𝟏𝑖 − 𝐼 𝜽
𝜀
𝟏𝑖:i-th element is 1,
others are 0
∆𝑖 𝐼 =
𝐼 𝜽 + 𝜀𝟏𝑖 − 𝐼 𝜽
𝜀
Computational efficient
Difficult implementation
Computational inefficient
Easy implementation
Definition of gradient
Differential approximation
For small 𝜀, ∆𝑖 𝐼 ≑
𝜕𝐼
𝜕𝜃 𝑖
Stochastic Gradient descent algorithm
(Mini-batch learning)
26
{ 𝒗, 𝒕 1, 𝒗, 𝒕 2, ⋯ , 𝒗, 𝒕 𝑛 ⋯ , 𝒗, 𝒕 𝑁 }
Parameters 𝜽 are learned with samples
𝐼 𝑛(𝜽) is the objective function associated to 𝒗, 𝒕 𝑛
𝒗, 𝒕 7
𝒗, 𝒕 10
𝒗, 𝒕 1
𝒗, 𝒕 2
𝒗, 𝒕 3
𝒗, 𝒕 4
𝒗, 𝒕 11𝒗, 𝒕 5
𝒗, 𝒕 6
𝜽 ← 𝜽 − 𝜂
𝜕𝐼
𝜕𝜽𝒗, 𝒕 7
𝒗, 𝒕 9𝒗, 𝒕 2
𝒗, 𝒕 11
Sampling
𝒗, 𝒕 6
𝒗, 𝒕 10𝒗, 𝒕 3
𝒗, 𝒕 12
𝒗, 𝒕 9
𝜽 ← 𝜽 − 𝜂
𝜕𝐼
𝜕𝜽
Sampling𝒗, 𝒕 12
𝒗, 𝒕 8
To avoid the overfitting, the parameters
are updated with each mini-batch.
Whole training data
Practical update of parameters
27
G. Hinton,
A Practical Guide to Training
Restricted Boltzmann Machines 2010.
Size of mini-batch:10 - 100
Learning rate 𝜂:Empirically determined
Weight decay rate 𝜆:0.01 - 0.00001
Momentum rate 𝜈:0.9 (initially 0.5)
𝜃(𝑡+1) = 𝜃(𝑡) + ∆𝜃(𝑡)
∆𝜃(𝑡)
= −𝜂
𝜕𝐼
𝜕𝜃
− 𝜆𝜃(𝑡)
+ 𝜈∆𝜃(𝑡−1)
Update rule
Gradient Weight decay momentum
Weight decay is to avoid the
unnecessary diverse.
(especially for sigmoid function)
Momentum is to avoid unnecessary
oscillation of update amount.
The similar effect of the conjugate
gradient algorithm is expected.
Outline
1. Examples of Deep NNs
2. RBM to Deep NN
3. Deep Neural Network (Deep NN)
– Back-Propagation (Supervised Learning)
4. Restricted Boltzmann Machine (RBM)
– Mathematics, Probabilistic Model and Inference Model
– Pre-training by Contradictive Diverse Learning
(Unsupervised Learning)
5. Inference Model with Distribution
28
http://bit.ly/dnnicpr2014
Restricted Boltzmann Machines
29
 Boltzmann Machines
Boltmann machine is a probabilistic model represented by
undirected graph (nodes and edges).
Here, the binary state {0,1} is considered as the state of the nodes.
 Unrestricted and restricted Boltzmann machine
v:visible layer
h:hidden layer
v:visible layer
h:hidden layer
• (Unrestricted) Boltzmann machine • Restricted Boltzamann machine
(RBM)
Every node is connected each other.
There is no edge in the same layer.
It helps analysis.
RBM: Probabilistic and energy models
30
v:visible layer {0,1}
h:hidden layer {0,1}
Probabilistic model: 𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
exp −𝐸 𝒗, 𝒉; 𝜽
RBM parameter: θ = (W,b,c)
Weight: W
Bias: b,c
Energy model: 𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 −
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
= −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗
Partition function: 𝑍 𝜽 =
𝒗,𝒉∈{0,1}
exp(−𝐸(𝒗, 𝒉; 𝜽))
=
𝑖 exp(𝑐𝑖 𝑣𝑖) 𝑗 exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
𝑖 exp(𝑐𝑖 𝑣𝑖) 𝑗 ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
RBM: Conditional probability model
(Inference model)
31
v:visible layer {0,1}
h:hidden layer {0,1}
𝑃 𝒉 𝒗; 𝜽 =
𝑃 𝒗, 𝒉; 𝜽
𝒉∈{0,1} 𝑃 𝒗, 𝒉; 𝜽
=
exp( 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑗 𝑏𝑗ℎ𝑗 + 𝑖 𝑐𝑖 𝑣𝑖)
𝒉∈{0,1} exp( 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑗 𝑏𝑗ℎ𝑗 + 𝑖 𝑐𝑖 𝑣𝑖)
Conditional probabilities of nodes
are independent.
𝑃 ℎ𝑗 𝒗; 𝜽 =
exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
=
𝑗
exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
=
𝑗
𝑃(ℎ𝑗|𝒗; 𝜽)
𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
exp −𝐸 𝒗, 𝒉; 𝜽
𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 −
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
= −𝒗 𝑇
𝑾𝒉 − 𝒃 𝑇
𝒉 − 𝒄 𝑇
𝒗
32
v:visible layer {0,1}
h:hidden layer {0,1}
𝑃 ℎ𝑗 𝒗; 𝜽 =
exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗)
𝑃 ℎ𝑗 = 1 𝒗; 𝜽 =
exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 1 + 𝑏𝑗 × 1)
exp 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 0 + 𝑏𝑗 × 0 + exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 1 + 𝑏𝑗 × 1)
𝜎 𝑥 =
1
1 + 𝑒−𝑥
=
exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗)
exp 0 + exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗)
=
1
1 + exp(− 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗)
= 𝜎
𝑖
𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄 𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Vector representation of
single layer NN
RBM: Conditional probability model
(Inference model)
Gaussian-Bernoulli RBM
33
v:visible layer {0,1}
h:hidden layer {0,1}
Bernoulli-Bernoulli RBM Gaussian-Bernoulli RBM
v:visible layer
Gaussian distribution
𝑁(𝑣; 𝜇, 𝑠2
)
h:hidden layer {0,1}
𝑃 𝒗, 𝒉 =
1
𝑍(𝜽)
exp −𝐸 𝒗, 𝒉
Gaussian-Bernoulli RBM
𝐸 𝒗, 𝒉 =
1
2𝑠2
𝑖
𝑣𝑖 − 𝑐𝑖
2 −
1
𝑠
𝑖,𝑗
𝑣𝑖 𝑤𝑖𝑗ℎ𝑖 −
𝑗
𝑏𝑗ℎ𝑗
Probabilistic model:
Inference model
(Conditional probability)
𝑃 𝒉 = 𝟏 𝒗 = 𝜎
1
𝑠
𝑾 𝑇 𝒗 + 𝒃
𝑃 𝒗 𝒉 = 𝑵 𝒗; 𝜎𝑾𝒉 + 𝒄, 𝑠2
𝑰
Energy model:
Outline
1. Examples of Deep NNs
2. RBM to Deep NN
3. Deep Neural Network (Deep NN)
– Back-Propagation (Supervised Learning)
4. Restricted Boltzmann Machine (RBM)
– Mathematics, Probabilistic Model and Inference Model
– Pre-training by Contradictive Diverse Learning
(Unsupervised Learning)
5. Inference Model with Distribution
34
http://bit.ly/dnnicpr2014
RBM: Contradictive Diverse Learning
35
𝒗(0)
𝒉(0)
𝒗(1)
𝒉(1)
𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃
𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄
𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃
𝑾 ← 𝑾 − 𝜀 Δ𝑾
Δ𝑾 =
𝟏
𝑁
𝑛
𝒗 𝑛(0)
𝑇
𝒉 𝑛(0) − 𝒗 𝑛(1)
𝑇
𝒉 𝑛(1)
Iterative process of the CD learning
The CD learning can be considered the maximum
likelihood estimation with given training data.
Momentum and
weight decay
are also applied.
RBM: Outline of the CD learning
36
v:visible layer {0,1}
h:hidden layer {0,1}
パラメータθ:Weight W, Bias b,c
𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
exp −𝐸 𝒗, 𝒉; 𝜽
𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 −
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
= −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗
• The maximum likelihood estimation with given training data {vn}.
• (Approximated) EM algorithm is applied to handle unobserved
hidden data.
• The Gibbs sampling is applied to evaluate the partition function.
• The Gibbs sampling is approximated by single sampling.
Outline of the Contrastive Divergence Learning
CD learning: Maximum likelihood
37
v:visible layer {0,1}
h:hidden layer {0,1} 𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
exp −𝐸 𝒗, 𝒉; 𝜽
𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 −
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
= −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗
RBM is probabilistic model
The maximum likelihood
gives parameters for given
training data.
Visible data are given.
Hidden data are not given.
Integrate the hidden data.
Log likelihood for training data {vn}
𝐿 𝜽 =
𝑛
𝐿 𝑛 𝜽 =
𝑛
log 𝑃 𝒗 𝑛; 𝜽
=
𝑛
log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽)
The optimization is performed by
the EM algorithm.
𝐿 𝑛 𝜽 = log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽)
EM Algorithm
38 Reference: これなら分かる最適化数学,金谷健一
𝐿 𝑛 𝜽 = log 𝑃 𝒗 𝑛; 𝜽 = log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽)
Log likelihood for training data {vn}
The EM algorithm monotonically increases the log likelihood.
EM algorithm
1. Initialize parameter 𝜽 by 𝜽0 . Set 𝜏 = 0.
2. Evaluate following function(E-step)
3. Find 𝜽 𝜏 which maximizes 𝑄 𝜏 𝜽 (M-step)
4. Set 𝜏←𝜏 + 1, then step2. Iterate until it is converged.
𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽)
※In the CD learning, the M-step is approximated.
Evaluation function and derivatives
39
𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
exp −𝐸 𝒗, 𝒉; 𝜽
𝑍 𝜽 =
𝒗,𝒉∈{0,1}
exp(−𝐸(𝒗, 𝒉; 𝜽))
𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽)
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log
exp −𝐸 𝒗 𝑛, 𝒉; 𝜽
𝑍(𝜽)
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) −𝐸 𝒗 𝑛, 𝒉; 𝜽 − log 𝑍(𝜽)
𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽)
Evaluation function
v:visible layer {0,1}
h:hidden layer {0,1}
𝜕𝑄 𝜏 𝜽
𝜕𝜽
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) −
𝜕
𝜕𝜽
𝐸 𝒗 𝑛, 𝒉; 𝜽 −
𝜕
𝜕𝜽
log 𝑍(𝜽)
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) −
𝜕
𝜕𝜽
𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽) −
𝜕
𝜕𝜽
𝐸 𝒗, 𝒉; 𝜽
Data term Model term
Derivative of partition function
(Model term)
40
𝜕 log 𝑍(𝜽)
𝜕𝜽
=
1
𝑍(𝜽)
𝜕𝑍
𝜕𝜽
(𝜽) =
1
𝑍(𝜽)
𝜕
𝜕𝜽
𝑓 𝒙; 𝜽 𝑑𝒙
=
1
𝑍(𝜽)
𝜕𝑓 𝒙; 𝜽
𝜕𝜽
𝑑𝒙
=
1
𝑍(𝜽)
𝑓 𝒙; 𝜽
𝜕 log 𝑓 𝒙; 𝜽
𝜕𝜽
𝑑𝒙
𝑃 𝒙; 𝜽 =
1
𝑍 𝜽
𝑓(𝒙; 𝜽) 𝑍 𝜽 = 𝑓 𝒙; 𝜽 𝑑𝒙
Probability function Partition function
Arbitrary function:
𝑓(𝒙; 𝜽)
= 𝑃 𝒙; 𝜽
𝜕 log 𝑓 𝒙; 𝜽
𝜕𝜽
𝑑𝒙
= 𝐸 𝒙∼𝑃 𝒙;𝜽
𝜕 log 𝑓 𝒙; 𝜽
𝜕𝜽
Derivative of
log function
Derivative of
log function
Derivative operator
into integral
Definition of expectation
Evaluation function and derivatives
41
𝜕𝑄 𝜏 𝜽
𝜕𝜽
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) −
𝜕
𝜕𝜽
𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽) −
𝜕
𝜕𝜽
𝐸 𝒗, 𝒉; 𝜽
𝑃 𝒗, 𝒉; 𝜽 =
1
𝑍 𝜽
exp −𝐸 𝒗, 𝒉; 𝜽
𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽)
Evaluation function
v:visible layer {0,1}
h:hidden layer {0,1}
𝐸 𝒗, 𝒉; 𝜽 = −
𝑖,𝑗
𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 −
𝑗
𝑏𝑗ℎ𝑗 −
𝑖
𝑐𝑖 𝑣𝑖
= −𝒗 𝑇
𝑾𝒉 − 𝒃 𝑇
𝒉 − 𝒄 𝑇
𝒗
𝜕
𝜕𝑾
𝐸 𝒗, 𝒉; 𝜽 = −𝒗 𝑻
𝒉
𝜕𝑄 𝜏 𝜽 𝜏
𝜕𝑾
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇
𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻
𝒉
𝜕𝑄 𝜏 𝜽 𝜏
𝜕𝜽
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) −
𝜕
𝜕𝜽
𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) −
𝜕
𝜕𝜽
𝐸 𝒗, 𝒉; 𝜽
Data term (Feasible) Model term (Infeasible)
Approximation by
Gibbs sampling
Derivatives of evaluation function
(Data term)
42
𝒗(0)
𝒉(0)
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Inference of the RBM
Data term
𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇 𝒉 = 𝒗 𝑛
𝑇 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒉
= 𝒗 𝑛
𝑇 𝟎 ∙ 𝑃 𝒉 = 𝟎 𝒗; 𝜽 + 𝟏 ∙ 𝑃 𝒉 = 𝟏 𝒗; 𝜽
= 𝒗 𝑛
𝑇
𝜎 𝑾 𝑇 𝒗 𝑛 + 𝒃
𝜕𝑄 𝜏 𝜽 𝜏
𝜕𝑾
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉
Data term(Feasible) Model term(Infeasible)
Approximation by
Gibbs sampling
Expectation Approximation by Monte-Calro
Method (Model term)
43
𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 ≑
1
𝑁
𝑖
𝒗(𝑖)
𝑇
𝒉(𝑖) 𝒗(𝑖)
𝑇
, 𝒉(𝑖) ∼ 𝑃(𝒗, 𝒉; 𝜽 𝜏)
Independent sampling
Monte-Calro method
How can we get the sampling? Gibbs sampling
𝜕𝑄 𝜏 𝜽 𝜏
𝜕𝑾
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉
Data term(Feasible) Model term(Infeasible)
Approximation by
Gibbs sampling
Gibbs sampling
44
𝒗(0)
𝒉(0)
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Inference of RBM
𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄
𝒗(1)
𝒉(1)
𝒗(∞)
𝒉(∞)
Gibbs sampling
1. Initialize 𝒗.
2. Sampling 𝒉 with𝑃(𝒉|𝒗).
3. Sampling 𝒗 with 𝑃(𝒗|𝒉).
4. Iterate step 2 and step 3.
𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 ≑
1
𝑁
𝑖
𝒗(𝑖)
𝑇
𝒉(𝑖)
𝜕𝑄 𝜏 𝜽 𝜏
𝜕𝑾
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉
Data term(Feasible) Model term(Infeasible)
Approximation by
Gibbs sampling
Approximated Evaluation of Model Term
45
𝒗(0)
𝒉(0)
𝒗(1)
𝒉(1)
Stop just one time!
𝒉(0) = 𝜎 𝑾 𝑇
𝒗(𝟎) + 𝒃
𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄
𝒉(1) = 𝜎 𝑾 𝑇
𝒗(𝟏) + 𝒃
𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇 𝒉 =𝒗(0)
𝑇
𝒉(0)
𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 =𝒗(1)
𝑇
𝒉(1) 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Inference of RBM
𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄
𝜕𝑄 𝜏 𝜽 𝜏
𝜕𝑾
= 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛
𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉
Data term(Feasible) Model term(Infeasible)
Approximation by
Gibbs sampling
Probabilities or status?
46
𝒗(0)
𝒉(0)
𝒗(1)
𝒉(1)
𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃
𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄
𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃
G. Hinton,
A Practical Guide to Training
Restricted Boltzmann Machines 2010.
The inference of the RBM gives the probabilities.
In the Gibbs sampling, should we sampling the status with the probabilities ?
Or should we simply use the probabilities ?
Hinton recommends to use the probabilities.
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃
Inference of RBM
𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄
RBM: Contradictive Diverse Learning
47
𝒗(0)
𝒉(0)
𝒗(1)
𝒉(1)
𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃
𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄
𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃
𝑾 ← 𝑾 − 𝜀 Δ𝑾
Δ𝑾 =
𝟏
𝑁
𝑛
𝒗 𝑛(0)
𝑇
𝒉 𝑛(0) − 𝒗 𝑛(1)
𝑇
𝒉 𝑛(1)
Iterative process of the CD learning
The CD learning can be considered the maximum
likelihood estimation with given training data.
Momentum and
weight decay
are also applied.
Pre-training for the stacked RBMs
48 Training data
Output data
Pre-training for
1st layer RBM
Training data
Output data
Pre-training for
2nd layer RBM
Input data
copy
copy
copy
Pre-training for the RBMs
Outline
1. Examples of Deep NNs
2. RBM to Deep NN
3. Deep Neural Network (Deep NN)
– Back-Propagation (Supervised Learning)
4. Restricted Boltzmann Machine (RBM)
– Mathematics, Probabilistic Model and Inference Model
– Pre-training by Contradictive Diverse Learning
(Unsupervised Learning)
5. Inference Model with Distribution
49
http://bit.ly/dnnicpr2014
Drop-out
50
The drop-out is expected to be similar
to ensemble learning.
It is effective to avoid the overfitting.
Drop-out
The nodes are randomly dropped out for
each mini-batch. The output of the
dropped node is zero.
50% drop-out rate is recommended.
G. Hinton, N.Srivastava, A.Krizhevsky, I.Sutskever, and
R.Salakhutdinov, “Improving neural networks by preventing
co-adaptation of feature detectors.”, arXiv preprint
arXiv:1207.0580, 2012.
Input
layer
Output
layer
Ensemble learning and Drop-out
51
Input
layer
Output
layer
Ensemble learning Drop-out
Integration of multiple weak learner > single learner
𝒉 𝒗 =
1
𝐾
𝒉1 𝒗 + 𝒉2 𝒗 + 𝒉3 𝒗 + ⋯ + 𝒉 𝐾 𝒗
The drop-out is
expected to be
similar effect to
ensemble learning
Fast drop-out learning
52 S.I. Wang and C.D. Manning, Fast dropout training, ICML 2013.
Input layer
𝑣1 𝑣2 𝑣3
ℎ Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝜎
ℎ = 𝜎 𝑛
𝑛 =
𝑖
𝑤𝑖 𝑣𝑖 + 𝑏
Inference of standard NN
𝑣1 𝑣2 𝑣3
ℎ
𝑤1 𝑤2 𝑤3
𝑛
𝜎
𝜉1 𝜉2 𝜉3
Input layer
Output layer
ℎ = 𝐸 𝜎 𝜒
𝜒 =
𝑖
𝜉𝑖 𝑤𝑖 𝑣𝑖 + 𝑏
𝜉𝑖 ∈ 0,1 ,
𝑃 𝜉𝑖 = 0 = 0.5
𝑃 𝜉𝑖 = 1 = 0.5
𝜒, 𝜉𝑖:Stochastic
variables
Inference of drop-out NN
𝜒
𝑃(𝜒)
Fast drop-out learning
53 S.I. Wang and C.D. Manning, Fast dropout training, ICML 2013.
𝜒
𝑃(𝜒)
𝜒
𝑃(𝜒) Approximate by Gaussian
𝑁(𝜒; 𝜇, 𝑠2
)
mean 𝜇,
variance𝑠2
ℎ = 𝐸 𝜎 𝜒
≑
−∞
∞
𝑁(𝜒; 𝜇, 𝑠2
) 𝜎 𝜒 𝑑𝜒
≑ 𝜎
𝜇
1 + 𝜋𝑠2/8
Closed form approximation
Fast calculation without
sampling.
𝑣1 𝑣2 𝑣3
ℎ
𝑤1 𝑤2 𝑤3
𝑛
𝜎
𝜉1 𝜉2 𝜉3
Input layer
Output layer
ℎ = 𝐸 𝜎 𝜒
𝜒 =
𝑖
𝜉𝑖 𝑤𝑖 𝑣𝑖 + 𝑏
𝜉𝑖 ∈ 0,1 ,
𝑃 𝜉𝑖 = 0 = 0.5
𝑃 𝜉𝑖 = 1 = 0.5
𝜒, 𝜉𝑖:Stochastic
variables
Inference of drop-out NN
Inference based on RBM model
54
M. Tanaka and M. Okutomi, A Novel Inference of a Restricted
Boltzmann Machine, ICPR2014.
Input layer
𝑣1 𝑣2 𝑣3
ℎ Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝜎
ℎ = 𝜎 𝑛
𝑛 =
𝑖
𝑤𝑖 𝑣𝑖 + 𝑏
Inference of standard NN
ℎ1 ℎ2 ℎ3
𝐻 Output layer
𝑤1 𝑤2 𝑤3
𝑛
𝜎 𝐻 = 𝐸 𝜎 𝜒
𝜒 =
𝑖
𝑤𝑖ℎ𝑖 + 𝑏
Inference based on RBM model
𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇
𝒗 + 𝒃
𝜒, ℎ𝑖: stochastic variables
𝜒
𝑃(𝜒)
𝜒
𝑃(𝜒) ℎ = 𝐸 𝜎 𝜒
≑
−∞
∞
𝑁(𝜒; 𝜇, 𝑠2
)𝜎 𝜒 𝑑𝜒
≑ 𝜎
𝜇
1 + 𝜋𝑠2/8
mean 𝜇,
variance𝑠2
Approximate by Gaussian
𝑁(𝜒; 𝜇, 𝑠2
)
Inference based on RBM model
55
M. Tanaka and M. Okutomi, A Novel Inference of a Restricted
Boltzmann Machine, ICPR2014.
Improve the performance!
Inference based on RBM model
56
Outline
1. Examples of Deep NNs
2. RBM to Deep NN
3. Deep Neural Network (Deep NN)
– Back-Propagation (Supervised Learning)
4. Restricted Boltzmann Machine (RBM)
– Mathematics, Probabilistic Model and Inference Model
– Pre-training by Contradictive Diverse Learning
(Unsupervised Learning)
5. Inference Model with Distribution
57
http://bit.ly/dnnicpr2014

More Related Content

PPTX
Lecture 1 graphical models
PDF
高速な物体候補領域提案手法 (Fast Object Proposal Methods)
PPTX
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
PPTX
Deep learning presentation
PPTX
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
PPTX
OpenCVを用いた画像処理入門
PPTX
分散深層学習 @ NIPS'17
PPTX
勾配降下法の 最適化アルゴリズム
Lecture 1 graphical models
高速な物体候補領域提案手法 (Fast Object Proposal Methods)
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
Deep learning presentation
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
OpenCVを用いた画像処理入門
分散深層学習 @ NIPS'17
勾配降下法の 最適化アルゴリズム

What's hot (20)

PDF
L2. Evaluating Machine Learning Algorithms I
PDF
第1回 配信講義 計算科学技術特論A (2021)
PDF
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
PDF
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
PPTX
画像処理応用
PDF
【メタサーベイ】数式ドリブン教師あり学習
PDF
汎用なNeural Network Potential「Matlantis」を使った新素材探索_2022応用物理学会_2022/3/22
PDF
GAN(と強化学習との関係)
PDF
インドア測位技術の現況:UWBビジネス元年到来?!
PDF
シンギュラリティを知らずに機械学習を語るな
PPTX
近年のHierarchical Vision Transformer
PDF
[Dl輪読会]dl hacks輪読
PDF
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
PDF
SSII2022 [OS3-02] Federated Learningの基礎と応用
PDF
Cosine Based Softmax による Metric Learning が上手くいく理由
PDF
「世界モデル」と関連研究について
PDF
顕著性マップの推定手法
PDF
なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由
PDF
Deeplearning輪読会
L2. Evaluating Machine Learning Algorithms I
第1回 配信講義 計算科学技術特論A (2021)
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
画像処理応用
【メタサーベイ】数式ドリブン教師あり学習
汎用なNeural Network Potential「Matlantis」を使った新素材探索_2022応用物理学会_2022/3/22
GAN(と強化学習との関係)
インドア測位技術の現況:UWBビジネス元年到来?!
シンギュラリティを知らずに機械学習を語るな
近年のHierarchical Vision Transformer
[Dl輪読会]dl hacks輪読
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
SSII2022 [OS3-02] Federated Learningの基礎と応用
Cosine Based Softmax による Metric Learning が上手くいく理由
「世界モデル」と関連研究について
顕著性マップの推定手法
なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由
Deeplearning輪読会
Ad

Similar to DNN and RBM (20)

PPTX
Batch normalization presentation
PPTX
04 Multi-layer Feedforward Networks
PPTX
Neural network basic and introduction of Deep learning
PDF
Echo state networks and locomotion patterns
PPTX
Introduction to deep learning
PPTX
Anomaly detection using deep one class classifier
PDF
Lecture 5 backpropagation
PPTX
The world of loss function
PDF
Restricting the Flow: Information Bottlenecks for Attribution
PDF
22PCOAM16 _ML_ Unit 2 Full unit notes.pdf
PDF
22PCOAM16 ML UNIT 2 NOTES & QB QUESTION WITH ANSWERS
PDF
SPICE-MATEX @ DAC15
PPTX
Training DNN Models - II.pptx
PDF
Super resolution in deep learning era - Jaejun Yoo
PPTX
Jsai final final final
PPTX
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)
PPTX
20230213_ComputerVision_연구.pptx
PDF
Lecture 5: Neural Networks II
PPTX
Introduction to Neural Networks and Deep Learning
PDF
Hardware Acceleration for Machine Learning
Batch normalization presentation
04 Multi-layer Feedforward Networks
Neural network basic and introduction of Deep learning
Echo state networks and locomotion patterns
Introduction to deep learning
Anomaly detection using deep one class classifier
Lecture 5 backpropagation
The world of loss function
Restricting the Flow: Information Bottlenecks for Attribution
22PCOAM16 _ML_ Unit 2 Full unit notes.pdf
22PCOAM16 ML UNIT 2 NOTES & QB QUESTION WITH ANSWERS
SPICE-MATEX @ DAC15
Training DNN Models - II.pptx
Super resolution in deep learning era - Jaejun Yoo
Jsai final final final
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)
20230213_ComputerVision_연구.pptx
Lecture 5: Neural Networks II
Introduction to Neural Networks and Deep Learning
Hardware Acceleration for Machine Learning
Ad

More from Masayuki Tanaka (20)

PDF
Slideshare breaking inter layer co-adaptation
PDF
PRMU201902 Presentation document
PDF
Gradient-Based Low-Light Image Enhancement
PDF
Year-End Seminar 2018
PPTX
遠赤外線カメラと可視カメラを利用した悪条件下における画像取得
PPTX
Learnable Image Encryption
PDF
クリエイティブ・コモンズ
PDF
デザイン4原則
PDF
メラビアンの法則
PDF
類似性の法則
PDF
権威に訴える論証
PDF
Chain rule of deep neural network layer for back propagation
PDF
Give Me Four
PDF
Tech art 20170315
PDF
My Slide Theme
PDF
Font Memo
PPT
One-point for presentation
PPTX
ADMM algorithm in ProxImaL
PPTX
Intensity Constraint Gradient-Based Image Reconstruction
PPTX
Least Square with L0, L1, and L2 Constraint
Slideshare breaking inter layer co-adaptation
PRMU201902 Presentation document
Gradient-Based Low-Light Image Enhancement
Year-End Seminar 2018
遠赤外線カメラと可視カメラを利用した悪条件下における画像取得
Learnable Image Encryption
クリエイティブ・コモンズ
デザイン4原則
メラビアンの法則
類似性の法則
権威に訴える論証
Chain rule of deep neural network layer for back propagation
Give Me Four
Tech art 20170315
My Slide Theme
Font Memo
One-point for presentation
ADMM algorithm in ProxImaL
Intensity Constraint Gradient-Based Image Reconstruction
Least Square with L0, L1, and L2 Constraint

Recently uploaded (20)

PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
diccionario toefl examen de ingles para principiante
PPT
protein biochemistry.ppt for university classes
PDF
An interstellar mission to test astrophysical black holes
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
famous lake in india and its disturibution and importance
The KM-GBF monitoring framework – status & key messages.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
diccionario toefl examen de ingles para principiante
protein biochemistry.ppt for university classes
An interstellar mission to test astrophysical black holes
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
The scientific heritage No 166 (166) (2025)
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Biophysics 2.pdffffffffffffffffffffffffff
Cell Membrane: Structure, Composition & Functions
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
microscope-Lecturecjchchchchcuvuvhc.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
2. Earth - The Living Planet Module 2ELS
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
famous lake in india and its disturibution and importance

DNN and RBM

  • 1. Masayuki Tanaka Aug. 17, 2015 Back-Propagation Algorithm for Deep Neural Networks and Contradictive Diverse Learning for Restricted Boltzmann Machine
  • 2. Outline 1. Examples of Deep Learning 2. RBM to Deep NN 3. Deep Neural Network (Deep NN) – Back-Propagation (Supervised Learning) 4. Restricted Boltzmann Machine (RBM) – Mathematics, Probabilistic Model and Inference Model – Pre-training by Contradictive Diverse Learning (Unsupervised Learning) 5. Inference Model with Distribution 1 http://bit.ly/dnnicpr2014
  • 3. Deep learning 2 – MNIST (handwritten digits benchmark) MNIST  Top performance in character recognition
  • 4. Deep learning 3 CIFAR10 – CIFAR (image classification benchmark)  Top performance in image classification
  • 5. Deep learning 4 Convolution Pooling Softmax OtherGoogLeNet, ILSVRC2014 Image Large Scale Visual Recognition Challenge (ILSVRC)  Top performance in visual recoginition
  • 6. Deep learning 5 http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/uns upervised_icml2012.pdf Automatic learning with youtube videos, neuron for human’s face neuron for cat 10,000,000: training samples Three days learning with 1,000 computers 5  “Cat neuron”
  • 8. Pros and Cons of Deep NN 7 Input layer Output layer Deep NN Until a few years ago… 1. Tend to be overfitting 2. Learning information does not reach to the lower layer ・Pre-training with RBM ・Big data Image net More than 1,5 M: Labeled images http://www.image-net.org/ Labeled Faces in the Wild More than 10,000: Face images http://vis-www.cs.umass.edu/lfw/ High-performance network
  • 9. Outline 1. Examples of Deep NNs 2. RBM to Deep NN 3. Deep Neural Network (Deep NN) – Back-Propagation (Supervised Learning) 4. Restricted Boltzmann Machine (RBM) – Mathematics, Probabilistic Model and Inference Model – Pre-training by Contradictive Diverse Learning (Unsupervised Learning) 5. Inference Model with Distribution 8 http://bit.ly/dnnicpr2014
  • 10. Single Layer Neural Network 9 Input layer 𝑣1 𝑣2 𝑣3 ℎ Output 𝑤1 𝑤2 𝑤3 ℎ = 𝜎 𝑖 𝑤𝑖 𝑣𝑖 + 𝑏 𝜎 𝑥 = 1 1 + 𝑒−𝑥 Sigmoid function 0.0 0.2 0.4 0.6 0.8 1.0 -6.0 -5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0  Single node output  Multiple nodes output (Single Layer NN) 𝒉Output layer Input Layer 𝒗 ℎ𝑗 = 𝜎 𝑖 𝑤𝑖𝑗 𝑣𝑖 + 𝑏𝑗 𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 Vector representation of Single layer NN It is equivalent to the inference model of the RBM
  • 11. Sigmoid function Weighted sum and Activation functions 10 Input layer 𝑣1 𝑣2 𝑣3 ℎ Output layer 𝑤1 𝑤2 𝑤3 𝑛 𝑓 ℎ = 𝑓 𝑛 𝑛 = 𝑖 𝑤𝑖 𝑣𝑖 + 𝑏 Input layer 𝑣1 𝑣2 𝑣3 ℎ Output layer 𝑤1 𝑤2 𝑤3 ℎ = 𝑓 𝑖 𝑤𝑖 𝑣𝑖 + 𝑏 𝑓 𝑥 = 𝜎 𝑥 = 1 1 + 𝑒−𝑥 0.0 0.2 0.4 0.6 0.8 1.0 -6.0 -5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 -6.0 -5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 Rectified linear unit 𝑓 𝑥 = ReLU 𝑥 = 0 (𝑥 < 0) 𝑥 (𝑥 ≥ 0)
  • 12. Single layer NN to Deep NN 11 The deep NN is build up by stacking single layer NNs. 1st NN 2nd NN k-th NN Output data Input NN The output of the single layer NN will be the input of the next single layer NN. The output data of the deep NN is inferred by iterating the process.
  • 13. Parameters estimation for deep NN 12 Parameters are estimated by gradient descent algorithm which minimizes the difference between the output data and teach data. x y x0 x1 x2 The deep NN is build up by stacking single layer NNs. 1st NN 2nd NN k-th NN Teach data Input NN
  • 14. Parameters estimation for deep NN 13 Parameters are estimated by gradient descent algorithm which minimizes the difference between the output data and teach data. The deep NN is build up by stacking single layer NNs. 1st NN 2nd NN k-th NN Teach data Input NN Back-propagation: The gradients can be calculated as propagating the information backward.
  • 15. Why the pre-training is necessary? 14 1st NN 2nd NN k-th NN Teach data Input NN The back-propagation calculates the gradient from the output layer to the input layer. The information of the back-propagation can not reach the deep layers. Deep layers(1st layer, 2nd layer, …) are better to be learned by the unsupervised learning. Pre-training with the RBMs.
  • 16. Pre-training with RBMs 15 1st NN 2nd NN k-th NN Input data Single layer NN RBM Data The inference of the single layer NN is mathematically equivalent to the inference of the RBM. The RBM parameters are estimated by maximum likelihood algorithm with given training data.
  • 17. Pre-training and fine-tuning 16 Training data Output data Pre-training for 1st layer RBM Training data Output data Pre-training for 2nd layer RBM Input data Teach data Back propagation copy copy copy Fine-tuning of deep NN Pre-training with RBMs
  • 18. Feature vector extraction 17 Training data Output data Pre-training for 1st layer RBM Training data Output data Pre-training for 2nd layer RBM Input data Feature copy copy copy Pre-training with RBMs
  • 19. Outline 1. Examples of Deep NNs 2. RBM to Deep NN 3. Deep Neural Network (Deep NN) – Back-Propagation (Supervised Learning) 4. Restricted Boltzmann Machine (RBM) – Mathematics, Probabilistic Model and Inference Model – Pre-training by Contradictive Diverse Learning (Unsupervised Learning) 5. Inference Model with Distribution 18 http://bit.ly/dnnicpr2014
  • 20. Back-Propagation Algorithm 19 Input data Teach data Back propagation Output data 𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃Vector representation of the single layer NN The goal of learning: Weights W and bias b of the each layer are estimated, so that the differences between the output data and the teach data are minimized. 𝐼 = 1 2 𝑘 ℎ 𝑘 (𝐿) − 𝑡 𝑘 2Objective function Efficient calculation of the gradient is important. 𝜕𝐼 𝜕𝑾(ℓ) Back-propagation algorithm is an efficient algorithm to calculate the gradients.
  • 21. Back-Propagation: Gradient of the sigmoid function 20 𝜎 𝑥 = 1 1 + 𝑒−𝑥  Sigmoid function  Gradient of the sigmoid function 𝜕𝜎 𝜕𝑥 = 1 − 𝜎 𝑥 𝜎(𝑥)  Derivation of the gradient of the sigmoid function 𝜕𝜎 𝜕𝑥 = 𝜕 𝜕𝑥 1 1 + 𝑒−𝑥 = − 1 1 + 𝑒−𝑥 2 × −𝑒−𝑥 = 𝑒−𝑥 1 + 𝑒−𝑥 2 = 𝑒−𝑥 1 + 𝑒−𝑥 × 1 1 + 𝑒−𝑥 = 1 − 1 1 + 𝑒−𝑥 1 1 + 𝑒−𝑥 = (1 − 𝜎 𝑥 )𝜎 𝑥
  • 22. Back-Propagation: Simplification 21  Single layer NN 𝒉 Output layer Input layer 𝒗 ℎ𝑗 = 𝜎 𝑖 𝑤𝑖𝑗 𝑣𝑖 + 𝑏𝑗 𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 𝑾′ = 𝑾 𝒃 𝑇 𝒗′ = 𝒗 1 𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 = 𝜎 𝑾′ 𝑇 𝒗′ Here and after, Let’s consider only weight W 𝒉 = 𝜎 𝑾 𝑇 𝒗 Vector representation of the single layer NN
  • 23. Two-layer NN 22 ℎ 𝑘 (2) = 𝜎(𝑛 𝑘 2 ) 𝑛 𝑘 2 = 𝑗 𝑤𝑗𝑘 (2) ℎ𝑗 (1) ℎ𝑗 (1) = 𝜎(𝑛𝑗 1 ) 𝑛𝑗 1 = 𝑖 𝑤𝑖𝑗 (1) 𝑣𝑖 𝑣1 𝑣2 𝑣𝑖 𝑛1 (1) 𝜎 𝜎 𝜎 𝑤𝑖𝑗 (1) 𝑛2 (1) 𝑛𝑗 (1) ℎ1 (1) ℎ2 (1) ℎ𝑗 (1) 𝑛1 (2) 𝜎 𝜎 𝜎 𝑛2 (2) 𝑛 𝑘 (2) ℎ1 (2) ℎ2 (2) ℎ 𝑘 (2) 1 st layer 2 nd layer 𝑤𝑗𝑘 (2) 𝑡1 𝑡2 𝑡 𝑘 Teach Input layer𝑣1 𝑣2 𝑣3 ℎ Output layer 𝑤1 𝑤2 𝑤3 ℎ = 𝜎 𝑖 𝑤𝑖 𝑣𝑖 Input layer 𝑣1 𝑣2 𝑣3 ℎ Output layer 𝑤1 𝑤2 𝑤3 𝑛 𝜎 ℎ = 𝜎 𝑛 𝑛 = 𝑖 𝑤𝑖 𝑣𝑖 Separate weighted sum and activation function
  • 24. Back-Propagation of two-layer NN 23 ℎ 𝑘 (2) = 𝜎(𝑛 𝑘 2 ) 𝑛 𝑘 2 = 𝑗 𝑤𝑗𝑘 (2) ℎ𝑗 (1) ℎ𝑗 (1) = 𝜎(𝑛𝑗 1 ) 𝑛𝑗 1 = 𝑖 𝑤𝑖𝑗 (1) 𝑣𝑖 𝑣1 𝑣2 𝑣𝑖 𝑛1 (1) 𝜎 𝜎 𝜎 𝑤𝑖𝑗 (1) 𝑛2 (1) 𝑛𝑗 (1) ℎ1 (1) ℎ2 (1) ℎ𝑗 (1) 𝑛1 (2) 𝜎 𝜎 𝜎 𝑛2 (2) 𝑛 𝑘 (2) ℎ1 (2) ℎ2 (2) ℎ 𝑘 (2) 1 st layer 2 nd layer 𝑤𝑗𝑘 (2) 𝑡1 𝑡2 𝑡 𝑘 Teach 𝐼 = 1 2 𝑘 ℎ 𝑘 (2) − 𝑡 𝑘 2 Objective function 𝜕𝐼 𝜕𝑤𝑗𝑘 (2) = 𝜕𝐼 𝜕𝑛 𝑘 (2) 𝜕𝑛 𝑘 (2) 𝜕𝑤𝑗𝑘 (2) = 𝛿 𝑘 (2) ℎ𝑗 (1) 𝛿 𝑘 (2) = 𝜕𝐼 𝜕𝑛 𝑘 (2) = 𝜕𝐼 𝜕ℎ 𝑘 (2) 𝜕ℎ 𝑘 (2) 𝜕𝑛 𝑘 (2) = (ℎ 𝑘 (2) − 𝑡 𝑘) ℎ 𝑘 2 (1 − ℎ 𝑘 2 ) 𝜕𝐼 𝜕𝑤𝑖𝑗 (1) = 𝜕𝐼 𝜕𝑛𝑗 (1) 𝜕𝑛𝑗 (1) 𝜕𝑤𝑖𝑗 (1) = 𝛿𝑗 (1) 𝑣𝑖 𝛿𝑗 (1) = 𝜕𝐼 𝜕𝑛𝑗 (1) = 𝑘 𝜕𝐼 𝜕𝑛 𝑘 (2) 𝜕𝑛 𝑘 (2) 𝜕ℎ𝑗 (1) 𝜕ℎ𝑗 (1) 𝜕𝑛𝑗 (1) = 𝑘 𝛿 𝑘 (2) 𝑤𝑗𝑘 (2) ℎ𝑗 1 (1 − ℎ𝑗 1 ) Back-propagation
  • 25. Back-Propagation of arbitrary layer 24 𝒉(ℓ−1) 𝑾(ℓ) 𝒏(ℓ) 𝒏(ℓ) = 𝑾 ℓ T 𝒉(ℓ−1) 𝒇(ℓ) 𝒉(ℓ) = 𝒇(ℓ) 𝒏(ℓ)𝒉(ℓ) 𝑾(ℓ+1) 𝜹(ℓ) = 𝑾(ℓ+1) 𝜹(ℓ+1) ⊗ 𝜕𝑓(ℓ) 𝜕𝒏(ℓ) 𝜕𝐼 𝜕𝑾(ℓ) = 𝜹(ℓ) 𝒉 ℓ−1 T ⊗: elementwise product 𝜹(ℓ+1) = 𝑾(ℓ+2) 𝜹(ℓ+2) ⊗ 𝜕𝑓(ℓ+1) 𝜕𝒏(ℓ+1) 𝜕𝐼 𝜕𝑾(ℓ+1) = 𝜹(ℓ+1) 𝒉 ℓ1 T 𝜹(ℓ−1) = 𝑾(ℓ) 𝜹(ℓ) ⊗ 𝜕𝑓(ℓ−1) 𝜕𝒏(ℓ−1) 𝜕𝐼 𝜕𝑾(ℓ−1) = 𝜹(ℓ−1) 𝒉 ℓ−2 T
  • 26. Tip for gradient calculation debug 25 𝐼(𝜽) = 1 2 𝑘 ℎ 𝑘 𝐿 (𝒗; 𝜽) − 𝑡 𝑘 2 𝜕𝐼 𝜕𝜃𝑖 Objective function Gradient calculated by the back-propagation 𝜕𝐼 𝜕𝜃𝑖 = lim 𝜀→0 𝐼 𝜽 + 𝜀𝟏𝑖 − 𝐼 𝜽 𝜀 𝟏𝑖:i-th element is 1, others are 0 ∆𝑖 𝐼 = 𝐼 𝜽 + 𝜀𝟏𝑖 − 𝐼 𝜽 𝜀 Computational efficient Difficult implementation Computational inefficient Easy implementation Definition of gradient Differential approximation For small 𝜀, ∆𝑖 𝐼 ≑ 𝜕𝐼 𝜕𝜃 𝑖
  • 27. Stochastic Gradient descent algorithm (Mini-batch learning) 26 { 𝒗, 𝒕 1, 𝒗, 𝒕 2, ⋯ , 𝒗, 𝒕 𝑛 ⋯ , 𝒗, 𝒕 𝑁 } Parameters 𝜽 are learned with samples 𝐼 𝑛(𝜽) is the objective function associated to 𝒗, 𝒕 𝑛 𝒗, 𝒕 7 𝒗, 𝒕 10 𝒗, 𝒕 1 𝒗, 𝒕 2 𝒗, 𝒕 3 𝒗, 𝒕 4 𝒗, 𝒕 11𝒗, 𝒕 5 𝒗, 𝒕 6 𝜽 ← 𝜽 − 𝜂 𝜕𝐼 𝜕𝜽𝒗, 𝒕 7 𝒗, 𝒕 9𝒗, 𝒕 2 𝒗, 𝒕 11 Sampling 𝒗, 𝒕 6 𝒗, 𝒕 10𝒗, 𝒕 3 𝒗, 𝒕 12 𝒗, 𝒕 9 𝜽 ← 𝜽 − 𝜂 𝜕𝐼 𝜕𝜽 Sampling𝒗, 𝒕 12 𝒗, 𝒕 8 To avoid the overfitting, the parameters are updated with each mini-batch. Whole training data
  • 28. Practical update of parameters 27 G. Hinton, A Practical Guide to Training Restricted Boltzmann Machines 2010. Size of mini-batch:10 - 100 Learning rate 𝜂:Empirically determined Weight decay rate 𝜆:0.01 - 0.00001 Momentum rate 𝜈:0.9 (initially 0.5) 𝜃(𝑡+1) = 𝜃(𝑡) + ∆𝜃(𝑡) ∆𝜃(𝑡) = −𝜂 𝜕𝐼 𝜕𝜃 − 𝜆𝜃(𝑡) + 𝜈∆𝜃(𝑡−1) Update rule Gradient Weight decay momentum Weight decay is to avoid the unnecessary diverse. (especially for sigmoid function) Momentum is to avoid unnecessary oscillation of update amount. The similar effect of the conjugate gradient algorithm is expected.
  • 29. Outline 1. Examples of Deep NNs 2. RBM to Deep NN 3. Deep Neural Network (Deep NN) – Back-Propagation (Supervised Learning) 4. Restricted Boltzmann Machine (RBM) – Mathematics, Probabilistic Model and Inference Model – Pre-training by Contradictive Diverse Learning (Unsupervised Learning) 5. Inference Model with Distribution 28 http://bit.ly/dnnicpr2014
  • 30. Restricted Boltzmann Machines 29  Boltzmann Machines Boltmann machine is a probabilistic model represented by undirected graph (nodes and edges). Here, the binary state {0,1} is considered as the state of the nodes.  Unrestricted and restricted Boltzmann machine v:visible layer h:hidden layer v:visible layer h:hidden layer • (Unrestricted) Boltzmann machine • Restricted Boltzamann machine (RBM) Every node is connected each other. There is no edge in the same layer. It helps analysis.
  • 31. RBM: Probabilistic and energy models 30 v:visible layer {0,1} h:hidden layer {0,1} Probabilistic model: 𝑃 𝒗, 𝒉; 𝜽 = 1 𝑍 𝜽 exp −𝐸 𝒗, 𝒉; 𝜽 RBM parameter: θ = (W,b,c) Weight: W Bias: b,c Energy model: 𝐸 𝒗, 𝒉; 𝜽 = − 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 − 𝑗 𝑏𝑗ℎ𝑗 − 𝑖 𝑐𝑖 𝑣𝑖 = −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗 Partition function: 𝑍 𝜽 = 𝒗,𝒉∈{0,1} exp(−𝐸(𝒗, 𝒉; 𝜽))
  • 32. = 𝑖 exp(𝑐𝑖 𝑣𝑖) 𝑗 exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗) 𝑖 exp(𝑐𝑖 𝑣𝑖) 𝑗 ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗) RBM: Conditional probability model (Inference model) 31 v:visible layer {0,1} h:hidden layer {0,1} 𝑃 𝒉 𝒗; 𝜽 = 𝑃 𝒗, 𝒉; 𝜽 𝒉∈{0,1} 𝑃 𝒗, 𝒉; 𝜽 = exp( 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑗 𝑏𝑗ℎ𝑗 + 𝑖 𝑐𝑖 𝑣𝑖) 𝒉∈{0,1} exp( 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑗 𝑏𝑗ℎ𝑗 + 𝑖 𝑐𝑖 𝑣𝑖) Conditional probabilities of nodes are independent. 𝑃 ℎ𝑗 𝒗; 𝜽 = exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗) ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗) = 𝑗 exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗) ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗) = 𝑗 𝑃(ℎ𝑗|𝒗; 𝜽) 𝑃 𝒗, 𝒉; 𝜽 = 1 𝑍 𝜽 exp −𝐸 𝒗, 𝒉; 𝜽 𝐸 𝒗, 𝒉; 𝜽 = − 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 − 𝑗 𝑏𝑗ℎ𝑗 − 𝑖 𝑐𝑖 𝑣𝑖 = −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗
  • 33. 32 v:visible layer {0,1} h:hidden layer {0,1} 𝑃 ℎ𝑗 𝒗; 𝜽 = exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗) ℎ 𝑗∈{0,1} exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 + 𝑏𝑗ℎ𝑗) 𝑃 ℎ𝑗 = 1 𝒗; 𝜽 = exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 1 + 𝑏𝑗 × 1) exp 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 0 + 𝑏𝑗 × 0 + exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 × 1 + 𝑏𝑗 × 1) 𝜎 𝑥 = 1 1 + 𝑒−𝑥 = exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗) exp 0 + exp( 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗) = 1 1 + exp(− 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗) = 𝜎 𝑖 𝑣𝑖 𝑤𝑖𝑗 + 𝑏𝑗 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄 𝒉 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 Vector representation of single layer NN RBM: Conditional probability model (Inference model)
  • 34. Gaussian-Bernoulli RBM 33 v:visible layer {0,1} h:hidden layer {0,1} Bernoulli-Bernoulli RBM Gaussian-Bernoulli RBM v:visible layer Gaussian distribution 𝑁(𝑣; 𝜇, 𝑠2 ) h:hidden layer {0,1} 𝑃 𝒗, 𝒉 = 1 𝑍(𝜽) exp −𝐸 𝒗, 𝒉 Gaussian-Bernoulli RBM 𝐸 𝒗, 𝒉 = 1 2𝑠2 𝑖 𝑣𝑖 − 𝑐𝑖 2 − 1 𝑠 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑖 − 𝑗 𝑏𝑗ℎ𝑗 Probabilistic model: Inference model (Conditional probability) 𝑃 𝒉 = 𝟏 𝒗 = 𝜎 1 𝑠 𝑾 𝑇 𝒗 + 𝒃 𝑃 𝒗 𝒉 = 𝑵 𝒗; 𝜎𝑾𝒉 + 𝒄, 𝑠2 𝑰 Energy model:
  • 35. Outline 1. Examples of Deep NNs 2. RBM to Deep NN 3. Deep Neural Network (Deep NN) – Back-Propagation (Supervised Learning) 4. Restricted Boltzmann Machine (RBM) – Mathematics, Probabilistic Model and Inference Model – Pre-training by Contradictive Diverse Learning (Unsupervised Learning) 5. Inference Model with Distribution 34 http://bit.ly/dnnicpr2014
  • 36. RBM: Contradictive Diverse Learning 35 𝒗(0) 𝒉(0) 𝒗(1) 𝒉(1) 𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃 𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄 𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃 𝑾 ← 𝑾 − 𝜀 Δ𝑾 Δ𝑾 = 𝟏 𝑁 𝑛 𝒗 𝑛(0) 𝑇 𝒉 𝑛(0) − 𝒗 𝑛(1) 𝑇 𝒉 𝑛(1) Iterative process of the CD learning The CD learning can be considered the maximum likelihood estimation with given training data. Momentum and weight decay are also applied.
  • 37. RBM: Outline of the CD learning 36 v:visible layer {0,1} h:hidden layer {0,1} パラメータθ:Weight W, Bias b,c 𝑃 𝒗, 𝒉; 𝜽 = 1 𝑍 𝜽 exp −𝐸 𝒗, 𝒉; 𝜽 𝐸 𝒗, 𝒉; 𝜽 = − 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 − 𝑗 𝑏𝑗ℎ𝑗 − 𝑖 𝑐𝑖 𝑣𝑖 = −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗 • The maximum likelihood estimation with given training data {vn}. • (Approximated) EM algorithm is applied to handle unobserved hidden data. • The Gibbs sampling is applied to evaluate the partition function. • The Gibbs sampling is approximated by single sampling. Outline of the Contrastive Divergence Learning
  • 38. CD learning: Maximum likelihood 37 v:visible layer {0,1} h:hidden layer {0,1} 𝑃 𝒗, 𝒉; 𝜽 = 1 𝑍 𝜽 exp −𝐸 𝒗, 𝒉; 𝜽 𝐸 𝒗, 𝒉; 𝜽 = − 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 − 𝑗 𝑏𝑗ℎ𝑗 − 𝑖 𝑐𝑖 𝑣𝑖 = −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗 RBM is probabilistic model The maximum likelihood gives parameters for given training data. Visible data are given. Hidden data are not given. Integrate the hidden data. Log likelihood for training data {vn} 𝐿 𝜽 = 𝑛 𝐿 𝑛 𝜽 = 𝑛 log 𝑃 𝒗 𝑛; 𝜽 = 𝑛 log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽) The optimization is performed by the EM algorithm. 𝐿 𝑛 𝜽 = log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽)
  • 39. EM Algorithm 38 Reference: これなら分かる最適化数学,金谷健一 𝐿 𝑛 𝜽 = log 𝑃 𝒗 𝑛; 𝜽 = log 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽) 𝑃(𝒗 𝑛, 𝒉; 𝜽) Log likelihood for training data {vn} The EM algorithm monotonically increases the log likelihood. EM algorithm 1. Initialize parameter 𝜽 by 𝜽0 . Set 𝜏 = 0. 2. Evaluate following function(E-step) 3. Find 𝜽 𝜏 which maximizes 𝑄 𝜏 𝜽 (M-step) 4. Set 𝜏←𝜏 + 1, then step2. Iterate until it is converged. 𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽) ※In the CD learning, the M-step is approximated.
  • 40. Evaluation function and derivatives 39 𝑃 𝒗, 𝒉; 𝜽 = 1 𝑍 𝜽 exp −𝐸 𝒗, 𝒉; 𝜽 𝑍 𝜽 = 𝒗,𝒉∈{0,1} exp(−𝐸(𝒗, 𝒉; 𝜽)) 𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽) = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log exp −𝐸 𝒗 𝑛, 𝒉; 𝜽 𝑍(𝜽) = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) −𝐸 𝒗 𝑛, 𝒉; 𝜽 − log 𝑍(𝜽) 𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽) Evaluation function v:visible layer {0,1} h:hidden layer {0,1} 𝜕𝑄 𝜏 𝜽 𝜕𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) − 𝜕 𝜕𝜽 𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝜕 𝜕𝜽 log 𝑍(𝜽) = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) − 𝜕 𝜕𝜽 𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽) − 𝜕 𝜕𝜽 𝐸 𝒗, 𝒉; 𝜽 Data term Model term
  • 41. Derivative of partition function (Model term) 40 𝜕 log 𝑍(𝜽) 𝜕𝜽 = 1 𝑍(𝜽) 𝜕𝑍 𝜕𝜽 (𝜽) = 1 𝑍(𝜽) 𝜕 𝜕𝜽 𝑓 𝒙; 𝜽 𝑑𝒙 = 1 𝑍(𝜽) 𝜕𝑓 𝒙; 𝜽 𝜕𝜽 𝑑𝒙 = 1 𝑍(𝜽) 𝑓 𝒙; 𝜽 𝜕 log 𝑓 𝒙; 𝜽 𝜕𝜽 𝑑𝒙 𝑃 𝒙; 𝜽 = 1 𝑍 𝜽 𝑓(𝒙; 𝜽) 𝑍 𝜽 = 𝑓 𝒙; 𝜽 𝑑𝒙 Probability function Partition function Arbitrary function: 𝑓(𝒙; 𝜽) = 𝑃 𝒙; 𝜽 𝜕 log 𝑓 𝒙; 𝜽 𝜕𝜽 𝑑𝒙 = 𝐸 𝒙∼𝑃 𝒙;𝜽 𝜕 log 𝑓 𝒙; 𝜽 𝜕𝜽 Derivative of log function Derivative of log function Derivative operator into integral Definition of expectation
  • 42. Evaluation function and derivatives 41 𝜕𝑄 𝜏 𝜽 𝜕𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) − 𝜕 𝜕𝜽 𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽) − 𝜕 𝜕𝜽 𝐸 𝒗, 𝒉; 𝜽 𝑃 𝒗, 𝒉; 𝜽 = 1 𝑍 𝜽 exp −𝐸 𝒗, 𝒉; 𝜽 𝑄 𝜏 𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) log 𝑃(𝒗 𝑛, 𝒉; 𝜽) Evaluation function v:visible layer {0,1} h:hidden layer {0,1} 𝐸 𝒗, 𝒉; 𝜽 = − 𝑖,𝑗 𝑣𝑖 𝑤𝑖𝑗ℎ𝑗 − 𝑗 𝑏𝑗ℎ𝑗 − 𝑖 𝑐𝑖 𝑣𝑖 = −𝒗 𝑇 𝑾𝒉 − 𝒃 𝑇 𝒉 − 𝒄 𝑇 𝒗 𝜕 𝜕𝑾 𝐸 𝒗, 𝒉; 𝜽 = −𝒗 𝑻 𝒉 𝜕𝑄 𝜏 𝜽 𝜏 𝜕𝑾 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛 𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 𝜕𝑄 𝜏 𝜽 𝜏 𝜕𝜽 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) − 𝜕 𝜕𝜽 𝐸 𝒗 𝑛, 𝒉; 𝜽 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) − 𝜕 𝜕𝜽 𝐸 𝒗, 𝒉; 𝜽 Data term (Feasible) Model term (Infeasible) Approximation by Gibbs sampling
  • 43. Derivatives of evaluation function (Data term) 42 𝒗(0) 𝒉(0) 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 Inference of the RBM Data term 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛 𝑇 𝒉 = 𝒗 𝑛 𝑇 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒉 = 𝒗 𝑛 𝑇 𝟎 ∙ 𝑃 𝒉 = 𝟎 𝒗; 𝜽 + 𝟏 ∙ 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝒗 𝑛 𝑇 𝜎 𝑾 𝑇 𝒗 𝑛 + 𝒃 𝜕𝑄 𝜏 𝜽 𝜏 𝜕𝑾 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛 𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 Data term(Feasible) Model term(Infeasible) Approximation by Gibbs sampling
  • 44. Expectation Approximation by Monte-Calro Method (Model term) 43 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 ≑ 1 𝑁 𝑖 𝒗(𝑖) 𝑇 𝒉(𝑖) 𝒗(𝑖) 𝑇 , 𝒉(𝑖) ∼ 𝑃(𝒗, 𝒉; 𝜽 𝜏) Independent sampling Monte-Calro method How can we get the sampling? Gibbs sampling 𝜕𝑄 𝜏 𝜽 𝜏 𝜕𝑾 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛 𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 Data term(Feasible) Model term(Infeasible) Approximation by Gibbs sampling
  • 45. Gibbs sampling 44 𝒗(0) 𝒉(0) 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 Inference of RBM 𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄 𝒗(1) 𝒉(1) 𝒗(∞) 𝒉(∞) Gibbs sampling 1. Initialize 𝒗. 2. Sampling 𝒉 with𝑃(𝒉|𝒗). 3. Sampling 𝒗 with 𝑃(𝒗|𝒉). 4. Iterate step 2 and step 3. 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 ≑ 1 𝑁 𝑖 𝒗(𝑖) 𝑇 𝒉(𝑖) 𝜕𝑄 𝜏 𝜽 𝜏 𝜕𝑾 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛 𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 Data term(Feasible) Model term(Infeasible) Approximation by Gibbs sampling
  • 46. Approximated Evaluation of Model Term 45 𝒗(0) 𝒉(0) 𝒗(1) 𝒉(1) Stop just one time! 𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃 𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄 𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛 𝑇 𝒉 =𝒗(0) 𝑇 𝒉(0) 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 =𝒗(1) 𝑇 𝒉(1) 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 Inference of RBM 𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄 𝜕𝑄 𝜏 𝜽 𝜏 𝜕𝑾 = 𝐸 𝒉∼𝑃(𝒉|𝒗 𝑛;𝜽 𝜏) 𝒗 𝑛 𝑇 𝒉 − 𝐸 𝒗,𝒉∼𝑃(𝒗,𝒉;𝜽 𝜏) 𝒗 𝑻 𝒉 Data term(Feasible) Model term(Infeasible) Approximation by Gibbs sampling
  • 47. Probabilities or status? 46 𝒗(0) 𝒉(0) 𝒗(1) 𝒉(1) 𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃 𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄 𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃 G. Hinton, A Practical Guide to Training Restricted Boltzmann Machines 2010. The inference of the RBM gives the probabilities. In the Gibbs sampling, should we sampling the status with the probabilities ? Or should we simply use the probabilities ? Hinton recommends to use the probabilities. 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 Inference of RBM 𝑃 𝒗 = 𝟏 𝒉; 𝜽 = 𝜎 𝑾𝒉 + 𝒄
  • 48. RBM: Contradictive Diverse Learning 47 𝒗(0) 𝒉(0) 𝒗(1) 𝒉(1) 𝒉(0) = 𝜎 𝑾 𝑇 𝒗(𝟎) + 𝒃 𝒗(1) = 𝜎 𝑾𝒉(𝟎) + 𝒄 𝒉(1) = 𝜎 𝑾 𝑇 𝒗(𝟏) + 𝒃 𝑾 ← 𝑾 − 𝜀 Δ𝑾 Δ𝑾 = 𝟏 𝑁 𝑛 𝒗 𝑛(0) 𝑇 𝒉 𝑛(0) − 𝒗 𝑛(1) 𝑇 𝒉 𝑛(1) Iterative process of the CD learning The CD learning can be considered the maximum likelihood estimation with given training data. Momentum and weight decay are also applied.
  • 49. Pre-training for the stacked RBMs 48 Training data Output data Pre-training for 1st layer RBM Training data Output data Pre-training for 2nd layer RBM Input data copy copy copy Pre-training for the RBMs
  • 50. Outline 1. Examples of Deep NNs 2. RBM to Deep NN 3. Deep Neural Network (Deep NN) – Back-Propagation (Supervised Learning) 4. Restricted Boltzmann Machine (RBM) – Mathematics, Probabilistic Model and Inference Model – Pre-training by Contradictive Diverse Learning (Unsupervised Learning) 5. Inference Model with Distribution 49 http://bit.ly/dnnicpr2014
  • 51. Drop-out 50 The drop-out is expected to be similar to ensemble learning. It is effective to avoid the overfitting. Drop-out The nodes are randomly dropped out for each mini-batch. The output of the dropped node is zero. 50% drop-out rate is recommended. G. Hinton, N.Srivastava, A.Krizhevsky, I.Sutskever, and R.Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors.”, arXiv preprint arXiv:1207.0580, 2012. Input layer Output layer
  • 52. Ensemble learning and Drop-out 51 Input layer Output layer Ensemble learning Drop-out Integration of multiple weak learner > single learner 𝒉 𝒗 = 1 𝐾 𝒉1 𝒗 + 𝒉2 𝒗 + 𝒉3 𝒗 + ⋯ + 𝒉 𝐾 𝒗 The drop-out is expected to be similar effect to ensemble learning
  • 53. Fast drop-out learning 52 S.I. Wang and C.D. Manning, Fast dropout training, ICML 2013. Input layer 𝑣1 𝑣2 𝑣3 ℎ Output layer 𝑤1 𝑤2 𝑤3 𝑛 𝜎 ℎ = 𝜎 𝑛 𝑛 = 𝑖 𝑤𝑖 𝑣𝑖 + 𝑏 Inference of standard NN 𝑣1 𝑣2 𝑣3 ℎ 𝑤1 𝑤2 𝑤3 𝑛 𝜎 𝜉1 𝜉2 𝜉3 Input layer Output layer ℎ = 𝐸 𝜎 𝜒 𝜒 = 𝑖 𝜉𝑖 𝑤𝑖 𝑣𝑖 + 𝑏 𝜉𝑖 ∈ 0,1 , 𝑃 𝜉𝑖 = 0 = 0.5 𝑃 𝜉𝑖 = 1 = 0.5 𝜒, 𝜉𝑖:Stochastic variables Inference of drop-out NN 𝜒 𝑃(𝜒)
  • 54. Fast drop-out learning 53 S.I. Wang and C.D. Manning, Fast dropout training, ICML 2013. 𝜒 𝑃(𝜒) 𝜒 𝑃(𝜒) Approximate by Gaussian 𝑁(𝜒; 𝜇, 𝑠2 ) mean 𝜇, variance𝑠2 ℎ = 𝐸 𝜎 𝜒 ≑ −∞ ∞ 𝑁(𝜒; 𝜇, 𝑠2 ) 𝜎 𝜒 𝑑𝜒 ≑ 𝜎 𝜇 1 + 𝜋𝑠2/8 Closed form approximation Fast calculation without sampling. 𝑣1 𝑣2 𝑣3 ℎ 𝑤1 𝑤2 𝑤3 𝑛 𝜎 𝜉1 𝜉2 𝜉3 Input layer Output layer ℎ = 𝐸 𝜎 𝜒 𝜒 = 𝑖 𝜉𝑖 𝑤𝑖 𝑣𝑖 + 𝑏 𝜉𝑖 ∈ 0,1 , 𝑃 𝜉𝑖 = 0 = 0.5 𝑃 𝜉𝑖 = 1 = 0.5 𝜒, 𝜉𝑖:Stochastic variables Inference of drop-out NN
  • 55. Inference based on RBM model 54 M. Tanaka and M. Okutomi, A Novel Inference of a Restricted Boltzmann Machine, ICPR2014. Input layer 𝑣1 𝑣2 𝑣3 ℎ Output layer 𝑤1 𝑤2 𝑤3 𝑛 𝜎 ℎ = 𝜎 𝑛 𝑛 = 𝑖 𝑤𝑖 𝑣𝑖 + 𝑏 Inference of standard NN ℎ1 ℎ2 ℎ3 𝐻 Output layer 𝑤1 𝑤2 𝑤3 𝑛 𝜎 𝐻 = 𝐸 𝜎 𝜒 𝜒 = 𝑖 𝑤𝑖ℎ𝑖 + 𝑏 Inference based on RBM model 𝑃 𝒉 = 𝟏 𝒗; 𝜽 = 𝜎 𝑾 𝑇 𝒗 + 𝒃 𝜒, ℎ𝑖: stochastic variables 𝜒 𝑃(𝜒) 𝜒 𝑃(𝜒) ℎ = 𝐸 𝜎 𝜒 ≑ −∞ ∞ 𝑁(𝜒; 𝜇, 𝑠2 )𝜎 𝜒 𝑑𝜒 ≑ 𝜎 𝜇 1 + 𝜋𝑠2/8 mean 𝜇, variance𝑠2 Approximate by Gaussian 𝑁(𝜒; 𝜇, 𝑠2 )
  • 56. Inference based on RBM model 55 M. Tanaka and M. Okutomi, A Novel Inference of a Restricted Boltzmann Machine, ICPR2014. Improve the performance!
  • 57. Inference based on RBM model 56
  • 58. Outline 1. Examples of Deep NNs 2. RBM to Deep NN 3. Deep Neural Network (Deep NN) – Back-Propagation (Supervised Learning) 4. Restricted Boltzmann Machine (RBM) – Mathematics, Probabilistic Model and Inference Model – Pre-training by Contradictive Diverse Learning (Unsupervised Learning) 5. Inference Model with Distribution 57 http://bit.ly/dnnicpr2014