Lecture 5: Neural Networks II

Neural Networks II
Sang Jun Lee
Ph.D. candidate, POSTECH
Email: lsj4u0208@postech.ac.kr
EECE695J 전자전기공학특론J(딥러닝기초및철강공정에의활용) – LECTURE 5 (2017. 9. 28)

2
▣ Lecture 4: Neural Network I
1-page Review
Input layer Output layerHidden layer
Perceptron Multilayer perceptron (MLP)
Backpropagation Vanishing gradient
Local gradient의 곱을 이용하여 parameter gradient 계산 Deep Neural Network → parameter gradient ≅ 0

XOR example
3
Vanishing Gradient
Hidden layer의 neuron 개수를 20개로 setting
Output layer의 노드 개수는
분류하고자 하는 class의 수로
결정

2-layer network
4
Vanishing Gradient
2-layer network
(1 hidden layer + 1 output layer)
학습이 진행됨에 따라 loss가 감소하는 것을 확인

6-layer network
5
Vanishing Gradient

6-layer network
6
Vanishing Gradient
!?

Training of a Neural Network
Activation functions
Data preprocessing
Regularization
Tips for training a neural network
7
Contents

Sigmoid function
- Saturated neurons “kill” the gradient (크기가 작거나 큰 입력 X에 대한 gradient ≅ 0)
- Sigmoid outputs are always positive
8
Activation Functions
𝑑𝑑
𝑑𝑑𝑑𝑑
𝜎𝜎 𝑥𝑥 = 1 − 𝜎𝜎 𝑥𝑥 ⋅ 𝜎𝜎 𝑥𝑥 ≤ 1
• 각 layer의 local gradient가 곱해 짐에 따라 parameter에 대한 gradient 감소
• 입력 데이터에 의한 학습효과 x

Sigmoid function
Sigmoid outputs are always positive
9
∇ 𝑤𝑤 𝐿𝐿 𝑥𝑥, 𝑦𝑦 = 𝒙𝒙 ⋅ 𝜎𝜎 𝑤𝑤 𝑇𝑇
𝑥𝑥 ⋅ 1 − 𝜎𝜎 𝑤𝑤 𝑇𝑇
𝑥𝑥 ⋅ 2 𝑦𝑦 − 𝜎𝜎 𝑤𝑤 𝑇𝑇
𝑥𝑥
𝜎𝜎(𝑥𝑥)
𝑙𝑙 𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑
𝐿𝐿(𝑥𝑥, 𝑦𝑦)
𝑦𝑦
𝐿𝐿 𝑥𝑥, 𝑦𝑦 = 𝑦𝑦 − 𝜎𝜎 𝑤𝑤 𝑇𝑇
𝑥𝑥
2
Parameter gradient (vector)의 component가 모두 + 또는 –
따라서 +/- 부호가 적절히 섞여있는 zero-centered data가 좋다

tanh
- [-1,1]의 range로의 mapping
- Zero-centered
- Saturated neurons kill the gradients
10

ReLU
- Computationally efficient
- Does not saturated (in + region)
- Always positive output
- Dead (output) neuron will never activate (not updated)
(slightly positive biases are commonly used)
11
ReLU
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑

Leaky ReLU
- 𝑓𝑓 𝑥𝑥 = max(𝛼𝛼𝛼𝛼, 𝑥𝑥)
- 𝑥𝑥의 부호에 따라 +1 또는 𝛼𝛼의 local gradient를
backpropagation 과정에 반영
Activation function에 따른 영상 분류 성능 비교 (CIFAR-10)
(* VLReLU: Very Leaky ReLU, Mishkin et al. 2015)
12

Mean subtraction
- Data가 모두 양수이면 parameter gradient (vector)의 component의 부호가 모두 + 또는 –
- Zero-centered data:
�𝑋𝑋 = 𝑋𝑋 − 𝜇𝜇𝑋𝑋
- 주의 할 점: 𝜇𝜇𝑋𝑋를 구할 때 training data만 사용하며, validation 또는 test 할 때에도 𝜇𝜇𝑋𝑋를 이용하여 data를
preprocessing
13
Data Preprocessing

Normalization
�𝑋𝑋 =
𝑋𝑋 − 𝜇𝜇𝑋𝑋
𝜎𝜎𝑋𝑋
�𝑋𝑋 =
2 𝑋𝑋 − 𝑋𝑋 𝑚𝑚𝑚𝑚 𝑚𝑚
𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑋𝑋 𝑚𝑚𝑚𝑚 𝑚𝑚
− 1 ∈ [−1, +1]
- 참고: 영상 데이터에 대해서는 일반적으로 zero-center를 preprocessing으로 사용
14
Data Preprocessing

RBM (Restricted Boltzmann Machine)
A bipartite graph, no connection within a layer
15
Weight Initialization

DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
16

17

18

19

20

21

Minimize KL divergence between input and recreated input
22

Pre-training
23

Pre-training
24

Pre-training
25

Fine tuning
26

No need to use complicated RBM for weight initialization
Simple methods for weight initialization
Make sure the weights are ‘just right’ (not too small & not too big)
- Small random number (ex. Gaussian with zero mean and 10−2
standard deviation)
𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐
)
- Xavier initialization: X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward
neural networks,” in International conference on artificial intelligence and statistics, 2010
)/ 𝒏𝒏
- He’s initialization: K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification,” 2015
)/ 𝟐𝟐𝒏𝒏
27

Xavier initialization:
)/𝒏𝒏
- 𝑠𝑠 = ∑𝑖𝑖 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 (assume that 𝑤𝑤𝑖𝑖 and 𝑥𝑥𝑖𝑖 are zero-mean & i.i.d random variable)
- 𝑉𝑉𝑉𝑉𝑉𝑉 𝑠𝑠 = 𝑉𝑉𝑉𝑉𝑉𝑉 ∑𝑖𝑖 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖
= ∑𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉(𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖)
= ∑𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉 𝑤𝑤𝑖𝑖 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥𝑖𝑖)
= 𝑛𝑛 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑤𝑤 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑥𝑥
28
s
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑

29
Optimization
ReLU
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑

Stochastic gradient descent (SGD)
What if loss changes quickly in one direction and slowly in another?
Very slow progress along shallow dimension, jitter along steep direction
Local minima or saddle point → zero gradient
30
Optimization

SGD with momentum
Build up “velocity” as a running mean of gradients
𝜌𝜌 gives “friction” (typically 𝜌𝜌 = 0.9 or 0.99)
31
Optimization

AdaGrad
Adaptive gradient algorithm: a modified stochastic gradient descent with per-parameter learning rate
Element-wise scaling of the gradient based on historical sum of squares in each dimension
32
Optimization

RMSProp
Root Mean Square Propagation
AdaGrad + running average
33
Optimization

Adam
Adaptive Moment Estimation
일반적으로, 𝛽𝛽1 = 0.9, 𝛽𝛽2 = 0.999, 𝜂𝜂 = 10−3
정도의 값을 사용
34
Optimization

In TensorFlow...
35
Optimization

The problem of overfitting
Basic idea:
- Add randomness
- Marginalize the noise
36
Regularization
Training accuracy
Test accuracy

Model ensemble
- Train multiple independent models
- Average their results at test time
37
Regularization
Reference : http://www.slideshare.net/sasasiapacific/ipb-improving-the-models-predictive-power-with-ensemble-approaches

Dropout
- In training step, randomly set some neurons to zero (hyper-parameter: drop probability)
- Kind of ensemble model
38
Regularization

Dropout (test time)
Consider a single neuron
In standard neural network: 𝑎𝑎 = 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦
Want to obtain the expectation: 𝑦𝑦 = 𝑓𝑓 𝑥𝑥 = 𝐸𝐸𝑧𝑧 𝑓𝑓 𝑥𝑥, 𝑧𝑧 = ∫ 𝑝𝑝 𝑧𝑧 𝑓𝑓 𝑥𝑥, 𝑧𝑧 𝑑𝑑𝑑𝑑
At test time, we have: 𝐸𝐸 𝑎𝑎 = 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦
Applying dropout with the drop probability of 0.5:
𝐸𝐸 𝑎𝑎 =
1
4
𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 +
1
4
𝑤𝑤1 𝑥𝑥 + 0𝑦𝑦 +
1
4
0𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 +
1
4
0𝑥𝑥 + 0𝑦𝑦 =
1
2
(𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦)
At test time, multiply by dropout probability
39
Regularization

DropConnect
In training step, randomly set some weights to zero
40
Regularization

DropConnect
In training step, randomly set some weights to zero
41
Regularization

Stochastic Depth
In training step, randomly drop layers
42
Regularization

Data augmentation
Crops / scales
- Original image: 256x480
- Sample random 224x224 patches
Randomize contrast and brightness
43
Regularization

Learning rate
48
Practical Tips for training a Neural Network

Transfer learning
49

Weight initialization
- ReLU
- Leaky ReLU
Optimization
- Adam optimizer
- ...
Regularization
- Dropout or batch normalization is generally sufficient
50

Activation functions
Sigmoid, tanh, ReLU, Leaky ReLU
Data preprocessing
Mean subtraction, normalization
Regularization
Model ensemble, dropout, data augmentation, ...
Tips for training a neural network
Learning rate, transfer learning
51
Summary

Computer Vision
영상 데이터에 대한 이해
Convolutional Neural Network
영상에 CNN이 효과적인 이유
52
Preview (Lecture 6)

Lecture 5: Neural Networks II

More Related Content

What's hot (20)

Similar to Lecture 5: Neural Networks II (20)

More from Sang Jun Lee (6)

Recently uploaded (20)

Lecture 5: Neural Networks II