SlideShare a Scribd company logo
Neural Networks II
Sang Jun Lee
Ph.D. candidate, POSTECH
Email: lsj4u0208@postech.ac.kr
EECE695J 전자전기공학특론J(딥러닝기초및철강공정에의활용) – LECTURE 5 (2017. 9. 28)
2
▣ Lecture 4: Neural Network I
1-page Review
Input layer Output layerHidden layer
Perceptron Multilayer perceptron (MLP)
Backpropagation Vanishing gradient
Local gradient의 곱을 이용하여 parameter gradient 계산 Deep Neural Network → parameter gradient ≅ 0
XOR example
3
Vanishing Gradient
Hidden layer의 neuron 개수를 20개로 setting
Output layer의 노드 개수는
분류하고자 하는 class의 수로
결정
2-layer network
4
Vanishing Gradient
2-layer network
(1 hidden layer + 1 output layer)
학습이 진행됨에 따라 loss가 감소하는 것을 확인
6-layer network
5
Vanishing Gradient
6-layer network
6
Vanishing Gradient
!?
Training of a Neural Network
Activation functions
Data preprocessing
Regularization
Tips for training a neural network
7
Contents
Sigmoid function
- Saturated neurons “kill” the gradient (크기가 작거나 큰 입력 X에 대한 gradient ≅ 0)
- Sigmoid outputs are always positive
8
Activation Functions
𝑑𝑑
𝑑𝑑𝑑𝑑
𝜎𝜎 𝑥𝑥 = 1 − 𝜎𝜎 𝑥𝑥 ⋅ 𝜎𝜎 𝑥𝑥 ≤ 1
• 각 layer의 local gradient가 곱해 짐에 따라 parameter에 대한 gradient 감소
• 입력 데이터에 의한 학습효과 x
Sigmoid function
Sigmoid outputs are always positive
9
Activation Functions
∇ 𝑤𝑤 𝐿𝐿 𝑥𝑥, 𝑦𝑦 = 𝒙𝒙 ⋅ 𝜎𝜎 𝑤𝑤 𝑇𝑇
𝑥𝑥 ⋅ 1 − 𝜎𝜎 𝑤𝑤 𝑇𝑇
𝑥𝑥 ⋅ 2 𝑦𝑦 − 𝜎𝜎 𝑤𝑤 𝑇𝑇
𝑥𝑥
𝜎𝜎(𝑥𝑥)
𝑙𝑙 𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑
𝐿𝐿(𝑥𝑥, 𝑦𝑦)
𝑦𝑦
𝐿𝐿 𝑥𝑥, 𝑦𝑦 = 𝑦𝑦 − 𝜎𝜎 𝑤𝑤 𝑇𝑇
𝑥𝑥
2
Parameter gradient (vector)의 component가 모두 + 또는 –
따라서 +/- 부호가 적절히 섞여있는 zero-centered data가 좋다
tanh
- [-1,1]의 range로의 mapping
- Zero-centered
- Saturated neurons kill the gradients
10
Activation Functions
ReLU
- Computationally efficient
- Does not saturated (in + region)
- Always positive output
- Dead (output) neuron will never activate (not updated)
(slightly positive biases are commonly used)
11
Activation Functions
ReLU
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑
Leaky ReLU
- 𝑓𝑓 𝑥𝑥 = max(𝛼𝛼𝛼𝛼, 𝑥𝑥)
- 𝑥𝑥의 부호에 따라 +1 또는 𝛼𝛼의 local gradient를
backpropagation 과정에 반영
Activation function에 따른 영상 분류 성능 비교 (CIFAR-10)
(* VLReLU: Very Leaky ReLU, Mishkin et al. 2015)
12
Activation Functions
Mean subtraction
- Data가 모두 양수이면 parameter gradient (vector)의 component의 부호가 모두 + 또는 –
- Zero-centered data:
�𝑋𝑋 = 𝑋𝑋 − 𝜇𝜇𝑋𝑋
- 주의 할 점: 𝜇𝜇𝑋𝑋를 구할 때 training data만 사용하며, validation 또는 test 할 때에도 𝜇𝜇𝑋𝑋를 이용하여 data를
preprocessing
13
Data Preprocessing
Normalization
�𝑋𝑋 =
𝑋𝑋 − 𝜇𝜇𝑋𝑋
𝜎𝜎𝑋𝑋
�𝑋𝑋 =
2 𝑋𝑋 − 𝑋𝑋 𝑚𝑚𝑚𝑚 𝑚𝑚
𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑋𝑋 𝑚𝑚𝑚𝑚 𝑚𝑚
− 1 ∈ [−1, +1]
- 참고: 영상 데이터에 대해서는 일반적으로 zero-center를 preprocessing으로 사용
14
Data Preprocessing
RBM (Restricted Boltzmann Machine)
A bipartite graph, no connection within a layer
15
Weight Initialization
DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
16
Weight Initialization
DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
17
Weight Initialization
DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
18
Weight Initialization
DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
19
Weight Initialization
DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
20
Weight Initialization
DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
21
Weight Initialization
DBN (Deep Belief Network)
Minimize KL divergence between input and recreated input
22
Weight Initialization
DBN (Deep Belief Network)
Pre-training
23
Weight Initialization
DBN (Deep Belief Network)
Pre-training
24
Weight Initialization
DBN (Deep Belief Network)
Pre-training
25
Weight Initialization
DBN (Deep Belief Network)
Fine tuning
26
Weight Initialization
No need to use complicated RBM for weight initialization
Simple methods for weight initialization
Make sure the weights are ‘just right’ (not too small & not too big)
- Small random number (ex. Gaussian with zero mean and 10−2
standard deviation)
𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐
)
- Xavier initialization: X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward
neural networks,” in International conference on artificial intelligence and statistics, 2010
𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐
)/ 𝒏𝒏
- He’s initialization: K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification,” 2015
𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐
)/ 𝟐𝟐𝒏𝒏
27
Weight Initialization
Xavier initialization:
𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐
)/𝒏𝒏
- 𝑠𝑠 = ∑𝑖𝑖 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 (assume that 𝑤𝑤𝑖𝑖 and 𝑥𝑥𝑖𝑖 are zero-mean & i.i.d random variable)
- 𝑉𝑉𝑉𝑉𝑉𝑉 𝑠𝑠 = 𝑉𝑉𝑉𝑉𝑉𝑉 ∑𝑖𝑖 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖
= ∑𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉(𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖)
= ∑𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉 𝑤𝑤𝑖𝑖 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥𝑖𝑖)
= 𝑛𝑛 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑤𝑤 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑥𝑥
28
Weight Initialization
s
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑
29
Optimization
ReLU
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑
Stochastic gradient descent (SGD)
What if loss changes quickly in one direction and slowly in another?
Very slow progress along shallow dimension, jitter along steep direction
Local minima or saddle point → zero gradient
30
Optimization
SGD with momentum
Build up “velocity” as a running mean of gradients
𝜌𝜌 gives “friction” (typically 𝜌𝜌 = 0.9 or 0.99)
31
Optimization
AdaGrad
Adaptive gradient algorithm: a modified stochastic gradient descent with per-parameter learning rate
Element-wise scaling of the gradient based on historical sum of squares in each dimension
32
Optimization
RMSProp
Root Mean Square Propagation
AdaGrad + running average
33
Optimization
Adam
Adaptive Moment Estimation
일반적으로, 𝛽𝛽1 = 0.9, 𝛽𝛽2 = 0.999, 𝜂𝜂 = 10−3
정도의 값을 사용
34
Optimization
In TensorFlow...
35
Optimization
The problem of overfitting
Basic idea:
- Add randomness
- Marginalize the noise
36
Regularization
Training accuracy
Test accuracy
Model ensemble
- Train multiple independent models
- Average their results at test time
37
Regularization
Reference : http://www.slideshare.net/sasasiapacific/ipb-improving-the-models-predictive-power-with-ensemble-approaches
Dropout
- In training step, randomly set some neurons to zero (hyper-parameter: drop probability)
- Kind of ensemble model
38
Regularization
Dropout (test time)
Consider a single neuron
In standard neural network: 𝑎𝑎 = 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦
Want to obtain the expectation: 𝑦𝑦 = 𝑓𝑓 𝑥𝑥 = 𝐸𝐸𝑧𝑧 𝑓𝑓 𝑥𝑥, 𝑧𝑧 = ∫ 𝑝𝑝 𝑧𝑧 𝑓𝑓 𝑥𝑥, 𝑧𝑧 𝑑𝑑𝑑𝑑
At test time, we have: 𝐸𝐸 𝑎𝑎 = 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦
Applying dropout with the drop probability of 0.5:
𝐸𝐸 𝑎𝑎 =
1
4
𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 +
1
4
𝑤𝑤1 𝑥𝑥 + 0𝑦𝑦 +
1
4
0𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 +
1
4
0𝑥𝑥 + 0𝑦𝑦 =
1
2
(𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦)
At test time, multiply by dropout probability
39
Regularization
DropConnect
In training step, randomly set some weights to zero
40
Regularization
DropConnect
In training step, randomly set some weights to zero
41
Regularization
Stochastic Depth
In training step, randomly drop layers
42
Regularization
Data augmentation
Crops / scales
- Original image: 256x480
- Sample random 224x224 patches
Randomize contrast and brightness
43
Regularization
44
ReLu, Xavier, Dropout
45
ReLu, Xavier, Dropout
46
ReLu, Xavier, Dropout
47
ReLu, Xavier, Dropout
Learning rate
48
Practical Tips for training a Neural Network
Transfer learning
49
Practical Tips for training a Neural Network
Weight initialization
- ReLU
- Leaky ReLU
Optimization
- Adam optimizer
- ...
Regularization
- Dropout or batch normalization is generally sufficient
50
Practical Tips for training a Neural Network
Activation functions
Sigmoid, tanh, ReLU, Leaky ReLU
Data preprocessing
Mean subtraction, normalization
Regularization
Model ensemble, dropout, data augmentation, ...
Tips for training a neural network
Learning rate, transfer learning
51
Summary
Computer Vision
영상 데이터에 대한 이해
Convolutional Neural Network
영상에 CNN이 효과적인 이유
52
Preview (Lecture 6)

More Related Content

PDF
Lecture 6: Convolutional Neural Networks
PPTX
Convolutional Neural Network (CNN) presentation from theory to code in Theano
PDF
Pixel RNN to Pixel CNN++
PDF
Dueling network architectures for deep reinforcement learning
PPTX
Anomaly detection using deep one class classifier
PPTX
Convolutional neural networks 이론과 응용
PDF
VAE-type Deep Generative Models
PPTX
DNN and RBM
Lecture 6: Convolutional Neural Networks
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Pixel RNN to Pixel CNN++
Dueling network architectures for deep reinforcement learning
Anomaly detection using deep one class classifier
Convolutional neural networks 이론과 응용
VAE-type Deep Generative Models
DNN and RBM

What's hot (20)

PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PPTX
The world of loss function
PDF
safe and efficient off policy reinforcement learning
PDF
方策勾配型強化学習の基礎と応用
PDF
is anyone_interest_in_auto-encoding_variational-bayes
PDF
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
Dueling Network Architectures for Deep Reinforcement Learning
PPTX
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
PDF
Deep Feed Forward Neural Networks and Regularization
PDF
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
PDF
K-means, EM and Mixture models
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
PDF
Continuous control with deep reinforcement learning (DDPG)
PPTX
Anomaly Detection and Localization Using GAN and One-Class Classifier
PDF
InfoGAN and Generative Adversarial Networks
PDF
Deep Learning for Computer Vision: Visualization (UPC 2016)
PDF
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
PPTX
[Vldb 2013] skyline operator on anti correlated distributions
PDF
[PR12] PR-036 Learning to Remember Rare Events
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
The world of loss function
safe and efficient off policy reinforcement learning
方策勾配型強化学習の基礎と応用
is anyone_interest_in_auto-encoding_variational-bayes
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Dueling Network Architectures for Deep Reinforcement Learning
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
Deep Feed Forward Neural Networks and Regularization
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
K-means, EM and Mixture models
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Continuous control with deep reinforcement learning (DDPG)
Anomaly Detection and Localization Using GAN and One-Class Classifier
InfoGAN and Generative Adversarial Networks
Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
[Vldb 2013] skyline operator on anti correlated distributions
[PR12] PR-036 Learning to Remember Rare Events
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Ad

Similar to Lecture 5: Neural Networks II (20)

PPTX
Batch normalization presentation
PDF
MLHEP 2015: Introductory Lecture #2
PPTX
Jsai final final final
PPT
Artificial Neural Networks-Supervised Learning Models
PPT
Artificial Neural Networks-Supervised Learning Models
PPT
Artificial Neural Networks-Supervised Learning Models
PDF
High performance large-scale image recognition without normalization
PPTX
UNIT IV NEURAL NETWORKS - Multilayer perceptron
PPTX
Artificial Neural Network
PPTX
NITW_Improving Deep Neural Networks.pptx
PPTX
NITW_Improving Deep Neural Networks (1).pptx
PDF
Boundness of a neural network weights using the notion of a limit of a sequence
PDF
4 high performance large-scale image recognition without normalization
PDF
Restricting the Flow: Information Bottlenecks for Attribution
PPTX
Training Neural Networks.pptx
PPTX
Training DNN Models - II.pptx
PDF
unit 1- NN concpts.pptx.pdf withautomstion
PDF
Echo state networks and locomotion patterns
PDF
Lecture 5 backpropagation
Batch normalization presentation
MLHEP 2015: Introductory Lecture #2
Jsai final final final
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
High performance large-scale image recognition without normalization
UNIT IV NEURAL NETWORKS - Multilayer perceptron
Artificial Neural Network
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks (1).pptx
Boundness of a neural network weights using the notion of a limit of a sequence
4 high performance large-scale image recognition without normalization
Restricting the Flow: Information Bottlenecks for Attribution
Training Neural Networks.pptx
Training DNN Models - II.pptx
unit 1- NN concpts.pptx.pdf withautomstion
Echo state networks and locomotion patterns
Lecture 5 backpropagation
Ad

More from Sang Jun Lee (6)

PDF
[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation
PDF
Lecture 7: Recurrent Neural Networks
PDF
Lecture 4: Neural Networks I
PDF
Lecture 3: Unsupervised Learning
PDF
Lecture 2: Supervised Learning
PDF
Lecture 1: Introduction to Python and TensorFlow
[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation
Lecture 7: Recurrent Neural Networks
Lecture 4: Neural Networks I
Lecture 3: Unsupervised Learning
Lecture 2: Supervised Learning
Lecture 1: Introduction to Python and TensorFlow

Recently uploaded (20)

PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
composite construction of structures.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
R24 SURVEYING LAB MANUAL for civil enggi
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Operating System & Kernel Study Guide-1 - converted.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
composite construction of structures.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mechanical Engineering MATERIALS Selection
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
OOP with Java - Java Introduction (Basics)
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
CYBER-CRIMES AND SECURITY A guide to understanding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

Lecture 5: Neural Networks II

  • 1. Neural Networks II Sang Jun Lee Ph.D. candidate, POSTECH Email: lsj4u0208@postech.ac.kr EECE695J 전자전기공학특론J(딥러닝기초및철강공정에의활용) – LECTURE 5 (2017. 9. 28)
  • 2. 2 ▣ Lecture 4: Neural Network I 1-page Review Input layer Output layerHidden layer Perceptron Multilayer perceptron (MLP) Backpropagation Vanishing gradient Local gradient의 곱을 이용하여 parameter gradient 계산 Deep Neural Network → parameter gradient ≅ 0
  • 3. XOR example 3 Vanishing Gradient Hidden layer의 neuron 개수를 20개로 setting Output layer의 노드 개수는 분류하고자 하는 class의 수로 결정
  • 4. 2-layer network 4 Vanishing Gradient 2-layer network (1 hidden layer + 1 output layer) 학습이 진행됨에 따라 loss가 감소하는 것을 확인
  • 7. Training of a Neural Network Activation functions Data preprocessing Regularization Tips for training a neural network 7 Contents
  • 8. Sigmoid function - Saturated neurons “kill” the gradient (크기가 작거나 큰 입력 X에 대한 gradient ≅ 0) - Sigmoid outputs are always positive 8 Activation Functions 𝑑𝑑 𝑑𝑑𝑑𝑑 𝜎𝜎 𝑥𝑥 = 1 − 𝜎𝜎 𝑥𝑥 ⋅ 𝜎𝜎 𝑥𝑥 ≤ 1 • 각 layer의 local gradient가 곱해 짐에 따라 parameter에 대한 gradient 감소 • 입력 데이터에 의한 학습효과 x
  • 9. Sigmoid function Sigmoid outputs are always positive 9 Activation Functions ∇ 𝑤𝑤 𝐿𝐿 𝑥𝑥, 𝑦𝑦 = 𝒙𝒙 ⋅ 𝜎𝜎 𝑤𝑤 𝑇𝑇 𝑥𝑥 ⋅ 1 − 𝜎𝜎 𝑤𝑤 𝑇𝑇 𝑥𝑥 ⋅ 2 𝑦𝑦 − 𝜎𝜎 𝑤𝑤 𝑇𝑇 𝑥𝑥 𝜎𝜎(𝑥𝑥) 𝑙𝑙 𝑙 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑥𝑥0 𝑥𝑥1 𝑥𝑥𝑑𝑑 𝐿𝐿(𝑥𝑥, 𝑦𝑦) 𝑦𝑦 𝐿𝐿 𝑥𝑥, 𝑦𝑦 = 𝑦𝑦 − 𝜎𝜎 𝑤𝑤 𝑇𝑇 𝑥𝑥 2 Parameter gradient (vector)의 component가 모두 + 또는 – 따라서 +/- 부호가 적절히 섞여있는 zero-centered data가 좋다
  • 10. tanh - [-1,1]의 range로의 mapping - Zero-centered - Saturated neurons kill the gradients 10 Activation Functions
  • 11. ReLU - Computationally efficient - Does not saturated (in + region) - Always positive output - Dead (output) neuron will never activate (not updated) (slightly positive biases are commonly used) 11 Activation Functions ReLU 𝑥𝑥0 𝑥𝑥1 𝑥𝑥𝑑𝑑
  • 12. Leaky ReLU - 𝑓𝑓 𝑥𝑥 = max(𝛼𝛼𝛼𝛼, 𝑥𝑥) - 𝑥𝑥의 부호에 따라 +1 또는 𝛼𝛼의 local gradient를 backpropagation 과정에 반영 Activation function에 따른 영상 분류 성능 비교 (CIFAR-10) (* VLReLU: Very Leaky ReLU, Mishkin et al. 2015) 12 Activation Functions
  • 13. Mean subtraction - Data가 모두 양수이면 parameter gradient (vector)의 component의 부호가 모두 + 또는 – - Zero-centered data: �𝑋𝑋 = 𝑋𝑋 − 𝜇𝜇𝑋𝑋 - 주의 할 점: 𝜇𝜇𝑋𝑋를 구할 때 training data만 사용하며, validation 또는 test 할 때에도 𝜇𝜇𝑋𝑋를 이용하여 data를 preprocessing 13 Data Preprocessing
  • 14. Normalization �𝑋𝑋 = 𝑋𝑋 − 𝜇𝜇𝑋𝑋 𝜎𝜎𝑋𝑋 �𝑋𝑋 = 2 𝑋𝑋 − 𝑋𝑋 𝑚𝑚𝑚𝑚 𝑚𝑚 𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑋𝑋 𝑚𝑚𝑚𝑚 𝑚𝑚 − 1 ∈ [−1, +1] - 참고: 영상 데이터에 대해서는 일반적으로 zero-center를 preprocessing으로 사용 14 Data Preprocessing
  • 15. RBM (Restricted Boltzmann Machine) A bipartite graph, no connection within a layer 15 Weight Initialization
  • 16. DBN (Deep Belief Network) Unsupervised learning on adjacent two layers as a pre-training step (weight initialization) 16 Weight Initialization
  • 17. DBN (Deep Belief Network) Unsupervised learning on adjacent two layers as a pre-training step (weight initialization) 17 Weight Initialization
  • 18. DBN (Deep Belief Network) Unsupervised learning on adjacent two layers as a pre-training step (weight initialization) 18 Weight Initialization
  • 19. DBN (Deep Belief Network) Unsupervised learning on adjacent two layers as a pre-training step (weight initialization) 19 Weight Initialization
  • 20. DBN (Deep Belief Network) Unsupervised learning on adjacent two layers as a pre-training step (weight initialization) 20 Weight Initialization
  • 21. DBN (Deep Belief Network) Unsupervised learning on adjacent two layers as a pre-training step (weight initialization) 21 Weight Initialization
  • 22. DBN (Deep Belief Network) Minimize KL divergence between input and recreated input 22 Weight Initialization
  • 23. DBN (Deep Belief Network) Pre-training 23 Weight Initialization
  • 24. DBN (Deep Belief Network) Pre-training 24 Weight Initialization
  • 25. DBN (Deep Belief Network) Pre-training 25 Weight Initialization
  • 26. DBN (Deep Belief Network) Fine tuning 26 Weight Initialization
  • 27. No need to use complicated RBM for weight initialization Simple methods for weight initialization Make sure the weights are ‘just right’ (not too small & not too big) - Small random number (ex. Gaussian with zero mean and 10−2 standard deviation) 𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐 ) - Xavier initialization: X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International conference on artificial intelligence and statistics, 2010 𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐 )/ 𝒏𝒏 - He’s initialization: K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” 2015 𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐 )/ 𝟐𝟐𝒏𝒏 27 Weight Initialization
  • 28. Xavier initialization: 𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐 )/𝒏𝒏 - 𝑠𝑠 = ∑𝑖𝑖 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 (assume that 𝑤𝑤𝑖𝑖 and 𝑥𝑥𝑖𝑖 are zero-mean & i.i.d random variable) - 𝑉𝑉𝑉𝑉𝑉𝑉 𝑠𝑠 = 𝑉𝑉𝑉𝑉𝑉𝑉 ∑𝑖𝑖 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 = ∑𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉(𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖) = ∑𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉 𝑤𝑤𝑖𝑖 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥𝑖𝑖) = 𝑛𝑛 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑤𝑤 ⋅ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑥𝑥 28 Weight Initialization s 𝑥𝑥0 𝑥𝑥1 𝑥𝑥𝑑𝑑
  • 30. Stochastic gradient descent (SGD) What if loss changes quickly in one direction and slowly in another? Very slow progress along shallow dimension, jitter along steep direction Local minima or saddle point → zero gradient 30 Optimization
  • 31. SGD with momentum Build up “velocity” as a running mean of gradients 𝜌𝜌 gives “friction” (typically 𝜌𝜌 = 0.9 or 0.99) 31 Optimization
  • 32. AdaGrad Adaptive gradient algorithm: a modified stochastic gradient descent with per-parameter learning rate Element-wise scaling of the gradient based on historical sum of squares in each dimension 32 Optimization
  • 33. RMSProp Root Mean Square Propagation AdaGrad + running average 33 Optimization
  • 34. Adam Adaptive Moment Estimation 일반적으로, 𝛽𝛽1 = 0.9, 𝛽𝛽2 = 0.999, 𝜂𝜂 = 10−3 정도의 값을 사용 34 Optimization
  • 36. The problem of overfitting Basic idea: - Add randomness - Marginalize the noise 36 Regularization Training accuracy Test accuracy
  • 37. Model ensemble - Train multiple independent models - Average their results at test time 37 Regularization Reference : http://www.slideshare.net/sasasiapacific/ipb-improving-the-models-predictive-power-with-ensemble-approaches
  • 38. Dropout - In training step, randomly set some neurons to zero (hyper-parameter: drop probability) - Kind of ensemble model 38 Regularization
  • 39. Dropout (test time) Consider a single neuron In standard neural network: 𝑎𝑎 = 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 Want to obtain the expectation: 𝑦𝑦 = 𝑓𝑓 𝑥𝑥 = 𝐸𝐸𝑧𝑧 𝑓𝑓 𝑥𝑥, 𝑧𝑧 = ∫ 𝑝𝑝 𝑧𝑧 𝑓𝑓 𝑥𝑥, 𝑧𝑧 𝑑𝑑𝑑𝑑 At test time, we have: 𝐸𝐸 𝑎𝑎 = 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 Applying dropout with the drop probability of 0.5: 𝐸𝐸 𝑎𝑎 = 1 4 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 + 1 4 𝑤𝑤1 𝑥𝑥 + 0𝑦𝑦 + 1 4 0𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 + 1 4 0𝑥𝑥 + 0𝑦𝑦 = 1 2 (𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦) At test time, multiply by dropout probability 39 Regularization
  • 40. DropConnect In training step, randomly set some weights to zero 40 Regularization
  • 41. DropConnect In training step, randomly set some weights to zero 41 Regularization
  • 42. Stochastic Depth In training step, randomly drop layers 42 Regularization
  • 43. Data augmentation Crops / scales - Original image: 256x480 - Sample random 224x224 patches Randomize contrast and brightness 43 Regularization
  • 48. Learning rate 48 Practical Tips for training a Neural Network
  • 49. Transfer learning 49 Practical Tips for training a Neural Network
  • 50. Weight initialization - ReLU - Leaky ReLU Optimization - Adam optimizer - ... Regularization - Dropout or batch normalization is generally sufficient 50 Practical Tips for training a Neural Network
  • 51. Activation functions Sigmoid, tanh, ReLU, Leaky ReLU Data preprocessing Mean subtraction, normalization Regularization Model ensemble, dropout, data augmentation, ... Tips for training a neural network Learning rate, transfer learning 51 Summary
  • 52. Computer Vision 영상 데이터에 대한 이해 Convolutional Neural Network 영상에 CNN이 효과적인 이유 52 Preview (Lecture 6)