Batch normalization 與他愉快的小伙伴

2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 1/49
Batch normalization 與他愉快的⼩伙伴Batch normalization 與他愉快的⼩伙伴
杜岳華

OutlineOutline
Batch normalization (https://arxiv.org/abs/1502.03167)
Layer normalization (https://arxiv.org/abs/1607.06450)
Recurrent batch normalization (https://arxiv.org/abs/1603.09025)
Group normalization (https://arxiv.org/abs/1803.08494)
How does batch normalization help optimization?
(https://arxiv.org/abs/1805.11604)

E ectE ect
Improve accuracy
Faster learning
Stable

Batch normalization [Google]Batch normalization [Google]
Problem: the distribution of each layer's input changes during training.
Solution: x the distribution of inputs into a subnetwork
Effect: available of high laerning rate, improve training ef ciency

Internal covariate shift (ICS)Internal covariate shift (ICS)
Batch Normalization—What the hey?
(https://gab41.lab41.org/batch-normalization-what-the-hey-
d480039a9e3b)

AssumaptionsAssumaptions
= (x; )h1 F1 Θ1
⋮
= ( ; )hi Fi hi−1 Θi
⋮
y = ( ; )Fk hk−1 Θk

AssumaptionsAssumaptions
= (x; , ) = f( x + )h1 F1 W1 b1 W1 b1
⋮
= ( ; , ) = f( + )hi Fi hi−1 Wi bi Wi hi−1 bi
⋮
y = ( ; , ) = f( + )Fk hk−1 Wk bk Wk hk−1 bk

Batch normalizationBatch normalization
Ideally, calculating over all the training set is the best.
μ =
1
m
∑
i=1
m
xi
= ( − μσ
2
1
m
∑
i=1
m
xi )
2
←xi^
− μxi
+ ϵσ
2
− −−−−
√
← B ( ) = γ + βyi Nγ,β xi^ xi^
γ and β are network parameters.
μ, σ

is a constant added to mini-batch variance for numerical stability
each mini-batch produces estimates of the mean and variance
BatchNorm can be added before or after the activation function
ϵ

Wx+b f Wx+b
xi
^xi
yi
γ,β

Ensure the output statistics of a layer are xed.
Wx+b Wx+b
xi
yi
γ,β
f

TestingTesting
Ideal solution
Compute over all the training set
Practical solution
Compute moving average of of batches during training
μ, σ
μ, σ

Pros and ConsPros and Cons
Advantage
Increase learning rate
Remove dropout and reduce regularization
Remove local response normalization
Regularizer
Disadvantage
Extra computation
Small batch size: no effect

Layer normalization [University of Toronto, G. Hinton]Layer normalization [University of Toronto, G. Hinton]
Problem: BatchNorm is dependent on batch, and is not obvious how to apply to
RNN
Varied length sequence in RNN
Hard to applied to online learning
Solution: transpose the normalization into layer and place it before non-linearity

AssumptionsAssumptions
Feed-forward neural networkFeed-forward neural network
hidden layer
= f( + )h
l+1
W
l
h
l
b
l
l − th
Standard RNNStandard RNN
hidden layer
= f( + + )h
t+1
Wh h
t
Wx x
t
b
l
t − th

Layer normalizationLayer normalization
Compute the layer normalization statistics over all the hidden units in the same layer.
hidden unit
= f( + )h
l+1
i
w
lT
i
h
l
b
l
i
⇒ = , = f( + )a
l
i
w
lT
i
h
l
h
l+1
i
a
l
i
b
l
i
i − th
=μ
l
1
h
∑
i=1
h
a
l
i
=σ
l
( −
1
h
∑
i=1
h
a
l
i
μ
l
)
2
− −−−−−−−−−−−

⎷



All the hidden units in a layer share the same normalization terms
Different training cases have different normlization terms
L ( ) = g ⊙ + bNg,b a
t
−a
t
μ
t
σ
t
g and b are network parameters.
⊙ : Hadamard product, or element-wise multiply

= +a
t
Wh h
t
Wx x
t
= f(L ( ))h
t
Ng,b a
t

Some analysisesSome analysises
Compare the invariance between batch, weight and layer normalization
Geometry of parameter space during training
make learning more stable

Pros and ConsPros and Cons
Advantage
Fast converage
Reduce vanishing gradient problem
Disadvantage
Not suitable for CNN

Recurrent batch normalizationRecurrent batch normalization
problem: limited use in stacked RNN
solution: apply batch normalization to hidden-to-hidden transition

input:
hidden state:
output:
= sigm( + + b)ft Wh ht−1 Wx xt−1
= sigm( + + b)it Wh ht−1 Wx xt−1
= sigm( + + b)ot Wh ht−1 Wx xt−1
= tanh( + + b)gt
Wh ht−1 Wx xt−1

= ⊙ + ⊙ct ft ct−1 it gt
= ⊙ tanh( )ht ot ct
xt−1
,ht−1 ct
ht
圖解LSTM (https://brohrer.mcknote.com/zh-
Hant/how_machine_learning_works/how_rnns_lstm_work.html)

xt-1
ct-1,
ht-1
ot-1
xt
ot
ct+1,
ht+1
xt+1
ot+1
LSTM unit
σ σ tanh σ
tanh
ct-1
ht-1
xt
ht
ct
Ft
It
Ot
ht
... ...
Wiki (https://en.wikipedia.org/wiki/Recurrent_neural_network)

Recurrent batch normalizationRecurrent batch normalization
B (x) = γ ⊙ + βNγ,β
x − μ
+ ϵσ
2
− −−−−
√

= sigm(B ( ) + B ( ) + b)ft N ,γh
βh
Wh ht−1 N ,γx
βx
Wx xt−1
= sigm(B ( ) + B ( ) + b)it N ,γh
βh
Wh ht−1 N ,γx
βx
Wx xt−1
= sigm(B ( ) + B ( ) + b)ot N ,γh
βh
Wh ht−1 N ,γx
βx
Wx xt−1
= tanh(B ( ) + B ( ) + b)gt
N ,γh
βh
Wh ht−1 N ,γx
βx
Wx xt−1

= ⊙ + ⊙ct ft ct−1 it gt
= ⊙ tanh(B ( ))ht ot N ,γc
βc
ct

Group normalization [Facebook AI Research]Group normalization [Facebook AI Research]
problem: BatchNorm's error increase rapidly when the batch size drcrease
CV require small batches constrained by memory comsumption
solution: divide channels into groups and compute within each group the mean
and variance for normalization
2481632
batch size (images per worker)
22
24
26
28
30
32
34
36
error(%)
Batch Norm
Group Norm

Group normalizationGroup normalization
H,W
C N
Batch Norm
H,W
C N
Layer Norm
H,W
C N
Instance Norm
H,W
C N
Group Norm

Group normalizationGroup normalization
結論都差不多，懶得講

How does batch normalization help optimization? [MIT]How does batch normalization help optimization? [MIT]
No, it is not about internal covariate shift!No, it is not about internal covariate shift!
It makes the optimization landscape signi cantly smoother.It makes the optimization landscape signi cantly smoother.

Investigate the connection between ICS and BatchNormInvestigate the connection between ICS and BatchNorm
VGG on CIFAR-10 w/o BatchNormVGG on CIFAR-10 w/o BatchNorm
Dramatic improvement both in terms of optimization and generalization
Difference in distribution stability
0 5k 10k 15k
Steps
50
100
TrainingAccuracy(%)
Standard, LR=0.1
Standard + BatchNorm, LR=0.1
Standard, LR=0.5
0 5k 10k 15k
Steps
50
100
TestAccuracy(%)
Standard, LR=0.1
Standard, LR=0.5
Layer#3
Standard Standard + BatchNorm
Layer#11

QuestionsQuestions
1. Is the effectiveness of BatchNorm indeed related to internal covariate shift?
2. Is BatchNorm's stabilization of layer input distributions even effective in reducing
ICS?

Does BatchNorm's performance stem from controlling ICS?Does BatchNorm's performance stem from controlling ICS?
We train the network with random noise injected after BatchNorm layers.
Each activation for each sample in the batch using i.i.d. noise with non-zero mean
and non-unit variance.
Noise distribution change at each time step.
0 5k 10k 15k
Steps
20
40
60
80
100
TrainingAccuracy
Standard
Standard + BatchNorm
Standard + "Noisy" Batchnorm
Layer#2
Standard Standard +
BatchNorm
Standard +
"Noisy" BatchNorm
Layer#9Layer#13

Is BatchNorm reducing ICS?Is BatchNorm reducing ICS?
Is there a broader notion of ICS that has such a direct link to training
performance?
Attempt to capture ICS from a perspective that is more tied to the underlying
optimization phenomenon.
Measure the difference between the gradients of each layer before and after
updates to all the previous layer.

internal covariate shift (ICS) as
corresponds to the gradient of the layer parameters
is the same gradient after all the previous layers have been updated.
Re ect the change in the optimization landscape of caused by the changes of its
input.
Def.
|| − |Gt,i G
′
t,i
|
2
= ∇L( , . . . , , . . . , ; , )Gt,i W
(t)
1
W
(t)
i
W
(t)
k
x
(t)
y
(t)
= ∇L( , . . . , , . . . , ; , )G
′
t,i
W
(t+1)
1
W
(t+1)
i
W
(t+1)
k
x
(t)
y
(t)
Gt,i
G
′
t,i
Wi

(a) VGG
20
40
60
80
100
TrainingAccuracy(%)
LR=0.1LR=0.1
Standard
0 5k 10k 15k
Steps
20
40
60
80
100
TrainingAccuracy(%)
LR=0.01LR=0.01
Standard
10 2
10 0
2-difference
Layer #5
0
1
CosAngle
Layer #10
10 3
10 1
10 1
2-difference
0 5k 10k 15k
Steps
0
1
CosAngle
0 5k 10k 15k
Steps

(b) Deep Linear Network
103
104
TrainingLoss
LR=1e-06LR=1e-06
Standard
0 5k 10k
Steps
103
104
TrainingLoss
LR=1e-07LR=1e-07
Standard
10 2
10 3
10 4
2-Difference
Layer #9
0
1
CosAngle
Layer #17
10 1
10 3
2-Difference
0 5k 10k
Steps
0
1
CosAngle
0 5k 10k
Steps

Model with BatchNorm have similar, or even worse, ICS
and are almost uncorrelated
Controlling the distributions layer inputs might not even reduce the ICS
Gt,i G
′
t,i

Why doess BatchNorm work?Why doess BatchNorm work?
Is there a more fundamental phenomenon at play here?
It reparametrizes the underlying optimization problem to make its landscape be
signi cantly more smooth.

Landscape smoothnessLandscape smoothness
Loss changes at a smaller rate and the magnitudes of the gradients are smaller too.
0 5k 10k 15k
Steps
100
101
LossLandscape
Standard
(a) loss landscape
0 5k 10k 15k
Steps
5
10
15
20
25
30
35
40
45
-smoothness
Standard
(b) “eﬀective”β-smoothness
0 5k 10k 15k
Steps
0
50
100
150
200
250
GradientPredictiveness
Standard
(c) gradient predictiveness

LipschitznessLipschitzness
Lipschitz continuousLipschitz continuous
A function is Lipschitz continuous,
K: a Lipschitz constant,
the smallest K is the (best) Lipschitz constant
f : X → Y
⇔ ∃K ≥ 0, ∀ , ∈ X,x1 x2
|f( ) − f( )| ≤ K| − |x1 x2 x1 x2

-smoothness-smoothnessβ
-smoothness-smoothness
A function is -smooth
β
f β
⇔ ∇f is β-Lipschitz
⇔ ||∇f( ) − ∇f( )|| ≤ β|| − ||x1 x2 x1 x2

The optimization landscapeThe optimization landscape
0 5k 10k 15k
Steps
100
101
LossLandscape
Standard
(a) loss landscape
0 5k 10k 15k
Steps
5
10
15
20
25
30
35
40
45
-smoothness
Standard
(b) “eﬀective”β-smoothness
0 5k 10k 15k
Steps
0
50
100
150
200
250
GradientPredictiveness
Standard
(c) gradient predictiveness

The optimization landscapeThe optimization landscape
Improve the Lipschitzness of the loss function
BatchNorm's reparametrization leads to gradients of the loss being more Lipschitz
too
the loss exhibits a signi cantly better "effective" -smoothness
Make the gradients more reliable and predictive
β

Theoretical analysisTheoretical analysis
跳過（逃跳過（逃

Is BatchNorm the best (only?) way to smoothen theIs BatchNorm the best (only?) way to smoothen the
landscape?landscape?
Is this smoothening effect a unique feature of BatchNorm?
Study schemes taht x the rst order moment of the activations, as BatchNorm
does.
normalize them by the average of their norm
norm, norm and norm
Lp
L1 L2 L∞

0 5k 10k
Steps
20
40
60
80
100
TrainingAccuracy(%)
Standard
Standard + L 1
Standard + L 2
Standard + L
0 5k 10k
Steps
102
103
104
TrainingLoss
Standard
Standard + L 1
Standard + L 2
Standard + L
(a) VGG (b) Deep Linear Model

Layer#11
Standard Standard + BatchNorm Standard + L 1Norm Standard + L 2Norm Standard + L Norm

All the normalization strategies offer comparable performance to BatchNorm
For deep linear network, -normalization performs even better than BatchNorm
-normalization leads to larger distributional covariate shift than vanilla network,
yet stiil yield improved optimization performance
l1
lp

ConclusionConclusion
BatchNorm might not even be reducing internal covariate shift.
BatchNorm makes the landscape of the corresponding optimization problem be
signi cantly more smooth.
Provide empirical demostration and theoretical justi cation. (Lipschitzness)
The smoothening effect is not uniquely tied to BatchNorm.

Q & AQ & A
Extra papersExtra papers
Understanding Batch Normalization (https://arxiv.org/abs/1806.02375)
Norm matters: ef cient and accurate normalization schemes in deep networks
Batch-normalized Recurrent Highway Networks
Differentiable Learning-to-Normalize via Switchable Normalization

Batch normalization 與他愉快的小伙伴

More Related Content

Similar to Batch normalization 與他愉快的小伙伴 (20)

More from 岳華杜 (20)

Recently uploaded (20)