SlideShare a Scribd company logo
Recurrent Neural Networks
Alex Kalinin alex@alexkalinin.com
Content
1. Example of Vanilla RNN
2. RNN Forward pass
3. RNN Backward pass
4. LSTM design
RNN Training problem
Feed-forward (“vanilla”) network
1
0
0
1
0
X
y
RNN
h
𝑊ℎℎ
𝑊ℎ𝑦
𝑊𝑥ℎ
Vanilla recurrent network
1) ℎ 𝑡 = tanh 𝑊ℎℎℎ 𝑡−1 + 𝑊𝑥ℎ 𝑥 + 𝑏ℎ
2) 𝑦 = 𝑊ℎ𝑦ℎ 𝑡 + 𝑏 𝑦
Example: character-level language processing
X
y
RNN
Training sequence:
”hello”
Vocabulary:
[e, h, l, o]
0
1
0
0
1
0
0
0
0
0
1
0
0
0
0
1
“h”“e” “l” “0”
𝑊ℎℎ
𝑊ℎ𝑦
𝑊𝑥ℎ
hX Y
𝑊ℎℎ = 4.1
𝑊𝑥ℎ = [3.6 −4.8 0.35 −0.26]
𝑊ℎ𝑦 =
−12.
−0.67
−0.85
14.
P
𝑏ℎ = 0.41
𝑏 𝑦 =
−0.2
−2.9
6.1
−3.4
“hello” RNN
hX Y P
0
1
0
0
“h”
ℎ0 = 0
“h”
hX Y P
0
1
0
0
“h”
ℎ 𝑡 = tanh 𝑊ℎℎℎ 𝑡−1 + 𝑊𝑥ℎ 𝑥 + 𝑏ℎ
ℎ0 = 0
“h”
hX Y P
0
1
0
0
“h”
ℎ = −0.99
“h”
hX Y P
0
1
0
0
“h”
ℎ = −0.99 𝑦 = 𝑊ℎ𝑦ℎ 𝑡 + 𝑏 𝑦
“h”
hX Y P
0
1
0
0
“h”
ℎ = −0.99 𝑦 =
11.
−2.2
6.9
−17
“h”
hX Y P
0
1
0
0
“h”
ℎ = −0.99 𝑦 =
11.
−2.2
6.9
−17
𝑝 =
0.99
0
0.01
0
“h”
hX Y P
0
1
0
0
“h”
ℎ = −0.99 𝑦 =
11.
−2.2
6.9
−17
𝑝 =
0.99
0
0.01
0
1
0
0
0
“e”
“h”
hX Y P
1
0
0
0
“e”
ℎ = −0.99
“h” “e”
hX Y P
1
0
0
0
“e”
ℎ = −0.99
ℎ 𝑡 = tanh 𝑊ℎℎℎ 𝑡−1 + 𝑊𝑥ℎ 𝑥 + 𝑏ℎ
“h” “e”
hX Y P
1
0
0
0
“e”
ℎ = −0.09
“h” “e”
hX Y P
1
0
0
0
“e”
ℎ = −0.09 𝑦 = 𝑊ℎ𝑦ℎ 𝑡 + 𝑏 𝑦
“h” “e”
hX Y P
1
0
0
0
“e”
ℎ = −0.09 𝑦 =
0.86
−2.8
6.2
−4.6
“h” “e”
hX Y P
1
0
0
0
“e”
ℎ = −0.09 𝑦 =
0.86
−2.8
6.2
−4.6
𝑝 =
0
0
0.99
0
“h” “e”
hX Y P
1
0
0
0
“e”
ℎ = −0.09 𝑦 =
0.86
−2.8
6.2
−4.6
𝑝 =
0
0
0.99
0
0
0
1
0
“l”
“h” “e”
hX Y P
0
0
1
0
“l”
ℎ = −0.09
“h” “e” “l”
hX Y P
0
0
1
0
“l”
ℎ = 0.38
“h” “e” “l”
hX Y P
0
0
1
0
“l”
ℎ = 0.38 𝑦 =
−4.7
−3.2
5.8
1.9
“h” “e” “l”
hX Y P
0
0
1
0
“l”
ℎ = 0.38 𝑦 =
−4.7
−3.2
5.8
1.9
𝑝 =
0
0
0.98
0.02
“h” “e” “l”
hX Y P
0
0
1
0
“l”
ℎ = 0.38 𝑦 =
−4.7
−3.2
5.8
1.9
𝑝 =
0
0
0.98
0.02
0
0
1
0
“l”
“h” “e” “l”
hX Y P
0
0
1
0
“l”
ℎ = 0.38
“h” “e” “l” “l”
hX Y P
0
0
1
0
“l”
ℎ = 0.98
“h” “e” “l” “l”
hX Y P
0
0
1
0
“l”
ℎ = 0.98
“h” “e” “l” “l”
𝑦 =
−12.
−3.6
5.3
10.
hX Y P
0
0
1
0
“l”
ℎ = 0.98
“h” “e” “l” “l”
𝑦 =
−12.
−3.6
5.3
10.
𝑝 =
0
0
0.01
0.99
hX Y P
0
0
1
0
“l”
ℎ = 0.98
“h” “e” “l” “l”
𝑦 =
−12.
−3.6
5.3
10.
𝑝 =
0
0
0.01
0.99
0
0
0
1
“o”
hX Y P
ℎ = 0.98
“h” “e” “l” “l” “o”
hX Y P
“h” ℎ0 = 0 “e”⨁
“e” ℎ1 =-0.99 “l”⨁
“l” ℎ2 =-0.09 “l”⨁
“l” ℎ3 =0.38 “o”⨁
hX Y P
“hello” “hello”
“hello ben” “hello ben”
“hello world” “hello world”
hX Y P
“it was” “it was”
“it was the” “it was the”
“it was the best” “it was the best”
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness… “, A Tale of Two Cities, Charles Dickens
50,000
300,000 (loss = 1.6066)
1,000,000 (loss = 1.8197)
“it was the best of” “it wes the best of” 2,000,000 (loss = 4.0844)
hX Y P
…
epoch 500000, loss: 6.447782290456328
…
epoch 1000000, loss: 5.290576956983398
…
epoch 1800000, loss: 4.267105168323299
epoch 1900000, loss: 4.175163586546514
epoch 2000000, loss: 4.0844739848413285
X
y
RNN
h
𝑊ℎℎ
𝑊ℎ𝑦
𝑊𝑥ℎ
Vanilla recurrent network
1) ℎ 𝑡 = tanh 𝑊ℎℎℎ 𝑡−1 + 𝑊𝑥ℎ 𝑥 + 𝑏ℎ
2) 𝑦 = 𝑊ℎ𝑦ℎ 𝑡 + 𝑏 𝑦
Input:
Target:
i t “ “ w a s “ “
t “ “ w a s “ “ t h
t
RNNs for Different Problems
Vanilla Neural Network
RNNs for Different Problems
Image Captioning
image -> sequence of words
RNNs for Different Problems
Sentiment Analysis
sequence of words -> class
RNNs for Different Problems
Translation
sequence of words -> sequence of words
ℎ1ℎ0
1 1 2
3
ℎ2
𝑥0 𝑥1 𝑥2
𝐿 = 𝑓(𝑊𝑥ℎ, 𝑊ℎℎ, 𝑊ℎ𝑦)𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
𝑤 𝑥ℎ ≔ 𝑤 𝑥ℎ − 0.01 ∙
𝜕𝐿
𝜕𝑤 𝑥ℎ
𝑤ℎℎ ≔ 𝑤ℎℎ − 0.01 ∙
𝜕𝐿
𝜕𝑤ℎℎ
𝑤ℎ𝑦 ≔ 𝑤ℎ𝑦 − 0.01 ∙
𝜕𝐿
𝜕𝑤ℎ𝑦
Training is hard with vanilla RNNs
𝛻𝐿 = [
𝜕𝐿
𝜕𝑤 𝑥ℎ
,
𝜕𝐿
𝜕𝑤ℎℎ
,
𝜕𝐿
𝜕𝑤ℎ𝑦
]
𝑊𝑥ℎ
𝑊ℎℎ
𝑊ℎ𝑦
<— Forward pass
<— Backward pass
ℎ1ℎ0
1 1 2
3
ℎ2
𝑥0 𝑥1 𝑥2
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝜕𝐿
𝜕𝑤ℎℎ
=?
𝐿 = (𝑦 − 3)2
𝐿 =?
y
𝜕𝐿
𝜕𝑤
=
𝜕𝑓
𝜕𝑔
∙
𝜕𝑔
𝜕ℎ
∙
𝜕ℎ
𝜕𝑘
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤
𝐿 = 𝑓(𝑔 ℎ(𝑘(𝑙(𝑚 𝑛(𝑤) ))) )
𝜕𝐿
𝜕𝑤ℎℎ
=?
𝐿 = ( 𝑊ℎℎtanh(𝑊ℎℎtanh(𝑊ℎℎtanh(𝑊𝑥ℎ 𝑥0) + 𝑊𝑥ℎ 𝑥1) + 𝑊𝑥ℎ 𝑥2) − 3)2
Compute gradient
Recursive application of chain rule:
𝜕𝐿
𝜕𝑤
=?
𝑓 = 𝑓(𝑔)𝑔 = 𝑔(ℎ)ℎ = ℎ(𝑘)
Gradient by hand
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
1
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
0.078
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
0.078
tanh
0.0778
ℎ0
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
ℎ0
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
ℎ0
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
0.078
1.
𝑊𝑥ℎ
𝑥1
ℎ0
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
𝑊ℎℎ 0.024
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
𝑊ℎℎ 0.024
*
0.0019
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
0.078
2.
𝑊𝑥ℎ
𝑥2
𝑊ℎℎ 0.024
*
0.0019
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+
-2.99
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝑊𝑥ℎ = 0.078
𝑊ℎ𝑦 = 0.051
𝑊ℎℎ = 0.024
Forward Pass
ℎ0 = tanh(𝑊𝑥ℎ 𝑥0)
ℎ1 = tanh(𝑊ℎℎℎ0 + 𝑊𝑥ℎ 𝑥1)
ℎ2 = tanh(𝑊ℎℎℎ1 + 𝑊𝑥ℎ 𝑥2)
𝑦 = 𝑊ℎ𝑦ℎ2
𝐿 = (𝑦 − 3)2
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝜕𝐿
𝜕𝑤
=
𝜕𝑓
𝜕𝑔
∙
𝜕𝑔
𝜕ℎ
∙
𝜕ℎ
𝜕𝑘
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤
𝐿 = 𝑓(𝑔 ℎ(𝑘(𝑙(𝑚 𝑛(𝑤) ))) )
𝜕𝐿
𝜕𝑤ℎℎ
=?
Compute gradient
Recursive application of chain rule:
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝜕𝑓
𝜕𝑔
∙
𝜕𝑔
𝜕ℎ
∙
𝜕ℎ
𝜕𝑘
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝜕𝑓
𝜕𝑔
∙
𝜕𝑔
𝜕ℎ
∙
𝜕ℎ
𝜕𝑘
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝜕𝑔
𝜕ℎ
∙
𝜕ℎ
𝜕𝑘
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1
𝜕𝑓
𝜕𝑔
=?
𝑓 = (𝑔)2
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝜕𝑔
𝜕ℎ
∙
𝜕ℎ
𝜕𝑘
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1
𝜕𝑓
𝜕𝑔
=
𝜕𝑔2
𝜕𝑔
= 2𝑔 = 2 −2.99 = −5.98
𝑓 = (𝑔)2
-5.98
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝜕ℎ
𝜕𝑘
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1-5.98
𝑔 = ℎ − 3
𝜕𝑔
𝜕ℎ
= 1
-5.98
tanh
tanh
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝝏𝒉
𝝏𝒌
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1-5.98
-5.98
ℎ = 𝑊ℎ𝑦 𝑘
𝜕ℎ
𝜕𝑘
= 𝑊ℎ𝑦
0.051tanh
tanh
𝜕ℎ
𝜕𝑊ℎ𝑦
= 𝑘
0.1566
-0.304
0.936
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝝏𝒉
𝝏𝒌
∙
𝜕𝑘
𝜕𝑙
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1-5.98
-5.98
ℎ = 𝑊ℎ𝑦 𝑘
𝜕ℎ
𝜕𝑘
= 𝑊ℎ𝑦
tanh
tanh
𝜕ℎ
𝜕𝑊ℎ𝑦
= 𝑘
-0.304
0.936
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝝏𝒉
𝝏𝒌
∙
𝝏𝒌
𝝏𝒍
∙
𝜕𝑙
𝜕𝑚
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1-5.98
-5.98
𝑘 = tanh(𝑙)
𝜕𝑘
𝜕𝑙
= 1 − 𝑘2
= 1−.15662
= .975
-0.304-0.297
tanh
tanh
0.936
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.07970
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝝏𝒉
𝝏𝒌
∙
𝝏𝒌
𝝏𝒍
∙
𝝏𝒍
𝝏𝒎
∙
𝜕𝑚
𝜕𝑛
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1-5.98
-5.98
-0.297
tanh
tanh
-0.297-0.0071
0.936
-0.304
-0.297
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.0797
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝝏𝒉
𝝏𝒌
∙
𝝏𝒌
𝝏𝒍
∙
𝝏𝒍
𝝏𝒎
∙
𝝏𝒎
𝝏𝒏
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
1-5.98
-5.98
-0.297
tanh
tanh
-0.297-0.0071
1 − 𝑘2
= 1−.07972
= .993
-0.0071
0.936
-0.304
-0.297
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.0797
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
1-5.98
-5.98
-0.297
tanh
tanh
-0.297-0.0071-0.0071
-0.0071
-0.00017
0.936
-0.304
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝝏𝒉
𝝏𝒌
∙
𝝏𝒌
𝝏𝒍
∙
𝝏𝒍
𝝏𝒎
∙
𝝏𝒎
𝝏𝒏
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
-0.0005
-0.297
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.0797
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
1-5.98
-5.98
-0.297
tanh
tanh
-0.297-0.0071-0.0071
-0.0071
-0.00017
1 − 𝑘2
= 1−.07782
= .993
0.936
-0.304
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝝏𝒉
𝝏𝒌
∙
𝝏𝒌
𝝏𝒍
∙
𝝏𝒍
𝝏𝒎
∙
𝝏𝒎
𝝏𝒏
∙
𝜕𝑛
𝜕𝑤 𝑥ℎ
-0.00017
-0.0005
-0.297
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.0797
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
1-5.98
-5.98
-0.297
tanh
tanh
-0.297-0.0071-0.0071
-0.0071
-0.00017
0.936
-0.304
𝜕𝐿
𝜕𝑤 𝑥ℎ
=
𝝏𝒇
𝝏𝒇
∙
𝝏𝒇
𝝏𝒈
∙
𝝏𝒈
𝝏𝒉
∙
𝝏𝒉
𝝏𝒌
∙
𝝏𝒌
𝝏𝒍
∙
𝝏𝒍
𝝏𝒎
∙
𝝏𝒎
𝝏𝒏
∙
𝝏𝒏
𝝏𝒘 𝒙𝒉
-0.00017
-0.00017
-0.0005
-0.297
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
0.0778
*
0.00187
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
0.07987
ℎ1
0.0797
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
0.0019
+
0.1579 0.1566
ℎ2
0.051𝑊ℎ𝑦
*
0.0080
𝑦
-3
+ **
-2.99 8.95
𝐿
1-5.98
-5.98
-0.297
tanh
tanh
-0.297-0.0071-0.0071
-0.0071
-0.00017
0.936
-0.304
-0.00017
-0.00017
-0.0005
-0.297
𝑤 𝑎 ≔ 𝑤 𝑎 − 0.01 ∙
𝜕𝐿
𝜕𝑤 𝑎
𝑤 𝑥ℎ ≔ 0.078 − 0.01 ∙ −.00017 = 0.0780017
𝑤ℎℎ ≔ 0.024 − 0.01 ∙ −.0005 = 0.024005
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
Backward Pass
*
0.078
1.
𝑊𝑥ℎ
𝑥0
𝑊ℎℎ 0.024
0.078
tanh
*
*
0.078
1.
𝑊𝑥ℎ
𝑥1
0.078
ℎ0
+
ℎ1
*
0.078
2.
𝑊𝑥ℎ
𝑥2
0.156
𝑊ℎℎ 0.024
*
+
0.1579
0.051𝑊ℎ𝑦
*
+ **
1-5.98
tanh
tanh
-0.297-0.0071
-0.0071
-0.00017
𝑥1𝑥0
ℎ1ℎ0
1 2
ℎ2
𝑥2
3
1
𝜕𝐿
𝜕𝑥
= 𝑤ℎℎ … 𝑤ℎℎ … 𝑤ℎℎ … 𝑤ℎℎ = 𝑤ℎℎ
𝑛
∙ 𝐶(𝑤)
𝑤ℎℎ𝑤ℎℎ𝑤ℎℎ
𝑤ℎℎ𝑤ℎℎ
1. 0.024
2. 0.000576
3. 1.382e-05
4. 3.318e-07
5. 7.963e-09
6. 1.911e-10
7. 4.586e-12
8. 1.101e-13
9. 2.642e-15
10. 6.340e-17
𝑊ℎℎ = 0.024
tanh tanhtanhtanhtanhtanh
Source: https://imgur.com/gallery/vaNahKE
W
x
2n
4n
𝑖
𝑓
𝑜
𝑔
=
𝑠𝑖𝑔𝑚
𝑠𝑖𝑔𝑚
𝑠𝑖𝑔𝑚
𝑡𝑎𝑛ℎ
𝑊
𝑥
ℎ 𝑡−1
𝑐𝑡 = 𝑓 ∙ 𝑐𝑡−1 + 𝑖 ∙ 𝑔
ℎ 𝑡 = 𝑜 ∙ tanh(𝑐𝑡)
i
f
o
g
x
h
Long Short-Term Memory (LSTM)
n
n
n
n
𝜎
𝜎
𝜎
𝜏
𝑡 − 1 𝑡
ℎ 𝑡 = (tanh) 𝑊
𝑥
ℎ 𝑡−1
- RNN
𝑐𝑡 = 𝑓 ∙ 𝑐𝑡−1 + 𝑖 ∙ 𝑔
ℎ 𝑡 = tanh 𝑊ℎℎℎ 𝑡−1 + 𝑊𝑥ℎ 𝑥RNN:
LSTM:
𝑖
𝑓
𝑜
𝑔
=
𝑠𝑖𝑔𝑚
𝑠𝑖𝑔𝑚
𝑠𝑖𝑔𝑚
𝑡𝑎𝑛ℎ
𝑊
𝑥
ℎ 𝑡−1
𝑐𝑡 = 𝑓 ∙ 𝑐𝑡−1 + 𝑖 ∙ 𝑔
ℎ 𝑡 = 𝑜 ∙ tanh(𝑐𝑡)
forget
gate,
0/1
input
gate,
0/1
f
incoming
X
i og
+
X
tanh
X
Long Short-Term Memory (LSTM)
𝑖
𝑓
𝑜
𝑔
=
𝑠𝑖𝑔𝑚
𝑠𝑖𝑔𝑚
𝑠𝑖𝑔𝑚
𝑡𝑎𝑛ℎ
𝑊
𝑥
ℎ 𝑡−1
𝑐𝑡 = 𝑓 ∙ 𝑐𝑡−1 + 𝑖 ∙ 𝑔
ℎ 𝑡 = 𝑜 ∙ tanh(𝑐𝑡)
𝑐𝑡−1
ℎ 𝑡
𝜕𝐿
𝜕𝑥
= 𝑤ℎℎ … 𝑤ℎℎ … 𝑤ℎℎ … 𝑤ℎℎ = 𝑤ℎℎ
𝑛
∙ 𝐶(𝑤)
𝑤ℎℎ𝑤ℎℎ𝑤ℎℎ
f f f
f f f
+ + +
RNN
LSTM
Flow of gradient
𝑡 − 1 𝑡 𝑡 + 1
𝑡 − 1 𝑡 𝑡 + 1
Source: https://imgur.com/gallery/vaNahKE
Long Short-Term Memory (LSTM)
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Reference
1. Long Term-Short Memory (Hochreiter, 1997),
http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf
2. Learning Long Term Dependencies With Gradient Descent is Difficult (Yoshua Bengio, 1994),
http://www.dsi.unifi.it/~paolo/ps/tnn-94-gradient.pdf
3. http://neuralnetworksanddeeplearning.com/chap5.html
4. Deep Learning, Ian Goodfellow et al., The MIT Press
5. Recurrent Neural Networks, LSTM, Andrej Karpathy, Stanford Lectures,
https://www.youtube.com/watch?v=iX5V1WpxxkY
Alex Kalinin alex@alexkalinin.com

More Related Content

PDF
On Bernstein Polynomials
PDF
Teoria Numérica (Palestra 01)
PDF
Control system exercise
PDF
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
PDF
ゲーム理論BASIC 第42回 -仁に関する定理の証明2-
DOCX
Ecuaciones dispositivos electronicos
PDF
PROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTION
PDF
Study Material Numerical Differentiation and Integration
On Bernstein Polynomials
Teoria Numérica (Palestra 01)
Control system exercise
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明2-
Ecuaciones dispositivos electronicos
PROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTION
Study Material Numerical Differentiation and Integration

What's hot (20)

PDF
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
DOCX
E E 481 Lab 1
DOCX
A Course in Fuzzy Systems and Control Matlab Chapter Three
PDF
B010310813
PDF
Least Squares
PPT
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
PDF
ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-
PDF
Study Material Numerical Solution of Odinary Differential Equations
PPTX
Fourier series
PPTX
Integral calculus
PDF
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
PPTX
Fourier transforms
PDF
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
PDF
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
PPT
Matrices ii
PPTX
Solution to second order pde
PDF
slides CIRM copulas, extremes and actuarial science
PDF
Integrales solucionario
PDF
slides tails copulas
PDF
DissertationSlides169
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
E E 481 Lab 1
A Course in Fuzzy Systems and Control Matlab Chapter Three
B010310813
Least Squares
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-
Study Material Numerical Solution of Odinary Differential Equations
Fourier series
Integral calculus
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
Fourier transforms
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
Matrices ii
Solution to second order pde
slides CIRM copulas, extremes and actuarial science
Integrales solucionario
slides tails copulas
DissertationSlides169
Ad

Viewers also liked (14)

PDF
Recurrent Neural Networks, LSTM and GRU
PPTX
Machine Learning in the Real World
PDF
A Note on BPTT for LSTM LM
PDF
RNN Explore
PDF
LSTM 네트워크 이해하기
PDF
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
PPTX
論文輪読資料「Gated Feedback Recurrent Neural Networks」
PDF
Machine Learning Lecture 3 Decision Trees
PDF
Recent Progress in RNN and NLP
PDF
RNN, LSTM and Seq-2-Seq Models
PPTX
Understanding RNN and LSTM
PDF
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
PDF
Recurrent Neural Networks. Part 1: Theory
PPTX
Electricity price forecasting with Recurrent Neural Networks
Recurrent Neural Networks, LSTM and GRU
Machine Learning in the Real World
A Note on BPTT for LSTM LM
RNN Explore
LSTM 네트워크 이해하기
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
論文輪読資料「Gated Feedback Recurrent Neural Networks」
Machine Learning Lecture 3 Decision Trees
Recent Progress in RNN and NLP
RNN, LSTM and Seq-2-Seq Models
Understanding RNN and LSTM
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Recurrent Neural Networks. Part 1: Theory
Electricity price forecasting with Recurrent Neural Networks
Ad

Similar to Recurrent Networks and LSTM deep dive (19)

PPTX
Recurrent neural networks (rnn) and long short term memory networks (lstm)
PDF
Data mining assignment 5
PPTX
1.2 A Tutorial Example - Deep Learning Foundations and Concepts.pptx
DOCX
Deep learning experiment 4 and 5 lab.docx
PPTX
Tensorflow, deep learning and recurrent neural networks without a ph d
PDF
Sect2 4
PDF
Recurrent and Recursive Networks (Part 1)
PDF
Recurrent Neural Networks (RNN): Unlocking Sequential Data Processing
PDF
Sect2 5
PPTX
Neural Network Back Propagation Algorithm
PPTX
Solving recurrences
PDF
Recurrence relation solutions
PDF
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
PDF
A Note on TopicRNN
PDF
Neural Network Part-2
PDF
Write Python for Speed
PPT
Recurrences
PDF
深層学習による非滑らかな関数の推定
PDF
RNNs for Timeseries Analysis
Recurrent neural networks (rnn) and long short term memory networks (lstm)
Data mining assignment 5
1.2 A Tutorial Example - Deep Learning Foundations and Concepts.pptx
Deep learning experiment 4 and 5 lab.docx
Tensorflow, deep learning and recurrent neural networks without a ph d
Sect2 4
Recurrent and Recursive Networks (Part 1)
Recurrent Neural Networks (RNN): Unlocking Sequential Data Processing
Sect2 5
Neural Network Back Propagation Algorithm
Solving recurrences
Recurrence relation solutions
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
A Note on TopicRNN
Neural Network Part-2
Write Python for Speed
Recurrences
深層学習による非滑らかな関数の推定
RNNs for Timeseries Analysis

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Big Data Technologies - Introduction.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine Learning_overview_presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Big Data Technologies - Introduction.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
A comparative analysis of optical character recognition models for extracting...
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Recurrent Networks and LSTM deep dive

Editor's Notes

  • #5: First, we calculate new hidden state. We use both the previous hidden state and the input. Using the previous hidden states provides “memory”. Then, we use new hidden state to calculate new output, y. This is a forward pass. All operations are differentiable, so we can use vanilla back-propagation to train our network.
  • #37: First, we calculate new hidden state. We use both the previous hidden state and the input. Using the previous hidden states provides “memory”. Then, we use new hidden state to calculate new output, y. This is a forward pass. All operations are differentiable, so we can use vanilla back-propagation to train our network.
  • #43: We need to train our network. Calculating L is a forward pass. Updating weights is a back-propagation.
  • #44: How to calculate the derivative of function.