SlideShare a Scribd company logo
RNN, Seq2Seq Learning
and Image Captioning
Dongang Wang
07 July 2017
Contents
• From RNN to LSTM
• Backpropagation through time (BPTT)
• Usage of LSTM
• Seq2Seq Learning
• Attention Mechanism
• Image Captioning
Recurrent Neural Network
Recurrent Neural Network
 The following update equations:
 with the cross entropy loss:
Backpropagation Through Time
 Using the previous equations, we could find
the derivatives of loss over the parameters:
 The s stands for the sizes of each part.
Specially, , which means the function
 does not change the dimensions.
1 1
, ,
,
a x o h a h
a o
s s s s s s
s s
U V W
b c
× × ×
× ×
∈ ∈ ∈
∈ ∈
  
 
2
2
1
tanh( )
1
x x x
x x x
e e e
x
e e e
−
−
− −
= =
+ +
a hs s=
Backpropagation Through Time
 Step 1: understand loss
 we have
 which means for any parameter M:
 In that case, we could only deal with one
time step.
, ,a x a x a xs s s s s s
U U U× × ×
∈ ∈ ∈  
( )
1t
L
L
∂
=
∂
( )
( )
t
t
t
L L L
M L M
∂ ∂ ∂
=
∂ ∂ ∂
∑
Backpropagation Through Time
 Step 1: understand loss
 We assume that the labels are one-hot, so
the loss for each time step should be
 Only m-th element left. And we will have
T( ) ( )( )
( ) ( )
T
( )
( ) ( )
ˆlog
0,0,... ,...,0,0
ˆ ˆ
1
0,0,... ,...,0,0
ˆ ˆ
t tt
m m
t t
m
t
t t
m
y yL
y y
y
y y
 −∂∂
=  
∂ ∂ 
 −
= = − 
 
Backpropagation Through Time
 Step 2: derivatives of V and c
 Straightforward:
( ) ( ) ( ) ( )
( ) ( )
( ) ( ) ( ) ( )
( ) ( )
ˆ
ˆ
ˆ
ˆ
t t t t
t t
t t t t
t t
L L y o
V y o V
L L y o
c y o c
∂ ∂ ∂ ∂
= ⋅ ⋅
∂ ∂ ∂ ∂
∂ ∂ ∂ ∂
= ⋅ ⋅
∂ ∂ ∂ ∂
( ) ( )
( )
, and 1
T
t t
to o
h
V c
∂ ∂
= =
∂ ∂
Backpropagation Through Time
 Step 2: derivatives of V and c
 The derivative of softmax:
 and
i
j
x
x
e
y
e
=
∑( )( )
1
( ) ( )
1 1( )
( )
( )( )
1
( ) ( )
ˆˆ
ˆ
=
ˆˆ
tt
d
t t
t
t
tt
d
t t
d d
yy
o o
y
o
yy
o o
 ∂∂
 
∂ ∂ 
∂  
 ∂
∂∂ 
 ∂ ∂ 

  

( ) ( ) ( )( )
( ) ( ) ( )
ˆ ˆ ˆ , ifˆ
=
ˆ ˆ , if
t t tt
i i ii
t t t
j i j
y y y i jy
o y y i j
 − =∂ 

∂ ≠
Backpropagation Through Time
 Step 2: derivatives of V and c
 Since we have ,
which will pick up one column from the
above matrix, the result will be:
T
( )
( ) ( )
1
0,0,... ,...,0,0
ˆ ˆ
t
t t
m
L
y y
 ∂ −
=  
∂  
( )( )
1
( ) ( )
T 1 1( ) ( ) ( )
( ) ( ) ( ) ( )
( )( )
1
( ) ( )
T( ) ( ) ( ) ( ) ( ) ( )
1 2
ˆˆ
ˆ 1
= 0,0,... ,...,0,0
ˆ ˆ
ˆˆ
ˆ ˆ ˆ ˆ ˆ, ,..., 1,...,
tt
d
t t
t t t
t t t t
m tt
d
t t
d d
t t t t t t
m d
yy
o o
L L y
o y o y
yy
o o
y y y y y y
 ∂∂
 
∂ ∂  ∂ ∂ ∂ −  =  
 ∂ ∂ ∂  
∂∂ 
 ∂ ∂ 
 = − =− 

  

Backpropagation Through Time
 Step 2: derivatives of V and c
 We have already got
 
 so: ( ) ( ) ( )
( ) ( ) ( )
( ) ( )
( ) ( ) ( )
( ) ( )
( ) ( )
ˆ
ˆ( )
ˆ
ˆ
ˆ( )
ˆ
T
t t t
t t t
t t
t t
t t t
t t
t t
t t
L L y o
y y h
V y o V
L L y o
y y
c y o c
∂ ∂ ∂ ∂
= ⋅ ⋅ = − ⋅
∂ ∂ ∂ ∂
∂ ∂ ∂ ∂
= ⋅ ⋅ = −
∂ ∂ ∂ ∂
∑ ∑
∑ ∑
( ) ( )
( )
, and 1
T
t t
to o
h
V c
∂ ∂
=
∂ ∂
( ) ( ) ( )
( ) ( )
( ) ( ) ( )
ˆ
ˆ
ˆ
t t t
t t
t t t
L L y
y y
o y o
∂ ∂ ∂
= = −
∂ ∂ ∂
Backpropagation Through Time
 Step 3: derivatives of h
 We already have
 and
( ) ( ) ( ) ( ) ( 1) ( 1)
( ) ( ) ( ) ( 1) ( 1) ( )
t t t t t t
t t t t t t
L L o L h a
h o h h a h
+ +
+ +
∂ ∂ ∂ ∂ ∂ ∂
= ⋅ + ⋅ ⋅
∂ ∂ ∂ ∂ ∂ ∂
( ) ( ) ( )
( ) ( )
( ) ( ) ( )
ˆ
ˆ
ˆ
t t t
t t
t t t
L L y
y y
o y o
∂ ∂ ∂
= = −
∂ ∂ ∂
( ) ( 1)
( ) ( )
, and
T T
t t
t t
o a
V W
h h
+
∂ ∂
= =
∂ ∂
Backpropagation Through Time
 Step 3: derivatives of h
 The derivative of tanh:
 we have observed that
 so
2 2
tanh( ) 4
2x x
d x
dx e e−
=
+ +
2 2
2
2 2 2 2
2 4
tanh ( ) 1
2 2
x x
x x x x
e e
x
e e e e
−
− −
+ −
= = −
+ + + +
2tanh( )
1 tanh ( )
d x
x
dx
= −
Backpropagation Through Time
 Step 3: derivatives of h
 Combine all the above together:
 Until the last time step
Backpropagation Through Time
 Step 4: derivatives of U, W and b
 We can write:
 and we have
( ) ( ) ( ) ( )
( ) ( )
( ) ( ) ( ) ( )
( ) ( )
( ) ( ) ( ) ( )
( ) ( )
t t t t
t t
t t t t
t t
t t t t
t t
L L h a
b h a b
L L h a
U h a U
L L h a
W h a W
∂ ∂ ∂ ∂
= ⋅ ⋅
∂ ∂ ∂ ∂
∂ ∂ ∂ ∂
= ⋅ ⋅
∂ ∂ ∂ ∂
∂ ∂ ∂ ∂
= ⋅ ⋅
∂ ∂ ∂ ∂
( ) ( ) ( )
( 1) ( )
1, , and
T T
t t t
t ta a a
h x
b W U
−∂ ∂ ∂
= = =
∂ ∂ ∂
Backpropagation Through Time
 Summary:
Backpropagation Through Time
 Summary:
Long-term Dependency
 This is an inherent problem, like Gradient
Vanishing Problem in CNN.
 Let’s focus on the hidden state. Somehow it
just works as matrix multiplication:
 The state of last step according to the first:
Long-term Dependency
 If we take the eigendecomposition of the
above equation:
 This means the eigenvalues should not be
too large or too small, or the hidden states
after several steps will be vanished or
exaggerated. The ideal choice should be ≈1.
Long Short Term Memory
 LSTM was proposed in 1997, which is
designed to add some variances to the net.
  sigmoid
  tanh
Long Short Term Memory
 i  input gate, f  forget gate,
 o  output gate, g  input modulation,
 z  output, h  state, c  memory cell
Usage of RNN(LSTM)
 One input, Many output
• Image Captioning
• Language Translation
 Many input, One output
• Video Classification
• Language Classification
 Many input, Many output
• Language Translation
• Video Captioning
Seq2Seq
 One successful implementation of LSTM is
Sequence-to-Sequence learning. This is first
introduced to solve machine translation
problems (kind of transfer learning).
– Sequence to sequence learning with neural
networks (2014)
– Learning Phrase Representation using RNN
Encoder-Decoder for Statistical Machine
Translation (2014)
Seq2Seq
 NLP fundations:
• word embedding
There are two ways to represent word. One is using
dictionary, and each word will be a one-hot vector. The
other way is using word embedding tools like
word2vec.
• beam searching
For output, each time there will be an output of
softmax probabilities. It is not fair to choose the word
with largest prob as the prediction. Instead, we choose
k largest each time, where k is the beam size.
Seq2Seq
 Model:
– Two LSTMs: encoder and decoder
– Sentence encoded to a length-fixed vector
– The vector acts as the first input to decoder,
and the output of each time step will be input
to next time step.
Seq2Seq
 Tricks:
• Deep LSTM using four layers.
The output of each layer works as the input of next
layer. The final output of each time step will be the
input of the next time step. (also other choices)
• Reverse the order of the words of the input.
It is said the reason is that minimal time lag is
smaller than normal order. However, I think the
reason should be it is more important for the
decoder to have a more precise beginning.
Attention
 Problems from the previous method:
– Using only the output from the last time step of
encoder, which will cause lack of sequence info
– The lengths of encoded features are fixed.
– Not robust for long sentences in translation
 Attention mechanism was proposed.
– Neural Machine Translation by Jointly Learning
to Align and Translate (2015)
Attention
 Model:
– Encoder: bidirectional LSTM
– Decoder: input the label and
state from last time step,
and the combination of all
encoder features.
Attention
 Decide on the parameters:
 Parameter α’s are the softmax probabilities
of energy e. This indicates the attention.
 Energy e is learned via a feedforward neural
network a. Energy will change for every
sentence, so the parameter will change.
Image Captioning
 If we exchange the encoder to CNN to deal
with images, the structure will transfer
information from image to language, which is
the idea of image captioning.
– Show, Attend and Tell Neural Image Caption
Generation with Visual Attention (2015)
– Boosting Image Captioning with Attributes (2017)
– Describing Videos by Exploiting Temporal
Structure (2015)
Image Captioning
 Encoder:
– Convolution Neural Network to extract features
– Using the conv layer feature instead of features
from fully-connected layer. Suppose the layer
contains L filters, then the output will be L
vectors a of D dim. Each filter should be a focus
on one part of the image. (attention)
– Combine the vectors a to one vector 𝑧𝑧̂.
Image Captioning
 Decoder:
 Compared with the basic LSTM, one more
input 𝑧𝑧̂ was taken into consideration.
Image Captioning
 Attention:
 Similar to the idea in attention part, this
deals with the relationship between vector a
and vector 𝑧𝑧̂.
Image Captioning
 Attention:
– Hard attention: Choose the attention part to
some probabilities
– Soft attention: Similar to the attention part in
translation, use the averages of the features.
Look Ahead
 Variants of RNNs:
– Hierarchical LSTM: also known as stacked LSTM,
or Deep Recurrent Neural Network
– Bidirectional LSTM: information in two directions
 Alternatives of RNN:
– Convolutional Seq2Seq
– Attention only Seq2Seq and one model for all
References
[Goodfellow, 2016] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep le
arning. MIT press.
[Sutskever, 2014] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to se
quence learning with neural networks. In Advances in neural information pr
ocessing systems (pp. 3104-3112).
[Bahdanau, 2014] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machin
e translation by jointly learning to align and translate. arXiv preprint arXiv:1
409.0473. ICLR 2015
[Yao, 2016] Yao, Ting, Pan, Yingwei, Li, Yehao, Qiu, Zhaofan, & Mei, Tao. (201
6). Boosting image captioning with attributes. ICLR 2017
[Xu, 2015] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., & Salakhutdinov, R., e
t al. (2015). Show, attend and tell: neural image caption generation with vis
ual attention. Computer Science, 2048-2057.
[Yao, 2015] Yao L, Torabi A, Cho K, et al. Describing Videos by Exploiting Tem
poral Structure[J]. Eprint Arxiv, 2015, 53:199-211.
References
[Chollet, 2016] Chollet, F. (2016). Xception: Deep Learning with Depthwise S
eparable Convolutions. arXiv preprint arXiv:1610.02357.
[Gehring, 2017] Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N.
(2017). Convolutional Sequence to Sequence Learning. arXiv preprint arXiv:
1705.03122.
[Vaswani, 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A. N., ... & Polosukhin, I. (2017). Attention Is All You Need. arXiv pre
print arXiv:1706.03762.
[Kaiser, 2017] Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., J
ones, L., & Uszkoreit, J. (2017). One Model To Learn Them All. arXiv preprint
arXiv:1706.05137.

More Related Content

PDF
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
PDF
Lecture 15 DCT, Walsh and Hadamard Transform
PPTX
Restricted Boltzman Machine (RBM) presentation of fundamental theory
PDF
03 image transform
PDF
SchNet: A continuous-filter convolutional neural network for modeling quantum...
PPTX
Image transforms
PPTX
DNN and RBM
PDF
Seq2Seq (encoder decoder) model
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
Lecture 15 DCT, Walsh and Hadamard Transform
Restricted Boltzman Machine (RBM) presentation of fundamental theory
03 image transform
SchNet: A continuous-filter convolutional neural network for modeling quantum...
Image transforms
DNN and RBM
Seq2Seq (encoder decoder) model

What's hot (20)

PPTX
PDF
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
PDF
Lecture 5: The Convolution Sum
PPTX
The discrete fourier transform (dsp) 4
PDF
Dft and its applications
PPTX
Discrete Fourier Transform
PPTX
Properties of dft
PDF
Digital Signal Processing[ECEG-3171]-Ch1_L04
PDF
Dsp U Lec10 DFT And FFT
PDF
Digital Signal Processing[ECEG-3171]-Ch1_L03
PDF
Chapter5 - The Discrete-Time Fourier Transform
PPTX
DISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transforms
PDF
Sparse autoencoder
PPTX
Discrete fourier transform
PPTX
Properties of dft
PDF
Fourier supplementals
PDF
Digital Signal Processing[ECEG-3171]-Ch1_L05
PDF
Dsp U Lec04 Discrete Time Signals & Systems
PDF
Digital Signal Processing[ECEG-3171]-Ch1_L02
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Lecture 5: The Convolution Sum
The discrete fourier transform (dsp) 4
Dft and its applications
Discrete Fourier Transform
Properties of dft
Digital Signal Processing[ECEG-3171]-Ch1_L04
Dsp U Lec10 DFT And FFT
Digital Signal Processing[ECEG-3171]-Ch1_L03
Chapter5 - The Discrete-Time Fourier Transform
DISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transforms
Sparse autoencoder
Discrete fourier transform
Properties of dft
Fourier supplementals
Digital Signal Processing[ECEG-3171]-Ch1_L05
Dsp U Lec04 Discrete Time Signals & Systems
Digital Signal Processing[ECEG-3171]-Ch1_L02
Ad

Similar to RNN and sequence-to-sequence processing (20)

PDF
Recurrent Neural Networks, LSTM and GRU
PPT
14889574 dl ml RNN Deeplearning MMMm.ppt
PDF
Recurrent Neural Networks
PDF
Rnn presentation 2
PDF
Backpropagation in RNN and LSTM
PDF
Recurrent Neural Networks. Part 1: Theory
PPTX
Introduction to deep learning
PDF
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
PDF
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PDF
Sequencing and Attention Models - 2nd Version
PDF
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
PDF
PPTX
RNN JAN 2025 ppt fro scratch looking from basic.pptx
PDF
Unit 6: Introduction to Deep Learning & RNN
PPTX
Computer vision lab seminar(deep learning) yong hoon
PPTX
Deep learning (2)
PDF
Sequence-to-Sequence Modeling for Time Series
PDF
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
PDF
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
PDF
RNNs for Timeseries Analysis
Recurrent Neural Networks, LSTM and GRU
14889574 dl ml RNN Deeplearning MMMm.ppt
Recurrent Neural Networks
Rnn presentation 2
Backpropagation in RNN and LSTM
Recurrent Neural Networks. Part 1: Theory
Introduction to deep learning
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Sequencing and Attention Models - 2nd Version
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
RNN JAN 2025 ppt fro scratch looking from basic.pptx
Unit 6: Introduction to Deep Learning & RNN
Computer vision lab seminar(deep learning) yong hoon
Deep learning (2)
Sequence-to-Sequence Modeling for Time Series
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
RNNs for Timeseries Analysis
Ad

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Spectroscopy.pptx food analysis technology
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation_ Review paper, used for researhc scholars
Chapter 3 Spatial Domain Image Processing.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

RNN and sequence-to-sequence processing

  • 1. RNN, Seq2Seq Learning and Image Captioning Dongang Wang 07 July 2017
  • 2. Contents • From RNN to LSTM • Backpropagation through time (BPTT) • Usage of LSTM • Seq2Seq Learning • Attention Mechanism • Image Captioning
  • 4. Recurrent Neural Network  The following update equations:  with the cross entropy loss:
  • 5. Backpropagation Through Time  Using the previous equations, we could find the derivatives of loss over the parameters:  The s stands for the sizes of each part. Specially, , which means the function  does not change the dimensions. 1 1 , , , a x o h a h a o s s s s s s s s U V W b c × × × × × ∈ ∈ ∈ ∈ ∈      2 2 1 tanh( ) 1 x x x x x x e e e x e e e − − − − = = + + a hs s=
  • 6. Backpropagation Through Time  Step 1: understand loss  we have  which means for any parameter M:  In that case, we could only deal with one time step. , ,a x a x a xs s s s s s U U U× × × ∈ ∈ ∈   ( ) 1t L L ∂ = ∂ ( ) ( ) t t t L L L M L M ∂ ∂ ∂ = ∂ ∂ ∂ ∑
  • 7. Backpropagation Through Time  Step 1: understand loss  We assume that the labels are one-hot, so the loss for each time step should be  Only m-th element left. And we will have T( ) ( )( ) ( ) ( ) T ( ) ( ) ( ) ˆlog 0,0,... ,...,0,0 ˆ ˆ 1 0,0,... ,...,0,0 ˆ ˆ t tt m m t t m t t t m y yL y y y y y  −∂∂ =   ∂ ∂   − = = −   
  • 8. Backpropagation Through Time  Step 2: derivatives of V and c  Straightforward: ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ˆ ˆ ˆ ˆ t t t t t t t t t t t t L L y o V y o V L L y o c y o c ∂ ∂ ∂ ∂ = ⋅ ⋅ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ⋅ ⋅ ∂ ∂ ∂ ∂ ( ) ( ) ( ) , and 1 T t t to o h V c ∂ ∂ = = ∂ ∂
  • 9. Backpropagation Through Time  Step 2: derivatives of V and c  The derivative of softmax:  and i j x x e y e = ∑( )( ) 1 ( ) ( ) 1 1( ) ( ) ( )( ) 1 ( ) ( ) ˆˆ ˆ = ˆˆ tt d t t t t tt d t t d d yy o o y o yy o o  ∂∂   ∂ ∂  ∂    ∂ ∂∂   ∂ ∂       ( ) ( ) ( )( ) ( ) ( ) ( ) ˆ ˆ ˆ , ifˆ = ˆ ˆ , if t t tt i i ii t t t j i j y y y i jy o y y i j  − =∂   ∂ ≠
  • 10. Backpropagation Through Time  Step 2: derivatives of V and c  Since we have , which will pick up one column from the above matrix, the result will be: T ( ) ( ) ( ) 1 0,0,... ,...,0,0 ˆ ˆ t t t m L y y  ∂ − =   ∂   ( )( ) 1 ( ) ( ) T 1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) 1 ( ) ( ) T( ) ( ) ( ) ( ) ( ) ( ) 1 2 ˆˆ ˆ 1 = 0,0,... ,...,0,0 ˆ ˆ ˆˆ ˆ ˆ ˆ ˆ ˆ, ,..., 1,..., tt d t t t t t t t t t m tt d t t d d t t t t t t m d yy o o L L y o y o y yy o o y y y y y y  ∂∂   ∂ ∂  ∂ ∂ ∂ −  =    ∂ ∂ ∂   ∂∂   ∂ ∂   = − =−      
  • 11. Backpropagation Through Time  Step 2: derivatives of V and c  We have already got    so: ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ˆ ˆ( ) ˆ ˆ ˆ( ) ˆ T t t t t t t t t t t t t t t t t t t t L L y o y y h V y o V L L y o y y c y o c ∂ ∂ ∂ ∂ = ⋅ ⋅ = − ⋅ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ⋅ ⋅ = − ∂ ∂ ∂ ∂ ∑ ∑ ∑ ∑ ( ) ( ) ( ) , and 1 T t t to o h V c ∂ ∂ = ∂ ∂ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ˆ ˆ ˆ t t t t t t t t L L y y y o y o ∂ ∂ ∂ = = − ∂ ∂ ∂
  • 12. Backpropagation Through Time  Step 3: derivatives of h  We already have  and ( ) ( ) ( ) ( ) ( 1) ( 1) ( ) ( ) ( ) ( 1) ( 1) ( ) t t t t t t t t t t t t L L o L h a h o h h a h + + + + ∂ ∂ ∂ ∂ ∂ ∂ = ⋅ + ⋅ ⋅ ∂ ∂ ∂ ∂ ∂ ∂ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ˆ ˆ ˆ t t t t t t t t L L y y y o y o ∂ ∂ ∂ = = − ∂ ∂ ∂ ( ) ( 1) ( ) ( ) , and T T t t t t o a V W h h + ∂ ∂ = = ∂ ∂
  • 13. Backpropagation Through Time  Step 3: derivatives of h  The derivative of tanh:  we have observed that  so 2 2 tanh( ) 4 2x x d x dx e e− = + + 2 2 2 2 2 2 2 2 4 tanh ( ) 1 2 2 x x x x x x e e x e e e e − − − + − = = − + + + + 2tanh( ) 1 tanh ( ) d x x dx = −
  • 14. Backpropagation Through Time  Step 3: derivatives of h  Combine all the above together:  Until the last time step
  • 15. Backpropagation Through Time  Step 4: derivatives of U, W and b  We can write:  and we have ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) t t t t t t t t t t t t t t t t t t L L h a b h a b L L h a U h a U L L h a W h a W ∂ ∂ ∂ ∂ = ⋅ ⋅ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ⋅ ⋅ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ⋅ ⋅ ∂ ∂ ∂ ∂ ( ) ( ) ( ) ( 1) ( ) 1, , and T T t t t t ta a a h x b W U −∂ ∂ ∂ = = = ∂ ∂ ∂
  • 18. Long-term Dependency  This is an inherent problem, like Gradient Vanishing Problem in CNN.  Let’s focus on the hidden state. Somehow it just works as matrix multiplication:  The state of last step according to the first:
  • 19. Long-term Dependency  If we take the eigendecomposition of the above equation:  This means the eigenvalues should not be too large or too small, or the hidden states after several steps will be vanished or exaggerated. The ideal choice should be ≈1.
  • 20. Long Short Term Memory  LSTM was proposed in 1997, which is designed to add some variances to the net.   sigmoid   tanh
  • 21. Long Short Term Memory  i  input gate, f  forget gate,  o  output gate, g  input modulation,  z  output, h  state, c  memory cell
  • 22. Usage of RNN(LSTM)  One input, Many output • Image Captioning • Language Translation  Many input, One output • Video Classification • Language Classification  Many input, Many output • Language Translation • Video Captioning
  • 23. Seq2Seq  One successful implementation of LSTM is Sequence-to-Sequence learning. This is first introduced to solve machine translation problems (kind of transfer learning). – Sequence to sequence learning with neural networks (2014) – Learning Phrase Representation using RNN Encoder-Decoder for Statistical Machine Translation (2014)
  • 24. Seq2Seq  NLP fundations: • word embedding There are two ways to represent word. One is using dictionary, and each word will be a one-hot vector. The other way is using word embedding tools like word2vec. • beam searching For output, each time there will be an output of softmax probabilities. It is not fair to choose the word with largest prob as the prediction. Instead, we choose k largest each time, where k is the beam size.
  • 25. Seq2Seq  Model: – Two LSTMs: encoder and decoder – Sentence encoded to a length-fixed vector – The vector acts as the first input to decoder, and the output of each time step will be input to next time step.
  • 26. Seq2Seq  Tricks: • Deep LSTM using four layers. The output of each layer works as the input of next layer. The final output of each time step will be the input of the next time step. (also other choices) • Reverse the order of the words of the input. It is said the reason is that minimal time lag is smaller than normal order. However, I think the reason should be it is more important for the decoder to have a more precise beginning.
  • 27. Attention  Problems from the previous method: – Using only the output from the last time step of encoder, which will cause lack of sequence info – The lengths of encoded features are fixed. – Not robust for long sentences in translation  Attention mechanism was proposed. – Neural Machine Translation by Jointly Learning to Align and Translate (2015)
  • 28. Attention  Model: – Encoder: bidirectional LSTM – Decoder: input the label and state from last time step, and the combination of all encoder features.
  • 29. Attention  Decide on the parameters:  Parameter α’s are the softmax probabilities of energy e. This indicates the attention.  Energy e is learned via a feedforward neural network a. Energy will change for every sentence, so the parameter will change.
  • 30. Image Captioning  If we exchange the encoder to CNN to deal with images, the structure will transfer information from image to language, which is the idea of image captioning. – Show, Attend and Tell Neural Image Caption Generation with Visual Attention (2015) – Boosting Image Captioning with Attributes (2017) – Describing Videos by Exploiting Temporal Structure (2015)
  • 31. Image Captioning  Encoder: – Convolution Neural Network to extract features – Using the conv layer feature instead of features from fully-connected layer. Suppose the layer contains L filters, then the output will be L vectors a of D dim. Each filter should be a focus on one part of the image. (attention) – Combine the vectors a to one vector 𝑧𝑧̂.
  • 32. Image Captioning  Decoder:  Compared with the basic LSTM, one more input 𝑧𝑧̂ was taken into consideration.
  • 33. Image Captioning  Attention:  Similar to the idea in attention part, this deals with the relationship between vector a and vector 𝑧𝑧̂.
  • 34. Image Captioning  Attention: – Hard attention: Choose the attention part to some probabilities – Soft attention: Similar to the attention part in translation, use the averages of the features.
  • 35. Look Ahead  Variants of RNNs: – Hierarchical LSTM: also known as stacked LSTM, or Deep Recurrent Neural Network – Bidirectional LSTM: information in two directions  Alternatives of RNN: – Convolutional Seq2Seq – Attention only Seq2Seq and one model for all
  • 36. References [Goodfellow, 2016] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep le arning. MIT press. [Sutskever, 2014] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to se quence learning with neural networks. In Advances in neural information pr ocessing systems (pp. 3104-3112). [Bahdanau, 2014] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machin e translation by jointly learning to align and translate. arXiv preprint arXiv:1 409.0473. ICLR 2015 [Yao, 2016] Yao, Ting, Pan, Yingwei, Li, Yehao, Qiu, Zhaofan, & Mei, Tao. (201 6). Boosting image captioning with attributes. ICLR 2017 [Xu, 2015] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., & Salakhutdinov, R., e t al. (2015). Show, attend and tell: neural image caption generation with vis ual attention. Computer Science, 2048-2057. [Yao, 2015] Yao L, Torabi A, Cho K, et al. Describing Videos by Exploiting Tem poral Structure[J]. Eprint Arxiv, 2015, 53:199-211.
  • 37. References [Chollet, 2016] Chollet, F. (2016). Xception: Deep Learning with Depthwise S eparable Convolutions. arXiv preprint arXiv:1610.02357. [Gehring, 2017] Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. arXiv preprint arXiv: 1705.03122. [Vaswani, 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention Is All You Need. arXiv pre print arXiv:1706.03762. [Kaiser, 2017] Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., J ones, L., & Uszkoreit, J. (2017). One Model To Learn Them All. arXiv preprint arXiv:1706.05137.