SlideShare a Scribd company logo
Vanishing/Exploding Gradient Problem
• Backpropagated errors multiply at each layer,
resulting in exponential decay (if derivative is
small) or growth (if derivative is large).
• Makes it very difficult train deep networks, or
simple recurrent networks over many time
steps.
1
Long Distance Dependencies
• It is very difficult to train SRNs to retain information over many
time steps
• This make is very difficult to learn SRNs that handle long-
distance dependencies, such as subject-verb agreement.
2
Long Short Term Memory
• LSTM networks, add additional gating units in
each memory cell.
– Forget gate
– Input gate
– Output gate
• Prevents vanishing/exploding gradient
problem and allows network to retain state
information over longer periods of time.
3
LSTM Network Architecture
4
Cell State
• Maintains a vector Ct that is the same
dimensionality as the hidden state, ht
• Information can be added or deleted from
this state vector via the forget and input
gates.
5
Cell State Example
• Want to remember person & number of a
subject noun so that it can be checked to
agree with the person & number of verb when
it is eventually encountered.
• Forget gate will remove existing information
of a prior subject when a new one is
encountered.
• Input gate "adds" in the information for the
new subject.
6
Forget Gate
• Forget gate computes a 0-1 value using a logistic sigmoid
output function from the input, xt, and the current hidden
state, ht:
• Multiplicatively combined with cell state, "forgetting"
information where the gate outputs something close to 0.
7
Hyperbolic Tangent Units
• Tanh can be used as an alternative
nonlinear function to the sigmoid logistic
(0-1) output function.
• Used to produce thresholded output
between –1 and 1.
8
Input Gate
• First, determine which entries in the cell state to
update by computing 0-1 sigmoid output.
• Then determine what amount to add/subtract from
these entries by computing a tanh output (valued –1
to 1) function of the input and hidden state.
9
Updating the Cell State
• Cell state is updated by using component-
wise vector multiply to "forget" and vector
addition to "input" new information.
10
Output Gate
• Hidden state is updated based on a "filtered"
version of the cell state, scaled to –1 to 1 using
tanh.
• Output gate computes a sigmoid function of the
input and current hidden state to determine
which elements of the cell state to "output".
11
Overall Network Architecture
• Single or multilayer networks can compute
LSTM inputs from problem inputs and
problem outputs from LSTM outputs.
12
It
Ot
e.g. a word as a “one hot” vector
e.g. a word “embedding” with
reduced dimensionality
e.g. a POS tag as a “one hot” vector
LSTM Training
• Trainable with backprop derivatives such as:
– Stochastic gradient descent (randomize order of
examples in each epoch) with momentum (bias weight
changes to continue in same direction as last update).
– ADAM optimizer (Kingma & Ma, 2015)
• Each cell has many parameters (Wf, Wi, WC, Wo)
– Generally requires lots of training data.
– Requires lots of compute time that exploits GPU
clusters.
13
General Problems Solved with LSTMs
• Sequence labeling
– Train with supervised output at each time step
computed using a single or multilayer network
that maps the hidden state (ht) to an output
vector (Ot).
• Language modeling
– Train to predict next input (Ot =It+1)
• Sequence (e.g. text) classification
– Train a single or multilayer network that maps the
final hidden state (hn) to an output vector (O).
14
Sequence to Sequence
Transduction (Mapping)
• Encoder/Decoder framework maps one
sequence to a "deep vector" then another
LSTM maps this vector to an output
sequence.
15
I1, I2,…,In
Encoder
LSTM
O1, O2,…,Om
hn
Decoder
LSTM
• Train model "end to end" on I/O pairs of
sequences.
Summary of
LSTM Application Architectures
16
Image Captioning Video Activity
Recog
Text Classification
Video Captioning
Machine Translation
POS Tagging
Language Modeling
Successful Applications of LSTMs
• Speech recognition: Language and acoustic modeling
• Sequence labeling
– POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
– NER
– Phrase Chunking
• Neural syntactic and semantic parsing
• Image captioning: CNN output vector to sequence
• Sequence to Sequence
– Machine Translation (Sustkever, Vinyals, & Le, 2014)
– Video Captioning (input sequence of CNN frame outputs)
17
Bi-directional LSTM (Bi-LSTM)
• Separate LSTMs process sequence forward and
backward and hidden layers at each time step are
concatenated to form the cell output.
18
xt+1
xt
xt-1
ht-1 ht+1
ht
Gated Recurrent Unit
(GRU)
• Alternative RNN to LSTM that uses fewer
gates (Cho, et al., 2014)
– Combines forget and input gates into “update”
gate.
– Eliminates cell state vector
19
GRU vs. LSTM
• GRU has significantly fewer parameters and
trains faster.
• Experimental results comparing the two
are still inconclusive, many problems they
perform the same, but each has problems
on which they work better.
20
Attention
• For many applications, it helps to add “attention” to
RNNs.
• Allows network to learn to attend to different parts of
the input at different time steps, shifting its attention
to focus on different aspects during its processing.
• Used in image captioning to focus on different parts of
an image when generating different parts of the output
sentence.
• In MT, allows focusing attention on different parts of
the source sentence when generating different parts of
the translation.
21
Attention for Image Captioning
(Xu, et al. 2015)
22
Conclusions
• By adding “gates” to an RNN, we can prevent
the vanishing/exploding gradient problem.
• Trained LSTMs/GRUs can retain state
information longer and handle long-distance
dependencies.
• Recent impressive results on a range of
challenging NLP problems.
23

More Related Content

PPTX
PPTX
Long Short Term Memory LSTM
PPTX
introduction to machine learning for students.pptx
PPTX
Long and short term memory presesntation
PPTX
RNN and LSTM model description and working advantages and disadvantages
PPTX
Long Short-Term Memory
PPTX
lstmhh hjhj uhujikj iijiijijiojijijijijiji
PDF
Recurrent neural networks rnn
Long Short Term Memory LSTM
introduction to machine learning for students.pptx
Long and short term memory presesntation
RNN and LSTM model description and working advantages and disadvantages
Long Short-Term Memory
lstmhh hjhj uhujikj iijiijijiojijijijijiji
Recurrent neural networks rnn

Similar to Long short term memory on tensorflow using python (20)

PPTX
recurrent_neural_networks_april_2020.pptx
PPTX
RNN-LSTM.pptx
PDF
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
PPTX
PDF
Recurrent Neural Networks
PDF
lepibwp74jd2rz.pdf
PDF
Applying Deep Learning Machine Translation to Language Services
PPTX
Long Short Term Memory (Neural Networks)
PPTX
Dataworkz odsc london 2018
PPTX
Tutorial on deep transformer (presentation slides)
PPTX
RNN-LSTM.pptx
PDF
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PDF
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
PPTX
Natural Language Processing Advancements By Deep Learning: A Survey
PDF
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
PDF
Convolutional and Recurrent Neural Networks
PPTX
Chatbot ppt
PDF
RNN and its applications
PDF
Recurrent Neural Networks. Part 1: Theory
PDF
Sequence Modelling with Deep Learning
recurrent_neural_networks_april_2020.pptx
RNN-LSTM.pptx
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Recurrent Neural Networks
lepibwp74jd2rz.pdf
Applying Deep Learning Machine Translation to Language Services
Long Short Term Memory (Neural Networks)
Dataworkz odsc london 2018
Tutorial on deep transformer (presentation slides)
RNN-LSTM.pptx
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
Natural Language Processing Advancements By Deep Learning: A Survey
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Convolutional and Recurrent Neural Networks
Chatbot ppt
RNN and its applications
Recurrent Neural Networks. Part 1: Theory
Sequence Modelling with Deep Learning
Ad

Recently uploaded (20)

PDF
Digital Logic Computer Design lecture notes
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
PPT on Performance Review to get promotions
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Construction Project Organization Group 2.pptx
PPT
Project quality management in manufacturing
PDF
composite construction of structures.pdf
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Artificial Intelligence
Digital Logic Computer Design lecture notes
CYBER-CRIMES AND SECURITY A guide to understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Internet of Things (IOT) - A guide to understanding
Operating System & Kernel Study Guide-1 - converted.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Current and future trends in Computer Vision.pptx
Sustainable Sites - Green Building Construction
PPT on Performance Review to get promotions
R24 SURVEYING LAB MANUAL for civil enggi
OOP with Java - Java Introduction (Basics)
Construction Project Organization Group 2.pptx
Project quality management in manufacturing
composite construction of structures.pdf
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Safety Seminar civil to be ensured for safe working.
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Artificial Intelligence
Ad

Long short term memory on tensorflow using python

  • 1. Vanishing/Exploding Gradient Problem • Backpropagated errors multiply at each layer, resulting in exponential decay (if derivative is small) or growth (if derivative is large). • Makes it very difficult train deep networks, or simple recurrent networks over many time steps. 1
  • 2. Long Distance Dependencies • It is very difficult to train SRNs to retain information over many time steps • This make is very difficult to learn SRNs that handle long- distance dependencies, such as subject-verb agreement. 2
  • 3. Long Short Term Memory • LSTM networks, add additional gating units in each memory cell. – Forget gate – Input gate – Output gate • Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time. 3
  • 5. Cell State • Maintains a vector Ct that is the same dimensionality as the hidden state, ht • Information can be added or deleted from this state vector via the forget and input gates. 5
  • 6. Cell State Example • Want to remember person & number of a subject noun so that it can be checked to agree with the person & number of verb when it is eventually encountered. • Forget gate will remove existing information of a prior subject when a new one is encountered. • Input gate "adds" in the information for the new subject. 6
  • 7. Forget Gate • Forget gate computes a 0-1 value using a logistic sigmoid output function from the input, xt, and the current hidden state, ht: • Multiplicatively combined with cell state, "forgetting" information where the gate outputs something close to 0. 7
  • 8. Hyperbolic Tangent Units • Tanh can be used as an alternative nonlinear function to the sigmoid logistic (0-1) output function. • Used to produce thresholded output between –1 and 1. 8
  • 9. Input Gate • First, determine which entries in the cell state to update by computing 0-1 sigmoid output. • Then determine what amount to add/subtract from these entries by computing a tanh output (valued –1 to 1) function of the input and hidden state. 9
  • 10. Updating the Cell State • Cell state is updated by using component- wise vector multiply to "forget" and vector addition to "input" new information. 10
  • 11. Output Gate • Hidden state is updated based on a "filtered" version of the cell state, scaled to –1 to 1 using tanh. • Output gate computes a sigmoid function of the input and current hidden state to determine which elements of the cell state to "output". 11
  • 12. Overall Network Architecture • Single or multilayer networks can compute LSTM inputs from problem inputs and problem outputs from LSTM outputs. 12 It Ot e.g. a word as a “one hot” vector e.g. a word “embedding” with reduced dimensionality e.g. a POS tag as a “one hot” vector
  • 13. LSTM Training • Trainable with backprop derivatives such as: – Stochastic gradient descent (randomize order of examples in each epoch) with momentum (bias weight changes to continue in same direction as last update). – ADAM optimizer (Kingma & Ma, 2015) • Each cell has many parameters (Wf, Wi, WC, Wo) – Generally requires lots of training data. – Requires lots of compute time that exploits GPU clusters. 13
  • 14. General Problems Solved with LSTMs • Sequence labeling – Train with supervised output at each time step computed using a single or multilayer network that maps the hidden state (ht) to an output vector (Ot). • Language modeling – Train to predict next input (Ot =It+1) • Sequence (e.g. text) classification – Train a single or multilayer network that maps the final hidden state (hn) to an output vector (O). 14
  • 15. Sequence to Sequence Transduction (Mapping) • Encoder/Decoder framework maps one sequence to a "deep vector" then another LSTM maps this vector to an output sequence. 15 I1, I2,…,In Encoder LSTM O1, O2,…,Om hn Decoder LSTM • Train model "end to end" on I/O pairs of sequences.
  • 16. Summary of LSTM Application Architectures 16 Image Captioning Video Activity Recog Text Classification Video Captioning Machine Translation POS Tagging Language Modeling
  • 17. Successful Applications of LSTMs • Speech recognition: Language and acoustic modeling • Sequence labeling – POS Tagging https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art) – NER – Phrase Chunking • Neural syntactic and semantic parsing • Image captioning: CNN output vector to sequence • Sequence to Sequence – Machine Translation (Sustkever, Vinyals, & Le, 2014) – Video Captioning (input sequence of CNN frame outputs) 17
  • 18. Bi-directional LSTM (Bi-LSTM) • Separate LSTMs process sequence forward and backward and hidden layers at each time step are concatenated to form the cell output. 18 xt+1 xt xt-1 ht-1 ht+1 ht
  • 19. Gated Recurrent Unit (GRU) • Alternative RNN to LSTM that uses fewer gates (Cho, et al., 2014) – Combines forget and input gates into “update” gate. – Eliminates cell state vector 19
  • 20. GRU vs. LSTM • GRU has significantly fewer parameters and trains faster. • Experimental results comparing the two are still inconclusive, many problems they perform the same, but each has problems on which they work better. 20
  • 21. Attention • For many applications, it helps to add “attention” to RNNs. • Allows network to learn to attend to different parts of the input at different time steps, shifting its attention to focus on different aspects during its processing. • Used in image captioning to focus on different parts of an image when generating different parts of the output sentence. • In MT, allows focusing attention on different parts of the source sentence when generating different parts of the translation. 21
  • 22. Attention for Image Captioning (Xu, et al. 2015) 22
  • 23. Conclusions • By adding “gates” to an RNN, we can prevent the vanishing/exploding gradient problem. • Trained LSTMs/GRUs can retain state information longer and handle long-distance dependencies. • Recent impressive results on a range of challenging NLP problems. 23