SlideShare a Scribd company logo
RECURRENT NEURAL NETWORKS
PART 1: THEORY
ANDRII GAKHOV
10.12.2015
Techtalk @ ferret
FEEDFORWARD NEURAL NETWORKS
NEURAL NETWORKS: INTUITION
▸ Neural network is a computational graph whose nodes are
computing units and whose directed edges transmit numerical
information from node to node.
▸ Each computing unit (neuron) is capable of evaluating a single
primitive function (activation function) of its input.
▸ In fact the network represents a chain of function compositions
which transform an input to an output vector.
3
x
y
h
V
W
h4
x1 x2 x3
y1 y2
h1 h2 h3Hidden Layer
Input Layer
Output Layer y
x
h
W
V
NEURAL NETWORKS: FNN
▸ The Feedforward neural network (FNN) is the most basic and widely
used artificial neural network. It consists of a number of layers of
computational units that are arranged into a layered configuration.
4
x
y
h
V
W
h = g Wx( )
y = f Vh( )
Unlike all hidden layers in a neural network, the output layer units
most commonly have as activation function:
‣ linear identity function (for regression problems)
‣ softmax (for classification problems).
▸ g - activation function for the hidden layer units
▸ f - activation function for the output layer units
NEURAL NETWORKS: POPULAR ACTIVATION FUNCTIONS
▸ Nonlinear “squashing” functions
▸ Easy to find the derivative
▸ Make stronger weak signals and don’t pay too much
attention to already strong signals
5
σ x( )=
1
1+ e−x
, σ :! → 0,1( ) tanh x( )=
ex
− e−x
ex
+ e−x
, tanh :! → −1,1( )
sigmoid tanh
NEURAL NETWORKS: HOW TO TRAIN
▸ The main problems for neural network training:
▸ billions parameters of the model
▸ multi-objective problem
▸ requirement of the high-level parallelism
▸ requirement to find wide domain where all minimising
functions are close to their minimums
6
Image: Wikipedia
In general, training of neural
network is an error minimization
problem
NEURAL NETWORKS: BP
▸ Feedforward neural network can be trained with
backpropagation algorithm (BP) - propagated gradient
descent algorithm, where gradients are propagated
backward, leading to very efficient computing of the higher
layer weight change.
▸ Backpropagation algorithm consists of 2 phases:
▸ Propagation. Forward propagation of inputs through the
neural network and generation output values. Backward
propagation of the output values through the neural
network and calculate errors on each layer.
▸ Weight update. Calculate gradients and correct weights
proportional to the negative gradient of the cost function
7
NEURAL NETWORKS: LIMITATIONS
▸ Feedforward neural network has several limitations due to
its architecture:
▸ accepts a fixed-sized vector as input (e.g. an image)
▸ produces a fixed-sized vector as output (e.g.
probabilities of different classes)
▸ performs such input-output mapping using a fixed
amount of computational steps (e.g. the number of
layers).
▸ These limitations make it really hard to model time series
problems when input and output are real-valued
sequences
8
RECURRENT NEURAL NETWORKS
RECURRENT NEURAL NETWORKS: INTUITION
▸ Recurrent neural network (RNN) is a neural network model
proposed in the 80’s for modelling time series.
▸ The structure of the network is similar to feedforward neural
network, with the distinction that it allows a recurrent hidden
state whose activation at each time is dependent on that of
the previous time (cycle).
T0 1 t…
x0 x1 xt
y0 y1 yt
x
y
h U
V
W
h0 h1 ht
10
RECURRENT NEURAL NETWORKS: SIMPLE RNN
▸ The time recurrence is introduced by relation for hidden layer
activity ht with its past hidden layer activity ht-1.
▸ This dependence is nonlinear because of using a logistic function.
11
ht = σ Wxt +Uht−1( )
yt = f Vht( )
xt
yt
ht-1 ht
ht
RECURRENT NEURAL NETWORKS: BPTT
The unfolded recurrent neural network can be seen as a deep
neural network, except that the recurrent weights are tied. To
train it we can use a modification of the BP algorithm that
works on sequences in time - backpropagation through time
(BPTT).
‣ For each training epoch: start by training on shorter
sequences, and then train on progressively longer
sequences until the length of max sequence (1, 2 … N-1, N).
‣ For each length of sequence k: unfold the network into a
normal feedforward network that has k hidden layers.
‣ Proceed with a standard BP algorithm.
12
RECURRENT NEURAL NETWORKS: TRUNCATED BPTT 13
Read more: http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf
One of the main problems of BPTT is the high cost of a single
parameter update, which makes it impossible to use a large
number of iterations.
‣ For instance, the gradient of an RNN on sequences of length 1000
costs the equivalent of a forward and a backward pass in a neural
network that has 1000 layers.
Truncated BPTT processes the sequence one timestep at a
time, and every T1 timesteps it runs BPTT for T2 timesteps, so
a parameter update can be cheap if T2 is small.
Truncated backpropagation is arguably the most practical
method for training RNNs
RECURRENT NEURAL NETWORKS: PROBLEMS
While in principle the recurrent network is a simple and powerful
model, in practice, it is unfortunately hard to train properly.
▸ PROBLEM: In the gradient back-propagation phase, the gradient
signal multiplied a large number of times by the weight matrix
associated with the recurrent connection.
▸ If |λ1| < 1 (weights are small) => Vanishing gradient problem
▸ If |λ1| > 1 (weights are big) => Exploding gradient problem
▸ SOLUTION:
▸ Exploding gradient problem => clipping the norm of the
exploded gradients when it is too large
▸ Vanishing gradient problem => relax nonlinear dependency to
liner dependency => LSTM, GRU, etc.
Read more: Razvan Pascanu et al. http://arxiv.org/pdf/1211.5063.pdf
14
LONG SHORT-TERM MEMORY
LONG SHORT-TERM MEMORY (LSTM)
▸ Proposed by Hochreiter & Schmidhuber (1997) and since then
has been modified by many researchers.
▸ The LSTM architecture consists of a set of recurrently
connected subnets, known as memory blocks.
▸ Each memory block consists of:
▸ memory cell - stores the state
▸ input gate - controls what to learn
▸ forget get - controls what to forget
▸ output gate - controls the amount of content to modify
▸ Unlike the traditional recurrent unit which overwrites its
content each timestep, the LSTM unit is able to decide
whether to keep the existing memory via the introduced gates
16
LSTM
x0
y0
x1 … xt
T
y1 … yt
0 1 t
? ? ? ?
▸ Model parameters:
▸ xt is the input at time t
▸ Weight matrices: Wi, Wf, Wc, Wo, Ui, Uf, Uc, Uo, Vo
▸ Bias vectors: bo, bf, bc, bo
17
The basic unit in the hidden layer of an LSTM is the memory
block that replaces the hidden units in a “traditional” RNN
LSTM MEMORY BLOCK
xt
yt
ht-1 ht
Ct-1 Ct
▸ Memory block is a subnet that allow an LSTM unit to
adaptively forget, memorise and expose the memory content.
18
it = σ Wi xt +Uiht−1 + bi( )
Ct = tanh Wcxt +Ucht−1 + bc( )
ft = σ Wf xt +Uf ht−1 + bf( )
Ct = ft ⋅Ct−1 + it ⋅Ct
ot = σ Woxt +Uoht−1 +VoCt( )
ht = ot ⋅tanh Ct( )
LSTM MEMORY BLOCK: INPUT GATE
yt
ht-1 ht
Ct-1 Ct
it
?
it = σ Wi xt +Uiht−1 + bi( )
▸ The input gate it controls the degree to which the new
memory content is added to the memory cell
19
xt
LSTM MEMORY BLOCK: CANDIDATES
yt
ht-1 ht
Ct-1 Ct
it
C̅ t
?
Ct = tanh Wcxt +Ucht−1 + bc( )
▸ The values C̅ t are candidates for the state of the memory
cell (that could be filtered by input gate decision later)
20
xt
LSTM MEMORY BLOCK: FORGET GATE
yt
ht-1 ht
Ct-1 Ct
it
C̅ t
ft
?
ft = σ Wf xt +Uf ht−1 + bf( )
▸ If the detected feature seems important, the forget gate ft will
be closed and carry information about it across many timesteps,
otherwise it can reset the memory content.
21
xt
LSTM MEMORY BLOCK: FORGET GATE
▸ Sometimes it’s good to forget.
If you’re analyzing a text corpus and come to the end of a
document you may have no reason to believe that the next
document has any relationship to it whatsoever, and
therefore the memory cell should be reset before the
network gets the first element of the next document.
▸ In many cases by reset we don’t only mean immediate set
it to 0, but also gradual resets corresponding to slowly
fading cell states
22
LSTM MEMORY BLOCK: MEMORY CELL
yt
ht-1 ht
Ct-1 Ct
it
C̅ t
ft
Ct
?
Ct = ft ⋅Ct−1 + it ⋅Ct
▸ The new state of the memory cell Ct calculated by partially
forgetting the existing memory content Ct-1 and adding a new
memory content C̅ t
23
xt
LSTM MEMORY BLOCK: OUTPUT GATE
▸ The output gate ot controls the amount of the memory
content to yield to the next hidden state
ot = σ Woxt +Uoht−1 +VoCt( )
24
xt
yt
ht-1 ht
Ct-1 Ct
it
C̅ t
ft
Ct
?
ot
LSTM MEMORY BLOCK: HIDDEN STATE
ht = ot ⋅tanh Ct( )
25
yt
ht-1 ht
Ct-1 Ct
it
C̅ t
ft
Ct
ht
xt
ot
LSTM MEMORY BLOCK: ALL TOGETHER
xt
yt
ht-1
ht
Ct-1 Ct
it
C̅ t
ft
Ct
ht
ot
26
yt = f Vyht( )
GATED RECURRENT UNIT
GATED RECURENT UNIT (GRU)
▸ Proposed by Cho et al. [2014].
▸ It is similar to LSTM in using gating functions, but differs
from LSTM in that it doesn’t have a memory cell.
▸ Each GRU consists of:
▸ update gate
▸ reset get
28
▸ Model parameters:
▸ xt is the input at time t
▸ Weight matrices: Wz, Wr, WH, Uz, Ur, UH
GRU 29
xt
yt
ht-1 ht
ht = 1− zt( )ht−1 + zt Ht
zt = σ Wz xt +Uzht−1( )
Ht = tanh WH xt +UH rt ⋅ht−1( )( )
rt = σ Wr xt +Urht−1( )
GRU: UPDATE GATE
zt = σ Wz xt +Uzht−1( )
30
xt
yt
ht-1 ht
zt
?
▸ Update gate zt decides how much unit update its
activation or content
GRU: RESET GATE
▸ When rt close to 0 (gate off), it makes the unit act as it’s
reading the first symbol from the input sequence, allowing
it to forget previously computed states
rt = σ Wr xt +Urht−1( )
31
xt
yt
ht-1 ht
rt
zt
?
GRU: CANDIDATE ACTIVTION
▸ Update gate zt decides how much unit update its
activation or content.
Ht = tanh WH xt +UH rt ⋅ht−1( )( )
32
xt
yt
ht-1 ht
Ht
rt
zt
?
GRU: HIDDEN STATE
▸ Activation at time t is the linear interpolation between
previous activations ht-1 and candidate activation Ht
ht = 1− zt( )ht−1 + zt Ht
33
xt
yt
ht-1 ht
Ht
rt
zt ht
GRU: ALL TOGETHER 34
xt
yt
ht-1
ht
Ht
rt
zt ht
READ MORE
▸ Supervised Sequence Labelling with Recurrent Neural Networks

http://www.cs.toronto.edu/~graves/preprint.pdf
▸ On the difficulty of training Recurrent Neural Networks

http://arxiv.org/pdf/1211.5063.pdf
▸ Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

http://arxiv.org/pdf/1412.3555v1.pdf
▸ Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation

http://arxiv.org/pdf/1406.1078v3.pdf
▸ Understanding LSTM Networks

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
▸ General Sequence Learning using Recurrent Neural Networks

https://www.youtube.com/watch?v=VINCQghQRuM
35
END OF PART 1
▸ @gakhov
▸ linkedin.com/in/gakhov
▸ www.datacrucis.com
36
THANK YOU

More Related Content

PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
PDF
Understanding GenAI/LLM and What is Google Offering - Felix Goh
PDF
LSTM Basics
PPTX
Convolution Neural Network (CNN)
PDF
A Portfolio Manager's Guidebook to Trade Execution
PPTX
Lecture 6: Ensemble Methods
PPTX
PPTX
Sumo tutorial: 1) Manual Network creation, 2) OSM to Netwrok, 3) OD Matrix to...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Understanding GenAI/LLM and What is Google Offering - Felix Goh
LSTM Basics
Convolution Neural Network (CNN)
A Portfolio Manager's Guidebook to Trade Execution
Lecture 6: Ensemble Methods
Sumo tutorial: 1) Manual Network creation, 2) OSM to Netwrok, 3) OD Matrix to...

What's hot (20)

PDF
Introduction to Recurrent Neural Network
PDF
Introduction to Recurrent Neural Network
PDF
Recurrent neural networks rnn
PPTX
Recurrent neural network
PPT
backpropagation in neural networks
PDF
Intro to Neural Networks
PPTX
RNN & LSTM: Neural Network for Sequential Data
PPT
rnn BASICS
PDF
Long Short Term Memory
PDF
[기초개념] Recurrent Neural Network (RNN) 소개
PPTX
Regularization in deep learning
PPTX
Unsupervised learning
PDF
PDF
순환신경망(Recurrent neural networks) 개요
PPTX
Understanding RNN and LSTM
PDF
Introduction to Neural Networks
PDF
Rnn and lstm
PDF
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
PDF
Linear models for classification
PDF
Deep Feed Forward Neural Networks and Regularization
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Recurrent neural networks rnn
Recurrent neural network
backpropagation in neural networks
Intro to Neural Networks
RNN & LSTM: Neural Network for Sequential Data
rnn BASICS
Long Short Term Memory
[기초개념] Recurrent Neural Network (RNN) 소개
Regularization in deep learning
Unsupervised learning
순환신경망(Recurrent neural networks) 개요
Understanding RNN and LSTM
Introduction to Neural Networks
Rnn and lstm
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Linear models for classification
Deep Feed Forward Neural Networks and Regularization
Ad

Viewers also liked (20)

PDF
Recurrent Neural Networks, LSTM and GRU
PDF
RNN, LSTM and Seq-2-Seq Models
PDF
Machine Learning Lecture 3 Decision Trees
PDF
A Brief Introduction on Recurrent Neural Network and Its Application
PPTX
Recurrent Neural Networks for Text Analysis
PDF
Recurrent Neural Network, Fractal for Deep Learning
PDF
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
PDF
Recurrent neural networks
ODP
Recurrent Neural Network tutorial (2nd)
PDF
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
PPTX
Neural network & its applications
PPTX
LSTM - Moving to the Brightside
PDF
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
PDF
RNN Explore
PPTX
Modeling Electronic Health Records with Recurrent Neural Networks
PDF
Multidimensional RNN
PDF
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
PPTX
Electricity price forecasting with Recurrent Neural Networks
PDF
Deep learning and feature extraction for time series forecasting
PPTX
Implementing a Fileserver with Nginx and Lua
Recurrent Neural Networks, LSTM and GRU
RNN, LSTM and Seq-2-Seq Models
Machine Learning Lecture 3 Decision Trees
A Brief Introduction on Recurrent Neural Network and Its Application
Recurrent Neural Networks for Text Analysis
Recurrent Neural Network, Fractal for Deep Learning
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent neural networks
Recurrent Neural Network tutorial (2nd)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Neural network & its applications
LSTM - Moving to the Brightside
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
RNN Explore
Modeling Electronic Health Records with Recurrent Neural Networks
Multidimensional RNN
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Electricity price forecasting with Recurrent Neural Networks
Deep learning and feature extraction for time series forecasting
Implementing a Fileserver with Nginx and Lua
Ad

Similar to Recurrent Neural Networks. Part 1: Theory (20)

PDF
1-pytorch-CNN-RNN.pdf
PDF
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Hardware Acceleration for Machine Learning
PDF
neural networksNnf
PDF
Artificial Neural Networks Lect7: Neural networks based on competition
PPTX
introduction to machine learning for students.pptx
PDF
5.MLP(Multi-Layer Perceptron)
PPT
PPT
Deep learning for detection hate speech.ppt
PPT
12337673 deep learning RNN RNN DL ML sa.ppt
PPT
Deep-Learning-2017-Lecture ML DL RNN.ppt
PPT
Deep-Learning-2017-Lecture6RNN.ppt
PPT
Deep-Learning-2017-Lecture6RNN.ppt
PDF
Machine Learning 1
PDF
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PPT
Lec 6-bp
PDF
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
PPTX
Transformer Zoo (a deeper dive)
PDF
Recurrent Neural Networks
PPTX
Introduction to Neural Networks and Deep Learning
1-pytorch-CNN-RNN.pdf
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Hardware Acceleration for Machine Learning
neural networksNnf
Artificial Neural Networks Lect7: Neural networks based on competition
introduction to machine learning for students.pptx
5.MLP(Multi-Layer Perceptron)
Deep learning for detection hate speech.ppt
12337673 deep learning RNN RNN DL ML sa.ppt
Deep-Learning-2017-Lecture ML DL RNN.ppt
Deep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.ppt
Machine Learning 1
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Lec 6-bp
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Transformer Zoo (a deeper dive)
Recurrent Neural Networks
Introduction to Neural Networks and Deep Learning

More from Andrii Gakhov (20)

PDF
Let's start GraphQL: structure, behavior, and architecture
PDF
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
PDF
Too Much Data? - Just Sample, Just Hash, ...
PDF
DNS Delegation
PPTX
Pecha Kucha: Ukrainian Food Traditions
PDF
Probabilistic data structures. Part 4. Similarity
PDF
Probabilistic data structures. Part 3. Frequency
PDF
Probabilistic data structures. Part 2. Cardinality
PDF
Вероятностные структуры данных
PDF
Apache Big Data Europe 2015: Selected Talks
PDF
Swagger / Quick Start Guide
PDF
API Days Berlin highlights
PDF
ELK - What's new and showcases
PDF
Apache Spark Overview @ ferret
PDF
Data Mining - lecture 8 - 2014
PDF
Data Mining - lecture 7 - 2014
PDF
Data Mining - lecture 6 - 2014
PDF
Data Mining - lecture 5 - 2014
PDF
Data Mining - lecture 4 - 2014
PDF
Data Mining - lecture 3 - 2014
Let's start GraphQL: structure, behavior, and architecture
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Too Much Data? - Just Sample, Just Hash, ...
DNS Delegation
Pecha Kucha: Ukrainian Food Traditions
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 2. Cardinality
Вероятностные структуры данных
Apache Big Data Europe 2015: Selected Talks
Swagger / Quick Start Guide
API Days Berlin highlights
ELK - What's new and showcases
Apache Spark Overview @ ferret
Data Mining - lecture 8 - 2014
Data Mining - lecture 7 - 2014
Data Mining - lecture 6 - 2014
Data Mining - lecture 5 - 2014
Data Mining - lecture 4 - 2014
Data Mining - lecture 3 - 2014

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Modernizing your data center with Dell and AMD
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Review of recent advances in non-invasive hemoglobin estimation
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Machine learning based COVID-19 study performance prediction
NewMind AI Monthly Chronicles - July 2025
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
GamePlan Trading System Review: Professional Trader's Honest Take
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Modernizing your data center with Dell and AMD
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Per capita expenditure prediction using model stacking based on satellite ima...

Recurrent Neural Networks. Part 1: Theory

  • 1. RECURRENT NEURAL NETWORKS PART 1: THEORY ANDRII GAKHOV 10.12.2015 Techtalk @ ferret
  • 3. NEURAL NETWORKS: INTUITION ▸ Neural network is a computational graph whose nodes are computing units and whose directed edges transmit numerical information from node to node. ▸ Each computing unit (neuron) is capable of evaluating a single primitive function (activation function) of its input. ▸ In fact the network represents a chain of function compositions which transform an input to an output vector. 3 x y h V W h4 x1 x2 x3 y1 y2 h1 h2 h3Hidden Layer Input Layer Output Layer y x h W V
  • 4. NEURAL NETWORKS: FNN ▸ The Feedforward neural network (FNN) is the most basic and widely used artificial neural network. It consists of a number of layers of computational units that are arranged into a layered configuration. 4 x y h V W h = g Wx( ) y = f Vh( ) Unlike all hidden layers in a neural network, the output layer units most commonly have as activation function: ‣ linear identity function (for regression problems) ‣ softmax (for classification problems). ▸ g - activation function for the hidden layer units ▸ f - activation function for the output layer units
  • 5. NEURAL NETWORKS: POPULAR ACTIVATION FUNCTIONS ▸ Nonlinear “squashing” functions ▸ Easy to find the derivative ▸ Make stronger weak signals and don’t pay too much attention to already strong signals 5 σ x( )= 1 1+ e−x , σ :! → 0,1( ) tanh x( )= ex − e−x ex + e−x , tanh :! → −1,1( ) sigmoid tanh
  • 6. NEURAL NETWORKS: HOW TO TRAIN ▸ The main problems for neural network training: ▸ billions parameters of the model ▸ multi-objective problem ▸ requirement of the high-level parallelism ▸ requirement to find wide domain where all minimising functions are close to their minimums 6 Image: Wikipedia In general, training of neural network is an error minimization problem
  • 7. NEURAL NETWORKS: BP ▸ Feedforward neural network can be trained with backpropagation algorithm (BP) - propagated gradient descent algorithm, where gradients are propagated backward, leading to very efficient computing of the higher layer weight change. ▸ Backpropagation algorithm consists of 2 phases: ▸ Propagation. Forward propagation of inputs through the neural network and generation output values. Backward propagation of the output values through the neural network and calculate errors on each layer. ▸ Weight update. Calculate gradients and correct weights proportional to the negative gradient of the cost function 7
  • 8. NEURAL NETWORKS: LIMITATIONS ▸ Feedforward neural network has several limitations due to its architecture: ▸ accepts a fixed-sized vector as input (e.g. an image) ▸ produces a fixed-sized vector as output (e.g. probabilities of different classes) ▸ performs such input-output mapping using a fixed amount of computational steps (e.g. the number of layers). ▸ These limitations make it really hard to model time series problems when input and output are real-valued sequences 8
  • 10. RECURRENT NEURAL NETWORKS: INTUITION ▸ Recurrent neural network (RNN) is a neural network model proposed in the 80’s for modelling time series. ▸ The structure of the network is similar to feedforward neural network, with the distinction that it allows a recurrent hidden state whose activation at each time is dependent on that of the previous time (cycle). T0 1 t… x0 x1 xt y0 y1 yt x y h U V W h0 h1 ht 10
  • 11. RECURRENT NEURAL NETWORKS: SIMPLE RNN ▸ The time recurrence is introduced by relation for hidden layer activity ht with its past hidden layer activity ht-1. ▸ This dependence is nonlinear because of using a logistic function. 11 ht = σ Wxt +Uht−1( ) yt = f Vht( ) xt yt ht-1 ht ht
  • 12. RECURRENT NEURAL NETWORKS: BPTT The unfolded recurrent neural network can be seen as a deep neural network, except that the recurrent weights are tied. To train it we can use a modification of the BP algorithm that works on sequences in time - backpropagation through time (BPTT). ‣ For each training epoch: start by training on shorter sequences, and then train on progressively longer sequences until the length of max sequence (1, 2 … N-1, N). ‣ For each length of sequence k: unfold the network into a normal feedforward network that has k hidden layers. ‣ Proceed with a standard BP algorithm. 12
  • 13. RECURRENT NEURAL NETWORKS: TRUNCATED BPTT 13 Read more: http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf One of the main problems of BPTT is the high cost of a single parameter update, which makes it impossible to use a large number of iterations. ‣ For instance, the gradient of an RNN on sequences of length 1000 costs the equivalent of a forward and a backward pass in a neural network that has 1000 layers. Truncated BPTT processes the sequence one timestep at a time, and every T1 timesteps it runs BPTT for T2 timesteps, so a parameter update can be cheap if T2 is small. Truncated backpropagation is arguably the most practical method for training RNNs
  • 14. RECURRENT NEURAL NETWORKS: PROBLEMS While in principle the recurrent network is a simple and powerful model, in practice, it is unfortunately hard to train properly. ▸ PROBLEM: In the gradient back-propagation phase, the gradient signal multiplied a large number of times by the weight matrix associated with the recurrent connection. ▸ If |λ1| < 1 (weights are small) => Vanishing gradient problem ▸ If |λ1| > 1 (weights are big) => Exploding gradient problem ▸ SOLUTION: ▸ Exploding gradient problem => clipping the norm of the exploded gradients when it is too large ▸ Vanishing gradient problem => relax nonlinear dependency to liner dependency => LSTM, GRU, etc. Read more: Razvan Pascanu et al. http://arxiv.org/pdf/1211.5063.pdf 14
  • 16. LONG SHORT-TERM MEMORY (LSTM) ▸ Proposed by Hochreiter & Schmidhuber (1997) and since then has been modified by many researchers. ▸ The LSTM architecture consists of a set of recurrently connected subnets, known as memory blocks. ▸ Each memory block consists of: ▸ memory cell - stores the state ▸ input gate - controls what to learn ▸ forget get - controls what to forget ▸ output gate - controls the amount of content to modify ▸ Unlike the traditional recurrent unit which overwrites its content each timestep, the LSTM unit is able to decide whether to keep the existing memory via the introduced gates 16
  • 17. LSTM x0 y0 x1 … xt T y1 … yt 0 1 t ? ? ? ? ▸ Model parameters: ▸ xt is the input at time t ▸ Weight matrices: Wi, Wf, Wc, Wo, Ui, Uf, Uc, Uo, Vo ▸ Bias vectors: bo, bf, bc, bo 17 The basic unit in the hidden layer of an LSTM is the memory block that replaces the hidden units in a “traditional” RNN
  • 18. LSTM MEMORY BLOCK xt yt ht-1 ht Ct-1 Ct ▸ Memory block is a subnet that allow an LSTM unit to adaptively forget, memorise and expose the memory content. 18 it = σ Wi xt +Uiht−1 + bi( ) Ct = tanh Wcxt +Ucht−1 + bc( ) ft = σ Wf xt +Uf ht−1 + bf( ) Ct = ft ⋅Ct−1 + it ⋅Ct ot = σ Woxt +Uoht−1 +VoCt( ) ht = ot ⋅tanh Ct( )
  • 19. LSTM MEMORY BLOCK: INPUT GATE yt ht-1 ht Ct-1 Ct it ? it = σ Wi xt +Uiht−1 + bi( ) ▸ The input gate it controls the degree to which the new memory content is added to the memory cell 19 xt
  • 20. LSTM MEMORY BLOCK: CANDIDATES yt ht-1 ht Ct-1 Ct it C̅ t ? Ct = tanh Wcxt +Ucht−1 + bc( ) ▸ The values C̅ t are candidates for the state of the memory cell (that could be filtered by input gate decision later) 20 xt
  • 21. LSTM MEMORY BLOCK: FORGET GATE yt ht-1 ht Ct-1 Ct it C̅ t ft ? ft = σ Wf xt +Uf ht−1 + bf( ) ▸ If the detected feature seems important, the forget gate ft will be closed and carry information about it across many timesteps, otherwise it can reset the memory content. 21 xt
  • 22. LSTM MEMORY BLOCK: FORGET GATE ▸ Sometimes it’s good to forget. If you’re analyzing a text corpus and come to the end of a document you may have no reason to believe that the next document has any relationship to it whatsoever, and therefore the memory cell should be reset before the network gets the first element of the next document. ▸ In many cases by reset we don’t only mean immediate set it to 0, but also gradual resets corresponding to slowly fading cell states 22
  • 23. LSTM MEMORY BLOCK: MEMORY CELL yt ht-1 ht Ct-1 Ct it C̅ t ft Ct ? Ct = ft ⋅Ct−1 + it ⋅Ct ▸ The new state of the memory cell Ct calculated by partially forgetting the existing memory content Ct-1 and adding a new memory content C̅ t 23 xt
  • 24. LSTM MEMORY BLOCK: OUTPUT GATE ▸ The output gate ot controls the amount of the memory content to yield to the next hidden state ot = σ Woxt +Uoht−1 +VoCt( ) 24 xt yt ht-1 ht Ct-1 Ct it C̅ t ft Ct ? ot
  • 25. LSTM MEMORY BLOCK: HIDDEN STATE ht = ot ⋅tanh Ct( ) 25 yt ht-1 ht Ct-1 Ct it C̅ t ft Ct ht xt ot
  • 26. LSTM MEMORY BLOCK: ALL TOGETHER xt yt ht-1 ht Ct-1 Ct it C̅ t ft Ct ht ot 26 yt = f Vyht( )
  • 28. GATED RECURENT UNIT (GRU) ▸ Proposed by Cho et al. [2014]. ▸ It is similar to LSTM in using gating functions, but differs from LSTM in that it doesn’t have a memory cell. ▸ Each GRU consists of: ▸ update gate ▸ reset get 28 ▸ Model parameters: ▸ xt is the input at time t ▸ Weight matrices: Wz, Wr, WH, Uz, Ur, UH
  • 29. GRU 29 xt yt ht-1 ht ht = 1− zt( )ht−1 + zt Ht zt = σ Wz xt +Uzht−1( ) Ht = tanh WH xt +UH rt ⋅ht−1( )( ) rt = σ Wr xt +Urht−1( )
  • 30. GRU: UPDATE GATE zt = σ Wz xt +Uzht−1( ) 30 xt yt ht-1 ht zt ? ▸ Update gate zt decides how much unit update its activation or content
  • 31. GRU: RESET GATE ▸ When rt close to 0 (gate off), it makes the unit act as it’s reading the first symbol from the input sequence, allowing it to forget previously computed states rt = σ Wr xt +Urht−1( ) 31 xt yt ht-1 ht rt zt ?
  • 32. GRU: CANDIDATE ACTIVTION ▸ Update gate zt decides how much unit update its activation or content. Ht = tanh WH xt +UH rt ⋅ht−1( )( ) 32 xt yt ht-1 ht Ht rt zt ?
  • 33. GRU: HIDDEN STATE ▸ Activation at time t is the linear interpolation between previous activations ht-1 and candidate activation Ht ht = 1− zt( )ht−1 + zt Ht 33 xt yt ht-1 ht Ht rt zt ht
  • 34. GRU: ALL TOGETHER 34 xt yt ht-1 ht Ht rt zt ht
  • 35. READ MORE ▸ Supervised Sequence Labelling with Recurrent Neural Networks
 http://www.cs.toronto.edu/~graves/preprint.pdf ▸ On the difficulty of training Recurrent Neural Networks
 http://arxiv.org/pdf/1211.5063.pdf ▸ Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
 http://arxiv.org/pdf/1412.3555v1.pdf ▸ Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
 http://arxiv.org/pdf/1406.1078v3.pdf ▸ Understanding LSTM Networks
 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ ▸ General Sequence Learning using Recurrent Neural Networks
 https://www.youtube.com/watch?v=VINCQghQRuM 35
  • 36. END OF PART 1 ▸ @gakhov ▸ linkedin.com/in/gakhov ▸ www.datacrucis.com 36 THANK YOU