Recurrent neural networks

Recurrent Neural Networks
Viacheslav Khomenko, Ph.D.

Contents
 Recap: feed-forward artificial neural network
 Temporal dependencies
 Recurrent neural network architectures
 RNN training
 New RNN architectures
 Practical considerations
 Neural models for locomotion
 Application of RNNs

RECAP: FEED-FORWARD
ARTIFICIAL NEURAL
NETWORK

Feed-forward network
W. McCulloch and W. Pitts , 1940s Abstract mathematical model of a brain cell
Perceptron for classificationF. Rosenblatt, 1958
Multi-layer artificial neural networkP. Werbos, 1975
Input
Features
Input
Input
Input
Petals
Sepal
Yellow
patch
VeinsIris flower
Input
layer
Hidden
layer(s)
Output
layer
Hid-
den
Hid-
den
Hid-
den
Out-
put
Iris
Out-
put
¬Iris
Decisions

Feed-forward network
Decisions are based on current inputs:
• No memory about the past
• No future scope
A 𝒚x A
Input layer Hidden layer(s) Output layer
A
Input Decision output
Simplified representation:
Vector of input features:
Vector of predicted values:
x
𝒚
Neural activation:
A – some activation function (tanh etc…)
𝑤, 𝑏 – network parameters

Temporal dependencies
Analyzing temporal dependencies
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4
P(Iris): 0.1
P(¬Iris): 0.9
P(Iris): 0.11
P(¬Iris): 0.89
P(Iris): 0.2
P(¬Iris): 0.8
P(Iris): 0.45
P(¬Iris): 0.55
P(Iris): 0.9
P(¬Iris): 0.1
Decision on
sequence of
observations
Improved decisions
Stem: seen
Petals: hidden
Stem: seen
Petals: hidden
Stem: seen
Petals: partial
Stem: partial
Petals: partial
Stem: hidden
Petals: seen

For each state
Reber Grammar
Synthetic problem that can not be solved without memory.
Learn to predict
next possible edges
Transitions have equal probabilities:
P(1→2) = P(1→3) = 0.5
0.5
0.5 0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
States (nodes)
Transitions
(edges)

Word
Current node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar

Word
Current node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
Input vector x at time t = 2 Output vector y at time t = 2

Memory is important → Reasoning relies on
experience

Pro: Dependencies between features at different timestamps
Cons:
• Limited history of the input (< 10 timestamps)
• Delay values should be set explicitly
• Not general, can not solve complex tasks (such as Reber Grammar)
• FFNN with delayed inputs
• No internal state
Time-delay neural network
Input
Features
Input
Input
Input
Input
layer Hidden
layer
Output
layer
Hid-
den
Hid-
den
Hid-
den
Out-
put 𝒚(𝒕)
x(t)
x(t-1)
x(t-2)
x(t-3)
delay
delay
delay

RECURRENT NEURAL
NETWORK
ARCHITECTURES

But… not working because not stable!
Simple recurrence:
feed-back output to input
Naïve attempts…
Lack of the feedback control
A∑
𝒚(𝒕)x(t)
A
Input layer Hidden layer Output layer
A
Input Decision output
Past output state
1 step delay
Expected
𝒚
𝒚
Obtained
𝒚
𝒚
Introducing recurrence

A
A
∑
𝒚(𝒕)x(t)
A
A
Context layer
Pro: Fast to train because can be parallelized in time
Cons:
• Output transforms hidden state → nonlinear effects, information distorted
• The output dimension may be too small → information in hidden states is truncated
M.I. Jordan, 1986
1 step delay
Jordan recurrent network
Limited short-term
memory
Output-to hidden
connections

J.L. Elman, 1990
Often referenced as the basic RNN structure
and called “Vanilla” RNN
• Should see complete sequence to be trained
• Can not be parallelized by timestamps
• Has some important training difficulties….
A
A
∑
𝒚(𝒕)x(t)
A
A
Context layer 1 step delay
Hidden-to hidden connections
make system Turing-complete
Elman recurrent network

𝑾𝑖ℎ Weight matrix from input to hidden
𝑾 𝑜 Weight matrix from hidden to output
𝒙 𝑡 Input (feature) vector at time t
𝒚 𝑡 Network output vector at time t
𝒉 𝑡 Network internal (hidden) states vector at time t
𝑼 Weight matrix from hidden to hidden
𝒃 Bias parameter vector
𝒉 𝑡 = 𝜎 𝑾𝑖ℎ ∙ 𝒙 𝑡 + 𝑼 ∙ 𝒉 𝑡−1 +𝒃
𝒚 𝑡 = 𝜎 𝑾 𝑜 ∙ 𝒉 𝑡
Vanilla RNN

Unfolding the network in time
Vanilla RNN
𝒉 𝑡 = 𝜎 𝑾𝑖ℎ ∙ 𝒙 𝑡 + 𝑼 ∙ 𝒉 𝑡−1 +𝒃

Backpropagation: • Reliable and controlled convergence
• Supported by most of ML frameworks
Evolutionary methods, expectation maximization,
non-parametric methods, particle swarm optimization
Target: obtain the network parameters that optimize the cost function
Cost functions: log loss, mean squared root error etc…
Tasks:
Methods:
• For each timestamp of the input sequence x predict output y (synchronously)
• For the input sequence x predict the scalar value of y (e.g., at end of sequence)
• For the input sequence x of length Lx generate the
output sequence y of different length Ly
Research
RNN training

1. Unfold the network.
2. Repeat for the train data:
1. Given some input sequence 𝒙
2. For t in 0, N-1:
1. Forward-propagate
2. Initialize hidden state to the past value 𝒉 𝑡−1
3. Obtain output sequence 𝒚
4. Calculate error 𝑬 𝒚, 𝒚
5. Back-propagate error across the unfolded network
6. Average the weights
7. Compute next hidden state value 𝒉 𝑡
𝒉 𝑡 = 𝜎 𝑾𝑖ℎ ∙ 𝒙 𝑡 + 𝑼 ∙ 𝒉 𝑡−1 +𝒃
𝑬 𝒚, 𝒚 = −
𝑡
𝒚 𝒕 lg 𝒚 𝒕
E.g., cross entropy loss:
Back-propagation through time

Apply chain rule:
Back-propagation through time
𝜕𝑬 𝟐
𝜕𝜽
=
𝑘=0
2
𝜕𝑬 𝟐
𝜕 𝒚 𝟐
∙
𝜕 𝒚 𝟐
𝜕𝒉 𝟐
∙
𝜕𝒉 𝟐
𝜕𝒉 𝒌
∙
𝜕𝒉 𝒌
𝜕𝜽
𝜽 - Network parametersFor time 2:
𝜕𝒉 𝟐
𝜕𝒉 𝟎
=
𝜕𝒉 𝟐
𝜕𝒉 𝟏
∙
𝜕𝒉 𝟏
𝜕𝒉 𝟎

Saturation
Gradient
close to 0
Saturated neurons gradients → 0
• Smaller weigh parameters leads to faster gradients vanishing.
• Very big initial parameters make the gradient descent to diverge fast (explode).
Drive previous layers gradients to 0
(especially for far time-stamps)
Known problem for deep feed-forward networks.
For recurrent networks (even shallow) makes impossible to learn long-term dependencies!
𝝏𝒉 𝑡
𝝏𝒉0
=
𝝏𝒉 𝑡
𝝏𝒉 𝑡−1
∙ ⋯ ∙
𝝏𝒉3
𝝏𝒉2
∙
𝝏𝒉2
𝝏𝒉1
∙
𝝏𝒉1
𝝏𝒉0
• Decays exponentially
• Network stops learning, can’t update
• Impossible to learn correlations
• between temporally distant events
Problem: vanishing gradients

Network can not converge and
weigh parameters do not stabilize
Diagnostics: NaNs; Cost function large fluctuations
Large increase in the norm of the gradient
during training
Pascanou R. et al, On the difficulty of training
recurrent neural networks. arXiv (2012)
Problem: exploding gradients
Solutions:
• Use gradient clipping
• Try reduce learning rate
• Change loss function by setting constrains on weights (L1/L2 norms)

Deep networks train difficulties:
• Vanishing gradient
• Exploding gradient
Possible solutions:
• One of the previously proposed solutions
or
• Use unsupervised pre-training →
difficult to implement, sometimes the
unsupervised solution differs much from the supervised
or
• Improve network architecture!
Fundamental deep learning problem

Echo State
Network Readout
Only readout
neurons are
trained!
Herbert Jaeger, 2001
In practice:
• Easy to over-fit
(models learns by
heart) – gives good
results on the train
data only
• The reservoir hyper-
parameters
optimization is not
evident
Reservoir computing

Liquid state
machine
Similar to ESN, but using more
biological plausible neuron models
→ spiking (dynamic) neurons
In practice:
• Still, more a
research area
• Requires special
hardware to be
computationally
efficient
Daniel Brunner
Tal Dahan and Astar Sade
Reservoir computing

• No Input Gate
• No Forget Gate
• No Output Gate
• No Input Activation
Function
• No Output
Activation Function
• No Peepholes
• Coupled Input and
• Forget Gate
• Full Gate Recurrence
Variants
S. Hochreiter & J. Schmidhuber, 1997
Long short-term memory
Due to gaining routing
mechanism, can be
efficiently trained to learn
LONG-TERM dependencies

Has context in both directions, at any timestamp
Bidirectional RNN

Last-1 output = First+1 output
BPXXXXXPE
BTXXXXXXXXTE
Testing capacity to
maintain long term
dependencies
Correct cases
BT ….. TE
BP ….. PE
Incorrect cases
BT ….. PE
BP ….. TE
System must be able to learn to compare
First+1 symbol with Last-1 symbol
Embedded Reber Grammar

Masking input (output)
Input (output) has variable length
Data batch

Length of input ≠ length of output
•CTC loss function•Encoder-decoder architecture
Transform the network outputs into a
conditional probability distribution over label
sequences
- C - A - T -
- BLANK
labelling
Result decoding
Raw output: -----CCCC---AA-TTTT---
1) Remove repeating symbols: -C-A-T-
2) Remove blanks: CAT

Locomotion principles in nature
[S.Roland et al., 2004]
Locomotion: movement or
the ability to move from
one place to another
Manipulation ≠ Locomotion
Aperiodic
series of
motions
Stable
Periodic
motion
gaits
Quasi stable
[A. Ijspeert et al., 2007]

Wheeled on soft
ground
[S.Roland et al. 2004 ]
Locomotion efficiency

Nature: no “pure” wheeled locomotion
Reason: variety of surfaces, rough terrain, adaptation is necessary
Biological locomotion exploits patterns
The number of legs influences
• Mechanical complexity
• Control complexity
• Generated patterns (for 6 legs N = (2k-1)! = 11! = 39 916 800 )
[S.Roland 2004]
Locomotion efficiency

• Gait control is on “automatic pilot”
• Automatic gait is energy efficient
• Perturbation introduces modification
Not fully nature way (weak adaptation, no decisions)
How the nature deals with locomotion?
- Initiate motion by putting energy
- Passive stage
- Generate
- Control for stability
- Repeat
- Brain?
- Nervous system?
- Spinal cord?
Inconceivable automation

Complexity of the phenomena involved in motor control
Central Nervous
System
Motor Nervous
System
Neuromuscular
Junction
 Models of musculoskeletal system …
 Models of Motor Nervous System
Extrait: Univ du Québec-etsmtl (cours) Extrait: collège de France ( L. Damn)
Extrait: Univ. Paris 8- cours Licence L.612
Spinal
cord
[P. Hénaff 2013]
Biological motor control

MU aggregates muscular fibers
innervated by the common
motor neuron. Contraction of
these fibers is thus
simultaneous.
Motor unit
Sensory nerve
Motor nerve
Dorsal root
Posterior horn
Anterior horn
Ventral root
Nervo-muscular
fiber
Reflexes: pathways
Muscle contraction as a
response to its own elongation
Muscle contraction as a
response to external stimuli
[P. Hénaff 2013]

Central Pattern Generator
• Automatic activity is controlled by spinal centers
• CPG (Central Pattern Generator) is a group of synaptic connections to generate
rhythmic motions
• The spinal pattern-generating networks do not require sensory input but nevertheless
are strongly regulated by input from limb proprioceptors

Sensory-motor architecture for locomotion
[McCrea 2006]
Biological sensory-motor architecture
models

Muscular contraction is put in place during embryonic life or after the birth
• Insects can walk directly upon
birth
• Most mammals require several
minutes to stand
• Humans require more than a
year to walk on two legs
How learning occurs
[ejjack2]

Mathematical modeling of CPG
[J. Nassour et al.
2010]
[P.F. Rowat,
A.I. Selverston
1997]

CPG approximation Limit cycle behavior
Gait Matrix
Coupling different CPG
Sensory feedback
Mathematical modeling of CPG
Hopf oscillator

Neural controllers
CPG of tronc
ipsilateral
And
Contralateral
Connections
Matsuoka model
Neural based CPG controller for biped locomotion [Taga 1995]
Neural controller
• 1 CPG per joint
• 2 coupled neurons per CPG
• Inhibitions: contra and ipsi latéral
• sensori motricity Intégration
Extrait de Taga 1995 (Biol. Cyb.)
Internal coupling of
the network
Articular sensory inputs:
speeds, forces, contact
ground
Model of Neuron i
(Matsuoka 1985)
[P. Hénaff 2013]

With couplingTemporal evaluation of frequency components of the
sagittal acceleration of the robot’s pelvis
• Automatically determines robot’s natural frequencies
• Continuously adapts to evolution of defects
Phase portraits of the oscillator
Without coupling
Learning
Synchronous
Compensation of articulation defects
ROBIAN
LISV, UVSQ
ROBIAN
LISV, UVSQ
[V.Khomenko, 2013,
LISV, UVSQ, France]

APPLICATION OF
RECURRENT NEURAL
NETWORKS

• Human-computer interaction
– Speech and handwriting recognition
– Music composition
– Activity recognition
• Identification and control
– Identification and control of dynamic systems by learning
– Biologically inspired robotics for adaptive locomotion
– Study of biological pattern structures forming and evaluation
Application of RNNs

Recurrent neural networks

More Related Content

What's hot (20)

Similar to Recurrent neural networks (20)

Recently uploaded (20)

Recurrent neural networks