SlideShare a Scribd company logo
Presented by Dr. Hung Le
Memory-based
Reinforcement Learning
1
Background
2
What is Reinforcement Learning (RL)?
● Agent interacts with environment
● S+A=>S’+R (MDP)
● The transition can be stochastic or
deterministic
● Find a policy π(S) → A to maximize
expected return E(∑R) from the
environment
3
A grid-world example
● The state space is discrete. We have 6 states
corresponding to 6 locations in the map
● The action space is discrete. We have 4 actions
corresponding to 4 movements
● The reward can be “nothing”, “poison”, “1
cheese” or “3 cheese”. We can convert to
scalars: 0, -1,1,3
● The transition in this case is deterministic,
corresponding to the outcome of movements.
○ It can be stochastic in other cases
○ E.g. at (0,0) move to the left may result in
(0,1) or (1,1) with equal probability
4
https://huggingface.co/blog/deep-rl-q-part2
Classic RL algorithms: Value learning
5
Q-learning
(temporal difference-TD)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8, no. 3 (1992): 279-292.
Williams, Ronald J. "Simple statistical gradient-following algorithms
for connectionist reinforcement learning." Machine learning 8, no. 3
(1992): 229-256.
● Basic idea: before finding optimal
policy, we find the value function
● Learn (action) value function:
○ V(s)
○ Q(s,a)
● V(s)=E(∑R from s)
● Q(s,a)=E(∑R from s,a)
● Given Q(s,a)
→ choose action that maximizes the
value (ε-greedy policy)
Classic RL algorithm: Policy gradient
● Basic idea: directly optimise the policy as
a function of states
● Need to estimate the gradient of the
objective function E(∑R) w.r.t the
parameters of the policy
● Focus on optimisation techniques
● No memory
6
REINFORCE
(policy gradient)
General RL algorithms
7
Do we have memory in value learning?
● Q-table in value learning can be considered as a memory
● It remembers “how good a state-action pair is on average”
● The memory is very basic, non-smooth and redundant
8
Challenges in RL: the optimal policy can be
complex
● Task:
○ Agent searches for the key
○ Agent picks the key
○ Agent open the door to access the
room
○ Agent finds the box in the room
● Reward:
○ If the agent reaches the box, get +1
reward
9
https://github.com/maximecb/gym-minigrid
→ How to learn such complicated
policies using the simple reward?
Short Answer: just learn from many trials (data)!
Chess
Self-driving
car
Video
games
Robotics
10
Deep RL: Value/Policy are neural networks
11
Example: RL agent plays video game
12
Game
simulation
DQN
Limitation of training with big data
● High cost
○ Training time (days to months)
○ GPU (dozens to hundreds)
● Require simulators
○ Agents are trained in simulation (millions to billions of steps)
○ The cost for one step of simulation can be high
○ Some real environments simply don’t have simulator
● Trained agents are unlike humans
○ Unsafe exploration
○ Weird behaviors
○ Fail to generalize
13
Human vs RL Agents in Atari games
● Human:
○ Few hours of practicing to reach
moderate performance
○ Don’t forget how to play old game
as learning new ones
○ Can play any game
● RL Agents (DQN-based):
○ 21 trillions hours of training to beat
human (AlphaZero), equivalents to
11,500 years of human practice
○ Catastrophic forgetting if learn
games sequentially
○ Despite forever training, there
exists failed games
14
What is missing?
15
Taxonomy of memories
16
What is memory?
● Memory is the ability to efficiently store,
retain and recall information
● Brain memory stores items, events and
high-level structures
● Computer memory stores data,
programs and temporary variables
17
Memory in neural networks
18
Long-term
memory
Short-term
memory
Functional
memory
● Semantic memory:
storing data in the neural
network weights
● Episodic memory: storing
episodic events in matrix
memory
● Associative memory: key-value
binding as in Hopfield Network
or Transformer layer
● Working memory: matrix
memory in memory augmented
neural network
● Memory stores
programs
● Memory of models,
mixture of experts ..
Semantic memory
● A feed-forward neural network can be
viewed as a semantic memory
○ Data is stored in the weight of the
network via backpropagation
○ Data is read via forwarding the
input
○ It can be associative memory as
well
● A table stores the statistics of data can
be also a semantic memory (value
table)
19
y=Wx
Working Memory
● Recurrent neural networks
contains working memory (hidden
state)
○ The hidden state capture
past inputs
○ The prediction is made
based on the hidden state
● Advanced versions of RNN
○ GRU/LSTM
○ MANN
20
Episodic Memory
● Often implemented as a matrix,
table
● Can be key-value memory
● Access via attention or analogy
search
● Support neural networks in
making predictions
21
Properties of memories
22
Lifespan Plasticity Example
● 1 episode is one day
● Last for 1 day
● Build memory instantly
Short-term Quick
1 Working
memory
● Persists across agent’s lifetime
● Last for several years
● Build memory instantly
Long-term Quick
2 Episodic
memory
● persists across agent’s lifetime
● Last for several years
● Take time to build memory
Long-term Slow
3 Semantic
memory
How can it help RL?
23
Memory as experiences
24
Semantic Memory in RL
● Human brain implements RL
● Dopamine neurons reflects a reward
prediction error (TD learning)
● What is the memory in brain that
stores V?
○ Value table is not scalable
○ May be a value model →DQN
(semantic memory)
25
DQN: Replay buffer is an episodic memory
• Store experiences: (s,a,r,s’) tuple
• Read memory via replay sampling
• Memory content is used to train
Action-Value Network Q
26
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
DQN’s memories are better than Q-table, but …
● Inefficiency:
○ Learning semantic memory (Q network)
is slow (gradient descent)
○ Optimise many parameters
● Bootstrap noise:
○ The target is the network’s output
○ If network is not well trained, the target
is noisy
● Reply buffer:
○ Raw observations
○ Need many sampling iterations
27
Alternative: episodic control paradigm
Current experience
Eg: (St, At),…
Memory
read
Experiences Final Returns
Policy
Value
Environment
Memory write
28
● Episodic memory is a key-value memory
● Directly binds experience to return→ refers to
experiences that have high return to make
decisions
Model-free episodic control: K-nearest neighbors
29
Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan
Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016).
Fix-size memory
First-in-first out
● No need to learn parameters
(pretrained 𝜙)
● Quick value estimation
Hybrid: episodic memory + semantic memory (DQN)
30
Lin, Zichuan, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. "Episodic Memory Deep Q-Networks. IJCAI18"
Episodic
TD learning
Limitation of model-free episodic memory
● Near-deterministic assumption
○ Assume clean env.
○ Store the best return
● Sample-inefficiency:
○ store state-action-value which demands
experiencing all actions to gain experience
● Fixed combination between episodic and parametric
values
○ episodic contribution weight unchanged for
different observations
○ requires manual tuning of the weight
31
What if the state is partially
observable and the number of
actions is large?
Model-based episodic memory
● Learn a model of trajectories using
self-supervised training
○ Model=LSTM
○ Learn to reconstruct past state-action
given current trajectory and query
● The trained LSTM is used to generate
trajectory representations
→ counterfactual trajectory
→ imagine actions
32
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic
Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
Discrete-action environment: Atari benchmark
33
Evaluation Metrics:
Normalised score = (Model’s score-random play
score)/(human score - random play score).
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
~60 games
Sample efficiency test on Atari games
34
Model-free (10M)
Hybrid (40M)
Model-based (10M)
DQN (200M)
Memory to build context
35
When the state is not enough …
● Partially Observable Environments:
○ States do not contain all required
information for optimal action
○ E.g. state=position, does not contain
velocity
● Ways to improve:
○ Build richer state representations
○ Memory of all past
observations/actions
● Policy gradient
36
Full map
Observed
state
RNN hidden state
RNN as policy model
Building better working memory
for better the context
37
● External memory: longer-term,
store more
● Unsupervised training to learn
read-write operation
Wayne, Greg, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja,
Agnieszka Grabska-Barwinska, Jack Rae et al. "Unsupervised predictive
memory in a goal-directed agent." arXiv preprint arXiv:1803.10760 (2018).
Unsupervised training on the memory
38
It is useful for memory-based decision process
39
Benchmark: Navigation with Distraction
40
Hung, Chia-Chun, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi
Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. "Optimizing
agent behavior over long time scales by transporting value." Nature
communications 10, no. 1 (2019): 1-12.
Memory is critical for distracting observations
41
Memory with attention is beneficial
42
Break and QA
43
Memory for exploration
44
Exploration issue in RL
● Rewards can be very sparse
○ RL agents cannot learn anything until
they collect the first reward
○ Explore forever?
● Sometimes reward function in
complicated real-world problem is
unknown
○ Don’t have simulator
○ Explore freely in real world is unsafe
→ Sample inefficiency
→Efficient exploration
45
Need exploring mechanisms
to enable sample-efficiency!!!
46
Aubret, A., L. Matignon, and S. Hassas. "A survey on intrinsic motivation in reinforcement learning."
In biological world, agents can cope with this
problem very well
● Animal can travel for long
distance till they find food
● Human can navigate to go to an
address in a strange city
○ intrinsic motivation
○ curiosity, hunch
○ intrinsic reward
47
https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
Agents should be motivated towards
“interesting” consequences
● C: actor vs M: world model
● M predicts consequences of C
● As a result:
▪ If C action results in repeated and
boring consequences →M predict
well
▪ C must explore novel
consequence
● Memory:
▪ To learn the world model
▪ To know if something novel or old
48
https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
M: Forward model learns dynamics
(semantic memory)
49
Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. 2015
Novelty if prediction
error is high (intrinsic
reward)
When novelty as prediction error is useless
● The prediction target is stochastic
● Information necessary for the prediction
is missing
→ Both the totally predictable and the
fundamentally unpredictable will get boring
→Solution: Remember all experiences
● “Store” all observations, including
stochastic ones in working, semantic or
episodic memory
● Instead of predicting, try recalling from
the memory
50
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
Working memory: Store visited in-episode states
●Novelty through reachability:
▪ Boring if reachable from states in memory in
less than k steps
●Learn to classify: reachable or unreachable
○ Collect 2 states from trajectory
○ Create label to indicate one is reachable from
another
51
Savinov et al. "Episodic Curiosity through Reachability." In ICLR 2018.
High if unreachable
Exploration with working memory is better
52
No intrinsic
reward
Intrinsic
reward via
dynamic
prediction
Intrinsic
reward via
WM
Deepmind’s Maze benchmark
53
Bad behavior
Good behavior
Semantic memory:
distillation to neural networks’ weight
●Target network: randomly
transform state
●Predictor network: try to
remember the transformed
state
○ A global memory
○ Random TV is not problem
■ Remember all noisy
channels
54
https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
Burda et al. Random Network Distillation: a new take on Curiosity-Driven Learning, In ICLR 2019
High if
cannot
distill
Episodic memory: explore from
stored good state
●Archive: a memory of good states
(state-score) → sample one
●Purely random exploration from this
state→ collect more states
●Update the archive
And many other tricks: imitation learning,
goal-based policy, …
55
Adrian Ecoffet et al.: First return, then explore. Nature 2021
Atari game: Superhuman performance
56
Episodic
Semantic
Working
Montezuma Revenge benchmark
Memory for optimisation
57
Episodic memory for hyperparameter optimisation
●RL is very sensitive to hyperparameters
●SOTA performance is achieved with extensive
hyperparameter tuning
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint
arXiv:1708.04133.
58
DQN
Hyperparameters
are enormous!
Limitation of memory-less optimiser
● Don’t have the context of training in the optimization process
● Treated as a stateless bandit or greedy optimization
○ Ignoring the context prevents the use of episodic experiences that can be
critical in optimization and planning
○ E.g. the hyperparameters that helped overcome a past local optimum in
the loss surface can be reused when the learning algorithm falls into a
similar local optimum
59
How to build the context (the key in
key-value memory)?
Optimising hyperparameter as episodic RL
●At each policy update, the hyper-agent:
○ Observe the training context- hyper-state
○ Configure the RL algorithm with suitable hyperparameters ψ - hyper-action
○ Train RL agent with ψ, observed learning progress – hyper-reward
●The goal of the Hyper-RL is the same as the main RL’s: to maximize the return of the RL agent
○ At a hyper-state, find hyper-action that maximize the accumulated hyper-reward
(hyper-return)
60
KEY |VALUE
Experience hyper-state/action |Outcome
Hyper-Returns
Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung
Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training."
AAAI (2022).
Hyper-state representation learning
● Compress the parameters/gradients to a vector hyper-state s
● VAE learns to reconstruct s
● The latent vector is the hyper-state representation
61
Continuous-action environment:
Mujoco benchmark
62
Metric: positive reward allocated based on the
distance moved forward and a negative reward
allocated for moving backward.
63
Policy gradient optimisation
64
Issues with naïve Policy Gradient
● High variance and unstable.
● The gradient may not accurately reflect the
policy gain when the policy changes
substantially
Trust-region optimization is a solution
● The new policy should be inside a small
trust region around the last sampling policy
(old policy)
● Bound KL(𝜋_(𝜃_𝑜𝑙𝑑 ) |𝜋_(𝜃_𝑛𝑒𝑤 )) (TRPO,
PPO, …)
65
What is wrong with these trust-region methods?
66
When the old policy is bad
● Bounding makes the new policy stuck in the
local optima with the old policy
● Relying in one old policy is not enough
→ Need to store many past policies and rely on all of
them
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, and Svetha
Venkatesh. "Memory-Constrained Policy Optimization." NeurIPS (2022).
67
Use two trust regions instead of one
Backup Trust Region from Virtual Policy
PG Objective
68
Memory of policy networks
- Build a memory of past policy. Choose 𝜓 from the policy memory via attention
- fϕ is a neural network parameterized by ϕ which outputs softmax attention
weights over the M past policies
- v is a “context” vector capturing different relations among θ, θold, ψ.
69
Atari
Final performance
Mujoco:
Conclusion
70
In summary
● Memory assists RL agents in many forms:
○ Semantic
○ Working
○ Episodic
● And in many tasks:
○ Store experiences
○ Exploration
○ State representation
○ Optimisation (hyperparameters, policy)
71
What’s next? Life-long memory
● So far, the memory lifespan is restricted to an episode (working memory) or a
task (episodic or semantic memory)
● A real memory will span across tasks and domains:
○ Playing 60 Atari games in a row
○ Learn Mujoco then learn Atari
● It requires new kind of memory that supports different representations from
different scenarios
● The amount of events and information is big
○ Efficient memory access mechanism
○ Effective memory selection
72
What’s next? Dynamic memory
● Current memory is fixed size (table, matrix, neural network)
○ It is not enough when the observations are dense
○ It is redundant when the observations are sparse
● Can we build a dynamic memory that automatically grow and shrink depending
on context?
○ Memory read and write will be more precise
○ No noise stored in the memory
73
What’s next? Hierarchical memory
● Current memory models are general flat, supporting single-access
● To remember details, it needs several steps or recall:
○ Coarse-grained chunk of steps
○ A specific step in the chunk
● Remember different timescales
○ Events from recent timesteps
○ Events from a far episode
74
What’s next? Abstract memory
● Current memory models stores specific events, states, actions or
representations of them
● To excel in diverse tasks, it is critical to capture abstract concepts:
○ Goal (e.g. use the red key to open the red door)
○ Relationships (e.g. climbing the ladder and picking the key are required to
pass the level)
○ High-level objects (e.g. anything the block the door is the obstacle)
● It is unclear how artificial memory can stores these complex concepts
75
What’s next? Complementary learning system
● A system of multiple memory kinds
● The memory communicates and transfers knowledge:
○ Episodic memory distill events to semantic knowledge
○ Working memory distill temporary information to long-term memory
● How to design an efficient and biologically plausible system of memory is an
open problem.
76
What’s next? Other testbeds for memory
● Continual RL
● Meta-RL
● Few-shot-RL
77
Demo and QA
https://github.com/thaihungle/AJCAI22-Tutorial
78
Our team at A2I2 is hiring!
Contact thai.le@deakin.edu.au for PhD scholarships.
79

More Related Content

PPTX
Instance based learning
PDF
What is the Expectation Maximization (EM) Algorithm?
PDF
Generalized Reinforcement Learning
DOC
Lecture #1: Introduction to machine learning (ML)
PPT
Support Vector Machines
PPT
Instance Based Learning in Machine Learning
PPTX
Random forest algorithm
PDF
Linear Regression Algorithm | Linear Regression in Python | Machine Learning ...
Instance based learning
What is the Expectation Maximization (EM) Algorithm?
Generalized Reinforcement Learning
Lecture #1: Introduction to machine learning (ML)
Support Vector Machines
Instance Based Learning in Machine Learning
Random forest algorithm
Linear Regression Algorithm | Linear Regression in Python | Machine Learning ...

What's hot (20)

PPTX
Open addressiing &rehashing,extendiblevhashing
PPT
Syntax and semantics of propositional logic
PDF
Privacy preserving machine learning
PDF
Graph Based Clustering
PDF
Introduction to Deep Learning, Keras, and TensorFlow
PPTX
Decision trees for machine learning
PDF
Machine Learning and Data Mining: 10 Introduction to Classification
PPTX
Scikit Learn intro
PPTX
K-Means manual work
PPT
Data preprocessing
PPTX
eScience SHAP talk
PPTX
Inductive bias
PDF
Introduction to Homomorphic Encryption
PPT
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
PDF
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
PDF
Vc dimension in Machine Learning
PDF
Research of adversarial example on a deep neural network
PPTX
Game Playing in Artificial intelligence.pptx
PPTX
Introduction to the AKS Primality Test
ODP
Sigma Protocols and Zero Knowledge
Open addressiing &rehashing,extendiblevhashing
Syntax and semantics of propositional logic
Privacy preserving machine learning
Graph Based Clustering
Introduction to Deep Learning, Keras, and TensorFlow
Decision trees for machine learning
Machine Learning and Data Mining: 10 Introduction to Classification
Scikit Learn intro
K-Means manual work
Data preprocessing
eScience SHAP talk
Inductive bias
Introduction to Homomorphic Encryption
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Vc dimension in Machine Learning
Research of adversarial example on a deep neural network
Game Playing in Artificial intelligence.pptx
Introduction to the AKS Primality Test
Sigma Protocols and Zero Knowledge
Ad

Similar to Memory-based Reinforcement Learning (20)

PDF
Memory for Lean Reinforcement Learning.pdf
PPTX
Deep Reinforcement Learning
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PDF
Introduction2drl
PPTX
Intro to Deep Reinforcement Learning
PDF
Shanghai deep learning meetup 4
PDF
An introduction to deep reinforcement learning
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PDF
PDF
Deep Q-Learning
PDF
Deep Reinforcement Learning
PDF
5 Important Deep Learning Research Papers You Must Read In 2020
PDF
RL presentation
PDF
Reinforcement learning in a nutshell
PDF
An introduction to reinforcement learning
PDF
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
An introduction to reinforcement learning
PPTX
Presentation on self driving cars using Deep Learning
Memory for Lean Reinforcement Learning.pdf
Deep Reinforcement Learning
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Introduction2drl
Intro to Deep Reinforcement Learning
Shanghai deep learning meetup 4
An introduction to deep reinforcement learning
R22 Machine learning jntuh UNIT- 5.pptx
Deep Q-Learning
Deep Reinforcement Learning
5 Important Deep Learning Research Papers You Must Read In 2020
RL presentation
Reinforcement learning in a nutshell
An introduction to reinforcement learning
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
Continuous control with deep reinforcement learning (DDPG)
anintroductiontoreinforcementlearning-180912151720.pdf
An introduction to reinforcement learning
Presentation on self driving cars using Deep Learning
Ad

More from Hung Le (6)

PPTX
AJCAI24 Tutorial: Towards Safe and Controlled LLMs
PPTX
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
PDF
Episodic Policy Gradient Training
PDF
Model Based Episodic Memory
PDF
Self-Attentive Associative Memory
PDF
Neural Stored-program Memory
AJCAI24 Tutorial: Towards Safe and Controlled LLMs
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Episodic Policy Gradient Training
Model Based Episodic Memory
Self-Attentive Associative Memory
Neural Stored-program Memory

Recently uploaded (20)

PPTX
Self management and self evaluation presentation
PPTX
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
PPTX
Learning-Plan-5-Policies-and-Practices.pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PDF
Why Top Brands Trust Enuncia Global for Language Solutions.pdf
PPTX
Relationship Management Presentation In Banking.pptx
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
fundraisepro pitch deck elegant and modern
PPTX
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Primary and secondary sources, and history
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Human Mind & its character Characteristics
Self management and self evaluation presentation
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
Learning-Plan-5-Policies-and-Practices.pptx
Instagram's Product Secrets Unveiled with this PPT
_ISO_Presentation_ISO 9001 and 45001.pptx
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
Why Top Brands Trust Enuncia Global for Language Solutions.pdf
Relationship Management Presentation In Banking.pptx
oil_refinery_presentation_v1 sllfmfls.pdf
2025-08-10 Joseph 02 (shared slides).pptx
fundraisepro pitch deck elegant and modern
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
Intro to ISO 9001 2015.pptx wareness raising
nose tajweed for the arabic alphabets for the responsive
Primary and secondary sources, and history
An Unlikely Response 08 10 2025.pptx
Human Mind & its character Characteristics

Memory-based Reinforcement Learning

  • 1. Presented by Dr. Hung Le Memory-based Reinforcement Learning 1
  • 3. What is Reinforcement Learning (RL)? ● Agent interacts with environment ● S+A=>S’+R (MDP) ● The transition can be stochastic or deterministic ● Find a policy π(S) → A to maximize expected return E(∑R) from the environment 3
  • 4. A grid-world example ● The state space is discrete. We have 6 states corresponding to 6 locations in the map ● The action space is discrete. We have 4 actions corresponding to 4 movements ● The reward can be “nothing”, “poison”, “1 cheese” or “3 cheese”. We can convert to scalars: 0, -1,1,3 ● The transition in this case is deterministic, corresponding to the outcome of movements. ○ It can be stochastic in other cases ○ E.g. at (0,0) move to the left may result in (0,1) or (1,1) with equal probability 4 https://huggingface.co/blog/deep-rl-q-part2
  • 5. Classic RL algorithms: Value learning 5 Q-learning (temporal difference-TD) Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3 (1992): 279-292. Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8, no. 3 (1992): 229-256. ● Basic idea: before finding optimal policy, we find the value function ● Learn (action) value function: ○ V(s) ○ Q(s,a) ● V(s)=E(∑R from s) ● Q(s,a)=E(∑R from s,a) ● Given Q(s,a) → choose action that maximizes the value (ε-greedy policy)
  • 6. Classic RL algorithm: Policy gradient ● Basic idea: directly optimise the policy as a function of states ● Need to estimate the gradient of the objective function E(∑R) w.r.t the parameters of the policy ● Focus on optimisation techniques ● No memory 6 REINFORCE (policy gradient)
  • 8. Do we have memory in value learning? ● Q-table in value learning can be considered as a memory ● It remembers “how good a state-action pair is on average” ● The memory is very basic, non-smooth and redundant 8
  • 9. Challenges in RL: the optimal policy can be complex ● Task: ○ Agent searches for the key ○ Agent picks the key ○ Agent open the door to access the room ○ Agent finds the box in the room ● Reward: ○ If the agent reaches the box, get +1 reward 9 https://github.com/maximecb/gym-minigrid → How to learn such complicated policies using the simple reward?
  • 10. Short Answer: just learn from many trials (data)! Chess Self-driving car Video games Robotics 10
  • 11. Deep RL: Value/Policy are neural networks 11
  • 12. Example: RL agent plays video game 12 Game simulation DQN
  • 13. Limitation of training with big data ● High cost ○ Training time (days to months) ○ GPU (dozens to hundreds) ● Require simulators ○ Agents are trained in simulation (millions to billions of steps) ○ The cost for one step of simulation can be high ○ Some real environments simply don’t have simulator ● Trained agents are unlike humans ○ Unsafe exploration ○ Weird behaviors ○ Fail to generalize 13
  • 14. Human vs RL Agents in Atari games ● Human: ○ Few hours of practicing to reach moderate performance ○ Don’t forget how to play old game as learning new ones ○ Can play any game ● RL Agents (DQN-based): ○ 21 trillions hours of training to beat human (AlphaZero), equivalents to 11,500 years of human practice ○ Catastrophic forgetting if learn games sequentially ○ Despite forever training, there exists failed games 14
  • 17. What is memory? ● Memory is the ability to efficiently store, retain and recall information ● Brain memory stores items, events and high-level structures ● Computer memory stores data, programs and temporary variables 17
  • 18. Memory in neural networks 18 Long-term memory Short-term memory Functional memory ● Semantic memory: storing data in the neural network weights ● Episodic memory: storing episodic events in matrix memory ● Associative memory: key-value binding as in Hopfield Network or Transformer layer ● Working memory: matrix memory in memory augmented neural network ● Memory stores programs ● Memory of models, mixture of experts ..
  • 19. Semantic memory ● A feed-forward neural network can be viewed as a semantic memory ○ Data is stored in the weight of the network via backpropagation ○ Data is read via forwarding the input ○ It can be associative memory as well ● A table stores the statistics of data can be also a semantic memory (value table) 19 y=Wx
  • 20. Working Memory ● Recurrent neural networks contains working memory (hidden state) ○ The hidden state capture past inputs ○ The prediction is made based on the hidden state ● Advanced versions of RNN ○ GRU/LSTM ○ MANN 20
  • 21. Episodic Memory ● Often implemented as a matrix, table ● Can be key-value memory ● Access via attention or analogy search ● Support neural networks in making predictions 21
  • 22. Properties of memories 22 Lifespan Plasticity Example ● 1 episode is one day ● Last for 1 day ● Build memory instantly Short-term Quick 1 Working memory ● Persists across agent’s lifetime ● Last for several years ● Build memory instantly Long-term Quick 2 Episodic memory ● persists across agent’s lifetime ● Last for several years ● Take time to build memory Long-term Slow 3 Semantic memory
  • 23. How can it help RL? 23
  • 25. Semantic Memory in RL ● Human brain implements RL ● Dopamine neurons reflects a reward prediction error (TD learning) ● What is the memory in brain that stores V? ○ Value table is not scalable ○ May be a value model →DQN (semantic memory) 25
  • 26. DQN: Replay buffer is an episodic memory • Store experiences: (s,a,r,s’) tuple • Read memory via replay sampling • Memory content is used to train Action-Value Network Q 26 Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
  • 27. DQN’s memories are better than Q-table, but … ● Inefficiency: ○ Learning semantic memory (Q network) is slow (gradient descent) ○ Optimise many parameters ● Bootstrap noise: ○ The target is the network’s output ○ If network is not well trained, the target is noisy ● Reply buffer: ○ Raw observations ○ Need many sampling iterations 27
  • 28. Alternative: episodic control paradigm Current experience Eg: (St, At),… Memory read Experiences Final Returns Policy Value Environment Memory write 28 ● Episodic memory is a key-value memory ● Directly binds experience to return→ refers to experiences that have high return to make decisions
  • 29. Model-free episodic control: K-nearest neighbors 29 Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016). Fix-size memory First-in-first out ● No need to learn parameters (pretrained 𝜙) ● Quick value estimation
  • 30. Hybrid: episodic memory + semantic memory (DQN) 30 Lin, Zichuan, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. "Episodic Memory Deep Q-Networks. IJCAI18" Episodic TD learning
  • 31. Limitation of model-free episodic memory ● Near-deterministic assumption ○ Assume clean env. ○ Store the best return ● Sample-inefficiency: ○ store state-action-value which demands experiencing all actions to gain experience ● Fixed combination between episodic and parametric values ○ episodic contribution weight unchanged for different observations ○ requires manual tuning of the weight 31 What if the state is partially observable and the number of actions is large?
  • 32. Model-based episodic memory ● Learn a model of trajectories using self-supervised training ○ Model=LSTM ○ Learn to reconstruct past state-action given current trajectory and query ● The trained LSTM is used to generate trajectory representations → counterfactual trajectory → imagine actions 32 Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
  • 33. Discrete-action environment: Atari benchmark 33 Evaluation Metrics: Normalised score = (Model’s score-random play score)/(human score - random play score). Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236 ~60 games
  • 34. Sample efficiency test on Atari games 34 Model-free (10M) Hybrid (40M) Model-based (10M) DQN (200M)
  • 35. Memory to build context 35
  • 36. When the state is not enough … ● Partially Observable Environments: ○ States do not contain all required information for optimal action ○ E.g. state=position, does not contain velocity ● Ways to improve: ○ Build richer state representations ○ Memory of all past observations/actions ● Policy gradient 36 Full map Observed state RNN hidden state RNN as policy model
  • 37. Building better working memory for better the context 37 ● External memory: longer-term, store more ● Unsupervised training to learn read-write operation Wayne, Greg, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae et al. "Unsupervised predictive memory in a goal-directed agent." arXiv preprint arXiv:1803.10760 (2018).
  • 38. Unsupervised training on the memory 38
  • 39. It is useful for memory-based decision process 39
  • 40. Benchmark: Navigation with Distraction 40 Hung, Chia-Chun, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. "Optimizing agent behavior over long time scales by transporting value." Nature communications 10, no. 1 (2019): 1-12.
  • 41. Memory is critical for distracting observations 41
  • 42. Memory with attention is beneficial 42
  • 45. Exploration issue in RL ● Rewards can be very sparse ○ RL agents cannot learn anything until they collect the first reward ○ Explore forever? ● Sometimes reward function in complicated real-world problem is unknown ○ Don’t have simulator ○ Explore freely in real world is unsafe → Sample inefficiency →Efficient exploration 45
  • 46. Need exploring mechanisms to enable sample-efficiency!!! 46 Aubret, A., L. Matignon, and S. Hassas. "A survey on intrinsic motivation in reinforcement learning."
  • 47. In biological world, agents can cope with this problem very well ● Animal can travel for long distance till they find food ● Human can navigate to go to an address in a strange city ○ intrinsic motivation ○ curiosity, hunch ○ intrinsic reward 47 https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
  • 48. Agents should be motivated towards “interesting” consequences ● C: actor vs M: world model ● M predicts consequences of C ● As a result: ▪ If C action results in repeated and boring consequences →M predict well ▪ C must explore novel consequence ● Memory: ▪ To learn the world model ▪ To know if something novel or old 48 https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
  • 49. M: Forward model learns dynamics (semantic memory) 49 Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. 2015 Novelty if prediction error is high (intrinsic reward)
  • 50. When novelty as prediction error is useless ● The prediction target is stochastic ● Information necessary for the prediction is missing → Both the totally predictable and the fundamentally unpredictable will get boring →Solution: Remember all experiences ● “Store” all observations, including stochastic ones in working, semantic or episodic memory ● Instead of predicting, try recalling from the memory 50 https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
  • 51. Working memory: Store visited in-episode states ●Novelty through reachability: ▪ Boring if reachable from states in memory in less than k steps ●Learn to classify: reachable or unreachable ○ Collect 2 states from trajectory ○ Create label to indicate one is reachable from another 51 Savinov et al. "Episodic Curiosity through Reachability." In ICLR 2018. High if unreachable
  • 52. Exploration with working memory is better 52 No intrinsic reward Intrinsic reward via dynamic prediction Intrinsic reward via WM
  • 53. Deepmind’s Maze benchmark 53 Bad behavior Good behavior
  • 54. Semantic memory: distillation to neural networks’ weight ●Target network: randomly transform state ●Predictor network: try to remember the transformed state ○ A global memory ○ Random TV is not problem ■ Remember all noisy channels 54 https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938 Burda et al. Random Network Distillation: a new take on Curiosity-Driven Learning, In ICLR 2019 High if cannot distill
  • 55. Episodic memory: explore from stored good state ●Archive: a memory of good states (state-score) → sample one ●Purely random exploration from this state→ collect more states ●Update the archive And many other tricks: imitation learning, goal-based policy, … 55 Adrian Ecoffet et al.: First return, then explore. Nature 2021
  • 56. Atari game: Superhuman performance 56 Episodic Semantic Working Montezuma Revenge benchmark
  • 58. Episodic memory for hyperparameter optimisation ●RL is very sensitive to hyperparameters ●SOTA performance is achieved with extensive hyperparameter tuning Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. 58 DQN Hyperparameters are enormous!
  • 59. Limitation of memory-less optimiser ● Don’t have the context of training in the optimization process ● Treated as a stateless bandit or greedy optimization ○ Ignoring the context prevents the use of episodic experiences that can be critical in optimization and planning ○ E.g. the hyperparameters that helped overcome a past local optimum in the loss surface can be reused when the learning algorithm falls into a similar local optimum 59 How to build the context (the key in key-value memory)?
  • 60. Optimising hyperparameter as episodic RL ●At each policy update, the hyper-agent: ○ Observe the training context- hyper-state ○ Configure the RL algorithm with suitable hyperparameters ψ - hyper-action ○ Train RL agent with ψ, observed learning progress – hyper-reward ●The goal of the Hyper-RL is the same as the main RL’s: to maximize the return of the RL agent ○ At a hyper-state, find hyper-action that maximize the accumulated hyper-reward (hyper-return) 60 KEY |VALUE Experience hyper-state/action |Outcome Hyper-Returns Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training." AAAI (2022).
  • 61. Hyper-state representation learning ● Compress the parameters/gradients to a vector hyper-state s ● VAE learns to reconstruct s ● The latent vector is the hyper-state representation 61
  • 62. Continuous-action environment: Mujoco benchmark 62 Metric: positive reward allocated based on the distance moved forward and a negative reward allocated for moving backward.
  • 63. 63
  • 64. Policy gradient optimisation 64 Issues with naïve Policy Gradient ● High variance and unstable. ● The gradient may not accurately reflect the policy gain when the policy changes substantially Trust-region optimization is a solution ● The new policy should be inside a small trust region around the last sampling policy (old policy) ● Bound KL(𝜋_(𝜃_𝑜𝑙𝑑 ) |𝜋_(𝜃_𝑛𝑒𝑤 )) (TRPO, PPO, …)
  • 65. 65 What is wrong with these trust-region methods?
  • 66. 66 When the old policy is bad ● Bounding makes the new policy stuck in the local optima with the old policy ● Relying in one old policy is not enough → Need to store many past policies and rely on all of them Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, and Svetha Venkatesh. "Memory-Constrained Policy Optimization." NeurIPS (2022).
  • 67. 67 Use two trust regions instead of one Backup Trust Region from Virtual Policy PG Objective
  • 68. 68 Memory of policy networks - Build a memory of past policy. Choose 𝜓 from the policy memory via attention - fϕ is a neural network parameterized by ϕ which outputs softmax attention weights over the M past policies - v is a “context” vector capturing different relations among θ, θold, ψ.
  • 71. In summary ● Memory assists RL agents in many forms: ○ Semantic ○ Working ○ Episodic ● And in many tasks: ○ Store experiences ○ Exploration ○ State representation ○ Optimisation (hyperparameters, policy) 71
  • 72. What’s next? Life-long memory ● So far, the memory lifespan is restricted to an episode (working memory) or a task (episodic or semantic memory) ● A real memory will span across tasks and domains: ○ Playing 60 Atari games in a row ○ Learn Mujoco then learn Atari ● It requires new kind of memory that supports different representations from different scenarios ● The amount of events and information is big ○ Efficient memory access mechanism ○ Effective memory selection 72
  • 73. What’s next? Dynamic memory ● Current memory is fixed size (table, matrix, neural network) ○ It is not enough when the observations are dense ○ It is redundant when the observations are sparse ● Can we build a dynamic memory that automatically grow and shrink depending on context? ○ Memory read and write will be more precise ○ No noise stored in the memory 73
  • 74. What’s next? Hierarchical memory ● Current memory models are general flat, supporting single-access ● To remember details, it needs several steps or recall: ○ Coarse-grained chunk of steps ○ A specific step in the chunk ● Remember different timescales ○ Events from recent timesteps ○ Events from a far episode 74
  • 75. What’s next? Abstract memory ● Current memory models stores specific events, states, actions or representations of them ● To excel in diverse tasks, it is critical to capture abstract concepts: ○ Goal (e.g. use the red key to open the red door) ○ Relationships (e.g. climbing the ladder and picking the key are required to pass the level) ○ High-level objects (e.g. anything the block the door is the obstacle) ● It is unclear how artificial memory can stores these complex concepts 75
  • 76. What’s next? Complementary learning system ● A system of multiple memory kinds ● The memory communicates and transfers knowledge: ○ Episodic memory distill events to semantic knowledge ○ Working memory distill temporary information to long-term memory ● How to design an efficient and biologically plausible system of memory is an open problem. 76
  • 77. What’s next? Other testbeds for memory ● Continual RL ● Meta-RL ● Few-shot-RL 77
  • 79. Our team at A2I2 is hiring! Contact thai.le@deakin.edu.au for PhD scholarships. 79