SlideShare a Scribd company logo
Discovering Reinforcement Learning
Algorithms
Oh et al. in <NeurIPS 2020>
발표자 : 윤지상
Graduate School of Information. Yonsei Univ.
Machine Learning & Computational Finance Lab.
1. Introduction
2. LPG
3. Details in LPG Architecture
4. Experiments
INDEX
1 Introduction
1. Introduction
RL에서 Learning to learning 이라고 알려져 있는 meta-learning은 특정
value function을 알고 있다면 policy update rule을 스스로 학습하고
unseen task에 적용이 가능하다는 연구들이 나오고 있다.
Scratch부터 RL의 학습을 최적화하는 방향으로
스스로 찾을 수는 없을까?
1. Introduction
This study contributes:
1. Agent의 policy와 semantic prediction vector를 학습하는 방법을 모델이
직접 찾을 수 있고 좋은 성능을 가질 수 있는 feasibility를 보여주었다.
2. Semantic prediction vector에 어떠한 가정도 넣지 않아 사용자의 설정을
더 최소화하고 meta-learning에 가까운 모델이 되었다.
3. 간단한 task들을 통해 만들어진 RL 학습 알고리즘이 복잡한 task에도
유의미한 성능을 보여주었다.
2 LPG
2. LPG
Learned Policy Gradient (LPG)
1. 몇 번의 행동 후 특정 상황에서의 점프
타이밍을 배운다.
2. 여러 번의 의사결정으로 몬스터 속도,
지름길 등 게임 전략들을 배운다.
3. 게임이 끝나고 점수를 더 높이기 위해서
어떻게 하면 게임 전략을 더 많이 배울 수
있을지 고민한다.
4. 다른 게임에서 게임 전략을 더 많이 터득
할 수 있는 노하우를 적용한다.
2. LPG
Learned Policy Gradient (LPG)
2. LPG
Learned Policy Gradient (LPG)
LPG parameterized by 𝜂 (Backward LSTM)
agent parameterized by θ
There are TWO learnable model
최종 목적 : optimized 𝜼 찾기
2. LPG
Learned Policy Gradient (LPG)
① agent가 𝜃의 parameter를 이용해 2개의 값 출력
1. 문제에 대한 action을 뽑을 분포 policy 𝜋𝜃
2. 문제의 action을 선택할 정보를 추정한 prediction 𝑦𝜃
게임 중 행동에 대한
선택과 기준
2. LPG
Learned Policy Gradient (LPG)
② agent가 𝑇 time-step 만큼 action을 취해 trajectory를 형성하고 LPG에서 나온
agent의 학습을 도와줄 정답 target 𝜋, 𝑦에 가깝게 agent의 𝜃 update
많은 행동으로 여
러 전략 터득
2. LPG
Learned Policy Gradient (LPG)
③ 여러 상황 environment 들에 대해 각각의 agent들이 𝑇 time-step마다 학습되고
모든 environment가 끝나면 total reward가 최대가 되도록 LPG의 𝜂 update
점수를 더 높일 게임
노하우 학습
3 Details in LPG
Architecture
3. Details in LPG Architecture
1) LPG Architecture
2) Agent Update (𝜃)
3) LPG Update (𝜂)
4) Balancing Agent Hyperparameters for Stabilisation (𝛼)
𝑝 ℰ 는 environment ℰ의 분포
𝑝 𝜃0 는 agent parameter 𝜃의 초기값 분포
𝐺는 lifetime 전체의 reward 합
𝜂∗ = 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
Objective :
1) LPG Architecture
Backward LSTM인 LPG에는,
input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝜑(𝑦𝜃 𝑠𝑡 ), 𝜑(𝑦𝜃 𝑠𝑡+1 )]
output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚
- 𝑟𝑡 : reward
- 𝑑𝑡 : episode 종료 여부 (binary value)
- 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent
- 𝑦𝜃 ∈ 0,1 𝑚
: 𝑚-dimensional categorical prediction vector (𝑚=30 사용)
- 𝜑 : shared neural network (dim 16 → dim 1)
3. Details in LPG Architecture
1) LPG Architecture
Backward LSTM인 LPG에는,
input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝑦𝜃 𝑠𝑡 , 𝑦𝜃 𝑠𝑡+1 ]
output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚
- 𝑟𝑡 : reward
- 𝑑𝑡 : episode 종료 여부 (binary value)
- 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent
- 𝑦𝜃 ∈ 0,1 𝑚
: 𝑚-dimensional categorical prediction vector (𝑚=30 사용)
- 𝜑 : shared neural network (dim 16 → dim 1)
LPG에 input으로 action이 아니라 state
에서 action이 나올 확률을 넣기 때문에
다양한 environment에 적용 가능
3. Details in LPG Architecture
2) Agent Update (𝜽)
𝜋으로 agent가 𝜋을 취하도록 directly 𝜃 update.
𝑦으로 value function처럼 state를 semantic하게 표현하도록 indirectly 𝜃 update
𝑇 time-step만큼 trajectory 형성 후 𝜃 update (𝑇=20 사용)
∆𝜃 ∝ 𝔼𝜋𝜃
[∇𝜃𝑙𝑜𝑔𝜋𝜃 𝑎 𝑠 𝜋 − 𝛼𝑦∇𝜃𝐷𝐾𝐿(𝑦𝜃(𝑠) 𝑦)]
categorical cross
entropy
KL-divergence
3. Details in LPG Architecture
3) LPG Update (𝜼)
𝜃0 → 𝜃𝑁까지 학습이 진행되고 ∆𝜂를 계산해야 하지만 memory 문제 때문에 𝜃𝐾(𝐾 < 𝑁)
만큼 agent 학습 후 ∆𝜂 계산 (𝐾 = 5 사용)
(e.g., 𝑇=20, 𝐾=5 일 때, 20 time-step 마다 𝜃𝑛 → 𝜃𝑛+1 update,
20x5 (=100) time-step이 지나 𝜃𝑛+5까지 update 되면 𝐺를 계산하고 ∆𝜂 계산,
Environment lifetime이 끝날 때까지 반복)
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺]
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
objective
gradient
3. Details in LPG Architecture
3) LPG Update (𝜼)
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺]
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
objective
gradient
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺 + 𝛽0∇𝜂ℋ 𝜋𝜃𝑁
+ 𝛽1∇𝜂ℋ 𝑦𝜃𝑁
− 𝛽2∇𝜂 𝜋 2
2
− 𝛽3∇𝜂 𝑦 2
2
]
3. Details in LPG Architecture
안정적 학습을 위해
regularized term 추가
4) Balancing Agent Hyperparameters for Stabilisation (𝜶)
한번에 다양한 environment를 학습하게 되는데 모두 동일한 parameter(e.g., learning
rate)를 적용하면 학습이 unstable하기 때문에 동적으로 parameter 설정
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ max
𝛼
𝔼𝜃0~𝑝 Θ [𝐺]
3. Details in LPG Architecture
𝛼~𝑝(𝛼|ℰ)
주어진 ℰ environment마다 G를 높이는 파라미터가 뽑힐 확률을
높인다. (𝛼=learning rate, KL-Divergence weight 사용)
3. Details in LPG Architecture
Ablation Study Result
3. Details in LPG Architecture
Lifetimes = N timesteps
Lifetimes = N timesteps
Lifetimes = N timesteps
Interacting environment
Environment Agent
940 64
𝑥1 → ⋯ → 𝑥20 → ⋯ → 𝑥100 → ⋯ → 𝑥𝑁
UPDATE
agent parameter 𝜃
COMPUTE & SAVE
LPG parameter 𝜂
SAMPLE ℰ~𝑝 ℰ , 𝜃~𝑝 𝜃 , 𝛼~𝑝(𝛼|ℰ)
UPDATE
𝑝(𝛼|ℰ)
𝑝(𝛼|ℰ)
UPDATE
LPG parameter 𝜂 using averaged 𝜂
4Experiments
4. Experiments
4. Experiments
Setting
- Baseline
1. A2C
2. LPG-V (only learns 𝜋 given 𝑦 (value function of TD(𝜆))
- Training Environments
1. Tabular grid worlds
2. Random grid worlds
3. Delayed chain MDP
4. Experiments
Specialising in Training Environments
4. Experiments
What does the prediction (y) look like?
4. Experiments
Does the prediction (y) capture true values and beyond?
Does the prediction(y) converge?
4. Experiments
Ablation Study
4. Experiments
Generalising from Toy Environments to Atari Games
Selected results

More Related Content

PDF
Deep Reinforcement Learning: Q-Learning
PPTX
deep reinforcement learning with double q learning
PDF
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
PDF
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PPT
Genetic Algorithms
PDF
DQN (Deep Q-Network)
PPTX
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
PDF
Dueling Network Architectures for Deep Reinforcement Learning
Deep Reinforcement Learning: Q-Learning
deep reinforcement learning with double q learning
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
Genetic Algorithms
DQN (Deep Q-Network)
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Dueling Network Architectures for Deep Reinforcement Learning

What's hot (20)

PPTX
Daa unit 2
ODP
Bubble sort
PDF
RCIM 2008 - Modello Scheduling
PDF
Discrete sequential prediction of continuous actions for deep RL
DOCX
Job shop scheduling problem using genetic algorithm
PPT
Parallel programming
PPT
Algorithm analysis
PDF
Chap 8. Optimization for training deep models
PDF
TensorFlow and Deep Learning Tips and Tricks
PDF
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
PDF
Distributed Deep Q-Learning
PPTX
Reinforcement Learning
PDF
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
PDF
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
PPTX
Data structure introduction
PDF
Optimization for Deep Learning
PPTX
2020 12-2-detr
PDF
Writing distributed N-body code using distributed FFT - 1
PDF
Data structure and algorithm notes
PDF
Deep Learning in Finance
Daa unit 2
Bubble sort
RCIM 2008 - Modello Scheduling
Discrete sequential prediction of continuous actions for deep RL
Job shop scheduling problem using genetic algorithm
Parallel programming
Algorithm analysis
Chap 8. Optimization for training deep models
TensorFlow and Deep Learning Tips and Tricks
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
Distributed Deep Q-Learning
Reinforcement Learning
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Data structure introduction
Optimization for Deep Learning
2020 12-2-detr
Writing distributed N-body code using distributed FFT - 1
Data structure and algorithm notes
Deep Learning in Finance
Ad

Similar to PPT - Discovering Reinforcement Learning Algorithms (20)

PPTX
Reinfocement learning
PPTX
Marl의 개념 및 군사용 적용방안
PDF
Introduction of Deep Reinforcement Learning
PDF
Alpha star sl and_rl (page no 6)
PDF
RL_UpsideDown
PDF
ESM Machine learning 5주차 Review by Mario Cho
PDF
Large scale-lm-part1
PPTX
Graph Neural Network #2-1 (PinSage)
PPTX
Stochastic latent actor critic - deep reinforcement learning with a latent va...
PPTX
Model based rl
PDF
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
PDF
Recent advances of AI for medical imaging : Engineering perspectives
PDF
1시간만에 머신러닝 개념 따라 잡기
PDF
Koss 1605 machine_learning_mariocho_t10
PDF
[2A4]DeepLearningAtNAVER
PDF
2024_개보위_개인정보 미래포럼_의료 인공지능 모델과 프라이버시 이슈.pdf
PDF
Reinforcement Learning on Mine Sweeper
PPTX
Deep Learning for AI (2)
PDF
Matching Network
PDF
Session-based recommendations with recurrent neural networks
Reinfocement learning
Marl의 개념 및 군사용 적용방안
Introduction of Deep Reinforcement Learning
Alpha star sl and_rl (page no 6)
RL_UpsideDown
ESM Machine learning 5주차 Review by Mario Cho
Large scale-lm-part1
Graph Neural Network #2-1 (PinSage)
Stochastic latent actor critic - deep reinforcement learning with a latent va...
Model based rl
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Recent advances of AI for medical imaging : Engineering perspectives
1시간만에 머신러닝 개념 따라 잡기
Koss 1605 machine_learning_mariocho_t10
[2A4]DeepLearningAtNAVER
2024_개보위_개인정보 미래포럼_의료 인공지능 모델과 프라이버시 이슈.pdf
Reinforcement Learning on Mine Sweeper
Deep Learning for AI (2)
Matching Network
Session-based recommendations with recurrent neural networks
Ad

More from Jisang Yoon (6)

PPTX
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPTX
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPTX
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
PPTX
PPT - Deep Hedging OF Derivatives Using Reinforcement Learning
PPTX
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPTX
PPT - Deep and Confident Prediction For Time Series at Uber
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
PPT - Deep Hedging OF Derivatives Using Reinforcement Learning
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - Deep and Confident Prediction For Time Series at Uber

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
1_Introduction to advance data techniques.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Computer network topology notes for revision
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Lecture1 pattern recognition............
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Miokarditis (Inflamasi pada Otot Jantung)
Introduction-to-Cloud-ComputingFinal.pptx
IB Computer Science - Internal Assessment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
1_Introduction to advance data techniques.pptx
Fluorescence-microscope_Botany_detailed content
.pdf is not working space design for the following data for the following dat...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Computer network topology notes for revision
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Qualitative Qantitative and Mixed Methods.pptx
Lecture1 pattern recognition............
Reliability_Chapter_ presentation 1221.5784
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ISS -ESG Data flows What is ESG and HowHow
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

PPT - Discovering Reinforcement Learning Algorithms

  • 1. Discovering Reinforcement Learning Algorithms Oh et al. in <NeurIPS 2020> 발표자 : 윤지상 Graduate School of Information. Yonsei Univ. Machine Learning & Computational Finance Lab.
  • 2. 1. Introduction 2. LPG 3. Details in LPG Architecture 4. Experiments INDEX
  • 4. 1. Introduction RL에서 Learning to learning 이라고 알려져 있는 meta-learning은 특정 value function을 알고 있다면 policy update rule을 스스로 학습하고 unseen task에 적용이 가능하다는 연구들이 나오고 있다. Scratch부터 RL의 학습을 최적화하는 방향으로 스스로 찾을 수는 없을까?
  • 5. 1. Introduction This study contributes: 1. Agent의 policy와 semantic prediction vector를 학습하는 방법을 모델이 직접 찾을 수 있고 좋은 성능을 가질 수 있는 feasibility를 보여주었다. 2. Semantic prediction vector에 어떠한 가정도 넣지 않아 사용자의 설정을 더 최소화하고 meta-learning에 가까운 모델이 되었다. 3. 간단한 task들을 통해 만들어진 RL 학습 알고리즘이 복잡한 task에도 유의미한 성능을 보여주었다.
  • 7. 2. LPG Learned Policy Gradient (LPG) 1. 몇 번의 행동 후 특정 상황에서의 점프 타이밍을 배운다. 2. 여러 번의 의사결정으로 몬스터 속도, 지름길 등 게임 전략들을 배운다. 3. 게임이 끝나고 점수를 더 높이기 위해서 어떻게 하면 게임 전략을 더 많이 배울 수 있을지 고민한다. 4. 다른 게임에서 게임 전략을 더 많이 터득 할 수 있는 노하우를 적용한다.
  • 8. 2. LPG Learned Policy Gradient (LPG)
  • 9. 2. LPG Learned Policy Gradient (LPG) LPG parameterized by 𝜂 (Backward LSTM) agent parameterized by θ There are TWO learnable model 최종 목적 : optimized 𝜼 찾기
  • 10. 2. LPG Learned Policy Gradient (LPG) ① agent가 𝜃의 parameter를 이용해 2개의 값 출력 1. 문제에 대한 action을 뽑을 분포 policy 𝜋𝜃 2. 문제의 action을 선택할 정보를 추정한 prediction 𝑦𝜃 게임 중 행동에 대한 선택과 기준
  • 11. 2. LPG Learned Policy Gradient (LPG) ② agent가 𝑇 time-step 만큼 action을 취해 trajectory를 형성하고 LPG에서 나온 agent의 학습을 도와줄 정답 target 𝜋, 𝑦에 가깝게 agent의 𝜃 update 많은 행동으로 여 러 전략 터득
  • 12. 2. LPG Learned Policy Gradient (LPG) ③ 여러 상황 environment 들에 대해 각각의 agent들이 𝑇 time-step마다 학습되고 모든 environment가 끝나면 total reward가 최대가 되도록 LPG의 𝜂 update 점수를 더 높일 게임 노하우 학습
  • 13. 3 Details in LPG Architecture
  • 14. 3. Details in LPG Architecture 1) LPG Architecture 2) Agent Update (𝜃) 3) LPG Update (𝜂) 4) Balancing Agent Hyperparameters for Stabilisation (𝛼) 𝑝 ℰ 는 environment ℰ의 분포 𝑝 𝜃0 는 agent parameter 𝜃의 초기값 분포 𝐺는 lifetime 전체의 reward 합 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] Objective :
  • 15. 1) LPG Architecture Backward LSTM인 LPG에는, input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝜑(𝑦𝜃 𝑠𝑡 ), 𝜑(𝑦𝜃 𝑠𝑡+1 )] output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚 - 𝑟𝑡 : reward - 𝑑𝑡 : episode 종료 여부 (binary value) - 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent - 𝑦𝜃 ∈ 0,1 𝑚 : 𝑚-dimensional categorical prediction vector (𝑚=30 사용) - 𝜑 : shared neural network (dim 16 → dim 1) 3. Details in LPG Architecture
  • 16. 1) LPG Architecture Backward LSTM인 LPG에는, input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝑦𝜃 𝑠𝑡 , 𝑦𝜃 𝑠𝑡+1 ] output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚 - 𝑟𝑡 : reward - 𝑑𝑡 : episode 종료 여부 (binary value) - 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent - 𝑦𝜃 ∈ 0,1 𝑚 : 𝑚-dimensional categorical prediction vector (𝑚=30 사용) - 𝜑 : shared neural network (dim 16 → dim 1) LPG에 input으로 action이 아니라 state 에서 action이 나올 확률을 넣기 때문에 다양한 environment에 적용 가능 3. Details in LPG Architecture
  • 17. 2) Agent Update (𝜽) 𝜋으로 agent가 𝜋을 취하도록 directly 𝜃 update. 𝑦으로 value function처럼 state를 semantic하게 표현하도록 indirectly 𝜃 update 𝑇 time-step만큼 trajectory 형성 후 𝜃 update (𝑇=20 사용) ∆𝜃 ∝ 𝔼𝜋𝜃 [∇𝜃𝑙𝑜𝑔𝜋𝜃 𝑎 𝑠 𝜋 − 𝛼𝑦∇𝜃𝐷𝐾𝐿(𝑦𝜃(𝑠) 𝑦)] categorical cross entropy KL-divergence 3. Details in LPG Architecture
  • 18. 3) LPG Update (𝜼) 𝜃0 → 𝜃𝑁까지 학습이 진행되고 ∆𝜂를 계산해야 하지만 memory 문제 때문에 𝜃𝐾(𝐾 < 𝑁) 만큼 agent 학습 후 ∆𝜂 계산 (𝐾 = 5 사용) (e.g., 𝑇=20, 𝐾=5 일 때, 20 time-step 마다 𝜃𝑛 → 𝜃𝑛+1 update, 20x5 (=100) time-step이 지나 𝜃𝑛+5까지 update 되면 𝐺를 계산하고 ∆𝜂 계산, Environment lifetime이 끝날 때까지 반복) ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺] 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] objective gradient 3. Details in LPG Architecture
  • 19. 3) LPG Update (𝜼) ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺] 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] objective gradient ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺 + 𝛽0∇𝜂ℋ 𝜋𝜃𝑁 + 𝛽1∇𝜂ℋ 𝑦𝜃𝑁 − 𝛽2∇𝜂 𝜋 2 2 − 𝛽3∇𝜂 𝑦 2 2 ] 3. Details in LPG Architecture 안정적 학습을 위해 regularized term 추가
  • 20. 4) Balancing Agent Hyperparameters for Stabilisation (𝜶) 한번에 다양한 environment를 학습하게 되는데 모두 동일한 parameter(e.g., learning rate)를 적용하면 학습이 unstable하기 때문에 동적으로 parameter 설정 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ max 𝛼 𝔼𝜃0~𝑝 Θ [𝐺] 3. Details in LPG Architecture 𝛼~𝑝(𝛼|ℰ) 주어진 ℰ environment마다 G를 높이는 파라미터가 뽑힐 확률을 높인다. (𝛼=learning rate, KL-Divergence weight 사용)
  • 21. 3. Details in LPG Architecture Ablation Study Result
  • 22. 3. Details in LPG Architecture Lifetimes = N timesteps Lifetimes = N timesteps Lifetimes = N timesteps Interacting environment Environment Agent 940 64 𝑥1 → ⋯ → 𝑥20 → ⋯ → 𝑥100 → ⋯ → 𝑥𝑁 UPDATE agent parameter 𝜃 COMPUTE & SAVE LPG parameter 𝜂 SAMPLE ℰ~𝑝 ℰ , 𝜃~𝑝 𝜃 , 𝛼~𝑝(𝛼|ℰ) UPDATE 𝑝(𝛼|ℰ) 𝑝(𝛼|ℰ) UPDATE LPG parameter 𝜂 using averaged 𝜂
  • 25. 4. Experiments Setting - Baseline 1. A2C 2. LPG-V (only learns 𝜋 given 𝑦 (value function of TD(𝜆)) - Training Environments 1. Tabular grid worlds 2. Random grid worlds 3. Delayed chain MDP
  • 26. 4. Experiments Specialising in Training Environments
  • 27. 4. Experiments What does the prediction (y) look like?
  • 28. 4. Experiments Does the prediction (y) capture true values and beyond? Does the prediction(y) converge?
  • 30. 4. Experiments Generalising from Toy Environments to Atari Games Selected results