【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representations

DEEP LEARNING JP
[DL Papers] Data-Efficient Reinforcement Learning
with Self-Predictive Representations
Xin Zhang, Matsuo Lab
http://deeplearning.jp
/

目次
2
1. 書誌情報
2. Introduction
3. Self-Predictive Representation
4. Related Works
5. Experiment Evaluation
6. Discussion

書誌情報
● タイトル：
○ Data-Efficient Reinforcement Learning with Self-Predictive Representations
● 著者
○ Max Schwarzer, Ankesh Anand, Rishab Goel, R Devon Hjelm, Aaron Courville,
Philip Bachman
● 所属：Mila, Université de Montréal, Microsoft Research
● 投稿日：2020/7/12 (arXiv), ICRL2021 Spotlight (7776)
● 概要
○ 強化学習のサンプル効率をあげるため、表現学習をSelf-supervisedで行った。
○ k step後の状態を予測する、状態予測ダイナミックスモデルを学習する。
○ ただし、状態の潜在空間上において、予測を行うことで、複雑度を下げる。
3

Introduction
強化学習におけるサンプル効率問題
- Atrai game, 10~50 years. OpenAI Five 45000 years of experience.
- 実世界では許されないので、サンプル効率を上げないといけない！
- CVとNLPでは、自己教師表現学習が有効で、業績残している。
- 強化学習における表現学習が有効。前から研究されていた。
- 強化学習のための状態表現学習(松嶋さんDL輪読会)
- 未来の状態が予測できるような、状態の表現が学習できないか？
- 自己教師で..
- データ拡張が使えて..
4

Kステップ後の表現を予測できるように学習した状態表現
Self-Predictive Representations（SPR）
5
1. Online encoder and
target encoder
2. Transition Model
3. Projection Heads
4. Prediction Loss

Target encoder, using EMA of online encoder.
6
target encoder
2. Transition Model
3. Projection Heads
4. Prediction Loss

1 ステップずつ、Kステップ分の状態表現を予測する。
7
target encoder
2. Transition Model
3. Projection Heads
4. Prediction Loss

Projection で小さい次元に圧縮する。predictionでさらに予測。
8
target encoder
2. Transition Model
3. Projection Heads
4. Prediction Loss

ステップごとのCosine Similarity Lossを取る。
9
target encoder
2. Transition Model
3. Projection Heads
4. Prediction Loss

Self-Predictive Representations.
10

Related Works
11
● Data-Efficient RL
○ SiMPle：pixel-level transition model.
○ Data-Efficient Rainbow(DER) and OTRainbow：
○ 再構築Lossで潜在空間モデルを学習
○ DrQ, RAD：image augmentationすることで多くのモデルベースよりも精度が良い
○ Data augmentionはマルチタスク、転移学習における汎化性の向上に有効
SPRのアプローチの方が、data-augmentationをさらに有効に使える。

Related Works
12
● Representation Learning in RL：
○ CURL：image augmentation + contrastive loss.
■ Image augmentationの方が効いる？（by RAD）
○ CPC, ST-DIM, DRIML：temporal contrastive losses.
○ DeepMDP, trains a transition model with L2 loss.
■ online encoder to prediction target. prone to representational collapse.
■ add observation reconstruction objective.
○ PBL：directly predicts representations of future states.
■ Two target networks. Focus on multi-task generalization. 100 times data as SPR.
SPRはself-supervised, trained in latent space, uses a normalized loss.
Target encoder. Augmentations.

Experiments. Atari
13
Human-Normalized scores：人間のスコアを1.0 にして評価する基準。
SPRは、データ拡張しなくてもSOTA。
（＊はデータ拡張。100k steps or 400k frames per game.)

Experiments
14
SimPLeも良さそうだが、結果の分布で見るとわかりやすい。SPRはSOTA。

Experiments
15
Dynamics modeling consistently improving performance.

Discussion
16
考察
- The target encoderは重要
- データ拡張がある時は、T=0. 並行して２つのencoderを学習する。
- 拡張がない時は、T=0.99. でほぼ固定
- Dynamics modelingは重要, K = 5.
- 流行っているContrastive lossesよりは良い。
今後の方向性
- CVとNLPを見ると、RLにも大規模なデータセットで事前学習し、fine
tuningする流れもやってくるのでは？
- SPRで学習したモデルで、モデルベースの学習をやる。

感想
17
- サンプル効率問題に向けて、自己教師あり学習でモデルを学習するアプローチ
は面白いと思って、読んだ。
- 思ったより、たくさんの研究があって、新規性をどう出すのか？
- Self-Supervised ＊ Model-based あたりが可能性高いと思っている。

参考文献
18
- https://zhuanlan.zhihu.com/p/164842371
- https://arxiv.org/pdf/2006.07733.pdf

【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representations

More Related Content

What's hot (20)

Similar to 【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representations (20)

More from Deep Learning JP (20)

【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representations