Learning to Navigate in Complex Environments 輪読

Learning to Navigate in
Complex Environments
2017.11.01
B4 Tatsuya Matsushima
1

Outlines
Information
0. Abstract
1. Introduction
2. Approach
3. Related Work
4. Experiment
5. Analysis
6. Conclusion
Impressions
2

Information
● Author
○ P. Mirowski, R. Pascanu et.al. (from DeepMind)
● Submitted on 11 Nov 2016 (v1), last revised 13 Jan 2017 (this version, v3)
● Accepted as ICLR 2017 Conference Track (Poster Session)
● arXiv
○ https://arxiv.org/abs/1611.03673
● Open Review
○ https://openreview.net/forum?id=SJMGPrcle&noteId=SJMGPrcle
3

0. Abstruct
● show data efficiency and task performance can be dramatically improved
by relying on additional auxiliary task in navigation task
○ auxiliary task: depth prediction and loop closure classification task
● this approach can learn to navigate from raw sensory input in complicated
3D maze, approaching human level performance
○ even when goal location changes frequently
● ナビゲーションにおいて，補助タスクを設けることでデータ効率とパフォーマンス
が改善する
○ 補助タスク：デプスの予測とループクロージャの分類
● 複雑な3次元の迷路内でも，生の感覚情報から人間と同等のパフォーマンスを
達成した
○ ゴールが頻繁に変わる問題設定下でもパフォーマンスが高まる
● https://www.youtube.com/watch?v=JL8F82qUG-Q&feature=youtu.be 4

1. Introduction
● The ability to navigate efficiently is fundamental intelligent behavior
○ conventional robotics methods (e.g. SLAM) tackles navigation thorough explicit focus on
position inference and mapping
● propose navigation abilities could emerge as the by-product of an agent
learning policy that maximizes reward
● 効率的なナビゲーション能力は知的な行動の基礎になる
○ 従来のロボティクスではナビゲーションを明示的な場所の推定とマッピングに焦点を当てて取り
組んできた
● 報酬を最大化する方策を学習する副産物としてナビゲーションを能力が生まれ
ると提案
5

1. Introduction
● use auxiliary task that provide denser training signals
○ supports navigation-relevant representation learning
○ 1. reconstruction of a low-demensional depthmap
■ aid obstacle avoidance and short-term trajectory planning
○ 2. loop closure prediction
■ trained to predict if the current location has been previously visited
● 補助タスクを利用して学習する
○ ナビゲーションに関連する表現を学習するため
○ 1. デプスマップの再現
■ 障害物の回避と短期的な軌道の計画に役立つと考えられる
○ 2. ループクロージャの予測
■ 一度訪れたことのある場所かどうかを予測できるように訓練する
6

2. Approach
● End-to-end learning framework incorporatong multiple objectives
○ maximize cumulative reward using actor-critic (A3C)
○ minimize auxiliary loss of depth prediction and loop closure prediction
● 多目的なend-to-endなRLフレームワーク
○ A3Cを用いて累積報酬を最大化
○ デプスマップの推定とループクロージャの推定の lossを最小化
7

2. Approach
● depth predition
○ bootstrap learning by builiding useful feature for RL
○ using depth as additional loss is more valuable than using it directly as input (appendix C)
○ discrete depth(classification task) is better than continuous depth(regression task)
● デプスの推定
○ 強化学習にとって有用な特徴をもつことで学習を加速する
○ デプスを入力ではなく，推定した結果とのlossとして扱う方が効率的(appendix C)
○ デプス推定を分類問題として扱う方が効率的
8

2. Approach
● loop closure prediction
○ loop closure label is 1 if the position of the agent is close to trajectory
○ encourage implicit velocity integration
● ループクロージャの推定
○ 現在いる場所と過去の軌跡の距離が閾値以下であるかを分類
○ 数学的には速度を積分すると得られる(ので位置推定を加速させると考えられる)
9

3. Related Work
● Playing FPS games with deep reinforcement learning
○ Guillaume Lample, Devendra Singh Chaplot(2016)
○ the performance of DQN agent in ViZDoom deathmatch environment can be enhanced by
the addition of a supervised auxiliary task
■ enemy detection task
○ 教師あり補助タスクによりVizdoom deathmatchのパフォーマンスが向上
■ 敵を判別するタスク
○ https://arxiv.org/abs/1609.05521
10

3. Related Work
● Contribution of this paper
○ Address how to learn an intrinsic representation of space, geometry, and movement while
simultaneously maximizing rewards through reinforcement learning.
○ method is validated in challenging maze domains with random start and goal locations
○ RLで報酬を最大化しながら空間・ジオメトリ・動作の内的な表現を獲得する方法について述べ
ている．
○ 手法を難しい迷路におけるランダムなスタート・ゴールのタスクで裏付けることができた
11

4. Experiments (要編集)
● DeepMind Lab
○ additional observation available (inetial information and local depth imformation)
○ 追加的な観測も可能にしたDeepMind Labを使用(慣性情報とデプスマップ)
● Task Setting
○ 5 mazes
■ I-maze
■ 2 static mazes
● goal and fruit locations are fixed and agent’s start location changes
● ゴールとフルーツの位置は変わらずスタート地点のみが変わる
■ 2 random goal mazes
● goal and fruits are randomly placed on every episode
● エピソードごとにゴールとフルーツの位置も変わる
12

4. Experiments
● Available Inputs
○ camera image
○ depth map
○ velocity of agent
● Actions
○ Discrete 8 actions
■ forward, backward, sideways, rotation in small incriments, rotational acceleration
● Rewards
○ Goal reward
○ ‘fruit’ reward for encouraging exploration
13

4. Experiments
● Compare models
○ FF A3C
○ LSTM A3C
○ Nav A3C
■ stacked LSTM with velocity, previous action and reward
○ Nav A3C + D1
■ Nav A3C with depth prediction from convolution layer
○ Nav A3C + D2
■ Nav A3C with depth prediction from the last LSTM layer
○ Nav A3C + L
■ Nav A3C with loop closure prediction
○ Nav A3C + D1
D2
L
■ above all auxiliary losses considered together
■ 全部のせ
14

4. Experiments
● Compare models
15

4. Experiments
● Architecture
16

4. Experiments
● Results of Nav A3C* + D1
L
○ I-maze
■ https://www.youtube.com/watch?v=PS4iJ7Hk_BU&feature=youtu.be
○ static maze
■ https://www.youtube.com/watch?v=-HsjQoIou_c&feature=youtu.be
■ https://www.youtube.com/watch?v=kH1AvRAYkbI&feature=youtu.be
○ random goal maze
■ https://www.youtube.com/watch?v=5IBT2UADJY0&feature=youtu.be
■ https://www.youtube.com/watch?v=e10mXgBG9yo&feature=youtu.be
○ star(*) represents a model implemented reward clipping
○ アスタリスク(*)はreward clippingしたモデルを表す
17

4. Experiments
● models are evaluated by training on the five mazes
○ A3C+D2
agents reach human-level performance on Static 1 and 2, and attain about 91%
and 59% of human scores on Random Goal 1 and 2
■ better than models without auxiliary tasks
○ A3C+D2
はStatic1,2でプロゲーマー並みのスコア，Random Goal1,2でプロゲーマーの91%,59%の
スコアを出した
■ 補助タスクなしのモデルよりも高いスコアを出している
○ better in models predicting depth as classification task
○ デプスの予測を回帰問題より分類問題として扱う方がスコアが高い
○ adding auxiliary prediction targets of depth and loop closure(Nav A3C+D1
D2
L) speeds up
learning dramatically on most of the mazes (increases AUC metric)
○ デプスとループクロージャの予測の補助タスクを追加するとほとんどのタスクで学習速度が高ま
る(AUCが高まる)
18

5. Analysis
● Position Decoding
○ train a position decoder that takes representation of location(hidden units of LSTM or
features in the last layer of FF A3C agent) as input, consisting of a linear classifier with
multinomial probability distribution over discretized maze locations
○ LSTMの隠れユニット(or FF A3Cの最後の層)を入力として，エージェントの離散的な位置を多項
分布として出力する線形分類器を作成
21

5. Analysis
● Position Decoding
○ initial uncertainty in positions is improved as more observations are aquired by agent
○ 初期の不確実性はエージェントの探索によって減少
○ position entripy spikes after a respawn, then decreases once the agent acquires certainty
about its location
○ 再スタート直後に位置に関するエントロピーは急増するが，位置に関する確実さを得られると
(一度探索した場所に来たことがわかると)減少
22

5. Analysis
● Stacked LSTM goal analysis
○ visualize the agent’s policy(LSTM activation) by applying t-SNE dimension reduction
○ 4 clear clusters in the LSTM A3C agent but 2 clusters in the Nav A3C agent
■ the Nav A3C policy-dictating LSTM maintains an efficient representation of 2
sub-policies with critical information about the currently relevant goal provided by the
additional LSTM.
○ t-SNEを用いてエージェントの方策を可視化
○ LSTM A3Cでは4つのクラスタ，Nav A3Cでは２つのクラスタに別れる
■ Nav A3Cの方策に関するLSTMは，上流のLSTMによって渡される相対的なゴールに対
するサブ方策を表現していると考えられる！
23

5. Analysis
● Stacked LSTM goal analysis
24

5. Analysis
● Different combinations of auxiliary tasks
○ compare reward prediction against depth prediction
■ reward prediction improves upon the plain stacked LSTM, but not as much as depth
prediction from policy LSTM(Nav A3C+D2
)
○ デプスの予測の代わりに報酬の予測をするモデルを作成し比較
■ 報酬の予測を使うと単純なstacked LSTMよりも改善したが，Nav A3C+D 2
ほどではない
25

6. Conclusion
● propose a deep RL method, augmented with memory and auxiliary
learning targets, for training agents to navigate within large and visually
rich environments
○ highlight the utility of un/self-supervised auxiliary objectives in providing richer training
signals that bootstrap learning and enhance data efficiency
● LSTMと補助タスクを用いて規模が大きい環境の中でナビゲーションを行うように
エージェントを訓練するRLの手法を提案した
○ ある種の補助タスクは学習を促進し，データの効率性を向上させる教師信号を提供する
○ 補助タスクのlossがオンラインである点はUNREALと異なる
26

6. Conclusion
● It will be important in the future to combine visually complex environments
with architectures that make use of external memory to enhance the
navigational abilities of agents.
● エージェントのナビゲーション能力を促進するために，外部の記憶を用いたアー
キテクチャで複雑な環境を統合することが重要になるだろう
27

Impressions
● これの続編があるらしい(StreetLearn)
○ 日経Roboticsの記事
■ http://techon.nikkeibp.co.jp/atcl/mag/15/00140/00019/?ST=print
○ 論文にはまだなってないらしい (曰く，日経Roboticsのみが報じている)
● Street Viewの画像を用いてナビゲーションタスクを行う
○ 補助タスクは「自己の絶対方位推定 (16方向)」と「次のノードの推定」
○ 速度やデプスマップもいらない (本論文との違い)
■ 単純なRGB画像だけでいける
○ 3つのLSTMを利用
■ 相対的な処理を行う LSTMはある程度汎用的になっていてモジュールとして扱えるらしい
28

Impressions
● これの続編があるらしい(StreetLearn)
29

Impressions
● ナビゲーション以外の定式化ができれば相当なインパクトがありそう
○ 「デプスとループクロージャの推定」という補助タスクの設計が結構意図的
■ ex) ロボットアームの操作ではどうなる ...?
○ これはOpen Reviewでも指摘されていた
■ Methodology does seem a bit ad-hoc, it would be nice to see if some of the auxiliary
task mechanisms could be formalized beyond simple "this is what worked for this
domain"
30

Impressions
○ Appendix C3実験されている
■ Seek-Avoid Arena
● りんごをとると報酬+1，レモンをとると報酬 -1
■ Stairway to Melon
● りんごをとると報酬+1だが進み続けると死ぬ，レモンを最初にとると報酬 -1だが進
み続けるとメロン報酬 +10がもらえる
31

Impressions
○ Appendix C3実験されている
■ 補助タスクありの方が早く高い Reward AUCを達成している
■ ナビゲーション以外でもパフォーマンスの改善が見られる
● 応用の可能性を示唆
32

Impressions
● StreetLearnの補助タスクもどう設計すれば良い...?
（無理やり解釈)
NYCがどういう世界
であるかを大雑把に
学習している？？？
33

Learning to Navigate in Complex Environments 輪読

More Related Content

What's hot (14)

Similar to Learning to Navigate in Complex Environments 輪読 (20)

Learning to Navigate in Complex Environments 輪読