Multi-agent Reinforcement Learning in Sequential Social Dilemmas

MULTI-AGENT REINFORCEMENT
LEARNING IN SEQUENTIAL SOCIAL
DILEMMAS
AI Lab 阿部拳之
2019/05/10

Summary
■ Joel Z. Leibo,Vinicius Zambaldi, Marc Lanctot, Janusz Marecki,Thore Graepel
– DeepMind
■ AAMAS 2017
■ Contribution
– Multi-agent RLにおけるagentの相互作用について調査
– 現実世界の社会的ジレンマの特徴を捉えたSequential Social Dilemma
(SSD)を提案
■ DeepMindによるブログ : https://deepmind.com/blog/understanding-agent-
cooperation/

Matrix Game
■ 社会的ジレンマを説明するモデルとして活用
■ よくある設定
– 二人のagentのゲーム
– Agentの行動：cooperate（協力）または
defect（敵対）
– 報酬
■ 𝑅：互いに協力
■ 𝑃：互いに敵対
■ 𝑆：相手に裏切られた
■ 𝑇：相手を裏切った

Matrix Game
■ Matrix Gameが社会的ジレンマとなる条件
– 𝑅 > 𝑃：相互協力>相互敵対
– 𝑅 > 𝑆：相互協力>相手に裏切られる
– 2𝑅 > 𝑇 + 𝑆：相互協力が社会的に最も良い
– 𝑇 > 𝑅 (greed) ：相手を裏切る>相互協力
or
𝑃 > 𝑆 (fear)：相互敵対>相手に裏切られる

Matrix Game Social Dilemmas
相手の戦略にかかわらず敵対する相手が敵対的な場合，
自分も敵対
相手が協力的な場合，
自分だけ裏切る
チキンゲームスタグハントゲーム囚人のジレンマ

Matrix Game Social Dilemmasとreal world
■ 実世界における社会的ジレンマの側面をいくつか無視してしまっている
– 時間軸が存在
– 協調性は段階的な量
– プレイヤは状態や他プレイヤに関する情報を部分的にしか持ってい
ない状況下で意思決定を下す
→これらの側面を捉えたSequential Social Dilemma (SSD) を提案

Markov Games
■ Two-player partially observable Markov game : 𝑀
– 状態空間 : 𝑆
– 各agentの観測関数：𝑂 ∶ 𝑆 × 1,2 → 𝑅 𝑑
– 行動空間 : 𝐴1, 𝐴2
– 状態遷移関数 : 𝜏 ∶ 𝑆 × 𝐴1 × 𝐴2 → Δ(𝑆)
– 報酬関数 : 𝑟𝑖 ∶ 𝑆 × 𝐴1 × 𝐴2 → 𝑅
– 政策 : 𝜋𝑖 ∶ 𝑂𝑖 → Δ(𝐴𝑖)
Δ(𝑋) : 𝑋上の確率分布
𝑂𝑖 = {𝑜𝑖|𝑠 ∈ 𝑆, 𝑜𝑖 = 𝑂(𝑠, 𝑖)

Markov Games
■ 協力的policy 𝜋 𝐶
と敵対的policy 𝜋 𝐷
の収益
– 𝑅 𝑠 ≔ 𝑉1
𝜋 𝐶,𝜋 𝐶
𝑠 = 𝑉2
𝜋 𝐶,𝜋 𝐶
𝑠
– 𝑃 𝑠 ≔ 𝑉1
𝜋 𝐷,𝜋 𝐷
𝑠 = 𝑉2
𝜋 𝐷,𝜋 𝐷
𝑠
– 𝑆 𝑠 ≔ 𝑉1
𝜋 𝐶,𝜋 𝐷
𝑠 = 𝑉2
𝜋 𝐷,𝜋 𝐶
𝑠
– 𝑇 𝑠 ≔ 𝑉1
𝜋 𝐷,𝜋 𝐶
𝑠 = 𝑉2
𝜋 𝐶,𝜋 𝐷
𝑠

Sequential Social Dilemma
■ SSD : (𝑀, Π 𝐶, Π 𝐷)
■ 収益： 𝑅, 𝑃, 𝑆, 𝑇 ≔ (𝑅 𝑠0 , 𝑃 𝑠0 , 𝑆 𝑆0 , 𝑇 𝑆0 )
■ 協力的policy 𝜋 𝐶と敵対的policy 𝜋 𝐷
– しきい値によってΠ 𝐶とΠ 𝐷を区別
– 例： 𝛼 𝜋 < 𝛼 𝑐 ⟺ 𝜋 ∈ Π 𝐶
𝛼 𝜋 > 𝛼 𝑑 ⟺ 𝜋 ∈ Π 𝐷
Π 𝐶
, Π 𝐷
: 協力的・敵対的policyの集合

Markov GameとMGSDとSSDの関係

Setting
■ Gathering，Wolfpack
– 観測 : 30✕10グリッドのRGB情報
– 行動 : 上下左右に移動，移動しない，
ビームを打つ（計8個）
■ Deep Q-Networkで学習
m state s0 2 S.
" 1X
t = 0
γt
ri (st ,~at )
#
. (5)
of two-player perfectly
es obtained when |S| =
C, D } , where C and D
efect respectively.
wed starting from state s0 2 S.
( st )) ,st + 1 ⇠T ( st ,~at )
" 1X
t = 0
γt
ri (st ,~at )
#
. (5)
e the special case of two-player perfectly
s) Markov games obtained when |S| =
cify A1 = A2 = { C, D} , where C and D
cooperate and defect respectively.
s), P(s), S(s), T(s) that determine when
social dilemma are deﬁned as follows.
= V ⇡ C
,⇡ C
1 (s) = V ⇡ C
,⇡ C
2 (s), (6)
= V ⇡ D
,⇡ D
1 (s) = V ⇡ D
,⇡ D
2 (s), (7)
Figur e 3: L eft : Gat her ing
player is dir ect ing it s bea
locat ion. T he r ed player is
fr om t he sout h. R ight : W
agent ’s view r elat ive t o t h

Gathering
■ 緑色のりんごを集めるゲーム
■ 二回ビームを被弾すると𝑁𝑡𝑎𝑔𝑔𝑒𝑑フレームの間ゲームから除外
■ 報酬
– りんごを取ると+1
– 取ったりんごは𝑁𝑎𝑝𝑝𝑙𝑒フレーム後に再出現
– ビームを当てること，被弾することに対しての報酬はなし
■ 𝑁𝑎𝑝𝑝𝑙𝑒と𝑁𝑡𝑎𝑔𝑔𝑒𝑑を変化させたときに，agentの敵対度合い（ビームを打つ
頻度）がどのように変化するかを分析
st )) ,st + 1 ⇠T ( st ,~at )
" 1X
t = 0
γt
ri (st ,~at )
#
. (5)
the special case of two-player perfectly
s) Markov games obtained when |S| =
fy A1 = A2 = { C, D} , where C and D
ooperate and defect respectively.
), P(s), S(s), T(s) that determine when
ocial dilemma are deﬁned as follows.
V ⇡ C
,⇡ C
1 (s) = V ⇡ C
,⇡ C
2 (s), (6)
V ⇡ D
,⇡ D
1 (s) = V ⇡ D
,⇡ D
2 (s), (7)
V ⇡ C
,⇡ D
1 (s) = V ⇡ D
,⇡ C
2 (s), (8)
Figur e 3: L eft : Gat her ing.
player is dir ect ing it s beam
locat ion. T he r ed player is
fr om t he sout h. R ight : W o
agent ’s view r elat ive t o t he
lust rat ed. I f an agent is in
shaped region around t he

Gathering
■ りんごが少ない or 被弾コストが高い
→敵対的なpolicyに
■ 資源が少ない場合，agentはconflictし
やすい
■ 資源が多い場合，agentはconflictしに
くい
over A i . Each agent updates its policy given a stored batch1
of experienced transitions { (s, a, ri , s0
)t : t = 1, . . . T} such
that
Qi (s, a) Qi (s, a) + ↵ ri + γ max
a02 A i
Qi (s0
, a0
) − Qi (s, a)
This is a“ growing batch” approach to reinforcement learn-
ing in the sense of [45]. However, it does not grow in an un-
bounded fashion. Rather, old data is discarded so the batch
can be constantly refreshed with new data reﬂecting more
recent transitions. We compared batch sizes of 1e5 (our
default) and 1e6 in our experiments (see Sect. 5.3). The
network representing the function Q is trained through gra-
dient descent on the mean squared Bellman residual with the
expectation taken over transitions uniformly sampled from
the batch (see [25]). Since the batch is constantly refreshed,
the Q-network may adapt to the changing data distribution
arising from the e↵ects of learning on ⇡1 and ⇡2.
In order to make learning in SSDs tractable, we make

Payoff matricesの解析
1. (𝜋1
𝐶
, 𝜋1
𝐷
)および(𝜋2
𝐶
, 𝜋2
𝐷
)をΠ 𝐶, Π 𝐷からサンプル
2. 𝜋1
𝐶
, 𝜋2
𝐶
, 𝜋1
𝐶
, 𝜋2
𝐷
, 𝜋1
𝐷
, 𝜋2
𝐶
, (𝜋1
𝐷
, 𝜋2
𝐷
)の組み合わせに対してエピ
ソードを実行
3. 得られた報酬をもとに𝑅, 𝑃, 𝑆, 𝑇を推定

Gathering
■ Π 𝐶, Π 𝐷 : 𝑁𝑎𝑝𝑝𝑙𝑒/𝑁𝑡𝑎𝑔𝑔𝑒𝑑が高い・低い環境における学習済みpolicyの集合
■ 社会的ジレンマが生じたケースでは，ほとんどが囚人のジレンマになった
Figur e 6: Sum m ar y of mat r ix gam es discover ed wit hin G at her ing (L eft ) and W olfpack (R ight ) t hr ough
ext r act ing em pir ical payo↵ m at r ices. T he games ar e classiﬁ ed by social dilem m a t ype indicat ed by color and
𝑇 − 𝑅
𝑃 − 𝑆

Wolfpack
■ 二人のプレイヤが獲物を追いかける
ゲーム
■ 報酬
– いずれかのプレイヤが獲物を捉え
た場合，一定範囲内 (𝑟𝑟𝑎𝑑𝑖𝑢𝑠)のプ
レイヤ全員に報酬
– 一人の場合 : 𝑟𝑙𝑜𝑛𝑒
– 二人の場合： 𝑟𝑡𝑒𝑎𝑚
■ 𝑟𝑡𝑒𝑎𝑚/𝑟𝑙𝑜𝑛𝑒と𝑟𝑟𝑎𝑑𝑖𝑢𝑠を変化させたとき
に，agentの協力度合い（報酬を受け取
るプレイヤの数）がどのように変化す
るかを分析
(5)
ctly
| =
D

Wolfpack
■ チームの報酬が大きい or 報酬を貰え
る範囲が広い
→協力的なpolicyに
■ ２つの異なる協力的なpolicyが生まれ
た
– 最初にお互いを見つける→一緒
に移動して獲物を捉える
– 最初に獲物を見つける→相方が
来るまで待つ
Qi (s, a) Qi (s, a) + ↵ ri + γ max
a02 A i
Qi (s , a ) − Qi (s, a)
This is a“ growing batch” approach to reinforcement learn-
ing in the sense of [45]. However, it does not grow in an un-
bounded fashion. Rather, old data is discarded so the batch
can be constantly refreshed with new data reflecting more
recent transitions. We compared batch sizes of 1e5 (our
default) and 1e6 in our experiments (see Sect. 5.3). The
network representing the function Q is trained through gra-
dient descent on themean squared Bellman residual with the
expectation taken over transitions uniformly sampled from
the batch (see [25]). Since the batch is constantly refreshed,
the Q-network may adapt to the changing data distribution
arising from the e↵ects of learning on ⇡1 and ⇡2.
In order to make learning in SSDs tractable, we make
the extra assumption that each individual agent’s learning
depends only on the other agent’s learning via the (slowly)
changing distribution of experienceit generates. That is, the
two learning agents are “ independent” of one another and
each regard the other as part of the environment. From the
perspective of player one, the learning of player two shows
up as a non-stationary environment. The independence as-
sumption can be seen as a particular kind of bounded ratio-
nality: agents do no recursive reasoning about one another’s
learning. In principle, this restriction could be dropped
through the use of planning-based reinforcement learning
methods like those of [24].
Figur e 4: Social out com es ar e infl uenced by env
r onm ent par amet er s. Top: G at her ing. Shown
t he beam -use r at e (aggr essiveness) as a funct ion o
r e-spawn t ime of apples Nap p l e (abundance) and r e
spawn t ime of agent s Nt agged (confl ict -cost ). T hes

Wolfpack
■ Π 𝐶, Π 𝐷 : 𝑟𝑡𝑒𝑎𝑚 ∗ 𝑟𝑟𝑎𝑑𝑖𝑢𝑠が高い・低い環境における学習済みpolicyの集合
■ チキンゲーム，スタグハントゲーム，囚人のジレンマのすべてが生じた
Figur e 6: Sum m ar y of mat r ix gam es discover ed wit hin G at her ing (L eft ) and W olfpack (R ight ) t hr ough
ext r act ing em pir ical payo↵ m at r ices. T he games ar e classiﬁ ed by social dilem m a t ype indicat ed by color and
𝑇 − 𝑅
𝑃 − 𝑆

Agent parameters influencing the
emergence of defection
■ 割引率
– 大きいと敵対的になりやすい
– Gathering : 他プレイヤを排除した
ほうが，後に報酬を得やすい

■ バッチサイズ
– 大きいと協調的になりやすい
– 大きいと他のagentに関する経験が
増えるため
■ Gathering : ビームを避けやすくなる
■ Wolfpack : 獲物を共に追う機会増加

■ 中間層のユニット数
– Gathering : 大きいと敵対的になりやすい
→敵対的な行動（ビームを打つ）：
相手を狙う必要があり，複雑
– Wolfpack : 大きいと協調的になりやすい
→協調的な行動（二人で獲物を追う）：
プレイヤの行動を協調する必要があり，
複雑

GatheringとWolfpackの違い
■ GatheringとWolfpackはどちらも囚人のジレンマのような特性を持つ
■ MGSDでモデリングをすると２ゲームの違いが見られない
■ SSDは連続的な構造を捉えるため，違いを見ることができる

MGSDが捉えられない学習の特徴
■ 協力的または敵対的な戦略を取るかを決定するのと同時に，どうやってそれらの
戦略を実行するかということを学習する必要がある
■ 協力的または敵対的な戦略を実行する方法を学習することが困難な可能性がある
■ 協力的または敵対的な戦略を実行するためには，細かい調整が必要になる可能性
がある（その保証はない）．
協力的な戦略と敵対的な戦略が同じ程度の調整が必要という保証もない．
■ 複数の異なる協力的または敵対的な戦略が存在する可能性がある
■ 協力的または敵対的な戦略を実行する方法を学習することの複雑さは同程度では
ない可能性がある

まとめと所感
■ 協力的にせよ敵対的にせよ，戦略がそもそも実現しやすいものなのかを考
える必要がある
■ SSDによって２ゲームの違いが見えたというよりは，学習結果を見たら
違った結果になってた，というだけでは．．．

Multi-agent Reinforcement Learning in Sequential Social Dilemmas

More Related Content

What's hot (20)

More from Kenshi Abe (7)

Recently uploaded (10)

Multi-agent Reinforcement Learning in Sequential Social Dilemmas