Maxmin qlearning controlling the estimation bias of qlearning

Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
Published as a conference paper at ICLR 2020, Qingfeng Lan et al.

Maxmin Q-Learning
유한한 MDP 상황에서 에이전트에게 특정 상태에서 어떤 액션(action)이 좋은지 알려주는 최선의 정책(Q-policy)을 학습하는 모델 프리(model-free) 강화 학습 알고리즘이다.
1. Q-Learning의 정의
Q-러닝에서의 액션은 특정 상태에서의 가능한 액션들에 대한 Q 값 중 최댓값을 취하며, 다음 상태에 대한 가능한 액션들의 (Q 값의) 최댓값을 바탕으로 그 값이 업데이트된다.
2. Q-Learning의 절차
Optimal
Policy
Q E=*(s,a) R R S, , A 는 각각 t시점의 보상, 상태, 액션max *Q (S a s,) |,t+1 S =t a ,A =t t t ta A∈+t+1[ ]
다음 상태와 특정 액션에 대한 Q의 최댓값을 바탕
으로 현재 상태와 액션에 대한 Q값을 업데이트 함.
epsilon-greedy 방법으로 액션을 산출
임의의 값으로 Q-Table을 초기화

Maxmin Q-Learning
Q-러닝의 문제점 중의 하나는 액션을 추정할 때 과추정(overestimation)이 발생되기 쉽다는 것이며, 이것은 Q-러닝의 업데이트 과정에서의 최댓값 추정에 기인한다.
3. Q-Learning의 문제점
Q (S a ) ← +α,t t Q Y -( ( for =
def
S a ) ),t t
Q
Yt r γ+t+1
Q
t Q (S a ),t t max Q(s a ),t+1a A∈
QEven unbiased estimates for ,∀ ∵(s a a) a ),t+1 Q (s a ),t+1 Q*= +(s e ( stochasticity)a,t+1
Q∴ is the optimal estimation for Q,E ([ ]s a ),t+1 QQ* = E ([ ]s a ),t+1
Qmax (s aa A∈ ),t+1 Q≥ (s a ),t+1Q (s a ),t+1 Q= (s a ) ,,t+1
QmaxE ([ ]s aa A∈ ),t+1 Qmax =≥ E ([ ]s aa A∈ ),t+1 Q*max (s aa A∈ ),t+1

Maxmin Q-Learning
기존의 Q 러닝에서 발생하였던 과추정(overestimation)의 문제를 해결하기 위해, 서로 다른 두 개의 Q함수를 바탕으로 액션을 선택 및 업데이트 하는 알고리즘을 말한다.
1. Double Q-Learning의 정의
Q러닝에서의 액션(action)은 특정 상태에서의 가능한 액션들에 대한 Q값 중 최댓값을 취하며, 다음 상태에 대한 가능한 액션들의 (Q값의) 최댓값을 바탕으로 그 값이 업데이트 된다.
2. Double Q-Learning의 구조
Algorithm 1 Double Q-learning
1: Initialize QA
,QB
,s
2: repeat
3: Choose a, based on QA
(s, ·) and QB
(s, ·), observe r, s
4: Choose (e.g. random) either UPDATE(A) or UPDATE(B)
5: if UPDATE(A) then
6: Define a∗
= arg maxa QA
(s , a)
7: QA
(s, a) ← QA
(s, a) + α(s, a) r + γQB
(s , a∗
) − QA
(s, a)
8: else if UPDATE(B) then
9: Define b∗
= arg maxa QB
(s , a)
10: QB
(s, a) ← QB
(s, a) + α(s, a)(r + γQA
(s , b∗
) − QB
(s, a))
11: end if
12: s ← s
13: until end
Lemma 2. Consider a stochastic process (ζt, ∆t, Ft), t ≥ 0, where ζt, ∆t, Ft : X → R satisfy the
equations:
∆t+1(xt) = (1 − ζt(xt))∆t(xt) + ζt(xt)Ft(xt) , (8)
where xt ∈ X and t = 0, 1, 2, . . .. Let Pt be a sequence of increasing σ-fields such that ζ0 and
∆0 are P0-measurable and ζt, ∆t and Ft−1 are Pt-measurable, t = 1, 2, . . . . Assume that the
following hold: 1) The set X is finite. 2) ζt(xt) ∈ [0, 1] , t ζt(xt) = ∞ , t(ζt(xt))2
< ∞ w.p.1
and ∀x = xt : ζt(x) = 0. 3) ||E{Ft|Pt}|| ≤ κ||∆t|| + ct, where κ ∈ [0, 1) and ct converges to
2
만약 Q(A)를 업데이트 시킨다면, 다음 상태에 대한 Q(B)값과의 차를
서로 다른 두 개의 Q 함수(또는 네트워크)를 생성 및 초기화.
두 개의 Q를 바탕으로 액션을 선택, 보상과 다음 상태를 획득(operations
for the estimation: sum, average, expectation etc)
만약 Q(B)를 업데이트 시킨다면, 다음 상태에 대한 Q(A)값과의 차를
바탕으로 업데이트를 진행함(대상은 임의 선택 등의 방법으로 선택).
Q-Learning
Double
Q-Learning과추정(Overestimation) 문제
과소추정(Underestimation)으로
과추정 문제를 해결하고자 함.

Maxmin Q-Learning
Q-함수를 바탕으로 한 추정(estimation)에 있어서 과추정 편향(overestimation bias)은 항상 문제가 되거나 이익이 되는 것이 아닌, 그 과제 환경의 특성에 따라 때로는 이익이
되기도 때로는 손해가 되기도 한다.
3. Overestimation Bias Helps and Hurts
B Left
8 actions
Right
Simple Episodic MDP (Sutton&Barto, 2018)
두 개의 종료(Terminal) 상태가 있음.
B에서의 8개의 액션에 따른 보상(reward)은 μ+U(-1,1) 분포를 따름.
A에서 시작하며 μ＞0 일 때 Left, μ＜0 일 때 Right 가 각각 최적의 액션(optimal action).
r=0
r ~ μ+U(-1, 1)
r=0
A
실험
환경 Published as a conference paper at ICLR 2020
(a) µ = +0.1 (overestimation helps) (b) µ = −0.1 (underestimation helps)
Figure 2: Comparison of three algorithms using the simple MDP in Figure 1 with different values
of µ, and thus different expected rewards. For µ = +0.1, shown in (a), the optimal -greedy policy
is to take the Left action with 95% probability. For µ = −0.1, shown in in (b), the optimal policy is
to take the Left action with 5% probability. The reported distance is the absolute difference between
the probability of taking the Left action under the learned policy compared to the optimal -greedy
policy. All results were averaged over 5, 000 runs.
to maintain N estimates of the action values, Qi
, and use the minimum of these estimates in the
Q-learning target: maxa mini∈{1,...,N} Qi
(s , a ). For N = 1, the update is simply Q-learning, and
so likely has overestimation bias. As N increase, the overestimation decreases; for some N > 1,
this maxmin estimator switches from an overestimate, in expectation, to an underestimate. We
실험
결과
(a) μ＞0 일 때, 최적의 정책(optimal policy)은 ‘Left’를 95%의 확률로 취하는 것(∵ε-greedy).
(b) μ＜0 일 때, 최적의 정책은 ‘Left’를 5%의 확률로 취하는 것(∵ε-greedy).
그래프 y 축에서의 거리(distance)는 최적의 정책과 각 모델의 Left 확률 값의 절대 차를 의미.
(a) 일 때에 모델의 정책이 과추정할수록 최적의 정책과 가까워짐(overestimation helps).
(b) 일 때에 모델의 정책이 과소추정할수록 최적의 정책과 가까워짐(underestimation helps).
Q-Learning은 과추정, Double Q-Learning은 과소추정하는 경향이 있다 > 둘 다 치우쳐(biased) 있다.
Published as a conference paper at ICLR 2020

Maxmin Q-Learning
We use this lemma to prove convergence of Double Q-learning under similar conditions as Q-
learning. Our theorem is as follows:
Theorem 1. Assume the conditions below are fulfilled. Then, in a given ergodic MDP, both QA
and
QB
as updated by Double Q-learning as described in Algorithm 1 will converge to the optimal value
function Q∗
as given in the Bellman optimality equation (2) with probability one if an infinite number
of experiences in the form of rewards and state transitions for each state action pair are given by
a proper learning policy. The additional conditions are: 1) The MDP is finite, i.e. |S × A| < ∞.
2) γ ∈ [0, 1). 3) The Q values are stored in a lookup table. 4) Both QA
and QB
receive an
infinite number of updates. 5) αt(s, a) ∈ [0, 1], t αt(s, a) = ∞, t(αt(s, a))2
< ∞ w.p.1, and
∀(s, a) = (st, at) : αt(s, a) = 0. 6) ∀s, a, s : Var{Rs
sa} < ∞.
A ‘proper’ learning policy ensures that each state action pair is visited an infinite number of times.
For instance, in a communicating MDP proper policies include a random policy.
Sketch of the proof. We sketch how to apply Lemma 2 to prove Theorem 1 without going into full
technical detail. Because of the symmetry in the updates on the functions QA
and QB
it suffices
to show convergence for either of these. We will apply Lemma 2 with Pt = {QA
0 , QB
0 , s0, a0, α0,
r1, s1, . . ., st, at}, X = S × A, ∆t = QA
t − Q∗
, ζ = α and Ft(st, at) = rt + γQB
t (st+1, a∗
) −
Q∗
t (st, at), where a∗
= arg maxa QA
(st+1, a). It is straightforward to show the first two conditions
of the lemma hold. The fourth condition of the lemma holds as a consequence of the boundedness
condition on the variance of the rewards in the theorem.
This leaves to show that the third condition on the expected contraction of Ft holds. We can write
Ft(st, at) = FQ
t (st, at) + γ QB
t (st+1, a∗
) − QA
t (st+1, a∗
) ,
where FQ
t = rt + γQA
t (st+1, a∗
) − Q∗
t (st, at) is the value of Ft if normal Q-learning would be
under consideration. It is well-known that E{FQ
t |Pt} ≤ γ||∆t||, so to apply the lemma we identify
ct = γQB
t (st+1, a∗
) − γQA
t (st+1, a∗
) and it suffices to show that ∆BA
t = QB
t − QA
t converges to
zero. Depending on whether QB
or QA
is updated, the update of ∆BA
t at time t is either
∆BA
t+1(st, at) = ∆BA
t (st, at) + αt(st, at)FB
t (st, at) , or
∆BA
t+1(st, at) = ∆BA
t (st, at) − αt(st, at)FA
t (st, at) ,
5
MaxMin Q-Learning은 기존의 Q-Learning의 과추정 편향과 Double Q-Learning의 과소추정 편향의 성질을 개선하기 위해, 여러 Q 함수를 구성하고 이 중 최소의 값을 가진
Q 함수를 바탕으로 임의 개의 Q를 업데이트하는 Q-러닝 모델이다.
1. Maxmin Q-Learning의 정의
Maxmin Q-Learning 에서 액션 추정(estimation)은 각 액션에 대한 (Q값의) 최솟값을 바탕으로 구성된 Q(min Q)의 최댓값을 바탕으로 결정되며, min Q를 업데이트 시에 타깃을
구성하는데에 사용한다.
2. Maxmin Q-Learning의 구조
N개의 Q-value 함수를 만들고 초기화함.
상태 s와 a에 대한 Q값들의 최솟값을 바탕으로 Q-table을 구성하고, 이라 함(reduce_min).이를 Qmin
Q min
Q i
집합 {1,...,N}에서 임의의 값(들)을 추출하여 부분 집합 S를 구성.
타깃 Y를 을 사용하여 구하고(max), 의 Q-value 값을 타깃값을
이용하여 업데이트.
Algorithm 1: Maxmin Q-learning
Input: step-size α, exploration parameter > 0, number of action-value functions N
Initialize N action-value functions {Q1
, . . . , QN
} randomly
Initialize empty replay buffer D
Observe initial state s
while Agent is interacting with the Environment do
Qmin
(s, a) ← mink∈{1,...,N} Qk
(s, a), ∀a ∈ A
Choose action a by -greedy based on Qmin
Take action a, observe r, s
Store transition (s, a, r, s ) in D
Select a subset S from {1, . . . , N} (e.g., randomly select one i to update)
for i ∈ S do
Sample random mini-batch of transitions (sD, aD, rD, sD) from D
Get update target: Y MQ
← rD + γ maxa ∈A Qmin
(sD, a )
Update action-value Qi
: Qi
(sD, aD) ← Qi
(sD, aD) + α[Y MQ
− Qi
(sD, aD)]
end
s ← s
end

Maxmin Q-Learning
learning; Maxmin Q-learning with N = 2 is not Double Q-learning.
The full algorithm is summarized in Algorithm 1, and is a simple modification of Q-learning with
experience replay. We use random subsamples of the observed data for each of the N estimators, to
make them nearly independent. To do this training online, we keep a replay buffer. On each step, a
random estimator i is chosen and updated using a mini-batch from the buffer. Multiple such updates
can be performed on each step, just like in experience replay, meaning multiple estimators can be
updated per step using different random mini-batches. In our experiments, to better match DQN, we
simply do one update per step. Finally, it is also straightforward to incorporate target networks to
get Maxmin DQN, by maintaining a target network for each estimator.
We now characterize the relation between the number of action-value functions used in Maxmin
Q-learning and the estimation bias of action values. For compactness, we write Qi
sa instead of
Qi
(s, a). Each Qi
sa has random approximation error ei
sa
Qi
sa = Q∗
sa + ei
sa.
We assume that ei
sa is a uniform random variable U(−τ, τ) for some τ > 0. The uniform random
assumption was used by Thrun & Schwartz (1993) to demonstrate bias in Q-learning, and reflects
that non-negligible positive and negative ei
sa are possible. Notice that for N estimators with nsa
samples, the τ will be proportional to some function of nsa/N, because the data will be shared
amongst the N estimators. For the general theorem, we use a generic τ, and in the following
corollary provide a specific form for τ in terms of N and nsa.
Recall that M is the number of actions applicable at state s . Define the estimation bias ZMN for
transition s, a, r, s to be
ZMN
def
= (r + γ max
a
Qmin
s a ) − (r + γ max
a
Q∗
s a )
= γ(max
a
Qmin
s a − max
a
Q∗
s a )
4
Maxmin Q-Learning은 다른 Q-러닝 모델에 비해 덜 편향적이며, 부분적으로 편향(bias)을 제어(control) 할 수 있다. 또한, 보상의 분산에 강건하다(robust)한 모습을 보인다.
3. Maxmin Q-Learning 의 특징
1) MaxMin Q-Learning은 Q-Learing 모델과 Double Q-Learning 모델에 비해 덜 편향적이다.
(a) µ = +0.1 (overestimation helps) (b) µ = −0.1 (underestimation helps)
Figure 2: Comparison of three algorithms using the simple MDP in Figure 1 with different values
of µ, and thus different expected rewards. For µ = +0.1, shown in (a), the optimal -greedy policy
is to take the Left action with 95% probability. For µ = −0.1, shown in in (b), the optimal policy is
to take the Left action with 5% probability. The reported distance is the absolute difference between
the probability of taking the Left action under the learned policy compared to the optimal -greedy
policy. All results were averaged over 5, 000 runs.
to maintain N estimates of the action values, Qi
, and use the minimum of these estimates in the
Q-learning target: maxa mini∈{1,...,N} Qi
(s , a ). For N = 1, the update is simply Q-learning, and
so likely has overestimation bias. As N increase, the overestimation decreases; for some N > 1,
this maxmin estimator switches from an overestimate, in expectation, to an underestimate. We
characterize the relationship between N and the expected estimation bias below in Theorem 1. Note
that Maxmin Q-learning uses a different mechanism to reduce overestimation bias than Double Q-
learning; Maxmin Q-learning with N = 2 is not Double Q-learning.
The full algorithm is summarized in Algorithm 1, and is a simple modification of Q-learning with
experience replay. We use random subsamples of the observed data for each of the N estimators, to
make them nearly independent. To do this training online, we keep a replay buffer. On each step, a
random estimator i is chosen and updated using a mini-batch from the buffer. Multiple such updates
can be performed on each step, just like in experience replay, meaning multiple estimators can be
updated per step using different random mini-batches. In our experiments, to better match DQN, we
simply do one update per step. Finally, it is also straightforward to incorporate target networks to
과추정이 도움이 되는 상황(overestimation helps)에서는 Q-Value 함수의 개수(N)에 따라 그 정도(편향의
정도)를 제어할 수 있는 것으로 보임.
과소추정이 도움이 되는 상황(underestimation helps)에서는 N에 따른 차이는 보이지 않지만, Q-Learning,
Double Q-learning에 비해 덜 편향적인 모습을 보임.
2) 보상의 분산(variance)이 증가하는 것에 강건(Robust)하다.
(a) Robustness under increasing reward variance (b) Reward ∼ N(−1, 10)
Figure 3: Comparison of four algorithms on Mountain Car under different reward variances. The
lines in (a) show the average number of steps taken in the last episode with one standard error. The
lines in (b) show the number of steps to reach the goal position during training when the reward
variance σ2
= 10. All results were averaged across 100 runs, with standard errors. Additional
환경: Mountain Car (Gym)
(a)는 보상의 분산이 점차 증가하였을 때 마지막 에피소드일 때의 평균 스텝 수를 나타낸 것이다.
(b)는 보상의 평균이 -1/분산이 10일 때, 에피소드 증가에 따른 목표까지의 스텝 수를 나타낸 것이다.
(a) Catcher (b) Lunar
(d) Breakout (e) Pixelc

Maxmin Q-Learning
4. 다양한 환경에서의 실험 결과
(a) Catcher (b) Lunarlander (c) Space Invaders
(d) Breakout (e) Pixelcopter (f) Asterix
(a) Catcher (b) Lunarlander (c) S
(d) Breakout (e) Pixelcopter
(g) Seaquest (h) Pixelcopter with varying N (i) Aster
Figure 4: Learning curves on the seven benchmark environments. The depict
over the last 100 episodes, and the curves are smoothed using an exponenti
previous reported results (Young & Tian, 2019). The results were averaged o
shaded area representing one standard error. Plots (h) and (i) show the perf
DQN on Pixelcopter and Asterix, with different N, highlighting that larger
slower early learning but better ﬁnal performance in both environments.
(a) Catcher (b) Lunarlander (c) Space Invaders
(d) Breakout (e) Pixelcopter (f) Asterix
(g) Seaquest (h) Pixelcopter with varying N (i) Asterix with varying N
Figure 4: Learning curves on the seven benchmark environments. The depicted return is averaged
over the last 100 episodes, and the curves are smoothed using an exponential average, to match
previous reported results (Young & Tian, 2019). The results were averaged over 20 runs, with the
(c) Space Invaders
(f) Asterix

Maxmin Q-Learning
5. Proofs for Maxmin Q-Learning
We now characterize the relation between the number of action-value functions used in Maxmin
Q-learning and the estimation bias of action values. For compactness, we write Qi
sa instead of
Qi
(s, a). Each Qi
sa has random approximation error ei
sa
Qi
sa = Q∗
sa + ei
sa.
We assume that ei
sa is a uniform random variable U(−τ, τ) for some τ > 0. The uniform random
assumption was used by Thrun & Schwartz (1993) to demonstrate bias in Q-learning, and reflects
that non-negligible positive and negative ei
sa are possible. Notice that for N estimators with nsa
samples, the τ will be proportional to some function of nsa/N, because the data will be shared
amongst the N estimators. For the general theorem, we use a generic τ, and in the following
corollary provide a specific form for τ in terms of N and nsa.
Recall that M is the number of actions applicable at state s . Define the estimation bias ZMN for
transition s, a, r, s to be
ZMN
def
= (r + γ max
a
Qmin
s a ) − (r + γ max
a
Q∗
s a )
= γ(max
a
Qmin
s a − max
a
Q∗
s a )
4
nsa samples for a single estimate,
V ar[Qmin
sa ] =
12N2
(N + 1)2(N + 2)
V ar[Qsa].
Under this uniform random noise assumption, for N ≥ 8, V ar[Qmin
sa ] < V ar[Qsa].
5 EXPERIMENTS
In this section, we first investigate robustness to reward variance, in a simple environment (Mountain
Car) in which we can perform more exhaustive experiments. Then, we investigate performance in
seven benchmark environments.
Robustness under increasing reward variance in Mountain Car Mountain Car (Sutton &
Barto, 2018) is a classic testbed in Reinforcement Learning, where the agent receives a reward
of −1 per step with γ = 1, until the car reaches the goal position and the episode ends. In our
experiment, we modify the rewards to be stochastic with the same mean value: the reward signal is
sampled from a Gaussian distribution N(−1, σ2
) on each time step. An agent should learn to reach
the goal position in as few steps as possible.
The experimental setup is as follows. We trained each algorithm with 1, 000 episodes. The number
of steps to reach the goal position in the last training episode was used as the performance measure.
The fewer steps, the better performance. All experimental results were averaged over 100 runs.
The key algorithm settings included the function approximator, step-sizes, exploration parameter
and replay buffer size. All algorithm used -greedy with = 0.1 and a buffer size of 100. For
each algorithm, the best step-size was chosen from {0.005, 0.01, 0.02, 0.04, 0.08}, separately for
each reward setting. Tile-coding was used to approximate the action-value function, where we used
8 tilings with each tile covering 1/8th of the bounded distance in each dimension. For Maxmin
Q-learning, we randomly chose one action-value function to update at each step.
As shown in Figure 3, when the reward variance is small, the performance of Q-learning, Double Q-
learning, Averaged Q-learning, and Maxmin Q-learning are comparable. However, as the variance
increases, Q-learning, Double Q-learning, and Averaged Q-learning became much less stable than
Maxmin Q-learning. In fact, when the variance was very high (σ = 50, see Appendix C.2), Q-
learning and Averaged Q-learning failed to reach the goal position in 5, 000 steps, and Double Q-
learning produced runs > 400 steps, even after many episodes.
6
(g) Seaquest (h) Pixelcopter w
Figure 4: Learning curves on the seven benchmark
over the last 100 episodes, and the curves are sm
previous reported results (Young & Tian, 2019). T
shaded area representing one standard error. Plots
DQN on Pixelcopter and Asterix, with different N
slower early learning but better final performance in
8
Theorem1.
Corollary 1.
samples for (s,a) Qand, for the estimator that uses all samples for a single estimate, Var =nsasa
Assuming the samples are evenly allocated the N estimators, then τ σ= where is the variance of²3 N nsa/²σnsa
ZIf ,MN maxr +γ( ) -Q
min
a s a maxr +γ( ) =Q*
a s a maxγ( -Q
min
a s a max )Q*
a s a
def
Z 1 where =
( ) … 1MM 1-
M 1+
M 1-
( )M + N
1
( )1… + N
1
( )M-1 + N
1
²N12
( )N 1+ ( N )² +2
2t-E MN MN tMNτγ[ [ ]]
Qsa
min
[ ] Var .Qsa[ ]
Z N
1
∝, ,E MN[ ] Z =γτ γτE M, N=1[ ] Zand E M, N→∞[ ] = -
def
MaxMin Q-Learning은 Q-Learing에 비해 과추정의 경향이 적다.
Lemma 1.
( ) and[ ]- E≤ ≤μ μi
f(x) F(x) Denote ,Xi[ ]E
def
μ .X Set+∞i X min1:N X andi{ }i 1, … , N∈
2
[ ]Var ＜
def def
X Denote the PDF and CDF ofmax
(N-1)σ
N:N X as f (x)1:N 1:N Fand , respectively. We then have,(x)1:NX .i
X1:N [ ]E ≤X1:N+1 [ ] .E X1:N
( )ii F (x) 1 - 1
N
-( ) .=1:N F(x) f f(x) (x) 1
N-1
-( ) .= N1:N F(x)
( )iii F (x) =
N
)( .N:N F(x) f f(x) (x)
N-1
( ) .= NN:N F(x)
{ }i 1, … , N∈
def
σ.and cumulative distribution function
Let X ,1 X …,2 , X be N i.i.d. R.V. from absolutely continuous distribution with probablity density functionN
( )iv If X …,1 X ~ U( ττ- , ) we have and., N X( )Var =1:N X( )Var ＜1:N+1 X( )Var ≤1:N X( )Var = , N 1, 2, …{ }.∈∀²σ1:1
4Nτ²
(N+1)²(N+2)
2n-1
(N-1)σ

Maxmin qlearning controlling the estimation bias of qlearning

More Related Content

What's hot (20)

Similar to Maxmin qlearning controlling the estimation bias of qlearning (20)

More from HyunKyu Jeon (20)

Recently uploaded (20)

Maxmin qlearning controlling the estimation bias of qlearning