Alpha Go: in few slides

Mastering the game of Go
A deep neural networks and tree search approach
Alessandro Cudazzo
Department of Computer Science
University of Pisa
alessandro@cudazzo.com
ISPR Midterm IV, 2020
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 1 / 6

Back in time: 2015
We are in 2015 and you may be wondering why mastering the ancient Chinese game of
Go is an important challenge for researchers in the artiﬁcial intelligence ﬁeld.
First, let’s start from the beginning of the story:
A 19x19 board game with approx. bd
possible sequences of moves (b ≈ 250, d ≈ 150).
It is a game of perfect information and can be formulate as a zero-sum game.
Figure: A Go
Board state.
Exhaustive search is infeasible: v∗
(s) optimal value function is
unfeasible to compute. It determines the outcome of a game from
any state s, under perfect play by both players.
Depth reduction with a approximate value function:
v(s) ≈ v∗
(s)
Breadth reduction by sampling actions from a policy function:
P(a|s)
At that time, the strongest Go program was based on MCTS - Pachi [1] ranked at 2
amateur dan on KGS. Experts agree that the major stumbling block to creating
stronger-than-amateur Go programs is the relative to a position evaluation function [2].

Supervised learning of policy networks
So, Deepmind had a clear view: they had to ﬁnd better policy and value functions with
deep learning and eﬃciently combines both with the MCTS heuristic search algorithm.
Supervised learning approach ⇒ SL policy network pσ(a|s):
It takes a 19x19x48 input feature to represent the board.
Deep convolution NN with a softmax output (13-layer).
DB 30M state-action (s, a) - stochastic gradient ascent to
maximize the likelihood of the human move a selected in s:
∆σ ∝
∂logpσ(a|s)
∂σ
An issue, it takes 3ms to process: slow for the rollout!
So, a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small
pattern features as been trained; with accuracy of 24.2% and 2µs to select.
The network pσ overcome the state of the art in term
of accuracy: 57% vs 44.4%. Small improvements in
accuracy led to large improvements in playing strength.

Reinforcement learning policy and value networks
By considering a best player, who uses a certain policy p, we can approximate its vp
(s).
Init a RL policy network pρ=σ and improve it in order to ﬁnd the best player:
1 Challenge the current pρ it vs a randomly selected previous iteration of the policy
network pick up from a pool of opponents to prevent overﬁtting on a single one.
2 Improve it by policy gradient reinforment learning:
∆ρ ∝
∂logpρ(a|s)
∂σ
zt , zt = ±r(sT ) , r(s) =
0 non-terminal time steps t < T
±1 winning/losing
Since v∗
(s) is infeasible we compute a Value network vθ(s) ≈ vpρ
(s) ≈ v∗
(s):
Regression NN on state-outcome pairs with a similar architecture to the policy
network but with one output - Trained with SGD and MSE as loss.
Use a self-play dataset consisting of 30M distinct position, each sampled from a
separate game. Each game was played by the pρ and itself until the game ends

MCTS: searching with policy and value networks
MCTS algorithm that selects
action by lookahead search plus
the policy and value networks.
Each edge (s, a) store an action
value Q(s, a), visit count N(s, a)
and a prior probability P(s, a).
Iterate for n simulation:
a Selection: traverse the tree from the root by selecting the edge/action a with max(Q + u)
until a leaf node is added at time L: sL. Exploration/exploitation controlled by u.
at = arg max
a
(Q(st , a) + u(st , a)); u(s, a) ∝
P(s, a)
1 + N(s, a)
; P(s, a) = pσ(a|s)
b Expansion: sL may be expanded/processed with the pσ(a|s) ⇒ store it for each legal a.
c Evaluation: compute vθ(sL) and the outcome zL with the fast rollout policy pπ by
sampling actions. Then compute the leaf evaluation: V (sL) = (1 − λ)vθ(sL) + λzL
d Backup: update the action values and visit counts of all traversed edged:
N(s, a) =
n
i=1
1(s, a, i); Q(s, a) =
1
N(s, a)
n
i=1
1(s, a, i)V (si
L)
Once the search is complete chose the most visited move from the root position.

Conclusion
MCTS is asynchronous: use multi-threaded search that executes simulations on
CPUs, and computes policy and value networks in parallel on GPUs.
RL policy network won 85% of games against Pachi by only sampling the next
move at ∼ pσ(·|st ) instead the the SL policy network won only 11%.
Initially, a dataset of complete games led the value net to overfitting! Next states
are strongly correlated and the regression target is shared for the entire game.
In MCTS the SL policy netwok performed better of the stronger RL policy network
to compute the P(s, a), presumably because humans select a diverse beam of
promising moves, whereas RL optimizes for the single best move.
The hyperparameter λ = 0 shows that the value networks provide a viable
alternative to Monte Carlo evaluation in Go! λ = 0.5 was the best.
Further details can be found in the original paper [3].
Reference:
[1] P. Baudiˇs and J.-L. Gailly, ‘Pachi: State of the art open source go program,’, vol. 7168, Jan. 2012.
[2] M. Müller, ‘Computer go,’ Artificial Intelligence, vol. 134, no. 1, pp. 145–179, 2002, issn: 0004-3702.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, ‘Mastering the game of go with deep neural networks and tree
search,’ Nature, vol. 529, pp. 484–503, 2016.

Alpha Go: in few slides

More Related Content

What's hot (19)

Similar to Alpha Go: in few slides (20)

Recently uploaded (20)

Alpha Go: in few slides