SlideShare a Scribd company logo
Mastering the game of Go
A deep neural networks and tree search approach
Alessandro Cudazzo
Department of Computer Science
University of Pisa
alessandro@cudazzo.com
ISPR Midterm IV, 2020
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 1 / 6
Back in time: 2015
We are in 2015 and you may be wondering why mastering the ancient Chinese game of
Go is an important challenge for researchers in the artificial intelligence field.
First, let’s start from the beginning of the story:
A 19x19 board game with approx. bd
possible sequences of moves (b ≈ 250, d ≈ 150).
It is a game of perfect information and can be formulate as a zero-sum game.
Figure: A Go
Board state.
Exhaustive search is infeasible: v∗
(s) optimal value function is
unfeasible to compute. It determines the outcome of a game from
any state s, under perfect play by both players.
Depth reduction with a approximate value function:
v(s) ≈ v∗
(s)
Breadth reduction by sampling actions from a policy function:
P(a|s)
At that time, the strongest Go program was based on MCTS - Pachi [1] ranked at 2
amateur dan on KGS. Experts agree that the major stumbling block to creating
stronger-than-amateur Go programs is the relative to a position evaluation function [2].
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 2 / 6
Supervised learning of policy networks
So, Deepmind had a clear view: they had to find better policy and value functions with
deep learning and efficiently combines both with the MCTS heuristic search algorithm.
Supervised learning approach ⇒ SL policy network pσ(a|s):
It takes a 19x19x48 input feature to represent the board.
Deep convolution NN with a softmax output (13-layer).
DB 30M state-action (s, a) - stochastic gradient ascent to
maximize the likelihood of the human move a selected in s:
∆σ ∝
∂logpσ(a|s)
∂σ
An issue, it takes 3ms to process: slow for the rollout!
So, a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small
pattern features as been trained; with accuracy of 24.2% and 2µs to select.
The network pσ overcome the state of the art in term
of accuracy: 57% vs 44.4%. Small improvements in
accuracy led to large improvements in playing strength.
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 3 / 6
Reinforcement learning policy and value networks
By considering a best player, who uses a certain policy p, we can approximate its vp
(s).
Init a RL policy network pρ=σ and improve it in order to find the best player:
1 Challenge the current pρ it vs a randomly selected previous iteration of the policy
network pick up from a pool of opponents to prevent overfitting on a single one.
2 Improve it by policy gradient reinforment learning:
∆ρ ∝
∂logpρ(a|s)
∂σ
zt , zt = ±r(sT ) , r(s) =
0 non-terminal time steps t < T
±1 winning/losing
Since v∗
(s) is infeasible we compute a Value network vθ(s) ≈ vpρ
(s) ≈ v∗
(s):
Regression NN on state-outcome pairs with a similar architecture to the policy
network but with one output - Trained with SGD and MSE as loss.
Use a self-play dataset consisting of 30M distinct position, each sampled from a
separate game. Each game was played by the pρ and itself until the game ends
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 4 / 6
MCTS: searching with policy and value networks
MCTS algorithm that selects
action by lookahead search plus
the policy and value networks.
Each edge (s, a) store an action
value Q(s, a), visit count N(s, a)
and a prior probability P(s, a).
Iterate for n simulation:
a Selection: traverse the tree from the root by selecting the edge/action a with max(Q + u)
until a leaf node is added at time L: sL. Exploration/exploitation controlled by u.
at = arg max
a
(Q(st , a) + u(st , a)); u(s, a) ∝
P(s, a)
1 + N(s, a)
; P(s, a) = pσ(a|s)
b Expansion: sL may be expanded/processed with the pσ(a|s) ⇒ store it for each legal a.
c Evaluation: compute vθ(sL) and the outcome zL with the fast rollout policy pπ by
sampling actions. Then compute the leaf evaluation: V (sL) = (1 − λ)vθ(sL) + λzL
d Backup: update the action values and visit counts of all traversed edged:
N(s, a) =
n
i=1
1(s, a, i); Q(s, a) =
1
N(s, a)
n
i=1
1(s, a, i)V (si
L)
Once the search is complete chose the most visited move from the root position.
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 5 / 6
Conclusion
MCTS is asynchronous: use multi-threaded search that executes simulations on
CPUs, and computes policy and value networks in parallel on GPUs.
RL policy network won 85% of games against Pachi by only sampling the next
move at ∼ pσ(·|st ) instead the the SL policy network won only 11%.
Initially, a dataset of complete games led the value net to overfitting! Next states
are strongly correlated and the regression target is shared for the entire game.
In MCTS the SL policy netwok performed better of the stronger RL policy network
to compute the P(s, a), presumably because humans select a diverse beam of
promising moves, whereas RL optimizes for the single best move.
The hyperparameter λ = 0 shows that the value networks provide a viable
alternative to Monte Carlo evaluation in Go! λ = 0.5 was the best.
Further details can be found in the original paper [3].
Reference:
[1] P. Baudiˇs and J.-L. Gailly, ‘Pachi: State of the art open source go program,’, vol. 7168, Jan. 2012.
[2] M. M¨uller, ‘Computer go,’ Artificial Intelligence, vol. 134, no. 1, pp. 145–179, 2002, issn: 0004-3702.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, ‘Mastering the game of go with deep neural networks and tree
search,’ Nature, vol. 529, pp. 484–503, 2016.
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 6 / 6

More Related Content

PDF
La question de la durabilité des technologies de calcul et de télécommunication
PDF
[ICLR/ICML2019読み会] A Wrapped Normal Distribution on Hyperbolic Space for Grad...
PDF
Kim Hammar - Distributed Deep Learning - RISE Learning Machines Meetup
PDF
Self-Learning Systems for Cyber Security
PDF
Quantitative Redistricting Workshop - Machine Learning for Fair Redistricting...
PDF
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
PDF
2013.10.24 big datavisualization
La question de la durabilité des technologies de calcul et de télécommunication
[ICLR/ICML2019読み会] A Wrapped Normal Distribution on Hyperbolic Space for Grad...
Kim Hammar - Distributed Deep Learning - RISE Learning Machines Meetup
Self-Learning Systems for Cyber Security
Quantitative Redistricting Workshop - Machine Learning for Fair Redistricting...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
2013.10.24 big datavisualization

What's hot (19)

PDF
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
PPTX
karnaugh maps
PDF
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
PPT
Artificial Intelligence
PDF
On First-Order Meta-Learning Algorithms
PDF
New Insights and Perspectives on the Natural Gradient Method
PDF
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
PPTX
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
PDF
SPDE presentation 2012
PPTX
Histogram based Enhancement
PDF
Os Urner
PPTX
Solving graph problems using networkX
PPTX
Histogram based enhancement
PPTX
Visualization using tSNE
PPT
Artificial Intelligence
PDF
Histogram Operation in Image Processing
PDF
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
PDF
Neural networks - BigSkyDevCon
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
karnaugh maps
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
Artificial Intelligence
On First-Order Meta-Learning Algorithms
New Insights and Perspectives on the Natural Gradient Method
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
SPDE presentation 2012
Histogram based Enhancement
Os Urner
Solving graph problems using networkX
Histogram based enhancement
Visualization using tSNE
Artificial Intelligence
Histogram Operation in Image Processing
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
Neural networks - BigSkyDevCon
Ad

Similar to Alpha Go: in few slides (20)

PDF
Google Deepmind Mastering Go Research Paper
PPTX
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
PDF
PDF
A Presentation on the Paper: Mastering the game of Go with deep neural networ...
PDF
How DeepMind Mastered The Game Of Go
PDF
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
PDF
Improving initial generations in pso algorithm for transportation network des...
PDF
Prim algorithm for the implementation of random mazes in videogames
PDF
Succinct Summarisation of Large Networks via Small Synthetic Representative G...
PDF
Citython presentation
PDF
2019 GDRR: Blockchain Data Analytics - Dissecting Blockchain Price Analytics...
PDF
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
PDF
Vol 14 No 1 - July 2014
DOCX
Data visualization with R and ggplot2.docx
PDF
Big data matrix factorizations and Overlapping community detection in graphs
PDF
Model-counting Approaches For Nonlinear Numerical Constraints
PDF
Performance analysis of transformation and bogdonov chaotic substitution base...
PDF
Alpha go 16110226_김영우
PDF
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
PPT
hanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppt
Google Deepmind Mastering Go Research Paper
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
A Presentation on the Paper: Mastering the game of Go with deep neural networ...
How DeepMind Mastered The Game Of Go
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Improving initial generations in pso algorithm for transportation network des...
Prim algorithm for the implementation of random mazes in videogames
Succinct Summarisation of Large Networks via Small Synthetic Representative G...
Citython presentation
2019 GDRR: Blockchain Data Analytics - Dissecting Blockchain Price Analytics...
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Vol 14 No 1 - July 2014
Data visualization with R and ggplot2.docx
Big data matrix factorizations and Overlapping community detection in graphs
Model-counting Approaches For Nonlinear Numerical Constraints
Performance analysis of transformation and bogdonov chaotic substitution base...
Alpha go 16110226_김영우
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
hanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppt
Ad

Recently uploaded (20)

PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
Artificial Intelligence
PPTX
web development for engineering and engineering
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
composite construction of structures.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Current and future trends in Computer Vision.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
Geodesy 1.pptx...............................................
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Artificial Intelligence
web development for engineering and engineering
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
composite construction of structures.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
OOP with Java - Java Introduction (Basics)
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Current and future trends in Computer Vision.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Construction Project Organization Group 2.pptx
Internet of Things (IOT) - A guide to understanding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Geodesy 1.pptx...............................................

Alpha Go: in few slides

  • 1. Mastering the game of Go A deep neural networks and tree search approach Alessandro Cudazzo Department of Computer Science University of Pisa alessandro@cudazzo.com ISPR Midterm IV, 2020 Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 1 / 6
  • 2. Back in time: 2015 We are in 2015 and you may be wondering why mastering the ancient Chinese game of Go is an important challenge for researchers in the artificial intelligence field. First, let’s start from the beginning of the story: A 19x19 board game with approx. bd possible sequences of moves (b ≈ 250, d ≈ 150). It is a game of perfect information and can be formulate as a zero-sum game. Figure: A Go Board state. Exhaustive search is infeasible: v∗ (s) optimal value function is unfeasible to compute. It determines the outcome of a game from any state s, under perfect play by both players. Depth reduction with a approximate value function: v(s) ≈ v∗ (s) Breadth reduction by sampling actions from a policy function: P(a|s) At that time, the strongest Go program was based on MCTS - Pachi [1] ranked at 2 amateur dan on KGS. Experts agree that the major stumbling block to creating stronger-than-amateur Go programs is the relative to a position evaluation function [2]. Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 2 / 6
  • 3. Supervised learning of policy networks So, Deepmind had a clear view: they had to find better policy and value functions with deep learning and efficiently combines both with the MCTS heuristic search algorithm. Supervised learning approach ⇒ SL policy network pσ(a|s): It takes a 19x19x48 input feature to represent the board. Deep convolution NN with a softmax output (13-layer). DB 30M state-action (s, a) - stochastic gradient ascent to maximize the likelihood of the human move a selected in s: ∆σ ∝ ∂logpσ(a|s) ∂σ An issue, it takes 3ms to process: slow for the rollout! So, a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small pattern features as been trained; with accuracy of 24.2% and 2µs to select. The network pσ overcome the state of the art in term of accuracy: 57% vs 44.4%. Small improvements in accuracy led to large improvements in playing strength. Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 3 / 6
  • 4. Reinforcement learning policy and value networks By considering a best player, who uses a certain policy p, we can approximate its vp (s). Init a RL policy network pρ=σ and improve it in order to find the best player: 1 Challenge the current pρ it vs a randomly selected previous iteration of the policy network pick up from a pool of opponents to prevent overfitting on a single one. 2 Improve it by policy gradient reinforment learning: ∆ρ ∝ ∂logpρ(a|s) ∂σ zt , zt = ±r(sT ) , r(s) = 0 non-terminal time steps t < T ±1 winning/losing Since v∗ (s) is infeasible we compute a Value network vθ(s) ≈ vpρ (s) ≈ v∗ (s): Regression NN on state-outcome pairs with a similar architecture to the policy network but with one output - Trained with SGD and MSE as loss. Use a self-play dataset consisting of 30M distinct position, each sampled from a separate game. Each game was played by the pρ and itself until the game ends Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 4 / 6
  • 5. MCTS: searching with policy and value networks MCTS algorithm that selects action by lookahead search plus the policy and value networks. Each edge (s, a) store an action value Q(s, a), visit count N(s, a) and a prior probability P(s, a). Iterate for n simulation: a Selection: traverse the tree from the root by selecting the edge/action a with max(Q + u) until a leaf node is added at time L: sL. Exploration/exploitation controlled by u. at = arg max a (Q(st , a) + u(st , a)); u(s, a) ∝ P(s, a) 1 + N(s, a) ; P(s, a) = pσ(a|s) b Expansion: sL may be expanded/processed with the pσ(a|s) ⇒ store it for each legal a. c Evaluation: compute vθ(sL) and the outcome zL with the fast rollout policy pπ by sampling actions. Then compute the leaf evaluation: V (sL) = (1 − λ)vθ(sL) + λzL d Backup: update the action values and visit counts of all traversed edged: N(s, a) = n i=1 1(s, a, i); Q(s, a) = 1 N(s, a) n i=1 1(s, a, i)V (si L) Once the search is complete chose the most visited move from the root position. Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 5 / 6
  • 6. Conclusion MCTS is asynchronous: use multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. RL policy network won 85% of games against Pachi by only sampling the next move at ∼ pσ(·|st ) instead the the SL policy network won only 11%. Initially, a dataset of complete games led the value net to overfitting! Next states are strongly correlated and the regression target is shared for the entire game. In MCTS the SL policy netwok performed better of the stronger RL policy network to compute the P(s, a), presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move. The hyperparameter λ = 0 shows that the value networks provide a viable alternative to Monte Carlo evaluation in Go! λ = 0.5 was the best. Further details can be found in the original paper [3]. Reference: [1] P. Baudiˇs and J.-L. Gailly, ‘Pachi: State of the art open source go program,’, vol. 7168, Jan. 2012. [2] M. M¨uller, ‘Computer go,’ Artificial Intelligence, vol. 134, no. 1, pp. 145–179, 2002, issn: 0004-3702. [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, ‘Mastering the game of go with deep neural networks and tree search,’ Nature, vol. 529, pp. 484–503, 2016. Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 6 / 6