Mastering the game of go with deep neural networks and tree searching

with deep neural networks
and tree search
Mastering
the game of Go
KoreaUniversity,
DepartmentofComputerScience&Radio
CommunicationEngineering
2016010646BumsooKim
MASSIVEDATAMANAGEMENT
Professor JaewooKang
1

MASSIVE DATA MANAGEMENT Presentation format
Contents
01.WhyGoisHardToConquer
02.Monte-CarloTreeSearch
03.PolicyNetworks
04.ValueNetwork
05.PlayingGo
1-1.Searchspace
1-2.Comparison
1-3.Othermethods
2-1.MCTS(Monte-CarloTreeSearch)
2-2.WhatisDifferent?
2-3.PolicyNetwork
2-4.ValueNetwork
3-1.SLpolicynetwork
3-2.RLpolicynetwork
3-3.Rolloutpolicynetwork
4-1.ValueNetworks
4-2.ReinforcementLearning
5-1.SearchingwithNetworks
5-2.Howdoesitwork?
2

Why Go is Hard To Conquer
1-1.Searchspace
1-2.Comparison
1-3.Othermethods
3

1-1. Search space 1. Why Go is Hard To Conquer
𝟏𝟗 𝟐
numbers of (着點)
𝟏𝟗𝟐 = 𝟑𝟔𝟏
Each point can be either or or
𝟑𝟑𝟔𝟏
≈ 𝟏𝟎𝟏𝟕𝟐
𝑨𝒗𝒆𝒓𝒂𝒈𝒆𝒃𝒓𝒂𝒏𝒄𝒉𝒊𝒏𝒈𝒇𝒂𝒄𝒕𝒐𝒓≈ 𝟐𝟓𝟎
𝑨𝒗𝒆𝒓𝒂𝒈𝒆𝒈𝒂𝒎𝒆𝒅𝒆𝒑𝒕𝒉≈ 𝟏𝟓𝟎
∴ 𝑮𝒂𝒎𝒆𝑻𝒓𝒆𝒆𝑪𝒐𝒎𝒑𝒍𝒆𝒙𝒊𝒕𝒚 = 𝟐𝟓𝟎𝟏𝟓𝟎
(≈ 𝟏𝟎𝟓𝟕𝟓
)
4

1-2. Comparison 1. Why Go is Hard To Conquer
𝒗𝒔
state-space complexity𝟏𝟎𝟓𝟎 𝟑𝟑𝟔𝟏
≈ 𝟏𝟎𝟏𝟕𝟐
𝟑𝟓 𝟐𝟓𝟎
𝟖𝟎 𝟏𝟓𝟎
≈𝟏𝟎𝟏𝟐𝟑 ≈𝟏𝟎𝟓𝟕𝟓
game breadth(b)
game depth(d)
complexity
𝑪𝒉𝒆𝒔𝒔 𝑮𝒐
5

1-2. Comparison 1. Why Go is Hard To Conquer
(numberofatomsinspace 𝟏𝟎𝟖𝟎) x 𝟏𝟎𝟒𝟗𝟓
Gosearchspace= 𝟐𝟓𝟎 𝟏𝟓𝟎
𝟐𝟓𝟎𝟏𝟓𝟎
≈
| How large is it actually?
6

1-3. Other methods 1. Why Go is Hard To Conquer
Surreal Number*
Artificial Neural Network
Evolutionary Computation
Reinforced Learning
Mediocre Algorithm
* John H. Conway, Game Theory(1969)
Due to it’s large search
space & complexity,
results were very poor
7

Monte-Carlo Tree Search
2-1.MCTS(Monte-CarloTreeSearch)
2-2.WhatisDifferent?
2-3.PolicyNetwork
2-4.ValueNetwork
8

2-1. MCTS 2. Monte-Carlo Tree Searching
Selection Expansion Simulation Backpropagation
Heuristic Search Algorithm
Analysis on the most promising moves based on random sampling
Useful in solving problems that are mathematically unable to calculate
9

Start from root R
Select child node 𝑳 which 𝒎𝒂𝒙
𝒘𝒊
𝒏𝒊
+ 𝒄
𝒍𝒏 𝒕
𝒏𝒊
𝒘𝒊 : number of wins after the 𝒊− 𝒕𝒉 move
𝒏𝒊 : number of simulations after the 𝒊− 𝒕𝒉 move
𝒄 : exploration parameter, theoretically = 𝟐
𝒕 : total number of situations, =σ 𝒏𝒊
* Kocsis and Szepesvári
*
𝑳 : leaf node
10

: Unless 𝑳 ends the game, create or choose node C𝑪
Start from root R
11

: Unless 𝑳 ends the game, create or choose node 𝑪𝑪
Play a random playout from node 𝑪
Start from root R
12

Update the result of the playout from C to R
numbers of visits & simulation score
𝑪
𝑹
: Unless 𝑳 ends the game, create or choose node 𝑪𝑪
Play a random playout from node 𝑪
Start from root R
13

MCTS has been widely used to conquer Go
others failed in conquering professional 𝑑𝑎𝑛𝑯𝒐𝒘𝒆𝒗𝒆𝒓,
What makes
AlphaGo so special?
14

2-2. What is Different? 2. Monte-Carlo Tree Searching
reducing breadth reducing depth
Reduce breadth = Policy Network(sampling actions)
Reduce depth = Value Network(position evaluation)
probability of move 𝒂 in position 𝒔 𝒑(𝒂|𝒔)
approximate value function 𝒗∗(𝒔)
15

2-3. Policy Network 2. Monte-Carlo Tree Searching
SL Policy Network RL Policy Network
Convolutional Neural Network
𝟑𝟎𝒎𝒊𝒍 KGS previous Go report
Fast, efficient learning
𝒑𝞂 𝒑𝝿 𝒑𝞀
Reinforcement Learning
Improves SL policy network
Adjust the policy towards
correct goals
weights = 𝞂
Policy Gradient Learning
16

2-4. Value Network 2. Monte-Carlo Tree Searching
Self-play → value network 𝒗 𝜃
RL Policy Network
𝒑𝞀
SL Policy Network
RL Policy Network’
𝒑𝞀
′
𝒑𝞂
Prevents Overfitting
* Only Reminding the KSG moves
Predicts Winner of the Game
vs
Position Data-set
→ predict Human Expert Moves
*
Evaluates the next move
17

Policy Networks
3-1.SLpolicynetwork
3-2.RLpolicynetwork
3-3.Rolloutpolicynetwork
18

3-1. SL Policy Network 3. Policy Networks
Image Recognition
Each point is converted into 11 feature data
dimension = 48
(3+1+8+8+8+8+8+1+1+1+1)
19

Input dimension = 19 × 19 × 48
𝟏𝟗
𝟏𝟗
Zero padding = 23 × 23 × 48
Convolutional Kernel = 𝒏 × 𝒏× 𝒌
In AlphaGo, 𝒏 = 𝟓,𝟑,𝟏 , 𝒌 = 𝟏𝟗𝟐
*
= 23 × 23 × 48
Input Data Dimension
20

Why Zero Padding?
0 0 0 0 0
0 0 0 0 0
0 0
0 0
0 0
Calculating edges
…
…
+2
+2 +2
+2
21

*https://brunch.co.kr/@justinleeanac/2
Final layer = Softmax* layer
Feeding Forward method (13 layers)
* Softmax layer
= Probability distribution from state 𝐬 over all legal moves 𝒂
= 𝒑 𝞂(𝒂|𝒔)
22

Supervised Learning Results
23

By Convolutional Neural Network,
Where Human Experts
will play next??
Improvements in SL policy by filter
HOW ELSE?
24

3-2. RL Policy Network 3. Policy Networks
Improving SL Policy Network by Reinforcement Learning
Policy Gradient Reinforcement Learning
𝒑 𝞂 = 𝒑𝞀 : identical structures, initialized with 𝞀 = 𝞂
𝒓(𝒔) : reward function
𝒕 : non-terminal steps
𝑻 : terminal(final) step
𝒔 𝑻 : final state
𝒓(𝒔 𝑻) : +1 for win, -1 for lose
𝒛𝒕 = 𝒓(𝒔 𝑻) : outcome
self play against its policy network
25

3-2. RL Policy Network 3. Policy Networks
Why Policy Gradient ?
Learn 𝒑𝞀(𝒂𝒕|𝒔𝒕) to maximize 𝐳𝐭
Policy network is interest in finding the probability distribution 𝑝𝞺, not
the value function. Therefore, they used policy gradient
Weight 𝞀 is updated at each time step 𝒕 by stochastic gradient ascent
Wins 80% of games against SL policy network
26

3-3. Rollout Policy 3. Policy Networks
𝒑 𝝿 : Fast policy for predicting human expert moves
27

Used in the tree search when running a rollout to the end of the game
28

Trained with linear softmax of small pattern features(weight = 𝝿)
29

Value Networks
4-1.ValueNetworks
4-2.ReinforcementLearning
30

4-1. Value Networks 4. Value Networks
Policy Networks Value Networks
SL Policy Network RL Policy Network
Goal Prediction Evaluation
Type Classification Policy Gradient Regression
Address Expert moves Self-play moves Winning rate
Structure Shares the same architecture
Output Probability distribution Scalar value
𝒑𝞂(𝒂|𝒔) 𝒑𝞀(𝒂|𝒔) 𝒗𝜃(𝒔′
)
31

4-1. Value Networks 4. Value Networks
Policy Network
Input : board position 𝒔
Parameters : 𝞂or 𝞀
Output : 𝒑 𝞂(𝒂|𝒔)or 𝒑𝞀(𝒂|𝒔)
Value Network
Input : predicted position 𝒔′
Parameters : 𝜃
Output : 𝒗 𝜃(𝒔′)
probability map
convolutional layers
current position predicted position
scalar value
32

4-2. Reinforcement Learning 4. Value Networks
𝒗𝒑(𝒔): predicts the outcome from position 𝒔 played by policy 𝒑
𝒗∗(𝒔): Optimal value function under perfect play
𝒗𝒑 𝞀(𝒔): predicts the outcome by RL policy 𝒑𝞀
𝒗 𝜃(𝒔)≈ 𝒗𝒑 𝞀(𝒔) ≈ 𝒗∗(𝒔)
33

Learn 𝜽to minimize 𝒛− 𝒗 𝜃(𝒔)
Weight 𝜽 is updated on state-outcome pairs (𝒔,𝒛)
Stochastic Gradient Descent, minimizing MSE(𝒛− 𝒗 𝜃(𝒔))
Naïve approach → Overfitting occurs
- Only memorized game outcomes(Cannot generalize new positions)
- Training set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟏𝟗
- Test set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟑𝟕
34

- New self-play data(𝟑𝟎𝒎𝒊𝒍distinct positions)
- Played against the RL policy network
- Training set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟐𝟐𝟔
Improvements
- Test set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟐𝟑𝟒
Constantly accurate than 𝒑 𝝿
Eventually approaches 𝒑𝞀
(Using
1
15,000
computation)
35

Playing Go
5-1.SearchingwithNetworks
5-2.Howdoesitwork?
5-3.Performance
36

5-1. Searching with Networks 5. Playing Go
𝑸(𝒔,𝒂) = action value
𝑷(𝒔,𝒂) = prior probability
𝑵(𝒔,𝒂) = visit count
𝒖(𝒔,𝒂) = bonus
Decays with repeated visits.
Encourage exploration.
Proportional to prior probability
Term Definition
37

Select edge with maximum 𝑸+ 𝒖(𝑷)
38

𝒑 𝞂 = SL policy network, processing new node
𝑷 = output probability for each action
= 𝒑 𝞂(𝒂|𝒔) → 𝒔𝑳
Only done when 𝑵𝑳 ≥ 𝟒𝟎
Why SL policy?
SL policy network 𝒑𝞂
performs better than
RL policy network 𝒑𝞀
(∵ Human = diverse promising moves
RL = optimizes single best move)
39

𝒑 𝝿 = fast rollout policy → output : 𝑧𝐿
𝒗𝜽(𝒔𝑳) = value network → output : scalar
Leaf Evaluation : 𝑽 𝒔𝑳 = 𝟏 − 𝝀 𝒗𝜽 𝒔𝑳 + 𝝀𝐳𝐋
computes winner
𝒗 𝜃(𝒔)≈ 𝒗𝒑 𝞀(𝒔) ≈ 𝒗∗(𝒔)
RL policy network 𝒑𝞀
to derive value function
40

Update𝑸(𝒔,𝒂)
41

Selectthenodethathasbeenmostvisited
(Selectingby 𝑸causesoverfitting)
Repeat&Update
42

5-2. How Does It Work? 5. Playing Go
Evaluation of all successor 𝒔′ of root 𝒔
Evaluation = value network 𝒗𝜽(𝒔′)
As you see, 54 is the highest estimated
winning percentages
43

𝑸 𝒔,𝒂 = action values for each (𝒔,𝒂)
Averaged over value network evaluations only
λ = 0
44

𝑸 𝒔,𝒂 = action values for each (𝒔,𝒂)
Averaged over rollout evaluations only
λ = 1
45

Move probabilities directly from 𝒑 𝞂(a|s)
SL policy network
Expansion when count is over 40
46

Percentage frequency of selected actions
Actions were selected from root simulation
47

The path with maximum visit count
Moves are presented in numbered sequence
Select the move indicated by red circle
48

5-3. Performance 5. Playing Go
Performance of AlphaGo on a single machine with various combinations
49

Performance contrast between Single vs. Distributed machines
50

Recent matches proved that AlphaGo has finally conquered the highest 𝟗𝒅𝒂𝒏
51

Thank you for your attention!
52

Mastering the game of go with deep neural networks and tree searching

More Related Content

What's hot (17)

Similar to Mastering the game of go with deep neural networks and tree searching (20)

More from Brian Kim (7)

Recently uploaded (20)

Mastering the game of go with deep neural networks and tree searching