SlideShare a Scribd company logo
with deep neural networks
and tree search
Mastering
the game of Go
KoreaUniversity,
DepartmentofComputerScience&Radio
CommunicationEngineering
2016010646BumsooKim
MASSIVEDATAMANAGEMENT
Professor JaewooKang
1
MASSIVE DATA MANAGEMENT Presentation format
Contents
01.WhyGoisHardToConquer
02.Monte-CarloTreeSearch
03.PolicyNetworks
04.ValueNetwork
05.PlayingGo
1-1.Searchspace
1-2.Comparison
1-3.Othermethods
2-1.MCTS(Monte-CarloTreeSearch)
2-2.WhatisDifferent?
2-3.PolicyNetwork
2-4.ValueNetwork
3-1.SLpolicynetwork
3-2.RLpolicynetwork
3-3.Rolloutpolicynetwork
4-1.ValueNetworks
4-2.ReinforcementLearning
5-1.SearchingwithNetworks
5-2.Howdoesitwork?
2
MASSIVE DATA MANAGEMENT Presentation format
Why Go is Hard To Conquer
1-1.Searchspace
1-2.Comparison
1-3.Othermethods
3
MASSIVE DATA MANAGEMENT Presentation format
1-1. Search space 1. Why Go is Hard To Conquer
𝟏𝟗 𝟐
numbers of (着點)
𝟏𝟗𝟐 = 𝟑𝟔𝟏
Each point can be either or or
𝟑𝟑𝟔𝟏
≈ 𝟏𝟎𝟏𝟕𝟐
𝑨𝒗𝒆𝒓𝒂𝒈𝒆𝒃𝒓𝒂𝒏𝒄𝒉𝒊𝒏𝒈𝒇𝒂𝒄𝒕𝒐𝒓≈ 𝟐𝟓𝟎
𝑨𝒗𝒆𝒓𝒂𝒈𝒆𝒈𝒂𝒎𝒆𝒅𝒆𝒑𝒕𝒉≈ 𝟏𝟓𝟎
∴ 𝑮𝒂𝒎𝒆𝑻𝒓𝒆𝒆𝑪𝒐𝒎𝒑𝒍𝒆𝒙𝒊𝒕𝒚 = 𝟐𝟓𝟎𝟏𝟓𝟎
(≈ 𝟏𝟎𝟓𝟕𝟓
)
4
MASSIVE DATA MANAGEMENT Presentation format
1-2. Comparison 1. Why Go is Hard To Conquer
𝒗𝒔
state-space complexity𝟏𝟎𝟓𝟎 𝟑𝟑𝟔𝟏
≈ 𝟏𝟎𝟏𝟕𝟐
𝟑𝟓 𝟐𝟓𝟎
𝟖𝟎 𝟏𝟓𝟎
≈𝟏𝟎𝟏𝟐𝟑 ≈𝟏𝟎𝟓𝟕𝟓
game breadth(b)
game depth(d)
complexity
𝑪𝒉𝒆𝒔𝒔 𝑮𝒐
5
MASSIVE DATA MANAGEMENT Presentation format
1-2. Comparison 1. Why Go is Hard To Conquer
(numberofatomsinspace 𝟏𝟎𝟖𝟎) x 𝟏𝟎𝟒𝟗𝟓
Gosearchspace= 𝟐𝟓𝟎 𝟏𝟓𝟎
𝟐𝟓𝟎𝟏𝟓𝟎
≈
| How large is it actually?
6
MASSIVE DATA MANAGEMENT Presentation format
1-3. Other methods 1. Why Go is Hard To Conquer
Surreal Number*
Artificial Neural Network
Evolutionary Computation
Reinforced Learning
Mediocre Algorithm
* John H. Conway, Game Theory(1969)
Due to it’s large search
space & complexity,
results were very poor
7
MASSIVE DATA MANAGEMENT Presentation format
Monte-Carlo Tree Search
2-1.MCTS(Monte-CarloTreeSearch)
2-2.WhatisDifferent?
2-3.PolicyNetwork
2-4.ValueNetwork
8
MASSIVE DATA MANAGEMENT Presentation format
2-1. MCTS 2. Monte-Carlo Tree Searching
Selection Expansion Simulation Backpropagation
Heuristic Search Algorithm
Analysis on the most promising moves based on random sampling
Useful in solving problems that are mathematically unable to calculate
9
MASSIVE DATA MANAGEMENT Presentation format
2-1. MCTS 2. Monte-Carlo Tree Searching
Selection Expansion Simulation Backpropagation
Start from root R
Select child node 𝑳 which 𝒎𝒂𝒙
𝒘𝒊
𝒏𝒊
+ 𝒄
𝒍𝒏 𝒕
𝒏𝒊
𝒘𝒊 : number of wins after the 𝒊− 𝒕𝒉 move
𝒏𝒊 : number of simulations after the 𝒊− 𝒕𝒉 move
𝒄 : exploration parameter, theoretically = 𝟐
𝒕 : total number of situations, =σ 𝒏𝒊
* Kocsis and Szepesvári
*
𝑳 : leaf node
10
MASSIVE DATA MANAGEMENT Presentation format
2-1. MCTS 2. Monte-Carlo Tree Searching
Selection Expansion Simulation Backpropagation
: Unless 𝑳 ends the game, create or choose node C𝑪
Start from root R
11
MASSIVE DATA MANAGEMENT Presentation format
2-1. MCTS 2. Monte-Carlo Tree Searching
Selection Expansion Simulation Backpropagation
: Unless 𝑳 ends the game, create or choose node 𝑪𝑪
Play a random playout from node 𝑪
Start from root R
12
MASSIVE DATA MANAGEMENT Presentation format
2-1. MCTS 2. Monte-Carlo Tree Searching
Selection Expansion Simulation Backpropagation
Update the result of the playout from C to R
numbers of visits & simulation score
𝑪
𝑹
: Unless 𝑳 ends the game, create or choose node 𝑪𝑪
Play a random playout from node 𝑪
Start from root R
13
MASSIVE DATA MANAGEMENT Presentation format
2-1. MCTS 2. Monte-Carlo Tree Searching
MCTS has been widely used to conquer Go
others failed in conquering professional 𝑑𝑎𝑛𝑯𝒐𝒘𝒆𝒗𝒆𝒓,
What makes
AlphaGo so special?
14
MASSIVE DATA MANAGEMENT Presentation format
2-2. What is Different? 2. Monte-Carlo Tree Searching
Selection Expansion Simulation Backpropagation
reducing breadth reducing depth
Reduce breadth = Policy Network(sampling actions)
Reduce depth = Value Network(position evaluation)
probability of move 𝒂 in position 𝒔 𝒑(𝒂|𝒔)
approximate value function 𝒗∗(𝒔)
15
MASSIVE DATA MANAGEMENT Presentation format
2-3. Policy Network 2. Monte-Carlo Tree Searching
SL Policy Network RL Policy Network
Convolutional Neural Network
𝟑𝟎𝒎𝒊𝒍 KGS previous Go report
Fast, efficient learning
𝒑𝞂 𝒑𝝿 𝒑𝞀
Reinforcement Learning
Improves SL policy network
Adjust the policy towards
correct goals
weights = 𝞂
Policy Gradient Learning
16
MASSIVE DATA MANAGEMENT Presentation format
2-4. Value Network 2. Monte-Carlo Tree Searching
Self-play → value network 𝒗 𝜃
RL Policy Network
𝒑𝞀
SL Policy Network
RL Policy Network’
𝒑𝞀
′
𝒑𝞂
Prevents Overfitting
* Only Reminding the KSG moves
Predicts Winner of the Game
vs
Position Data-set
→ predict Human Expert Moves
*
Evaluates the next move
17
MASSIVE DATA MANAGEMENT Presentation format
Policy Networks
3-1.SLpolicynetwork
3-2.RLpolicynetwork
3-3.Rolloutpolicynetwork
18
MASSIVE DATA MANAGEMENT Presentation format
3-1. SL Policy Network 3. Policy Networks
Convolutional Neural Network
Image Recognition
Each point is converted into 11 feature data
dimension = 48
(3+1+8+8+8+8+8+1+1+1+1)
19
MASSIVE DATA MANAGEMENT Presentation format
3-1. SL Policy Network 3. Policy Networks
Input dimension = 19 × 19 × 48
𝟏𝟗
𝟏𝟗
Zero padding = 23 × 23 × 48
Convolutional Kernel = 𝒏 × 𝒏× 𝒌
In AlphaGo, 𝒏 = 𝟓,𝟑,𝟏 , 𝒌 = 𝟏𝟗𝟐
*
= 23 × 23 × 48
Input Data Dimension
Convolutional Neural Network
20
MASSIVE DATA MANAGEMENT Presentation format
3-1. SL Policy Network 3. Policy Networks
Why Zero Padding?
Convolutional Neural Network
0 0 0 0 0
0 0 0 0 0
0 0
0 0
0 0
Calculating edges
…
…
+2
+2 +2
+2
21
MASSIVE DATA MANAGEMENT Presentation format
3-1. SL Policy Network 3. Policy Networks
Convolutional Neural Network
*https://brunch.co.kr/@justinleeanac/2
Final layer = Softmax* layer
Feeding Forward method (13 layers)
* Softmax layer
= Probability distribution from state 𝐬 over all legal moves 𝒂
= 𝒑 𝞂(𝒂|𝒔)
22
MASSIVE DATA MANAGEMENT Presentation format
3-1. SL Policy Network 3. Policy Networks
Supervised Learning Results
23
MASSIVE DATA MANAGEMENT Presentation format
3-1. SL Policy Network 3. Policy Networks
By Convolutional Neural Network,
Where Human Experts
will play next??
Improvements in SL policy by filter
HOW ELSE?
24
MASSIVE DATA MANAGEMENT Presentation format
3-2. RL Policy Network 3. Policy Networks
Improving SL Policy Network by Reinforcement Learning
Policy Gradient Reinforcement Learning
𝒑 𝞂 = 𝒑𝞀 : identical structures, initialized with 𝞀 = 𝞂
𝒓(𝒔) : reward function
𝒕 : non-terminal steps
𝑻 : terminal(final) step
𝒔 𝑻 : final state
𝒓(𝒔 𝑻) : +1 for win, -1 for lose
𝒛𝒕 = 𝒓(𝒔 𝑻) : outcome
self play against its policy network
25
MASSIVE DATA MANAGEMENT Presentation format
3-2. RL Policy Network 3. Policy Networks
Why Policy Gradient ?
Learn 𝒑𝞀(𝒂𝒕|𝒔𝒕) to maximize 𝐳𝐭
Policy network is interest in finding the probability distribution 𝑝𝞺, not
the value function. Therefore, they used policy gradient
Weight 𝞀 is updated at each time step 𝒕 by stochastic gradient ascent
Wins 80% of games against SL policy network
26
MASSIVE DATA MANAGEMENT Presentation format
3-3. Rollout Policy 3. Policy Networks
𝒑 𝝿 : Fast policy for predicting human expert moves
27
MASSIVE DATA MANAGEMENT Presentation format
3-3. Rollout Policy 3. Policy Networks
Used in the tree search when running a rollout to the end of the game
28
MASSIVE DATA MANAGEMENT Presentation format
3-3. Rollout Policy 3. Policy Networks
Trained with linear softmax of small pattern features(weight = 𝝿)
29
MASSIVE DATA MANAGEMENT Presentation format
Value Networks
4-1.ValueNetworks
4-2.ReinforcementLearning
30
MASSIVE DATA MANAGEMENT Presentation format
4-1. Value Networks 4. Value Networks
Policy Networks Value Networks
SL Policy Network RL Policy Network
Goal Prediction Evaluation
Type Classification Policy Gradient Regression
Address Expert moves Self-play moves Winning rate
Structure Shares the same architecture
Output Probability distribution Scalar value
𝒑𝞂(𝒂|𝒔) 𝒑𝞀(𝒂|𝒔) 𝒗𝜃(𝒔′
)
31
MASSIVE DATA MANAGEMENT Presentation format
4-1. Value Networks 4. Value Networks
Policy Network
Input : board position 𝒔
Parameters : 𝞂or 𝞀
Output : 𝒑 𝞂(𝒂|𝒔)or 𝒑𝞀(𝒂|𝒔)
Value Network
Input : predicted position 𝒔′
Parameters : 𝜃
Output : 𝒗 𝜃(𝒔′)
probability map
convolutional layers
current position predicted position
scalar value
32
MASSIVE DATA MANAGEMENT Presentation format
4-2. Reinforcement Learning 4. Value Networks
𝒗𝒑(𝒔): predicts the outcome from position 𝒔 played by policy 𝒑
𝒗∗(𝒔): Optimal value function under perfect play
𝒗𝒑 𝞀(𝒔): predicts the outcome by RL policy 𝒑𝞀
𝒗 𝜃(𝒔)≈ 𝒗𝒑 𝞀(𝒔) ≈ 𝒗∗(𝒔)
33
MASSIVE DATA MANAGEMENT Presentation format
4-2. Reinforcement Learning 4. Value Networks
Learn 𝜽to minimize 𝒛− 𝒗 𝜃(𝒔)
Weight 𝜽 is updated on state-outcome pairs (𝒔,𝒛)
Stochastic Gradient Descent, minimizing MSE(𝒛− 𝒗 𝜃(𝒔))
Naïve approach → Overfitting occurs
- Only memorized game outcomes(Cannot generalize new positions)
- Training set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟏𝟗
- Test set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟑𝟕
34
MASSIVE DATA MANAGEMENT Presentation format
4-2. Reinforcement Learning 4. Value Networks
- New self-play data(𝟑𝟎𝒎𝒊𝒍distinct positions)
- Played against the RL policy network
- Training set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟐𝟐𝟔
Improvements
- Test set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟐𝟑𝟒
Constantly accurate than 𝒑 𝝿
Eventually approaches 𝒑𝞀
(Using
1
15,000
computation)
35
MASSIVE DATA MANAGEMENT Presentation format
Playing Go
5-1.SearchingwithNetworks
5-2.Howdoesitwork?
5-3.Performance
36
MASSIVE DATA MANAGEMENT Presentation format
5-1. Searching with Networks 5. Playing Go
𝑸(𝒔,𝒂) = action value
𝑷(𝒔,𝒂) = prior probability
𝑵(𝒔,𝒂) = visit count
𝒖(𝒔,𝒂) = bonus
Decays with repeated visits.
Encourage exploration.
Proportional to prior probability
Term Definition
37
MASSIVE DATA MANAGEMENT Presentation format
5-1. Searching with Networks 5. Playing Go
Select edge with maximum 𝑸+ 𝒖(𝑷)
38
MASSIVE DATA MANAGEMENT Presentation format
5-1. Searching with Networks 5. Playing Go
𝒑 𝞂 = SL policy network, processing new node
𝑷 = output probability for each action
= 𝒑 𝞂(𝒂|𝒔) → 𝒔𝑳
Only done when 𝑵𝑳 ≥ 𝟒𝟎
Why SL policy?
SL policy network 𝒑𝞂
performs better than
RL policy network 𝒑𝞀
(∵ Human = diverse promising moves
RL = optimizes single best move)
39
MASSIVE DATA MANAGEMENT Presentation format
5-1. Searching with Networks 5. Playing Go
𝒑 𝝿 = fast rollout policy → output : 𝑧𝐿
𝒗𝜽(𝒔𝑳) = value network → output : scalar
Leaf Evaluation : 𝑽 𝒔𝑳 = 𝟏 − 𝝀 𝒗𝜽 𝒔𝑳 + 𝝀𝐳𝐋
computes winner
𝒗 𝜃(𝒔)≈ 𝒗𝒑 𝞀(𝒔) ≈ 𝒗∗(𝒔)
RL policy network 𝒑𝞀
to derive value function
40
MASSIVE DATA MANAGEMENT Presentation format
5-1. Searching with Networks 5. Playing Go
Update𝑸(𝒔,𝒂)
41
MASSIVE DATA MANAGEMENT Presentation format
5-1. Searching with Networks 5. Playing Go
Selectthenodethathasbeenmostvisited
(Selectingby 𝑸causesoverfitting)
Repeat&Update
42
MASSIVE DATA MANAGEMENT Presentation format
5-2. How Does It Work? 5. Playing Go
Evaluation of all successor 𝒔′ of root 𝒔
Evaluation = value network 𝒗𝜽(𝒔′)
As you see, 54 is the highest estimated
winning percentages
43
MASSIVE DATA MANAGEMENT Presentation format
5-2. How Does It Work? 5. Playing Go
𝑸 𝒔,𝒂 = action values for each (𝒔,𝒂)
Averaged over value network evaluations only
λ = 0
44
MASSIVE DATA MANAGEMENT Presentation format
5-2. How Does It Work? 5. Playing Go
𝑸 𝒔,𝒂 = action values for each (𝒔,𝒂)
Averaged over rollout evaluations only
λ = 1
45
MASSIVE DATA MANAGEMENT Presentation format
5-2. How Does It Work? 5. Playing Go
Move probabilities directly from 𝒑 𝞂(a|s)
SL policy network
Expansion when count is over 40
46
MASSIVE DATA MANAGEMENT Presentation format
5-2. How Does It Work? 5. Playing Go
Percentage frequency of selected actions
Actions were selected from root simulation
47
MASSIVE DATA MANAGEMENT Presentation format
5-2. How Does It Work? 5. Playing Go
The path with maximum visit count
Moves are presented in numbered sequence
Select the move indicated by red circle
48
MASSIVE DATA MANAGEMENT Presentation format
5-3. Performance 5. Playing Go
Performance of AlphaGo on a single machine with various combinations
49
MASSIVE DATA MANAGEMENT Presentation format
5-3. Performance 5. Playing Go
Performance contrast between Single vs. Distributed machines
50
MASSIVE DATA MANAGEMENT Presentation format
5-3. Performance 5. Playing Go
Recent matches proved that AlphaGo has finally conquered the highest 𝟗𝒅𝒂𝒏
51
Thank you for your attention!
52

More Related Content

PDF
Representation learning
PDF
220206 transformer interpretability beyond attention visualization
PDF
Poster_Reseau_Neurones_Journees_2013
PDF
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
PDF
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
PDF
DSR Routing Decisions for Mobile Ad Hoc Networks using Fuzzy Inference System
PDF
GPUFish_technical_report
PPTX
Performance security tradeoff in Robotic Mobile Wireless Ad hoc Networks
Representation learning
220206 transformer interpretability beyond attention visualization
Poster_Reseau_Neurones_Journees_2013
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
DSR Routing Decisions for Mobile Ad Hoc Networks using Fuzzy Inference System
GPUFish_technical_report
Performance security tradeoff in Robotic Mobile Wireless Ad hoc Networks

What's hot (17)

PDF
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
PDF
Improvement of chaotic secure communication scheme based on steganographic me...
PDF
RankSRGAN
PDF
Segmentation of Images by using Fuzzy k-means clustering with ACO
PDF
Electricity Demand Forecasting Using Fuzzy-Neural Network
PDF
High speed multiplier using vedic mathematics
PDF
Mathematics 08-00326
PDF
Electricity Demand Forecasting Using ANN
PDF
Traffic sign classification
PDF
Caravan insurance data mining prediction models
PDF
IRJET- LS Chaotic based Image Encryption System Via Permutation Models
PPTX
Multilayer & Back propagation algorithm
DOCX
11 construction productivity and cost estimation using artificial
PDF
ゆるふわ強化学習入門
PDF
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
PPTX
Mining Regional Knowledge in Spatial Dataset
PDF
Reliability Prediction using the Fussel Algorithm
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
Improvement of chaotic secure communication scheme based on steganographic me...
RankSRGAN
Segmentation of Images by using Fuzzy k-means clustering with ACO
Electricity Demand Forecasting Using Fuzzy-Neural Network
High speed multiplier using vedic mathematics
Mathematics 08-00326
Electricity Demand Forecasting Using ANN
Traffic sign classification
Caravan insurance data mining prediction models
IRJET- LS Chaotic based Image Encryption System Via Permutation Models
Multilayer & Back propagation algorithm
11 construction productivity and cost estimation using artificial
ゆるふわ強化学習入門
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Mining Regional Knowledge in Spatial Dataset
Reliability Prediction using the Fussel Algorithm
Ad

Similar to Mastering the game of go with deep neural networks and tree searching (20)

PDF
Alpha go 16110226_김영우
PPTX
DNN and RBM
PPTX
OpenAI Retro Contest
PPTX
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
PDF
Matrix Factorization In Recommender Systems
ODP
Online advertising and large scale model fitting
PDF
TensorFlow and Deep Learning Tips and Tricks
PPTX
DDPG algortihm for angry birds
PPTX
Neural Learning to Rank
PPTX
Data analytics concepts
PDF
AlphaGo and AlphaGo Zero
PDF
Learning Graphs Representations Using Recurrent Graph Convolution Networks Fo...
PPTX
Nuts and Bolts of Transfer Learning.pptx
PPTX
Deep Learning for Search
PDF
Restricting the Flow: Information Bottlenecks for Attribution
PPTX
Neural Learning to Rank
PDF
AlphaZero and beyond: Polygames
PPTX
presentation of IntroductionDeepLearning.pptx
PDF
Eye deep
PDF
How DeepMind Mastered The Game Of Go
Alpha go 16110226_김영우
DNN and RBM
OpenAI Retro Contest
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
Matrix Factorization In Recommender Systems
Online advertising and large scale model fitting
TensorFlow and Deep Learning Tips and Tricks
DDPG algortihm for angry birds
Neural Learning to Rank
Data analytics concepts
AlphaGo and AlphaGo Zero
Learning Graphs Representations Using Recurrent Graph Convolution Networks Fo...
Nuts and Bolts of Transfer Learning.pptx
Deep Learning for Search
Restricting the Flow: Information Bottlenecks for Attribution
Neural Learning to Rank
AlphaZero and beyond: Polygames
presentation of IntroductionDeepLearning.pptx
Eye deep
How DeepMind Mastered The Game Of Go
Ad

More from Brian Kim (7)

PDF
FreeAnchor
PDF
20190708 bumsookim yolact
PDF
20190718 bumsookim 2_attention
PDF
Spectral cnn
PDF
Compressing neural language models by sparse word representation
PDF
Dcgan
PPTX
Google net
FreeAnchor
20190708 bumsookim yolact
20190718 bumsookim 2_attention
Spectral cnn
Compressing neural language models by sparse word representation
Dcgan
Google net

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to machine learning and Linear Models
PPTX
Database Infoormation System (DBIS).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Reliability_Chapter_ presentation 1221.5784
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Clinical guidelines as a resource for EBP(1).pdf
Foundation of Data Science unit number two notes
Introduction to machine learning and Linear Models
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
oil_refinery_comprehensive_20250804084928 (1).pptx

Mastering the game of go with deep neural networks and tree searching

  • 1. with deep neural networks and tree search Mastering the game of Go KoreaUniversity, DepartmentofComputerScience&Radio CommunicationEngineering 2016010646BumsooKim MASSIVEDATAMANAGEMENT Professor JaewooKang 1
  • 2. MASSIVE DATA MANAGEMENT Presentation format Contents 01.WhyGoisHardToConquer 02.Monte-CarloTreeSearch 03.PolicyNetworks 04.ValueNetwork 05.PlayingGo 1-1.Searchspace 1-2.Comparison 1-3.Othermethods 2-1.MCTS(Monte-CarloTreeSearch) 2-2.WhatisDifferent? 2-3.PolicyNetwork 2-4.ValueNetwork 3-1.SLpolicynetwork 3-2.RLpolicynetwork 3-3.Rolloutpolicynetwork 4-1.ValueNetworks 4-2.ReinforcementLearning 5-1.SearchingwithNetworks 5-2.Howdoesitwork? 2
  • 3. MASSIVE DATA MANAGEMENT Presentation format Why Go is Hard To Conquer 1-1.Searchspace 1-2.Comparison 1-3.Othermethods 3
  • 4. MASSIVE DATA MANAGEMENT Presentation format 1-1. Search space 1. Why Go is Hard To Conquer 𝟏𝟗 𝟐 numbers of (着點) 𝟏𝟗𝟐 = 𝟑𝟔𝟏 Each point can be either or or 𝟑𝟑𝟔𝟏 ≈ 𝟏𝟎𝟏𝟕𝟐 𝑨𝒗𝒆𝒓𝒂𝒈𝒆𝒃𝒓𝒂𝒏𝒄𝒉𝒊𝒏𝒈𝒇𝒂𝒄𝒕𝒐𝒓≈ 𝟐𝟓𝟎 𝑨𝒗𝒆𝒓𝒂𝒈𝒆𝒈𝒂𝒎𝒆𝒅𝒆𝒑𝒕𝒉≈ 𝟏𝟓𝟎 ∴ 𝑮𝒂𝒎𝒆𝑻𝒓𝒆𝒆𝑪𝒐𝒎𝒑𝒍𝒆𝒙𝒊𝒕𝒚 = 𝟐𝟓𝟎𝟏𝟓𝟎 (≈ 𝟏𝟎𝟓𝟕𝟓 ) 4
  • 5. MASSIVE DATA MANAGEMENT Presentation format 1-2. Comparison 1. Why Go is Hard To Conquer 𝒗𝒔 state-space complexity𝟏𝟎𝟓𝟎 𝟑𝟑𝟔𝟏 ≈ 𝟏𝟎𝟏𝟕𝟐 𝟑𝟓 𝟐𝟓𝟎 𝟖𝟎 𝟏𝟓𝟎 ≈𝟏𝟎𝟏𝟐𝟑 ≈𝟏𝟎𝟓𝟕𝟓 game breadth(b) game depth(d) complexity 𝑪𝒉𝒆𝒔𝒔 𝑮𝒐 5
  • 6. MASSIVE DATA MANAGEMENT Presentation format 1-2. Comparison 1. Why Go is Hard To Conquer (numberofatomsinspace 𝟏𝟎𝟖𝟎) x 𝟏𝟎𝟒𝟗𝟓 Gosearchspace= 𝟐𝟓𝟎 𝟏𝟓𝟎 𝟐𝟓𝟎𝟏𝟓𝟎 ≈ | How large is it actually? 6
  • 7. MASSIVE DATA MANAGEMENT Presentation format 1-3. Other methods 1. Why Go is Hard To Conquer Surreal Number* Artificial Neural Network Evolutionary Computation Reinforced Learning Mediocre Algorithm * John H. Conway, Game Theory(1969) Due to it’s large search space & complexity, results were very poor 7
  • 8. MASSIVE DATA MANAGEMENT Presentation format Monte-Carlo Tree Search 2-1.MCTS(Monte-CarloTreeSearch) 2-2.WhatisDifferent? 2-3.PolicyNetwork 2-4.ValueNetwork 8
  • 9. MASSIVE DATA MANAGEMENT Presentation format 2-1. MCTS 2. Monte-Carlo Tree Searching Selection Expansion Simulation Backpropagation Heuristic Search Algorithm Analysis on the most promising moves based on random sampling Useful in solving problems that are mathematically unable to calculate 9
  • 10. MASSIVE DATA MANAGEMENT Presentation format 2-1. MCTS 2. Monte-Carlo Tree Searching Selection Expansion Simulation Backpropagation Start from root R Select child node 𝑳 which 𝒎𝒂𝒙 𝒘𝒊 𝒏𝒊 + 𝒄 𝒍𝒏 𝒕 𝒏𝒊 𝒘𝒊 : number of wins after the 𝒊− 𝒕𝒉 move 𝒏𝒊 : number of simulations after the 𝒊− 𝒕𝒉 move 𝒄 : exploration parameter, theoretically = 𝟐 𝒕 : total number of situations, =σ 𝒏𝒊 * Kocsis and Szepesvári * 𝑳 : leaf node 10
  • 11. MASSIVE DATA MANAGEMENT Presentation format 2-1. MCTS 2. Monte-Carlo Tree Searching Selection Expansion Simulation Backpropagation : Unless 𝑳 ends the game, create or choose node C𝑪 Start from root R 11
  • 12. MASSIVE DATA MANAGEMENT Presentation format 2-1. MCTS 2. Monte-Carlo Tree Searching Selection Expansion Simulation Backpropagation : Unless 𝑳 ends the game, create or choose node 𝑪𝑪 Play a random playout from node 𝑪 Start from root R 12
  • 13. MASSIVE DATA MANAGEMENT Presentation format 2-1. MCTS 2. Monte-Carlo Tree Searching Selection Expansion Simulation Backpropagation Update the result of the playout from C to R numbers of visits & simulation score 𝑪 𝑹 : Unless 𝑳 ends the game, create or choose node 𝑪𝑪 Play a random playout from node 𝑪 Start from root R 13
  • 14. MASSIVE DATA MANAGEMENT Presentation format 2-1. MCTS 2. Monte-Carlo Tree Searching MCTS has been widely used to conquer Go others failed in conquering professional 𝑑𝑎𝑛𝑯𝒐𝒘𝒆𝒗𝒆𝒓, What makes AlphaGo so special? 14
  • 15. MASSIVE DATA MANAGEMENT Presentation format 2-2. What is Different? 2. Monte-Carlo Tree Searching Selection Expansion Simulation Backpropagation reducing breadth reducing depth Reduce breadth = Policy Network(sampling actions) Reduce depth = Value Network(position evaluation) probability of move 𝒂 in position 𝒔 𝒑(𝒂|𝒔) approximate value function 𝒗∗(𝒔) 15
  • 16. MASSIVE DATA MANAGEMENT Presentation format 2-3. Policy Network 2. Monte-Carlo Tree Searching SL Policy Network RL Policy Network Convolutional Neural Network 𝟑𝟎𝒎𝒊𝒍 KGS previous Go report Fast, efficient learning 𝒑𝞂 𝒑𝝿 𝒑𝞀 Reinforcement Learning Improves SL policy network Adjust the policy towards correct goals weights = 𝞂 Policy Gradient Learning 16
  • 17. MASSIVE DATA MANAGEMENT Presentation format 2-4. Value Network 2. Monte-Carlo Tree Searching Self-play → value network 𝒗 𝜃 RL Policy Network 𝒑𝞀 SL Policy Network RL Policy Network’ 𝒑𝞀 ′ 𝒑𝞂 Prevents Overfitting * Only Reminding the KSG moves Predicts Winner of the Game vs Position Data-set → predict Human Expert Moves * Evaluates the next move 17
  • 18. MASSIVE DATA MANAGEMENT Presentation format Policy Networks 3-1.SLpolicynetwork 3-2.RLpolicynetwork 3-3.Rolloutpolicynetwork 18
  • 19. MASSIVE DATA MANAGEMENT Presentation format 3-1. SL Policy Network 3. Policy Networks Convolutional Neural Network Image Recognition Each point is converted into 11 feature data dimension = 48 (3+1+8+8+8+8+8+1+1+1+1) 19
  • 20. MASSIVE DATA MANAGEMENT Presentation format 3-1. SL Policy Network 3. Policy Networks Input dimension = 19 × 19 × 48 𝟏𝟗 𝟏𝟗 Zero padding = 23 × 23 × 48 Convolutional Kernel = 𝒏 × 𝒏× 𝒌 In AlphaGo, 𝒏 = 𝟓,𝟑,𝟏 , 𝒌 = 𝟏𝟗𝟐 * = 23 × 23 × 48 Input Data Dimension Convolutional Neural Network 20
  • 21. MASSIVE DATA MANAGEMENT Presentation format 3-1. SL Policy Network 3. Policy Networks Why Zero Padding? Convolutional Neural Network 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Calculating edges … … +2 +2 +2 +2 21
  • 22. MASSIVE DATA MANAGEMENT Presentation format 3-1. SL Policy Network 3. Policy Networks Convolutional Neural Network *https://brunch.co.kr/@justinleeanac/2 Final layer = Softmax* layer Feeding Forward method (13 layers) * Softmax layer = Probability distribution from state 𝐬 over all legal moves 𝒂 = 𝒑 𝞂(𝒂|𝒔) 22
  • 23. MASSIVE DATA MANAGEMENT Presentation format 3-1. SL Policy Network 3. Policy Networks Supervised Learning Results 23
  • 24. MASSIVE DATA MANAGEMENT Presentation format 3-1. SL Policy Network 3. Policy Networks By Convolutional Neural Network, Where Human Experts will play next?? Improvements in SL policy by filter HOW ELSE? 24
  • 25. MASSIVE DATA MANAGEMENT Presentation format 3-2. RL Policy Network 3. Policy Networks Improving SL Policy Network by Reinforcement Learning Policy Gradient Reinforcement Learning 𝒑 𝞂 = 𝒑𝞀 : identical structures, initialized with 𝞀 = 𝞂 𝒓(𝒔) : reward function 𝒕 : non-terminal steps 𝑻 : terminal(final) step 𝒔 𝑻 : final state 𝒓(𝒔 𝑻) : +1 for win, -1 for lose 𝒛𝒕 = 𝒓(𝒔 𝑻) : outcome self play against its policy network 25
  • 26. MASSIVE DATA MANAGEMENT Presentation format 3-2. RL Policy Network 3. Policy Networks Why Policy Gradient ? Learn 𝒑𝞀(𝒂𝒕|𝒔𝒕) to maximize 𝐳𝐭 Policy network is interest in finding the probability distribution 𝑝𝞺, not the value function. Therefore, they used policy gradient Weight 𝞀 is updated at each time step 𝒕 by stochastic gradient ascent Wins 80% of games against SL policy network 26
  • 27. MASSIVE DATA MANAGEMENT Presentation format 3-3. Rollout Policy 3. Policy Networks 𝒑 𝝿 : Fast policy for predicting human expert moves 27
  • 28. MASSIVE DATA MANAGEMENT Presentation format 3-3. Rollout Policy 3. Policy Networks Used in the tree search when running a rollout to the end of the game 28
  • 29. MASSIVE DATA MANAGEMENT Presentation format 3-3. Rollout Policy 3. Policy Networks Trained with linear softmax of small pattern features(weight = 𝝿) 29
  • 30. MASSIVE DATA MANAGEMENT Presentation format Value Networks 4-1.ValueNetworks 4-2.ReinforcementLearning 30
  • 31. MASSIVE DATA MANAGEMENT Presentation format 4-1. Value Networks 4. Value Networks Policy Networks Value Networks SL Policy Network RL Policy Network Goal Prediction Evaluation Type Classification Policy Gradient Regression Address Expert moves Self-play moves Winning rate Structure Shares the same architecture Output Probability distribution Scalar value 𝒑𝞂(𝒂|𝒔) 𝒑𝞀(𝒂|𝒔) 𝒗𝜃(𝒔′ ) 31
  • 32. MASSIVE DATA MANAGEMENT Presentation format 4-1. Value Networks 4. Value Networks Policy Network Input : board position 𝒔 Parameters : 𝞂or 𝞀 Output : 𝒑 𝞂(𝒂|𝒔)or 𝒑𝞀(𝒂|𝒔) Value Network Input : predicted position 𝒔′ Parameters : 𝜃 Output : 𝒗 𝜃(𝒔′) probability map convolutional layers current position predicted position scalar value 32
  • 33. MASSIVE DATA MANAGEMENT Presentation format 4-2. Reinforcement Learning 4. Value Networks 𝒗𝒑(𝒔): predicts the outcome from position 𝒔 played by policy 𝒑 𝒗∗(𝒔): Optimal value function under perfect play 𝒗𝒑 𝞀(𝒔): predicts the outcome by RL policy 𝒑𝞀 𝒗 𝜃(𝒔)≈ 𝒗𝒑 𝞀(𝒔) ≈ 𝒗∗(𝒔) 33
  • 34. MASSIVE DATA MANAGEMENT Presentation format 4-2. Reinforcement Learning 4. Value Networks Learn 𝜽to minimize 𝒛− 𝒗 𝜃(𝒔) Weight 𝜽 is updated on state-outcome pairs (𝒔,𝒛) Stochastic Gradient Descent, minimizing MSE(𝒛− 𝒗 𝜃(𝒔)) Naïve approach → Overfitting occurs - Only memorized game outcomes(Cannot generalize new positions) - Training set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟏𝟗 - Test set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟑𝟕 34
  • 35. MASSIVE DATA MANAGEMENT Presentation format 4-2. Reinforcement Learning 4. Value Networks - New self-play data(𝟑𝟎𝒎𝒊𝒍distinct positions) - Played against the RL policy network - Training set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟐𝟐𝟔 Improvements - Test set 𝒎𝒊𝒏 𝑴𝑺𝑬 = 𝟎.𝟐𝟑𝟒 Constantly accurate than 𝒑 𝝿 Eventually approaches 𝒑𝞀 (Using 1 15,000 computation) 35
  • 36. MASSIVE DATA MANAGEMENT Presentation format Playing Go 5-1.SearchingwithNetworks 5-2.Howdoesitwork? 5-3.Performance 36
  • 37. MASSIVE DATA MANAGEMENT Presentation format 5-1. Searching with Networks 5. Playing Go 𝑸(𝒔,𝒂) = action value 𝑷(𝒔,𝒂) = prior probability 𝑵(𝒔,𝒂) = visit count 𝒖(𝒔,𝒂) = bonus Decays with repeated visits. Encourage exploration. Proportional to prior probability Term Definition 37
  • 38. MASSIVE DATA MANAGEMENT Presentation format 5-1. Searching with Networks 5. Playing Go Select edge with maximum 𝑸+ 𝒖(𝑷) 38
  • 39. MASSIVE DATA MANAGEMENT Presentation format 5-1. Searching with Networks 5. Playing Go 𝒑 𝞂 = SL policy network, processing new node 𝑷 = output probability for each action = 𝒑 𝞂(𝒂|𝒔) → 𝒔𝑳 Only done when 𝑵𝑳 ≥ 𝟒𝟎 Why SL policy? SL policy network 𝒑𝞂 performs better than RL policy network 𝒑𝞀 (∵ Human = diverse promising moves RL = optimizes single best move) 39
  • 40. MASSIVE DATA MANAGEMENT Presentation format 5-1. Searching with Networks 5. Playing Go 𝒑 𝝿 = fast rollout policy → output : 𝑧𝐿 𝒗𝜽(𝒔𝑳) = value network → output : scalar Leaf Evaluation : 𝑽 𝒔𝑳 = 𝟏 − 𝝀 𝒗𝜽 𝒔𝑳 + 𝝀𝐳𝐋 computes winner 𝒗 𝜃(𝒔)≈ 𝒗𝒑 𝞀(𝒔) ≈ 𝒗∗(𝒔) RL policy network 𝒑𝞀 to derive value function 40
  • 41. MASSIVE DATA MANAGEMENT Presentation format 5-1. Searching with Networks 5. Playing Go Update𝑸(𝒔,𝒂) 41
  • 42. MASSIVE DATA MANAGEMENT Presentation format 5-1. Searching with Networks 5. Playing Go Selectthenodethathasbeenmostvisited (Selectingby 𝑸causesoverfitting) Repeat&Update 42
  • 43. MASSIVE DATA MANAGEMENT Presentation format 5-2. How Does It Work? 5. Playing Go Evaluation of all successor 𝒔′ of root 𝒔 Evaluation = value network 𝒗𝜽(𝒔′) As you see, 54 is the highest estimated winning percentages 43
  • 44. MASSIVE DATA MANAGEMENT Presentation format 5-2. How Does It Work? 5. Playing Go 𝑸 𝒔,𝒂 = action values for each (𝒔,𝒂) Averaged over value network evaluations only λ = 0 44
  • 45. MASSIVE DATA MANAGEMENT Presentation format 5-2. How Does It Work? 5. Playing Go 𝑸 𝒔,𝒂 = action values for each (𝒔,𝒂) Averaged over rollout evaluations only λ = 1 45
  • 46. MASSIVE DATA MANAGEMENT Presentation format 5-2. How Does It Work? 5. Playing Go Move probabilities directly from 𝒑 𝞂(a|s) SL policy network Expansion when count is over 40 46
  • 47. MASSIVE DATA MANAGEMENT Presentation format 5-2. How Does It Work? 5. Playing Go Percentage frequency of selected actions Actions were selected from root simulation 47
  • 48. MASSIVE DATA MANAGEMENT Presentation format 5-2. How Does It Work? 5. Playing Go The path with maximum visit count Moves are presented in numbered sequence Select the move indicated by red circle 48
  • 49. MASSIVE DATA MANAGEMENT Presentation format 5-3. Performance 5. Playing Go Performance of AlphaGo on a single machine with various combinations 49
  • 50. MASSIVE DATA MANAGEMENT Presentation format 5-3. Performance 5. Playing Go Performance contrast between Single vs. Distributed machines 50
  • 51. MASSIVE DATA MANAGEMENT Presentation format 5-3. Performance 5. Playing Go Recent matches proved that AlphaGo has finally conquered the highest 𝟗𝒅𝒂𝒏 51
  • 52. Thank you for your attention! 52