SlideShare a Scribd company logo
Dynamic Programming and
Reinforcement Learning
applied to Tetris game
Suelen Goularte Carvalho
Inteligência Artificial
2015
Tetris
Tetris
✓ Board 20 x 10
✓ 7 types of tetronimos
(pieces)
✓ Move to down, left or
right
✓ Rotation pieces
Tetris One-Piece Controller
Player knows:
✓ board
✓ current piece.
Tetris Two-Piece Controller
Player knows:
✓ board
✓ current piece
✓ next piece
Tetris Evaluation
One-Piece Controller
Two-Piece Controller
How many
possibilities
do we have
just here?
Tetris indeed contains a huge
number of board configurations.
Finding the strategy that maximizes
the average score is an
NP-Complete problem!
— Building Controllers for Tetris, 2009
7.0 × 2 ≃ 5.6 × 10
199 59
Complexity
Tetris
Tetris is a problems of sequential
decision making under uncertainty.
In the context of dynamic programming
and stochastic control, the most
important object is the cost-to-go
function, which evaluates the expected
future cost from current state.
— Feature-Based Methods for Large Scale Dynamic Programming
7000
3000
2500
1000 4000Si
5000
7000
3000
2500
1000
4000
best immediate
reward
Si
immediate reward
future reward
13000
9000
immediate reward
vs.
5000
best future
reward
best immediate
reward
Immediate reward
Future reward
7.0 × 2 ≃ 5.6 × 10
199 59
Essentially impossible to
compute, or even store, the value
of the cost-to-go function at every
possible state.
— Feature-Based Methods for Large Scale Dynamic Programming
Compact representation alleviate
the computational time and space
of dynamic programming, which
employs an exhaustive look-up
table, storing one value per state.
— Feature-Based Methods for Large Scale Dynamic Programming
S {s1, s2, …, sn} V {v1, v2, …, sm}
where m < n
For example, if the state i represents the
number of customers in a queueing
system, a possible and often interesting
feature f is defined by f(0) = 0 and f(i) = 1
if i > 0. Such a feature focuses on whether
a queue is empty or not.
— Feature-Based Methods for Large Scale Dynamic Programming
— Feature-Based Methods for Large Scale Dynamic Programming
Feature-bases method
S {s1, s2, …, sn} V {v1, v2, …, sm}
where m < n
— Feature-Based Methods for Large Scale Dynamic Programming
Features:
★ Height of the current wall.
★ Number of holes.
H = {0, ..., 20},
L = {0, ..., 200}.
Feature extraction F : S ~ H x L
10 X 20
Feature-bases method
Using a feature-based
evaluation function works better
than just choosing the move
that realizes the highest
immediate reward.
— Building Controllers for Tetris, 2009
Example of features
— Building Controllers for Tetris, 2009
...The problem of building a Tetris
controller comes down to building a
good evaluation function. Ideally,
this function should return high
values for the good decisions and
low values for the bad ones.
— Building Controllers for Tetris, 2009
Reinforcement Learning
context, algorithms aim at
tuning the weights such that
the evaluation function
approximates well the
optimal expected future
score from each state.
— Building Controllers for Tetris, 2009
Reinforcement Learning
Reinforcement Learning by
The Big Bang Theory
https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C
Reinforcement Learning
Imagine disputar um novo jogo cuja
as regras você não conhece, depois
de aproximadamente uma centena de
movimentos, seu oponente anuncia:
“Você perdeu!”. Em resumo, isso é
aprendizagem por reforço.
Supervised Learning
input 1 2 3 4 5 6 7 8 ….
output 1 2 9 16 25 36 49 64 ….
y = f(x) -> function approximation
https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo
Map inputs to output
f(x) = x
labels scores well
2
Unsupervised Learning
x
x
x
xx
x
x
xx
x
o
o
o
o
o
o
o
o
f(x) -> clusters description
o
o x
x
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
oo
o
type
clusters scores well
Reinforcement Learning
Agent
Environment
ActionReward, State
behaviors scores well
Reinforcement Learning
✓ Agents take actions in an environment and
receive rewards
✓ Goal is to find the policy π that maximizes
rewards
✓ Inspired by research into psychology and
animal learning
Reinforcement Learning Model
Given:
S set of states,
A set of actions,
T(s, a, s') ~ P(s’ | s, a)
transitional model,
R reward function
5000
7000
3000
2500
1000
4000Si
immediate reward
future reward
13000
9000
Find:
π(s) = a policy that maximizes
Needs higher computation,
processing and memory.
Dynamic Programming
Dynamic Programming
Solving problems by breaking it down
into simpler subproblems. Solving
each subproblems just once, and
storing their solutions.
https://en.wikipedia.org/wiki/Dynamic_programming
A G
caminho ótimo
A B
caminho ótimo
G
caminho ótimo
Support Property: Optimal
Substructure
Fibonacci Sequence
0 1 1 2 3 5 8 13 21
The sum of two numbers before results
in the follow number.
0 1 1 2 3 5 8 13 21
f(n) = f(n-1) + f(n-2)
Recursive Formula:
v =
0 1 2 3 4 5 6 7 8n =
Fibonacci Sequence
Fibonacci
0 1 1 2 3 5 8 13 21
0 1 2 3 4 5 6 7 8
f(6) = f(6-1) + f(6-2)
f(6) = f(5) + f(4)
f(6) = 5 + 3
f(6) = 8
v =
n =
Fibonacci Sequence - Normal computation
6
5 4
4
3 2
2 1
1 0
2 1
2
2 1
3 3
1 0 1 0 1 0
1 0
f(n) = f(n-1) + f(n-2)
6
5 4
4
3 2
2 1
1 0
2 1
2
2 1
3 3
1 0 1 0 1 0
1 0
Fibonacci Sequence - Normal computation
O(n )2
18 of 25 Nodes Are
Repeated Calculations!
Dictionary m
m[0]=0, m[1]=1
integer fib(n)
if m[n] == null
m[n] = fib(n-1)+ fib(n-2)
return m[n]
Fibonacci Sequence - Dynamic
Programming
Fibonacci Sequence - Dynamic
Programming
5
4 3
3
2 1
1 0
2 index value
0
1
2
3
4
5
0
1
5
4 3
3
2 1
1 0
2 index value
0
1
2
3
4
5
0
1
1
1+0=1
Fibonacci Sequence - Dynamic
Programming
5
4 3
3
2 1
1 0
2 index value
0
1
2
3
4
5
0
1
1
2
1+0=1
1+1=2
Fibonacci Sequence - Dynamic
Programming
5
4 3
3
2 1
1 0
2 index value
0
1
2
3
4
5
0
1
1
2
3
1+0=1
1+1=2
2+1=3
Fibonacci Sequence - Dynamic
Programming
5
4 3
3
2 1
1 0
2
O(1) memory
O(n) running time
index value
0
1
2
3
4
5
0
1
1
2
3
5
1+0=1
1+1=2
2+1=3
3+2=5
Fibonacci Sequence - Dynamic
Programming
100 games played
31
Some scores from time…
Tsitsiklis and van Roy (1996)
Bertsekas and Tsitsiklis (1996)3200100 games played
Kakade (2001) applied
without specifying how many game
scores are averaged though6800
Farias and van Roy (2006)
90 games played.
4700
— Building Controllers for Tetris, 2009
Two-piece controller with some original
features of which the weights were tuned
by hand. Only 1 game was played and
this took a week.
One-piece controller
56 games played.
Tuned by hand. 660Mil
7,2Mi
Currents best!
Dellacherie (Fahey, 2003)
Dellacherie (Fahey, 2003)
— Building Controllers for Tetris, 2009
Experiment…
Experiment
— Feature-Based Methods for Large Scale Dynamic Programming
Experienced human
Tetris player would take
about 3 minutes to
eliminate 30 rows.
20 jogadores. 3 jogadas cada. 3 minutos cada jogada.
Experiment cont.
30
Média obtida: 24 score
Jogador 7 (eu) jogada 1
1000 scores ~ 1 row
Experiment cont.
• Média 24 score a cada 3 minutos.
• Ou seja, 5.760 a cada 12h de jogo contínuo.
• Um ser-humano jogando começa a ficar próximo a
performance dos algoritmos, após algumas
otimizações, após mais ou menos 8h de jogo
contínuo.
Experiment cont.
Conclusão…
Dynamic Programming
Reinforcement Learning
Tetris
Otimiza a utilização do poder computacional.
Otimiza peso utilizado nas features.
Utiliza feature-based para maximizar o score.
Dúvidas?
Suelen Goularte Carvalho
Inteligência Artificial
2015

More Related Content

PPT
Chapter 6
PPTX
Software architectural design patterns(MVC, MVP, MVVM, VIPER) for iOS
PPTX
Face Recognition Technology
PPTX
PPTX
TETRIS AI WITH REINFORCEMENT LEARNING
PDF
How to write a project proposal
PPSX
Genetic_Algorithm_AI(TU)
PPTX
Genetic Algorithm by Example
Chapter 6
Software architectural design patterns(MVC, MVP, MVVM, VIPER) for iOS
Face Recognition Technology
TETRIS AI WITH REINFORCEMENT LEARNING
How to write a project proposal
Genetic_Algorithm_AI(TU)
Genetic Algorithm by Example

Similar to Dynamic Programming and Reinforcement Learning applied to Tetris Game (20)

PDF
Algorithms for Reinforcement Learning
PPT
Dynamicpgmming
PDF
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
PPT
Dynamic pgmming
PDF
Lecture 9 Markov decision process
PPTX
Dynamic programming prasintation eaisy
PDF
Reading Materials for Operational Research
PPTX
dynamic programming complete by Mumtaz Ali (03154103173)
PDF
A Brief Survey of Reinforcement Learning
PDF
Shanghai deep learning meetup 4
PPTX
Learning Task in machine learning
PDF
Adaptive Training of Radial Basis Function Networks Based on Cooperative
PPTX
5617723.pptx
PPTX
UNIT 1 Machine Learning [KCS-055] (1).pptx
PPT
vorl1.ppt
PPT
reinforcement-learning.ppt
PPT
reinforcement-learning.prsentation for c
PPT
reinforcement-learning its based on the slide of university
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Algorithms for Reinforcement Learning
Dynamicpgmming
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
Dynamic pgmming
Lecture 9 Markov decision process
Dynamic programming prasintation eaisy
Reading Materials for Operational Research
dynamic programming complete by Mumtaz Ali (03154103173)
A Brief Survey of Reinforcement Learning
Shanghai deep learning meetup 4
Learning Task in machine learning
Adaptive Training of Radial Basis Function Networks Based on Cooperative
5617723.pptx
UNIT 1 Machine Learning [KCS-055] (1).pptx
vorl1.ppt
reinforcement-learning.ppt
reinforcement-learning.prsentation for c
reinforcement-learning its based on the slide of university
Continuous control with deep reinforcement learning (DDPG)
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Ad

More from Suelen Carvalho (20)

PDF
Porque Métodos Ágeis não é pra você!
PDF
Scrum: Relembrando os por quês?
PDF
Techtrends xp desafios da agilidade com trabalho remoto
PDF
Introdução a Kotlin
PDF
Introdução a Android Instant Apps
PDF
Google IO'17
PDF
Construindo Times de Alta Performance - Produtos & Engenharia
PDF
Git Merge e Rebase - The goal and differences
PDF
Desenvolvimento de Novos Líderes - Paidéia Educação
PDF
O sucesso do seu app está nos detalhes!
PDF
PDF
Supporting Coding and Testing
PDF
Intercon Android 2014 - Google Play In App Billing
PDF
Semana da Computação USP São Carlos 2014 - Carreira Mobile
PDF
TDC 2014 - Tudo sobre GCM Push Notifications
PDF
Mobile Conf 2014 - Sua carreira e o que o desenvolvimento mobile tem a ver co...
PDF
Conexao Java - Sua primeira app Android
PDF
7 Masters sobre Android
PDF
A história do surgimento da plataforma móvel Android.
PDF
O fantástico mundo de Android
Porque Métodos Ágeis não é pra você!
Scrum: Relembrando os por quês?
Techtrends xp desafios da agilidade com trabalho remoto
Introdução a Kotlin
Introdução a Android Instant Apps
Google IO'17
Construindo Times de Alta Performance - Produtos & Engenharia
Git Merge e Rebase - The goal and differences
Desenvolvimento de Novos Líderes - Paidéia Educação
O sucesso do seu app está nos detalhes!
Supporting Coding and Testing
Intercon Android 2014 - Google Play In App Billing
Semana da Computação USP São Carlos 2014 - Carreira Mobile
TDC 2014 - Tudo sobre GCM Push Notifications
Mobile Conf 2014 - Sua carreira e o que o desenvolvimento mobile tem a ver co...
Conexao Java - Sua primeira app Android
7 Masters sobre Android
A história do surgimento da plataforma móvel Android.
O fantástico mundo de Android
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Spectroscopy.pptx food analysis technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Electronic commerce courselecture one. Pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
A comparative analysis of optical character recognition models for extracting...
MIND Revenue Release Quarter 2 2025 Press Release
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Electronic commerce courselecture one. Pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction

Dynamic Programming and Reinforcement Learning applied to Tetris Game

  • 1. Dynamic Programming and Reinforcement Learning applied to Tetris game Suelen Goularte Carvalho Inteligência Artificial 2015
  • 3. Tetris ✓ Board 20 x 10 ✓ 7 types of tetronimos (pieces) ✓ Move to down, left or right ✓ Rotation pieces
  • 4. Tetris One-Piece Controller Player knows: ✓ board ✓ current piece.
  • 5. Tetris Two-Piece Controller Player knows: ✓ board ✓ current piece ✓ next piece
  • 8. Tetris indeed contains a huge number of board configurations. Finding the strategy that maximizes the average score is an NP-Complete problem! — Building Controllers for Tetris, 2009 7.0 × 2 ≃ 5.6 × 10 199 59
  • 10. Tetris is a problems of sequential decision making under uncertainty. In the context of dynamic programming and stochastic control, the most important object is the cost-to-go function, which evaluates the expected future cost from current state. — Feature-Based Methods for Large Scale Dynamic Programming
  • 11. 7000 3000 2500 1000 4000Si 5000 7000 3000 2500 1000 4000 best immediate reward Si immediate reward future reward 13000 9000 immediate reward vs. 5000 best future reward best immediate reward Immediate reward Future reward
  • 12. 7.0 × 2 ≃ 5.6 × 10 199 59 Essentially impossible to compute, or even store, the value of the cost-to-go function at every possible state. — Feature-Based Methods for Large Scale Dynamic Programming
  • 13. Compact representation alleviate the computational time and space of dynamic programming, which employs an exhaustive look-up table, storing one value per state. — Feature-Based Methods for Large Scale Dynamic Programming S {s1, s2, …, sn} V {v1, v2, …, sm} where m < n
  • 14. For example, if the state i represents the number of customers in a queueing system, a possible and often interesting feature f is defined by f(0) = 0 and f(i) = 1 if i > 0. Such a feature focuses on whether a queue is empty or not. — Feature-Based Methods for Large Scale Dynamic Programming
  • 15. — Feature-Based Methods for Large Scale Dynamic Programming Feature-bases method S {s1, s2, …, sn} V {v1, v2, …, sm} where m < n
  • 16. — Feature-Based Methods for Large Scale Dynamic Programming Features: ★ Height of the current wall. ★ Number of holes. H = {0, ..., 20}, L = {0, ..., 200}. Feature extraction F : S ~ H x L 10 X 20 Feature-bases method
  • 17. Using a feature-based evaluation function works better than just choosing the move that realizes the highest immediate reward. — Building Controllers for Tetris, 2009
  • 18. Example of features — Building Controllers for Tetris, 2009
  • 19. ...The problem of building a Tetris controller comes down to building a good evaluation function. Ideally, this function should return high values for the good decisions and low values for the bad ones. — Building Controllers for Tetris, 2009
  • 20. Reinforcement Learning context, algorithms aim at tuning the weights such that the evaluation function approximates well the optimal expected future score from each state. — Building Controllers for Tetris, 2009
  • 22. Reinforcement Learning by The Big Bang Theory https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C
  • 23. Reinforcement Learning Imagine disputar um novo jogo cuja as regras você não conhece, depois de aproximadamente uma centena de movimentos, seu oponente anuncia: “Você perdeu!”. Em resumo, isso é aprendizagem por reforço.
  • 24. Supervised Learning input 1 2 3 4 5 6 7 8 …. output 1 2 9 16 25 36 49 64 …. y = f(x) -> function approximation https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo Map inputs to output f(x) = x labels scores well 2
  • 25. Unsupervised Learning x x x xx x x xx x o o o o o o o o f(x) -> clusters description o o x x x x x x x x x x o o o o o o o oo o type clusters scores well
  • 27. Reinforcement Learning ✓ Agents take actions in an environment and receive rewards ✓ Goal is to find the policy π that maximizes rewards ✓ Inspired by research into psychology and animal learning
  • 28. Reinforcement Learning Model Given: S set of states, A set of actions, T(s, a, s') ~ P(s’ | s, a) transitional model, R reward function 5000 7000 3000 2500 1000 4000Si immediate reward future reward 13000 9000 Find: π(s) = a policy that maximizes
  • 31. Dynamic Programming Solving problems by breaking it down into simpler subproblems. Solving each subproblems just once, and storing their solutions. https://en.wikipedia.org/wiki/Dynamic_programming
  • 32. A G caminho ótimo A B caminho ótimo G caminho ótimo Support Property: Optimal Substructure
  • 33. Fibonacci Sequence 0 1 1 2 3 5 8 13 21 The sum of two numbers before results in the follow number.
  • 34. 0 1 1 2 3 5 8 13 21 f(n) = f(n-1) + f(n-2) Recursive Formula: v = 0 1 2 3 4 5 6 7 8n = Fibonacci Sequence
  • 35. Fibonacci 0 1 1 2 3 5 8 13 21 0 1 2 3 4 5 6 7 8 f(6) = f(6-1) + f(6-2) f(6) = f(5) + f(4) f(6) = 5 + 3 f(6) = 8 v = n =
  • 36. Fibonacci Sequence - Normal computation 6 5 4 4 3 2 2 1 1 0 2 1 2 2 1 3 3 1 0 1 0 1 0 1 0 f(n) = f(n-1) + f(n-2)
  • 37. 6 5 4 4 3 2 2 1 1 0 2 1 2 2 1 3 3 1 0 1 0 1 0 1 0 Fibonacci Sequence - Normal computation O(n )2
  • 38. 18 of 25 Nodes Are Repeated Calculations!
  • 39. Dictionary m m[0]=0, m[1]=1 integer fib(n) if m[n] == null m[n] = fib(n-1)+ fib(n-2) return m[n] Fibonacci Sequence - Dynamic Programming
  • 40. Fibonacci Sequence - Dynamic Programming 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1
  • 41. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 1+0=1 Fibonacci Sequence - Dynamic Programming
  • 42. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 2 1+0=1 1+1=2 Fibonacci Sequence - Dynamic Programming
  • 43. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 2 3 1+0=1 1+1=2 2+1=3 Fibonacci Sequence - Dynamic Programming
  • 44. 5 4 3 3 2 1 1 0 2 O(1) memory O(n) running time index value 0 1 2 3 4 5 0 1 1 2 3 5 1+0=1 1+1=2 2+1=3 3+2=5 Fibonacci Sequence - Dynamic Programming
  • 45. 100 games played 31 Some scores from time… Tsitsiklis and van Roy (1996) Bertsekas and Tsitsiklis (1996)3200100 games played Kakade (2001) applied without specifying how many game scores are averaged though6800 Farias and van Roy (2006) 90 games played. 4700 — Building Controllers for Tetris, 2009
  • 46. Two-piece controller with some original features of which the weights were tuned by hand. Only 1 game was played and this took a week. One-piece controller 56 games played. Tuned by hand. 660Mil 7,2Mi Currents best! Dellacherie (Fahey, 2003) Dellacherie (Fahey, 2003) — Building Controllers for Tetris, 2009
  • 48. Experiment — Feature-Based Methods for Large Scale Dynamic Programming Experienced human Tetris player would take about 3 minutes to eliminate 30 rows.
  • 49. 20 jogadores. 3 jogadas cada. 3 minutos cada jogada. Experiment cont. 30 Média obtida: 24 score
  • 50. Jogador 7 (eu) jogada 1 1000 scores ~ 1 row Experiment cont.
  • 51. • Média 24 score a cada 3 minutos. • Ou seja, 5.760 a cada 12h de jogo contínuo. • Um ser-humano jogando começa a ficar próximo a performance dos algoritmos, após algumas otimizações, após mais ou menos 8h de jogo contínuo. Experiment cont.
  • 53. Dynamic Programming Reinforcement Learning Tetris Otimiza a utilização do poder computacional. Otimiza peso utilizado nas features. Utiliza feature-based para maximizar o score.