Dynamic Programming and Reinforcement Learning applied to Tetris Game

Dynamic Programming and
Reinforcement Learning
applied to Tetris game
Suelen Goularte Carvalho
Inteligência Artiﬁcial
2015

Tetris
✓ Board 20 x 10
✓ 7 types of tetronimos
(pieces)
✓ Move to down, left or
right
✓ Rotation pieces

Tetris One-Piece Controller
Player knows:
✓ board
✓ current piece.

Tetris Two-Piece Controller
Player knows:
✓ board
✓ current piece
✓ next piece

Tetris Evaluation
One-Piece Controller
Two-Piece Controller

How many
possibilities
do we have
just here?

Tetris indeed contains a huge
number of board conﬁgurations.
Finding the strategy that maximizes
the average score is an
NP-Complete problem!
— Building Controllers for Tetris, 2009
7.0 × 2 ≃ 5.6 × 10
199 59

Tetris is a problems of sequential
decision making under uncertainty.
In the context of dynamic programming
and stochastic control, the most
important object is the cost-to-go
function, which evaluates the expected
future cost from current state.
— Feature-Based Methods for Large Scale Dynamic Programming

7000
3000
2500
1000 4000Si
5000
7000
3000
2500
1000
4000
best immediate
reward
Si
immediate reward
future reward
13000
9000
immediate reward
vs.
5000
best future
reward
best immediate
reward
Immediate reward
Future reward

7.0 × 2 ≃ 5.6 × 10
199 59
Essentially impossible to
compute, or even store, the value
of the cost-to-go function at every
possible state.

Compact representation alleviate
the computational time and space
of dynamic programming, which
employs an exhaustive look-up
table, storing one value per state.
S {s1, s2, …, sn} V {v1, v2, …, sm}
where m < n

For example, if the state i represents the
number of customers in a queueing
system, a possible and often interesting
feature f is deﬁned by f(0) = 0 and f(i) = 1
if i > 0. Such a feature focuses on whether
a queue is empty or not.

Feature-bases method
S {s1, s2, …, sn} V {v1, v2, …, sm}
where m < n

Features:
★ Height of the current wall.
★ Number of holes.
H = {0, ..., 20},
L = {0, ..., 200}.
Feature extraction F : S ~ H x L
10 X 20
Feature-bases method

Using a feature-based
evaluation function works better
than just choosing the move
that realizes the highest
immediate reward.

Example of features

...The problem of building a Tetris
controller comes down to building a
good evaluation function. Ideally,
this function should return high
values for the good decisions and
low values for the bad ones.

context, algorithms aim at
tuning the weights such that
the evaluation function
approximates well the
optimal expected future
score from each state.

Reinforcement Learning by
The Big Bang Theory
https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C

Imagine disputar um novo jogo cuja
as regras você não conhece, depois
de aproximadamente uma centena de
movimentos, seu oponente anuncia:
“Você perdeu!”. Em resumo, isso é
aprendizagem por reforço.

Supervised Learning
input 1 2 3 4 5 6 7 8 ….
output 1 2 9 16 25 36 49 64 ….
y = f(x) -> function approximation
https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo
Map inputs to output
f(x) = x
labels scores well
2

Unsupervised Learning
x
x
x
xx
x
x
xx
x
o
o
o
o
o
o
o
o
f(x) -> clusters description
o
o x
x
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
oo
o
type
clusters scores well

Agent
Environment
ActionReward, State
behaviors scores well

✓ Agents take actions in an environment and
receive rewards
✓ Goal is to ﬁnd the policy π that maximizes
rewards
✓ Inspired by research into psychology and
animal learning

Reinforcement Learning Model
Given:
S set of states,
A set of actions,
T(s, a, s') ~ P(s’ | s, a)
transitional model,
R reward function
5000
7000
3000
2500
1000
4000Si
immediate reward
future reward
13000
9000
Find:
π(s) = a policy that maximizes

Needs higher computation,
processing and memory.

Dynamic Programming
Solving problems by breaking it down
into simpler subproblems. Solving
each subproblems just once, and
storing their solutions.
https://en.wikipedia.org/wiki/Dynamic_programming

A G
caminho ótimo
A B
caminho ótimo
G
caminho ótimo
Support Property: Optimal
Substructure

Fibonacci Sequence
0 1 1 2 3 5 8 13 21
The sum of two numbers before results
in the follow number.

0 1 1 2 3 5 8 13 21
f(n) = f(n-1) + f(n-2)
Recursive Formula:
v =
0 1 2 3 4 5 6 7 8n =
Fibonacci Sequence

Fibonacci
0 1 1 2 3 5 8 13 21
0 1 2 3 4 5 6 7 8
f(6) = f(6-1) + f(6-2)
f(6) = f(5) + f(4)
f(6) = 5 + 3
f(6) = 8
v =
n =

Fibonacci Sequence - Normal computation
6
5 4
4
3 2
2 1
1 0
2 1
2
2 1
3 3
1 0 1 0 1 0
1 0
f(n) = f(n-1) + f(n-2)

6
5 4
4
3 2
2 1
1 0
2 1
2
2 1
3 3
1 0 1 0 1 0
1 0
Fibonacci Sequence - Normal computation
O(n )2

18 of 25 Nodes Are
Repeated Calculations!

Dictionary m
m[0]=0, m[1]=1
integer fib(n)
if m[n] == null
m[n] = fib(n-1)+ fib(n-2)
return m[n]
Fibonacci Sequence - Dynamic
Programming

Programming
5
4 3
3
2 1
1 0
2 index value
0
1
2
3
4
5
0
1

5
4 3
3
2 1
1 0
2 index value
0
1
2
3
4
5
0
1
1
1+0=1
Programming

5
4 3
3
2 1
1 0
2 index value
0
1
2
3
4
5
0
1
1
2
1+0=1
1+1=2
Programming

5
4 3
3
2 1
1 0
2 index value
0
1
2
3
4
5
0
1
1
2
3
1+0=1
1+1=2
2+1=3
Programming

5
4 3
3
2 1
1 0
2
O(1) memory
O(n) running time
index value
0
1
2
3
4
5
0
1
1
2
3
5
1+0=1
1+1=2
2+1=3
3+2=5
Programming

100 games played
31
Some scores from time…
Tsitsiklis and van Roy (1996)
Bertsekas and Tsitsiklis (1996)3200100 games played
Kakade (2001) applied
without specifying how many game
scores are averaged though6800
Farias and van Roy (2006)
90 games played.
4700

Two-piece controller with some original
features of which the weights were tuned
by hand. Only 1 game was played and
this took a week.
One-piece controller
56 games played.
Tuned by hand. 660Mil
7,2Mi
Currents best!
Dellacherie (Fahey, 2003)
Dellacherie (Fahey, 2003)

Experiment
Experienced human
Tetris player would take
about 3 minutes to
eliminate 30 rows.

20 jogadores. 3 jogadas cada. 3 minutos cada jogada.
Experiment cont.
30
Média obtida: 24 score

Jogador 7 (eu) jogada 1
1000 scores ~ 1 row
Experiment cont.

• Média 24 score a cada 3 minutos.
• Ou seja, 5.760 a cada 12h de jogo contínuo.
• Um ser-humano jogando começa a ﬁcar próximo a
performance dos algoritmos, após algumas
otimizações, após mais ou menos 8h de jogo
contínuo.
Experiment cont.

Dynamic Programming
Tetris
Otimiza a utilização do poder computacional.
Otimiza peso utilizado nas features.
Utiliza feature-based para maximizar o score.

Dúvidas?
Suelen Goularte Carvalho
Inteligência Artiﬁcial
2015

Dynamic Programming and Reinforcement Learning applied to Tetris Game

More Related Content

Similar to Dynamic Programming and Reinforcement Learning applied to Tetris Game (20)

More from Suelen Carvalho (20)

Recently uploaded (20)

Dynamic Programming and Reinforcement Learning applied to Tetris Game