Generic Reinforcement Schemes and Their Optimization

Proceedings of the European Computing Conference
Generic Reinforcement Schemes and Their Optimization
DANA SIMIAN, FLORIN STOICA
Department of Informatics
“Lucian Blaga” University of Sibiu
Str. Dr. Ion Ratiu 5-7, 550012, Sibiu
ROMANIA
dana.simian@ulbsibiu.ro, florin.stoica@ulbsibiu.ro
Abstract: - The aim of this paper is to introduce a generic two-parameters dependent absolutely expedient
reinforcement scheme and to present a method for learning parameters optimization. We optimize, using a Breeder
genetic algorithm, many schemes derived from our generic one, in order to reach the best performance. Furthermore,
we compare our results in terms of speed and efficiency.
Key-Words: - Reinforcement Learning, Breeder genetic algorithm, Optimization
1 Introduction
Reinforcement schemes represent algorithms which
realize the learning process for stochastic learning
automata. Stochastic learning automata adapt to changes
in their environment as results of a reinforcement
learning process. Given a set of possible actions, a
stochastic learning automaton must choose the optimal
one, based on the environment response and the past
actions. Initially equal probabilities are associated to all
possible actions, one action is selected at random and the
actions probabilities are updated based on the
environment response. A detailed characterization of
reinforcement learning can be found in [14]. In [7] is
underlined that the major advantage of reinforcement
learning is that it needs information about the
environment only for the reinforcement signal.
Reinforcement learning has several applications in
autonomic robotics, designing multi-agent systems,
intelligent vehicles control, etc. ([2], [3], [5], [15], [16]).
In [11] we designed a simulator of an intelligent vehicle
control system. The system was based on two learning
automata.
In other articles ([9], [13]), we defined new
reinforcement schemes in order to reach a best
performance of our system. Usually, reinforcement
schemes depends on many parameters, as we can see in
section 3. An important problem is to choose the optimal
scheme’s parameters.
The aim of this paper is to introduce a generic
reinforcement scheme from which many other
reinforcement schems can be obtained and to present an
optimization method of these schemes with respect to
learning parameters. We also optimize this new scheme
and those that we introduced in [9], [12]. We evaluate
and compare our schemes using two criteria: the speed
of the optimization process and the efficiency of the
optimized schema.
The remainder of this paper is organized as follows. In
section 2 we briefly present the mathematical
backgrounds of stochastic learning automata with
variable structure. In section 3 is presented our generic
absolutely expedient reinforcement scheme, together
with other particular schemes derived from it. In section
3 we present our optimization method for reinforcement
schemes learning parameters and analyze the provided
results. Conclusions and further directions of study are
presented in section 5.
2 Mathematical backgrounds of stochastic
automata
A stochastic automaton supposes the existence of a set of
actions, which define the input of the environment and a
response set. The range of the response values depends
on the model we chose. There are three different models
for representation of the response values: P-model, S-model
and Q-model. The P-model uses a set of binary
values, 0 or 1. In the S-model the response values are
continuous in the range (0, 1). In the Q-model the
response set is a finite set of discrete values in the range
(0, 1). In this paper we use the P-model for our
reinforcement schemes.
A stochastic automaton selects one action at random,
observes the response from the environment and updates
the action probabilities based on that response. An action
can be rewarded or punished using a set of penalties
probabilities.
Mathematical model of a stochastic automaton is defined
by a triple {α , c,β } corresponding to the elements
presented before:
a) α ={α1 ,α 2 ,...,α r } - the input actions of the
environment
b) β - the response set.
In the case of P-model, { 1 , 2} β = β β is a binary set:
ISBN: 978-960-474-297-4 332

β = 0 is a favourable outcome and β =1 is an
unfavourable outcome.
To reefer the time instant is used the notationα (n) ,
β (n) .
c) c ={c1 , c2 ,..., cr } - the set of penalty probabilities.
The element i c is the probability that action i α
will
result in an unfavourable response:
ci = P(β (n) =1|α (n) =α i ) i =1, 2, ..., r
The evolution in time of penalty probabilities defines
two types of environments: stationary (the penalty
probabilities are constant over time) and nonstationary
(the penalties change over time).
In the following we consider only stationary random
environments.
The action probabilities vector at time moment n+1 is
updated using a mapping T and the current probabilities
pi (n) = P(α (n) =α i ), i =1, r :
p(n +1) = T[ p(n),α (n),β (n)]
Reinforcement schemes are named linear if p(n +1) is
a linear function of p(n) , and nonlinear otherwise.
The evaluation of performances of a learning automaton
is made using a quantitative norm of behavior ([17])
represented by the average penalty for a given action
probability vector, M(n).
M n P n p n
( ) ( ( ) 1| ( ))
= β
= =
r r
=Σ = = ∗ = =Σ
P n n P n cpn
( β ( ) 1| α ( ) α ) ( α ( ) α
) ( )
i i i i
i i
1 1
= =
The only class of reinforcement schemes for which
necessary and sufficient conditions of design are
available is represented by absolutely expedient learning
schemes, defined in [7]. An automaton is absolutely
expedient if M(n +1) < M(n) for all n ([7]).
The general solution for absolutely expedient schemes
was found by Lakshmivarahan and Thathachar in [4].
Other studies about expedient learning algorithms can be
found in [8].
In [17] is presented a nonlinear absolutely expedient
reinforcement scheme, for a stationary N-teacher P-model
environment. In the case of N-teacher model, if
the automaton produced the action i α
and the responses
from environments (or “teachers”) are denoted by
j j N
β i =1,..., , then the updating rules are:
⎤
( 1) ( ) 1 β φ ( ( ))
Σ Σ
≠ =
=
− ∗ ⎥⎦
⎡
⎢⎣
+ = +
r
j
j i
j
N
k
k
i i i p n
N
p n p n
1 1
⎤
1 1 β ψ ( ( )) (1)
Σ Σ
≠ =
=
∗ ⎥⎦
⎡
− −
⎢⎣
r
j
j i
j
N
k
k
i p n
N 1 1
( 1) ( ) 1 ( ( ))
+ = − ⎡ ⎤ ∗ + ⎢⎣ ⎥⎦
p n p n p n
1
1 1 ( ( )), .
⎡ ⎤
+ ⎢ − ⎥ ∗ ∀ ≠ ⎣ 1
⎦
N
k
j j i j
k
N
k
i j
k
N
p n j i
N
β φ
β ψ
=
=
Σ
Σ
(2)
i φ
and i ψ
satisfy the following conditions:
p n
p n
r λ
1 = = = p n ≤
( ( )) 0
( ( ))
( )
...
( ( ))
1
( )
p n
p n
r
φ φ
(3)
p n
p n
r μ
1 = = = p n ≤
( ( )) 0
( ( ))
( )
...
( ( ))
1
( )
p n
p n
r
ψ ψ
(4)
r
Σ
pi n j p n
( ) + φ ( ( )) >
0 (5)
i j j
1
≠ =
r
Σ
pi n j p n
( ) − ψ ( ( )) <
1 (6)
i j j
1
≠ =
p j (n) +ψ j ( p(n)) > 0 (7)
p j (n) −φ j ( p(n)) <1 (8)
for all j∈{1,..., r} {i}
In [1] and [15] is proved that the automaton with the
reinforcement scheme given in (1)-(2) is absolutely
expedient in a stationary environment if the functions
λ ( p(n)) and μ ( p(n)) satisfy the following conditions:
λ ( p(n)) ≤ 0
μ ( p(n)) ≤ 0 (9)
λ ( p(n)) + μ ( p(n)) < 0
3 Generic absolutely expedient
reinforcement scheme
In the following we present a generic two-parameter
dependent reinforcement schemes and prove that this
scheme is absolutely expedient in a stationary
environment. We start from the scheme given in (1) –
(2). This scheme is also valid for a single-teacher model.
In this case we will define a single environment response
denoted by f .
Thus, the updating rules become:
pn pn f Hn pn
( 1) ( ) ( ( )) [1 ( )]
(1 ) ( ) [1 ( )]
+ = + ∗ − ∗ ∗ − −
i i 1
i
f pn
2
i
γ
γ
− − ∗ − ∗ −
pn pn f Hn
p n f p n
( 1) ( ) ( ( ))
( ) (1 ) ( ) ( )
+ = − ∗ − γ
∗ ∗
1
∗ + − ∗ − ∗
γ
2
j j
j j
(10)
for all j ≠ i , i.e.:
2 ( ( )) ( ) k k ψ p n = −γ ∗ p n
1 ( ( )) ( ) ( ) k k φ p n = −γ ∗H n ∗ p n
ISBN: 978-960-474-297-4 333

where learning parameters 1 γ and 2 γ are real values
γ 1,γ 2 ∈(0,1) (11)
The function H is defined as:
⎧
( ) min 1; max min ( ) ,
{ {
= ⎨ i
− ⎩ ∗ (1 −
( ))
1
i
H n p n
p n
ε
γ
}}
⎛ ⎫ 1 − ()
⎞ ⎪ ⎜⎜ − ∗ ⎟⎟ ⎬ ⎝ 1 ⎠ 1,
⎪⎭
;0
( )
j
j j r
j i
p n
p n
ε
γ =≠
Parameter ε is an arbitrarily small positive real number.
Our reinforcement scheme differs from schemes given in
[15]-[17], by the definition of H and φ k .
We will show that are satisfied all the conditions of the
reinforcement scheme (1) - (2).
From (3), (4) we have:
p n H n p n Hn pn
p n p n
( ( )) ( ) ( ) ( ) ( ( ))
( ) ( )
φ γ
− ∗ ∗
= 1
=− ∗ = 1
(3’)
k k
k k
γ λ
p n p n p n
p n p n
( ( )) ( ) ( ( ))
( ) ( )
ψ γ
− ∗
= 2
=− = (4’)
2
k k
k k
γ μ
The conditions (5) – (8) become:
p n H n p n
H n p n
( ) ( ) (1 ( )) 0
( ) ( )
− ∗ ∗ − > ⇔
<
i 1
i
1
i
p n
(1 ( ))
i
γ
γ
∗ −
(5’)
Condition (5’) is satisfied by the definition of the
function H(n) .
2 ( ) (1 ( )) 1 i i p n +γ ∗ − p n < (6’)
But 2 ( ) (1 ( )) ( ) 1 ( ) 1 i i i i p n +γ ∗ − p n < p n + − p n =
since 2 0 <γ <1
2 ( ) ( ) 0, {1,..., }{ } j j p n −γ ∗ p n > ∀j∈ r i (7’)
But 2 2 ( ) ( ) ( ) (1 ) 0 j j j p n −γ ∗ p n = p n ∗ −γ >
since 2 0 <γ <1 and 0 < p j (n) <1 for all
j∈{1,..., r}{i}
1
p n
1 −
()
1
( ) ( ) ( ) 1 ( )
( )
j
j j
j
p n H n p n H n
p n
γ
γ
+ ∗ ∗ < ⇔ <
∗
(8’)
∀j∈{1,..., r}{i} .
This condition is satisfied by the definition of the
function H(n) .
Therefore our reinforcement scheme is a candidate for
absolute expediency.
Furthermore, the functions λ and μ for our nonlinear
scheme satisfy:
1 λ ( p(n)) = −γ ∗H(n) ≤ 0
2 μ ( p(n)) = −γ ≤ 0
1 2 λ ( p(n)) +μ ( p(n)) = −γ ∗H(n) −γ < 0
In conclusion, the algorithm given in equations (10) is
absolutely expedient in a stationary environment. This
algorithm defines a two-parameter dependent generic
absolutely expedient reinforcement scheme. We will
denote this scheme by 2
1 Rγ
γ . Choosing different
expressions for the parameters, such that (11) holds, we
obtained several absolutely expedient reinforcement
schemes.
In [9] we introduced and studied the scheme *(1 )
(1 )* Rθ δ
−
− θ δ
,
with 0 <θ <1 and 0 <δ <1. Obviously
0 <θ *(1−δ ) <1 and 0 < (1−θ )*δ <1, therefore this
is a absolutely expedient reinforcement scheme.
In [12] we introduced the scheme θ
Rθ *δ , with 0 <θ <1
and 0 <θ ∗δ <1.
4. Optimization of two-parameters
reinforcement schemes
A very important problem is to find the optimal values
of learning parameters in the scheme 2
1 Rγ
γ in order to
reach the best performance. In [13], we introduced first
the idea of learning parameters optimization in a
reinforcement scheme using genetic algorithms. We
develop here this idea and use a Breeder genetic
algorithm, for providing the optimal learning parameters
for the generic scheme 2
1 Rγ
γ . We also apply the method
for the particular schemes presented in section 3.
Furthermore, we compare our results in terms of speed
and efficiency. For the simplicity of notations, we
consider, in our comparisons, the scheme Rθ
δ , with
1 Rγ
δ ,θ ∈(0,1) instead of 2
γ .
The aim is to find optimal values for the learning
parameters δ and θ in the schemes: Rθ
δ , *(1 )
(1 )* Rθ δ
−
and
− θ δ
θ
Rθ *δ .
Because parameters are real values, we use the Breeder
genetic algorithm, proposed by Mühlenbein and
Schlierkamp-Voosen in [6], which represents solutions
(chromosomes) as vectors of real numbers. This
algorithm is closer to the reality than normal genetic
algorithms which use discrete representation of
solutions. The skeleton of the Breeder genetic algorithm
can be found in [13]. The selection is achieved randomly
from the T% best elements of current population, where
T is a constant of the algorithm (usually, T = 40 provide
best results). Thus, within each generation, two elements
selected from the T% best chromosomes are subject to
crossover operation. On the new child obtained from the
mate of the parents is applied the mutation operator. The
ISBN: 978-960-474-297-4 334

process is repeated until are obtained N-1 new
individuals, where N represents the size of the initial
population. The best chromosome (evaluated through
fitness function) is inserted in the new population (1-
elitism). Thus, the new population will have also N
elements.
Let be 1,..., { }i i n x x = = and 1,..., { }i i n y y = = two
chromosomes. The Breeder crossover operator gives a
new chromosome z, whose genes are represented by
( ) i i i i i z x y x α = + − , i=1,…,n, with i α
a random
variable uniformly distributed between [−ε ,1+ε ], ε
depends on the problem to be solved and typically is in
the interval [0,0.5] .
The mutation operator gives i i i i i x = x + s ⋅ r ⋅a , i=1,…n
with { 1,1} i s ∈ − uniform at random,
i xi r = r ⋅ domain ,
r∈[0.1, 0.5] (typically 0.1) , 2 k
i a = − ⋅α withα ∈[0,1]
uniform at random and k is the number of bytes used to
represent a number in the machine within is executed the
Breeder algorithm (mutation precision).
The probability of mutation is typically choosed as 1/ n .
In order to find the best values for learning parameters
δ and θ of our reinforcement schemes and to compare
the results, we consider the same example we used in
[9], [13]. We used our reinforcement schemes for robot
navigation in the grid world presented in Fig. 1. The
current position of the robot is marked by a circle.
Navigation is done using four actionsα ={N, S, E,W} ,
corresponding to the four possible movements along the
coordinate directions.
Fig. 1. Grid world for robot navigation
We have a single optimal action (movement to S). In the
learning process, only this action receives reward.
Initially, we choose for the optimal action a small
probability value (0.0005). We stop the execution when
the probability of the optimal action, popt, reaches a
certain value (popt=0.9999).
We make the performance evaluation of our schemes
using the “number of steps” of the learning algorithm
until the stop condition is achieved.
Using the Breeder genetic algorithm, we can provide the
optimal learning parameters for our schemes, in order to
reach the best performance.
Each chromosome contains two genes, representing the
real values δ and θ . The fitness function for
chromosomes evaluation is represented by the number of
steps necessary by the learning process to reach the
value 0.9999 for the probability of the optimal action.
The parameters of Breeder algorithm are assigned with
following values: δ = 0 , r = 0.1, k = 8 . The initial
population has 400 chromosomes and algorithm is
stopped after 1000 generations.
The results provided by the Breeder genetic algorithm
are presented in Table 1.
Optimal values for learning parameters
provided by the Breeder algorithm
4 actions with
(0) 0.0005,
(0) =
0.9995 / 3
=
opt
p
i≠opt
p
Scheme 4.1
θ
Rδ
Scheme 4.2
θ
Rθ *δ
Scheme 4.3
*(1 )
(1 )*
θ −
δ
R −
θ δ
δ 0.5866 0.7036 0.5741
θ 0.9469 0.8983 0.3640
Average
number of
steps to reach
16.95 16.98 43.70
popt=0.9999
Table 1. Optimal values for learning parameters
provided by the Breeder genetic algorithm
Fig. 2. Schema optimization vs. time passed
In figure 2 is presented the optimization process for
reinforcement schemes analyzed in Table 1, using two
dimensions of data: the time passed vs. the performance
evaluation of optimized scheme (number of steps
necessary to reach the stop conditions of the learning
process).
In figure 3 is presented the optimization process using as
dimensions of data the number of generations in the
Breeder algorithm vs. the performance evaluation of
optimized scheme.
ISBN: 978-960-474-297-4 335

Fig. 3. Schema optimization vs. number of generations
in Breeder algorithm
With results obtained in Table 1, we can conclude that
Breeder genetic algorithm is capable to provide the best
values for learning parameters, and thus our schemes
were optimized for best performance. The results
obtained by our nonlinear optimized schemes are
significant better than those obtained in [10], [12], [17].
5 Conclusions
Using a Breeder genetic algorithm, we found
automatically the optimal values for the learning
parameters of many reinforcement schemes, in order to
reach the best performance, measured in number of
iterations in learning process (“number of steps”).
From graphical results of optimization process showed
in Fig.2 and Fig. 3, we can conclude that scheme 4.3,
R θ *(1 −
δ
)
(1 )*
, − θ δ
is more adequate for applications with less time
allocated for schema optimization, and scheme 4.2,
Rθ
, θ *δ is very efficient if we allocate for optimization
enough time. However, the new generic scheme θ
Rδ ,
introduced in section 3, outperforms the other schemes
in terms of speed and qualitative results in the learning
process.
There are many possibilities for choosing the form of
parameters in generic scheme 2
1 Rγ
γ such that the
conditions (11) are satisfied. Breeder genetic algorithm,
presented in section 4, can be used for optimization of
parameters values regardless of choice of 1 2 γ ,γ .
The graphical results obtained suggest than
1 2 γ =δ ,γ =θ , with 0 <δ ,θ <1 give better results
than other more complicated choices. As further
directions of study we want to rigorous prove or to
invalidate this conjecture.
References:
[1] N. Baba, New Topics in Learning Automata: Theory
and Applications, Lecture Notes in Control and
Information Sciences, Berlin, Germany: Springer-
Verlag, pp.750-758, 1984.
[2] O. Buffet, A. Dutech, and F. Charpillet, Incremental
reinforcement learning for designing multi-agent
systems, In J. P. Müller, E. Andre, S. Sen, and C.
Frasson, editors, Proceedings of the Fifth International
Conference onAutonomous Agents, Montreal, Canada,
ACM Press, pp. 31–38, 2001.
[3] M. Dorigo, Introduction to the Special Issue on
Learning Autonomous Robots, IEEE Trans. on Systems,
Man and Cybernetics - part B, Vol. 26, No. 3, pp. 361-
364, 1996.
[4] S. Lakshmivarahan, M.A.L. Thathachar, Absolutely
Expedient Learning Algorithms for Stochastic
Automata, IEEE Transactions on Systems, Man and
Cybernetics, vol. SMC-6, pp. 281-286, 1973.
[5] J. Moody, Y. Liu, M. Saffell, and K. Youn.
Stochastic direct reinforcement: Application to simple
games with recurrence, In Proceedings of Artificial
Multiagent Learning. Papers from the 2004 AAAI Fall
Symposium, Technical Report FS-04-02.
[6] H. Mühlenbein, D. Schlierkamp-Voosen, The science
of breeding and its application to the breeder genetic
algorithm, Evolutionary Computation, vol. 1, pp. 335-
360, 1994.
[7] K. S. Narendra, M. A. L. Thathachar, Learning
Automata: an introduction, Prentice-Hall, 1989.
[8] C. Rivero, Characterization of the absolutely
expedient learning algorithms for stochastic automata in
a non-discrete space of actions, ESANN'2003
proceedings - European Symposium on Artificial Neural
Networks Bruges (Belgium), pp. 307-312, 2003.
[9] D. Simian, F. Stoica, A New Nonlinear
Reinforcement Scheme for Stochastic Learning
Automata, Proceedings of the 12th WSEAS International
Conference on Automatic Control, Modelling &
Simulation, Catania, Italy, pp. 450-454, 2010.
[10] F. Stoica, E. M. Popa, An Absolutely Expedient
Learning Algorithm for Stochastic Automata, WSEAS
Transactions on Computers, Issue 2, Volume 6, pp. 229-
235, 2007.
[11] F. Stoica, D. Simian, Automatic control based on
Wasp Behavioral Model and Stochastic Learning
Automata. Mathematics and Computers in Science and
Engineering Series, Proceedings of 10th WSEAS
Conference on Mathematical Methods, Computational
ISBN: 978-960-474-297-4 336

Techniques and Intelligent Systems (MAMECTIS '08),
Corfu 2008, WSEAS Press, pp. 289-295, 2008.
[12] F. Stoica, E. M. Popa, I. Pah, A new reinforcement
scheme for stochastic learning automata – Application to
Automatic Control, Proceedings of the International
Conference on e-Business, Porto, Portugal, pp. 45-50,
2008.
[13] F. Stoica, D. Simian, Optimizing a New Nonlinear
Reinforcement Scheme with Breeder genetic algorithm,
Proceedings of the 11th WSEAS International
Conference on Evolutionary Computing (EC'10), Iaşi,
Romania, pp. 273-278, 2010.
[14] R. Sutton, A. Barto, Reinforcement learning: An
introduction, MIT-press, Cambridge, MA, 1998.
[15] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study
of Learning Automata Games in Automated Highway
Systems, 1st IEEE Conference on Intelligent
Transportation Systems (ITSC’97), Boston,
Massachusetts, 1997
[16] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study
of Multiple Intelligent Vehicle Control using Stochastic
Learning Automata, IEEE Transactions on Systems,
Man and Cybernetics – Part A, Systems and Human,
pp.1-42, 1997.
[17] C. Ünsal, Intelligent Navigation of Autonomous
Vehicles in an Automated Highway System: Learning
Methods and Interacting Vehicles Approach, dissertation
thesis, Pittsburg University, Virginia, USA, 1997.
ISBN: 978-960-474-297-4 337

Generic Reinforcement Schemes and Their Optimization

More Related Content

What's hot (18)

Viewers also liked (17)

Similar to Generic Reinforcement Schemes and Their Optimization (20)

More from infopapers (9)

Recently uploaded (20)

Generic Reinforcement Schemes and Their Optimization