SlideShare a Scribd company logo
Proceedings of the European Computing Conference 
Generic Reinforcement Schemes and Their Optimization 
DANA SIMIAN, FLORIN STOICA 
Department of Informatics 
“Lucian Blaga” University of Sibiu 
Str. Dr. Ion Ratiu 5-7, 550012, Sibiu 
ROMANIA 
dana.simian@ulbsibiu.ro, florin.stoica@ulbsibiu.ro 
Abstract: - The aim of this paper is to introduce a generic two-parameters dependent absolutely expedient 
reinforcement scheme and to present a method for learning parameters optimization. We optimize, using a Breeder 
genetic algorithm, many schemes derived from our generic one, in order to reach the best performance. Furthermore, 
we compare our results in terms of speed and efficiency. 
Key-Words: - Reinforcement Learning, Breeder genetic algorithm, Optimization 
1 Introduction 
Reinforcement schemes represent algorithms which 
realize the learning process for stochastic learning 
automata. Stochastic learning automata adapt to changes 
in their environment as results of a reinforcement 
learning process. Given a set of possible actions, a 
stochastic learning automaton must choose the optimal 
one, based on the environment response and the past 
actions. Initially equal probabilities are associated to all 
possible actions, one action is selected at random and the 
actions probabilities are updated based on the 
environment response. A detailed characterization of 
reinforcement learning can be found in [14]. In [7] is 
underlined that the major advantage of reinforcement 
learning is that it needs information about the 
environment only for the reinforcement signal. 
Reinforcement learning has several applications in 
autonomic robotics, designing multi-agent systems, 
intelligent vehicles control, etc. ([2], [3], [5], [15], [16]). 
In [11] we designed a simulator of an intelligent vehicle 
control system. The system was based on two learning 
automata. 
In other articles ([9], [13]), we defined new 
reinforcement schemes in order to reach a best 
performance of our system. Usually, reinforcement 
schemes depends on many parameters, as we can see in 
section 3. An important problem is to choose the optimal 
scheme’s parameters. 
The aim of this paper is to introduce a generic 
reinforcement scheme from which many other 
reinforcement schems can be obtained and to present an 
optimization method of these schemes with respect to 
learning parameters. We also optimize this new scheme 
and those that we introduced in [9], [12]. We evaluate 
and compare our schemes using two criteria: the speed 
of the optimization process and the efficiency of the 
optimized schema. 
The remainder of this paper is organized as follows. In 
section 2 we briefly present the mathematical 
backgrounds of stochastic learning automata with 
variable structure. In section 3 is presented our generic 
absolutely expedient reinforcement scheme, together 
with other particular schemes derived from it. In section 
3 we present our optimization method for reinforcement 
schemes learning parameters and analyze the provided 
results. Conclusions and further directions of study are 
presented in section 5. 
2 Mathematical backgrounds of stochastic 
automata 
A stochastic automaton supposes the existence of a set of 
actions, which define the input of the environment and a 
response set. The range of the response values depends 
on the model we chose. There are three different models 
for representation of the response values: P-model, S-model 
and Q-model. The P-model uses a set of binary 
values, 0 or 1. In the S-model the response values are 
continuous in the range (0, 1). In the Q-model the 
response set is a finite set of discrete values in the range 
(0, 1). In this paper we use the P-model for our 
reinforcement schemes. 
A stochastic automaton selects one action at random, 
observes the response from the environment and updates 
the action probabilities based on that response. An action 
can be rewarded or punished using a set of penalties 
probabilities. 
Mathematical model of a stochastic automaton is defined 
by a triple {α , c,β } corresponding to the elements 
presented before: 
a) α ={α1 ,α 2 ,...,α r } - the input actions of the 
environment 
b) β - the response set. 
In the case of P-model, { 1 , 2} β = β β is a binary set: 
ISBN: 978-960-474-297-4 332
Proceedings of the European Computing Conference 
β = 0 is a favourable outcome and β =1 is an 
unfavourable outcome. 
To reefer the time instant is used the notationα (n) , 
β (n) . 
c) c ={c1 , c2 ,..., cr } - the set of penalty probabilities. 
The element i c is the probability that action i α 
will 
result in an unfavourable response: 
ci = P(β (n) =1|α (n) =α i ) i =1, 2, ..., r 
The evolution in time of penalty probabilities defines 
two types of environments: stationary (the penalty 
probabilities are constant over time) and nonstationary 
(the penalties change over time). 
In the following we consider only stationary random 
environments. 
The action probabilities vector at time moment n+1 is 
updated using a mapping T and the current probabilities 
pi (n) = P(α (n) =α i ), i =1, r : 
p(n +1) = T[ p(n),α (n),β (n)] 
Reinforcement schemes are named linear if p(n +1) is 
a linear function of p(n) , and nonlinear otherwise. 
The evaluation of performances of a learning automaton 
is made using a quantitative norm of behavior ([17]) 
represented by the average penalty for a given action 
probability vector, M(n). 
M n P n p n 
( ) ( ( ) 1| ( )) 
= β 
= = 
r r 
=Σ = = ∗ = =Σ 
P n n P n cpn 
( β ( ) 1| α ( ) α ) ( α ( ) α 
) ( ) 
i i i i 
i i 
1 1 
= = 
The only class of reinforcement schemes for which 
necessary and sufficient conditions of design are 
available is represented by absolutely expedient learning 
schemes, defined in [7]. An automaton is absolutely 
expedient if M(n +1) < M(n) for all n ([7]). 
The general solution for absolutely expedient schemes 
was found by Lakshmivarahan and Thathachar in [4]. 
Other studies about expedient learning algorithms can be 
found in [8]. 
In [17] is presented a nonlinear absolutely expedient 
reinforcement scheme, for a stationary N-teacher P-model 
environment. In the case of N-teacher model, if 
the automaton produced the action i α 
and the responses 
from environments (or “teachers”) are denoted by 
j j N 
β i =1,..., , then the updating rules are: 
⎤ 
( 1) ( ) 1 β φ ( ( )) 
Σ Σ 
≠ = 
= 
− ∗ ⎥⎦ 
⎡ 
⎢⎣ 
+ = + 
r 
j 
j i 
j 
N 
k 
k 
i i i p n 
N 
p n p n 
1 1 
⎤ 
1 1 β ψ ( ( )) (1) 
Σ Σ 
≠ = 
= 
∗ ⎥⎦ 
⎡ 
− − 
⎢⎣ 
r 
j 
j i 
j 
N 
k 
k 
i p n 
N 1 1 
( 1) ( ) 1 ( ( )) 
+ = − ⎡ ⎤ ∗ + ⎢⎣ ⎥⎦ 
p n p n p n 
1 
1 1 ( ( )), . 
⎡ ⎤ 
+ ⎢ − ⎥ ∗ ∀ ≠ ⎣ 1 
⎦ 
N 
k 
j j i j 
k 
N 
k 
i j 
k 
N 
p n j i 
N 
β φ 
β ψ 
= 
= 
Σ 
Σ 
(2) 
i φ 
and i ψ 
satisfy the following conditions: 
p n 
p n 
r λ 
1 = = = p n ≤ 
( ( )) 0 
( ( )) 
( ) 
... 
( ( )) 
1 
( ) 
p n 
p n 
r 
φ φ 
(3) 
p n 
p n 
r μ 
1 = = = p n ≤ 
( ( )) 0 
( ( )) 
( ) 
... 
( ( )) 
1 
( ) 
p n 
p n 
r 
ψ ψ 
(4) 
r 
Σ 
pi n j p n 
( ) + φ ( ( )) > 
0 (5) 
i j j 
1 
≠ = 
r 
Σ 
pi n j p n 
( ) − ψ ( ( )) < 
1 (6) 
i j j 
1 
≠ = 
p j (n) +ψ j ( p(n)) > 0 (7) 
p j (n) −φ j ( p(n)) <1 (8) 
for all j∈{1,..., r}  {i} 
In [1] and [15] is proved that the automaton with the 
reinforcement scheme given in (1)-(2) is absolutely 
expedient in a stationary environment if the functions 
λ ( p(n)) and μ ( p(n)) satisfy the following conditions: 
λ ( p(n)) ≤ 0 
μ ( p(n)) ≤ 0 (9) 
λ ( p(n)) + μ ( p(n)) < 0 
3 Generic absolutely expedient 
reinforcement scheme 
In the following we present a generic two-parameter 
dependent reinforcement schemes and prove that this 
scheme is absolutely expedient in a stationary 
environment. We start from the scheme given in (1) – 
(2). This scheme is also valid for a single-teacher model. 
In this case we will define a single environment response 
denoted by f . 
Thus, the updating rules become: 
pn pn f Hn pn 
( 1) ( ) ( ( )) [1 ( )] 
(1 ) ( ) [1 ( )] 
+ = + ∗ − ∗ ∗ − − 
i i 1 
i 
f pn 
2 
i 
γ 
γ 
− − ∗ − ∗ − 
pn pn f Hn 
p n f p n 
( 1) ( ) ( ( )) 
( ) (1 ) ( ) ( ) 
+ = − ∗ − γ 
∗ ∗ 
1 
∗ + − ∗ − ∗ 
γ 
2 
j j 
j j 
(10) 
for all j ≠ i , i.e.: 
2 ( ( )) ( ) k k ψ p n = −γ ∗ p n 
1 ( ( )) ( ) ( ) k k φ p n = −γ ∗H n ∗ p n 
ISBN: 978-960-474-297-4 333
Proceedings of the European Computing Conference 
where learning parameters 1 γ and 2 γ are real values 
γ 1,γ 2 ∈(0,1) (11) 
The function H is defined as: 
⎧ 
( ) min 1; max min ( ) , 
{ { 
= ⎨ i 
− ⎩ ∗ (1 − 
( )) 
1 
i 
H n p n 
p n 
ε 
γ 
}} 
⎛ ⎫ 1 − () 
⎞ ⎪ ⎜⎜ − ∗ ⎟⎟ ⎬ ⎝ 1 ⎠ 1, 
⎪⎭ 
;0 
( ) 
j 
j j r 
j i 
p n 
p n 
ε 
γ =≠ 
Parameter ε is an arbitrarily small positive real number. 
Our reinforcement scheme differs from schemes given in 
[15]-[17], by the definition of H and φ k . 
We will show that are satisfied all the conditions of the 
reinforcement scheme (1) - (2). 
From (3), (4) we have: 
p n H n p n Hn pn 
p n p n 
( ( )) ( ) ( ) ( ) ( ( )) 
( ) ( ) 
φ γ 
− ∗ ∗ 
= 1 
=− ∗ = 1 
(3’) 
k k 
k k 
γ λ 
p n p n p n 
p n p n 
( ( )) ( ) ( ( )) 
( ) ( ) 
ψ γ 
− ∗ 
= 2 
=− = (4’) 
2 
k k 
k k 
γ μ 
The conditions (5) – (8) become: 
p n H n p n 
H n p n 
( ) ( ) (1 ( )) 0 
( ) ( ) 
− ∗ ∗ − > ⇔ 
< 
i 1 
i 
1 
i 
p n 
(1 ( )) 
i 
γ 
γ 
∗ − 
(5’) 
Condition (5’) is satisfied by the definition of the 
function H(n) . 
2 ( ) (1 ( )) 1 i i p n +γ ∗ − p n < (6’) 
But 2 ( ) (1 ( )) ( ) 1 ( ) 1 i i i i p n +γ ∗ − p n < p n + − p n = 
since 2 0 <γ <1 
2 ( ) ( ) 0, {1,..., }{ } j j p n −γ ∗ p n > ∀j∈ r i (7’) 
But 2 2 ( ) ( ) ( ) (1 ) 0 j j j p n −γ ∗ p n = p n ∗ −γ > 
since 2 0 <γ <1 and 0 < p j (n) <1 for all 
j∈{1,..., r}{i} 
1 
p n 
1 − 
() 
1 
( ) ( ) ( ) 1 ( ) 
( ) 
j 
j j 
j 
p n H n p n H n 
p n 
γ 
γ 
+ ∗ ∗ < ⇔ < 
∗ 
(8’) 
∀j∈{1,..., r}{i} . 
This condition is satisfied by the definition of the 
function H(n) . 
Therefore our reinforcement scheme is a candidate for 
absolute expediency. 
Furthermore, the functions λ and μ for our nonlinear 
scheme satisfy: 
1 λ ( p(n)) = −γ ∗H(n) ≤ 0 
2 μ ( p(n)) = −γ ≤ 0 
1 2 λ ( p(n)) +μ ( p(n)) = −γ ∗H(n) −γ < 0 
In conclusion, the algorithm given in equations (10) is 
absolutely expedient in a stationary environment. This 
algorithm defines a two-parameter dependent generic 
absolutely expedient reinforcement scheme. We will 
denote this scheme by 2 
1 Rγ 
γ . Choosing different 
expressions for the parameters, such that (11) holds, we 
obtained several absolutely expedient reinforcement 
schemes. 
In [9] we introduced and studied the scheme *(1 ) 
(1 )* Rθ δ 
− 
− θ δ 
, 
with 0 <θ <1 and 0 <δ <1. Obviously 
0 <θ *(1−δ ) <1 and 0 < (1−θ )*δ <1, therefore this 
is a absolutely expedient reinforcement scheme. 
In [12] we introduced the scheme θ 
Rθ *δ , with 0 <θ <1 
and 0 <θ ∗δ <1. 
4. Optimization of two-parameters 
reinforcement schemes 
A very important problem is to find the optimal values 
of learning parameters in the scheme 2 
1 Rγ 
γ in order to 
reach the best performance. In [13], we introduced first 
the idea of learning parameters optimization in a 
reinforcement scheme using genetic algorithms. We 
develop here this idea and use a Breeder genetic 
algorithm, for providing the optimal learning parameters 
for the generic scheme 2 
1 Rγ 
γ . We also apply the method 
for the particular schemes presented in section 3. 
Furthermore, we compare our results in terms of speed 
and efficiency. For the simplicity of notations, we 
consider, in our comparisons, the scheme Rθ 
δ , with 
1 Rγ 
δ ,θ ∈(0,1) instead of 2 
γ . 
The aim is to find optimal values for the learning 
parameters δ and θ in the schemes: Rθ 
δ , *(1 ) 
(1 )* Rθ δ 
− 
and 
− θ δ 
θ 
Rθ *δ . 
Because parameters are real values, we use the Breeder 
genetic algorithm, proposed by Mühlenbein and 
Schlierkamp-Voosen in [6], which represents solutions 
(chromosomes) as vectors of real numbers. This 
algorithm is closer to the reality than normal genetic 
algorithms which use discrete representation of 
solutions. The skeleton of the Breeder genetic algorithm 
can be found in [13]. The selection is achieved randomly 
from the T% best elements of current population, where 
T is a constant of the algorithm (usually, T = 40 provide 
best results). Thus, within each generation, two elements 
selected from the T% best chromosomes are subject to 
crossover operation. On the new child obtained from the 
mate of the parents is applied the mutation operator. The 
ISBN: 978-960-474-297-4 334
Proceedings of the European Computing Conference 
process is repeated until are obtained N-1 new 
individuals, where N represents the size of the initial 
population. The best chromosome (evaluated through 
fitness function) is inserted in the new population (1- 
elitism). Thus, the new population will have also N 
elements. 
Let be 1,..., { }i i n x x = = and 1,..., { }i i n y y = = two 
chromosomes. The Breeder crossover operator gives a 
new chromosome z, whose genes are represented by 
( ) i i i i i z x y x α = + − , i=1,…,n, with i α 
a random 
variable uniformly distributed between [−ε ,1+ε ], ε 
depends on the problem to be solved and typically is in 
the interval [0,0.5] . 
The mutation operator gives i i i i i x = x + s ⋅ r ⋅a , i=1,…n 
with { 1,1} i s ∈ − uniform at random, 
i xi r = r ⋅ domain , 
r∈[0.1, 0.5] (typically 0.1) , 2 k 
i a = − ⋅α withα ∈[0,1] 
uniform at random and k is the number of bytes used to 
represent a number in the machine within is executed the 
Breeder algorithm (mutation precision). 
The probability of mutation is typically choosed as 1/ n . 
In order to find the best values for learning parameters 
δ and θ of our reinforcement schemes and to compare 
the results, we consider the same example we used in 
[9], [13]. We used our reinforcement schemes for robot 
navigation in the grid world presented in Fig. 1. The 
current position of the robot is marked by a circle. 
Navigation is done using four actionsα ={N, S, E,W} , 
corresponding to the four possible movements along the 
coordinate directions. 
Fig. 1. Grid world for robot navigation 
We have a single optimal action (movement to S). In the 
learning process, only this action receives reward. 
Initially, we choose for the optimal action a small 
probability value (0.0005). We stop the execution when 
the probability of the optimal action, popt, reaches a 
certain value (popt=0.9999). 
We make the performance evaluation of our schemes 
using the “number of steps” of the learning algorithm 
until the stop condition is achieved. 
Using the Breeder genetic algorithm, we can provide the 
optimal learning parameters for our schemes, in order to 
reach the best performance. 
Each chromosome contains two genes, representing the 
real values δ and θ . The fitness function for 
chromosomes evaluation is represented by the number of 
steps necessary by the learning process to reach the 
value 0.9999 for the probability of the optimal action. 
The parameters of Breeder algorithm are assigned with 
following values: δ = 0 , r = 0.1, k = 8 . The initial 
population has 400 chromosomes and algorithm is 
stopped after 1000 generations. 
The results provided by the Breeder genetic algorithm 
are presented in Table 1. 
Optimal values for learning parameters 
provided by the Breeder algorithm 
4 actions with 
(0) 0.0005, 
(0) = 
0.9995 / 3 
= 
opt 
p 
i≠opt 
p 
Scheme 4.1 
θ 
Rδ 
Scheme 4.2 
θ 
Rθ *δ 
Scheme 4.3 
*(1 ) 
(1 )* 
θ − 
δ 
R − 
θ δ 
δ 0.5866 0.7036 0.5741 
θ 0.9469 0.8983 0.3640 
Average 
number of 
steps to reach 
16.95 16.98 43.70 
popt=0.9999 
Table 1. Optimal values for learning parameters 
provided by the Breeder genetic algorithm 
Fig. 2. Schema optimization vs. time passed 
In figure 2 is presented the optimization process for 
reinforcement schemes analyzed in Table 1, using two 
dimensions of data: the time passed vs. the performance 
evaluation of optimized scheme (number of steps 
necessary to reach the stop conditions of the learning 
process). 
In figure 3 is presented the optimization process using as 
dimensions of data the number of generations in the 
Breeder algorithm vs. the performance evaluation of 
optimized scheme. 
ISBN: 978-960-474-297-4 335
Proceedings of the European Computing Conference 
Fig. 3. Schema optimization vs. number of generations 
in Breeder algorithm 
With results obtained in Table 1, we can conclude that 
Breeder genetic algorithm is capable to provide the best 
values for learning parameters, and thus our schemes 
were optimized for best performance. The results 
obtained by our nonlinear optimized schemes are 
significant better than those obtained in [10], [12], [17]. 
5 Conclusions 
Using a Breeder genetic algorithm, we found 
automatically the optimal values for the learning 
parameters of many reinforcement schemes, in order to 
reach the best performance, measured in number of 
iterations in learning process (“number of steps”). 
From graphical results of optimization process showed 
in Fig.2 and Fig. 3, we can conclude that scheme 4.3, 
R θ *(1 − 
δ 
) 
(1 )* 
, − θ δ 
is more adequate for applications with less time 
allocated for schema optimization, and scheme 4.2, 
Rθ 
, θ *δ is very efficient if we allocate for optimization 
enough time. However, the new generic scheme θ 
Rδ , 
introduced in section 3, outperforms the other schemes 
in terms of speed and qualitative results in the learning 
process. 
There are many possibilities for choosing the form of 
parameters in generic scheme 2 
1 Rγ 
γ such that the 
conditions (11) are satisfied. Breeder genetic algorithm, 
presented in section 4, can be used for optimization of 
parameters values regardless of choice of 1 2 γ ,γ . 
The graphical results obtained suggest than 
1 2 γ =δ ,γ =θ , with 0 <δ ,θ <1 give better results 
than other more complicated choices. As further 
directions of study we want to rigorous prove or to 
invalidate this conjecture. 
References: 
[1] N. Baba, New Topics in Learning Automata: Theory 
and Applications, Lecture Notes in Control and 
Information Sciences, Berlin, Germany: Springer- 
Verlag, pp.750-758, 1984. 
[2] O. Buffet, A. Dutech, and F. Charpillet, Incremental 
reinforcement learning for designing multi-agent 
systems, In J. P. Müller, E. Andre, S. Sen, and C. 
Frasson, editors, Proceedings of the Fifth International 
Conference onAutonomous Agents, Montreal, Canada, 
ACM Press, pp. 31–38, 2001. 
[3] M. Dorigo, Introduction to the Special Issue on 
Learning Autonomous Robots, IEEE Trans. on Systems, 
Man and Cybernetics - part B, Vol. 26, No. 3, pp. 361- 
364, 1996. 
[4] S. Lakshmivarahan, M.A.L. Thathachar, Absolutely 
Expedient Learning Algorithms for Stochastic 
Automata, IEEE Transactions on Systems, Man and 
Cybernetics, vol. SMC-6, pp. 281-286, 1973. 
[5] J. Moody, Y. Liu, M. Saffell, and K. Youn. 
Stochastic direct reinforcement: Application to simple 
games with recurrence, In Proceedings of Artificial 
Multiagent Learning. Papers from the 2004 AAAI Fall 
Symposium, Technical Report FS-04-02. 
[6] H. Mühlenbein, D. Schlierkamp-Voosen, The science 
of breeding and its application to the breeder genetic 
algorithm, Evolutionary Computation, vol. 1, pp. 335- 
360, 1994. 
[7] K. S. Narendra, M. A. L. Thathachar, Learning 
Automata: an introduction, Prentice-Hall, 1989. 
[8] C. Rivero, Characterization of the absolutely 
expedient learning algorithms for stochastic automata in 
a non-discrete space of actions, ESANN'2003 
proceedings - European Symposium on Artificial Neural 
Networks Bruges (Belgium), pp. 307-312, 2003. 
[9] D. Simian, F. Stoica, A New Nonlinear 
Reinforcement Scheme for Stochastic Learning 
Automata, Proceedings of the 12th WSEAS International 
Conference on Automatic Control, Modelling & 
Simulation, Catania, Italy, pp. 450-454, 2010. 
[10] F. Stoica, E. M. Popa, An Absolutely Expedient 
Learning Algorithm for Stochastic Automata, WSEAS 
Transactions on Computers, Issue 2, Volume 6, pp. 229- 
235, 2007. 
[11] F. Stoica, D. Simian, Automatic control based on 
Wasp Behavioral Model and Stochastic Learning 
Automata. Mathematics and Computers in Science and 
Engineering Series, Proceedings of 10th WSEAS 
Conference on Mathematical Methods, Computational 
ISBN: 978-960-474-297-4 336
Proceedings of the European Computing Conference 
Techniques and Intelligent Systems (MAMECTIS '08), 
Corfu 2008, WSEAS Press, pp. 289-295, 2008. 
[12] F. Stoica, E. M. Popa, I. Pah, A new reinforcement 
scheme for stochastic learning automata – Application to 
Automatic Control, Proceedings of the International 
Conference on e-Business, Porto, Portugal, pp. 45-50, 
2008. 
[13] F. Stoica, D. Simian, Optimizing a New Nonlinear 
Reinforcement Scheme with Breeder genetic algorithm, 
Proceedings of the 11th WSEAS International 
Conference on Evolutionary Computing (EC'10), Iaşi, 
Romania, pp. 273-278, 2010. 
[14] R. Sutton, A. Barto, Reinforcement learning: An 
introduction, MIT-press, Cambridge, MA, 1998. 
[15] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study 
of Learning Automata Games in Automated Highway 
Systems, 1st IEEE Conference on Intelligent 
Transportation Systems (ITSC’97), Boston, 
Massachusetts, 1997 
[16] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study 
of Multiple Intelligent Vehicle Control using Stochastic 
Learning Automata, IEEE Transactions on Systems, 
Man and Cybernetics – Part A, Systems and Human, 
pp.1-42, 1997. 
[17] C. Ünsal, Intelligent Navigation of Autonomous 
Vehicles in an Automated Highway System: Learning 
Methods and Interacting Vehicles Approach, dissertation 
thesis, Pittsburg University, Virginia, USA, 1997. 
ISBN: 978-960-474-297-4 337

More Related Content

PDF
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
PDF
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
PDF
X01 Supervised learning problem linear regression one feature theorie
PDF
22 01 2014_03_23_31_eee_formula_sheet_final
DOCX
B.tech ii unit-2 material beta gamma function
DOCX
Btech_II_ engineering mathematics_unit2
PDF
MATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIR
PDF
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
X01 Supervised learning problem linear regression one feature theorie
22 01 2014_03_23_31_eee_formula_sheet_final
B.tech ii unit-2 material beta gamma function
Btech_II_ engineering mathematics_unit2
MATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIR
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...

What's hot (18)

PDF
Understanding Blackbox Prediction via Influence Functions
DOCX
A Course in Fuzzy Systems and Control Matlab Chapter Three
PDF
Sampling method : MCMC
PDF
Numerical solution of fuzzy differential equations by Milne’s predictor-corre...
PDF
SIAMSEAS2015
PPTX
A machine learning method for efficient design optimization in nano-optics
PPTX
A machine learning method for efficient design optimization in nano-optics
PDF
Finite frequency H∞ control for wind turbine systems in T-S form
PDF
Numerical
PPT
Introductory maths analysis chapter 14 official
PDF
Time series analysis, modeling and applications
PDF
The algebraic techniques module4
PDF
RTSP Report
DOC
Gamma & Beta functions
PPT
Introductory maths analysis chapter 17 official
PPT
Introductory maths analysis chapter 10 official
PPT
Numerical Methods
PDF
HMM-Based Speech Synthesis
Understanding Blackbox Prediction via Influence Functions
A Course in Fuzzy Systems and Control Matlab Chapter Three
Sampling method : MCMC
Numerical solution of fuzzy differential equations by Milne’s predictor-corre...
SIAMSEAS2015
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
Finite frequency H∞ control for wind turbine systems in T-S form
Numerical
Introductory maths analysis chapter 14 official
Time series analysis, modeling and applications
The algebraic techniques module4
RTSP Report
Gamma & Beta functions
Introductory maths analysis chapter 17 official
Introductory maths analysis chapter 10 official
Numerical Methods
HMM-Based Speech Synthesis
Ad

Viewers also liked (17)

PDF
A general frame for building optimal multiple SVM kernels
PDF
A new Reinforcement Scheme for Stochastic Learning Automata
PDF
A Distributed CTL Model Checker
PDF
Algebraic Approach to Implementing an ATL Model Checker
PDF
Using genetic algorithms and simulation as decision support in marketing stra...
PDF
An Executable Actor Model in Abstract State Machine Language
PDF
Modeling the Broker Behavior Using a BDI Agent
PDF
A new co-mutation genetic operator
PDF
Using the Breeder GA to Optimize a Multiple Regression Analysis Model
PDF
Intelligent agents in ontology-based applications
PDF
Implementing an ATL Model Checker tool using Relational Algebra concepts
PDF
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
PDF
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
PDF
An AsmL model for an Intelligent Vehicle Control System
PDF
Building a new CTL model checker using Web Services
PDF
Deliver Dynamic and Interactive Web Content in J2EE Applications
PDF
Building a Web-bridge for JADE agents
A general frame for building optimal multiple SVM kernels
A new Reinforcement Scheme for Stochastic Learning Automata
A Distributed CTL Model Checker
Algebraic Approach to Implementing an ATL Model Checker
Using genetic algorithms and simulation as decision support in marketing stra...
An Executable Actor Model in Abstract State Machine Language
Modeling the Broker Behavior Using a BDI Agent
A new co-mutation genetic operator
Using the Breeder GA to Optimize a Multiple Regression Analysis Model
Intelligent agents in ontology-based applications
Implementing an ATL Model Checker tool using Relational Algebra concepts
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
An AsmL model for an Intelligent Vehicle Control System
Building a new CTL model checker using Web Services
Deliver Dynamic and Interactive Web Content in J2EE Applications
Building a Web-bridge for JADE agents
Ad

Similar to Generic Reinforcement Schemes and Their Optimization (20)

PDF
ACTIVE CONTROLLER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZHEN...
PDF
ADAPTIVESYNCHRONIZER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZH...
PDF
Automatic control based on Wasp Behavioral Model and Stochastic Learning Auto...
PDF
International Journal of Instrumentation and Control Systems (IJICS)
PDF
ADAPTIVE CONTROLLER DESIGN FOR THE ANTI-SYNCHRONIZATION OF HYPERCHAOTIC YANG ...
PDF
Projective and hybrid projective synchronization of 4-D hyperchaotic system v...
PDF
Adaptive Controller and Synchronizer Design for Hyperchaotic Zhou System with...
PDF
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
PDF
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
PDF
ADAPTIVE CONTROLLER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC XU ...
PDF
ACTIVE CONTROLLER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC XU AN...
PDF
PCB_Lect02_Pairwise_allign (1).pdf
PDF
ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF HYPERCHAOTIC QI SYSTEM
PDF
Adaptive Stabilization and Synchronization of Hyperchaotic QI System
PDF
safe and efficient off policy reinforcement learning
PDF
PDF
Reliability Importance in Weighted-k-out-of-n Systems
PDF
Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...
PDF
The Generalized Difference Operator of the 퐧 퐭퐡 Kind
PDF
ANTI-SYNCHRONIZATION OF HYPERCHAOTIC WANG AND HYPERCHAOTIC LI SYSTEMS WITH UN...
ACTIVE CONTROLLER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZHEN...
ADAPTIVESYNCHRONIZER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZH...
Automatic control based on Wasp Behavioral Model and Stochastic Learning Auto...
International Journal of Instrumentation and Control Systems (IJICS)
ADAPTIVE CONTROLLER DESIGN FOR THE ANTI-SYNCHRONIZATION OF HYPERCHAOTIC YANG ...
Projective and hybrid projective synchronization of 4-D hyperchaotic system v...
Adaptive Controller and Synchronizer Design for Hyperchaotic Zhou System with...
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
ADAPTIVE CONTROLLER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC XU ...
ACTIVE CONTROLLER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC XU AN...
PCB_Lect02_Pairwise_allign (1).pdf
ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF HYPERCHAOTIC QI SYSTEM
Adaptive Stabilization and Synchronization of Hyperchaotic QI System
safe and efficient off policy reinforcement learning
Reliability Importance in Weighted-k-out-of-n Systems
Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...
The Generalized Difference Operator of the 퐧 퐭퐡 Kind
ANTI-SYNCHRONIZATION OF HYPERCHAOTIC WANG AND HYPERCHAOTIC LI SYSTEMS WITH UN...

More from infopapers (9)

PDF
A New Model Checking Tool
PDF
CTL Model Update Implementation Using ANTLR Tools
PDF
Generating JADE agents from SDL specifications
PDF
An evolutionary method for constructing complex SVM kernels
PDF
Evaluation of a hybrid method for constructing multiple SVM kernels
PDF
Interoperability issues in accessing databases through Web Services
PDF
Using Ontology in Electronic Evaluation for Personalization of eLearning Systems
PDF
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
PDF
An executable model for an Intelligent Vehicle Control System
A New Model Checking Tool
CTL Model Update Implementation Using ANTLR Tools
Generating JADE agents from SDL specifications
An evolutionary method for constructing complex SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
Interoperability issues in accessing databases through Web Services
Using Ontology in Electronic Evaluation for Personalization of eLearning Systems
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
An executable model for an Intelligent Vehicle Control System

Recently uploaded (20)

PDF
An interstellar mission to test astrophysical black holes
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Microbiology with diagram medical studies .pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
famous lake in india and its disturibution and importance
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
An interstellar mission to test astrophysical black holes
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
TOTAL hIP ARTHROPLASTY Presentation.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Microbiology with diagram medical studies .pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Derivatives of integument scales, beaks, horns,.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Cell Membrane: Structure, Composition & Functions
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
famous lake in india and its disturibution and importance
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
2. Earth - The Living Planet Module 2ELS
The KM-GBF monitoring framework – status & key messages.pptx
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
2Systematics of Living Organisms t-.pptx
Biophysics 2.pdffffffffffffffffffffffffff
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5

Generic Reinforcement Schemes and Their Optimization

  • 1. Proceedings of the European Computing Conference Generic Reinforcement Schemes and Their Optimization DANA SIMIAN, FLORIN STOICA Department of Informatics “Lucian Blaga” University of Sibiu Str. Dr. Ion Ratiu 5-7, 550012, Sibiu ROMANIA dana.simian@ulbsibiu.ro, florin.stoica@ulbsibiu.ro Abstract: - The aim of this paper is to introduce a generic two-parameters dependent absolutely expedient reinforcement scheme and to present a method for learning parameters optimization. We optimize, using a Breeder genetic algorithm, many schemes derived from our generic one, in order to reach the best performance. Furthermore, we compare our results in terms of speed and efficiency. Key-Words: - Reinforcement Learning, Breeder genetic algorithm, Optimization 1 Introduction Reinforcement schemes represent algorithms which realize the learning process for stochastic learning automata. Stochastic learning automata adapt to changes in their environment as results of a reinforcement learning process. Given a set of possible actions, a stochastic learning automaton must choose the optimal one, based on the environment response and the past actions. Initially equal probabilities are associated to all possible actions, one action is selected at random and the actions probabilities are updated based on the environment response. A detailed characterization of reinforcement learning can be found in [14]. In [7] is underlined that the major advantage of reinforcement learning is that it needs information about the environment only for the reinforcement signal. Reinforcement learning has several applications in autonomic robotics, designing multi-agent systems, intelligent vehicles control, etc. ([2], [3], [5], [15], [16]). In [11] we designed a simulator of an intelligent vehicle control system. The system was based on two learning automata. In other articles ([9], [13]), we defined new reinforcement schemes in order to reach a best performance of our system. Usually, reinforcement schemes depends on many parameters, as we can see in section 3. An important problem is to choose the optimal scheme’s parameters. The aim of this paper is to introduce a generic reinforcement scheme from which many other reinforcement schems can be obtained and to present an optimization method of these schemes with respect to learning parameters. We also optimize this new scheme and those that we introduced in [9], [12]. We evaluate and compare our schemes using two criteria: the speed of the optimization process and the efficiency of the optimized schema. The remainder of this paper is organized as follows. In section 2 we briefly present the mathematical backgrounds of stochastic learning automata with variable structure. In section 3 is presented our generic absolutely expedient reinforcement scheme, together with other particular schemes derived from it. In section 3 we present our optimization method for reinforcement schemes learning parameters and analyze the provided results. Conclusions and further directions of study are presented in section 5. 2 Mathematical backgrounds of stochastic automata A stochastic automaton supposes the existence of a set of actions, which define the input of the environment and a response set. The range of the response values depends on the model we chose. There are three different models for representation of the response values: P-model, S-model and Q-model. The P-model uses a set of binary values, 0 or 1. In the S-model the response values are continuous in the range (0, 1). In the Q-model the response set is a finite set of discrete values in the range (0, 1). In this paper we use the P-model for our reinforcement schemes. A stochastic automaton selects one action at random, observes the response from the environment and updates the action probabilities based on that response. An action can be rewarded or punished using a set of penalties probabilities. Mathematical model of a stochastic automaton is defined by a triple {α , c,β } corresponding to the elements presented before: a) α ={α1 ,α 2 ,...,α r } - the input actions of the environment b) β - the response set. In the case of P-model, { 1 , 2} β = β β is a binary set: ISBN: 978-960-474-297-4 332
  • 2. Proceedings of the European Computing Conference β = 0 is a favourable outcome and β =1 is an unfavourable outcome. To reefer the time instant is used the notationα (n) , β (n) . c) c ={c1 , c2 ,..., cr } - the set of penalty probabilities. The element i c is the probability that action i α will result in an unfavourable response: ci = P(β (n) =1|α (n) =α i ) i =1, 2, ..., r The evolution in time of penalty probabilities defines two types of environments: stationary (the penalty probabilities are constant over time) and nonstationary (the penalties change over time). In the following we consider only stationary random environments. The action probabilities vector at time moment n+1 is updated using a mapping T and the current probabilities pi (n) = P(α (n) =α i ), i =1, r : p(n +1) = T[ p(n),α (n),β (n)] Reinforcement schemes are named linear if p(n +1) is a linear function of p(n) , and nonlinear otherwise. The evaluation of performances of a learning automaton is made using a quantitative norm of behavior ([17]) represented by the average penalty for a given action probability vector, M(n). M n P n p n ( ) ( ( ) 1| ( )) = β = = r r =Σ = = ∗ = =Σ P n n P n cpn ( β ( ) 1| α ( ) α ) ( α ( ) α ) ( ) i i i i i i 1 1 = = The only class of reinforcement schemes for which necessary and sufficient conditions of design are available is represented by absolutely expedient learning schemes, defined in [7]. An automaton is absolutely expedient if M(n +1) < M(n) for all n ([7]). The general solution for absolutely expedient schemes was found by Lakshmivarahan and Thathachar in [4]. Other studies about expedient learning algorithms can be found in [8]. In [17] is presented a nonlinear absolutely expedient reinforcement scheme, for a stationary N-teacher P-model environment. In the case of N-teacher model, if the automaton produced the action i α and the responses from environments (or “teachers”) are denoted by j j N β i =1,..., , then the updating rules are: ⎤ ( 1) ( ) 1 β φ ( ( )) Σ Σ ≠ = = − ∗ ⎥⎦ ⎡ ⎢⎣ + = + r j j i j N k k i i i p n N p n p n 1 1 ⎤ 1 1 β ψ ( ( )) (1) Σ Σ ≠ = = ∗ ⎥⎦ ⎡ − − ⎢⎣ r j j i j N k k i p n N 1 1 ( 1) ( ) 1 ( ( )) + = − ⎡ ⎤ ∗ + ⎢⎣ ⎥⎦ p n p n p n 1 1 1 ( ( )), . ⎡ ⎤ + ⎢ − ⎥ ∗ ∀ ≠ ⎣ 1 ⎦ N k j j i j k N k i j k N p n j i N β φ β ψ = = Σ Σ (2) i φ and i ψ satisfy the following conditions: p n p n r λ 1 = = = p n ≤ ( ( )) 0 ( ( )) ( ) ... ( ( )) 1 ( ) p n p n r φ φ (3) p n p n r μ 1 = = = p n ≤ ( ( )) 0 ( ( )) ( ) ... ( ( )) 1 ( ) p n p n r ψ ψ (4) r Σ pi n j p n ( ) + φ ( ( )) > 0 (5) i j j 1 ≠ = r Σ pi n j p n ( ) − ψ ( ( )) < 1 (6) i j j 1 ≠ = p j (n) +ψ j ( p(n)) > 0 (7) p j (n) −φ j ( p(n)) <1 (8) for all j∈{1,..., r} {i} In [1] and [15] is proved that the automaton with the reinforcement scheme given in (1)-(2) is absolutely expedient in a stationary environment if the functions λ ( p(n)) and μ ( p(n)) satisfy the following conditions: λ ( p(n)) ≤ 0 μ ( p(n)) ≤ 0 (9) λ ( p(n)) + μ ( p(n)) < 0 3 Generic absolutely expedient reinforcement scheme In the following we present a generic two-parameter dependent reinforcement schemes and prove that this scheme is absolutely expedient in a stationary environment. We start from the scheme given in (1) – (2). This scheme is also valid for a single-teacher model. In this case we will define a single environment response denoted by f . Thus, the updating rules become: pn pn f Hn pn ( 1) ( ) ( ( )) [1 ( )] (1 ) ( ) [1 ( )] + = + ∗ − ∗ ∗ − − i i 1 i f pn 2 i γ γ − − ∗ − ∗ − pn pn f Hn p n f p n ( 1) ( ) ( ( )) ( ) (1 ) ( ) ( ) + = − ∗ − γ ∗ ∗ 1 ∗ + − ∗ − ∗ γ 2 j j j j (10) for all j ≠ i , i.e.: 2 ( ( )) ( ) k k ψ p n = −γ ∗ p n 1 ( ( )) ( ) ( ) k k φ p n = −γ ∗H n ∗ p n ISBN: 978-960-474-297-4 333
  • 3. Proceedings of the European Computing Conference where learning parameters 1 γ and 2 γ are real values γ 1,γ 2 ∈(0,1) (11) The function H is defined as: ⎧ ( ) min 1; max min ( ) , { { = ⎨ i − ⎩ ∗ (1 − ( )) 1 i H n p n p n ε γ }} ⎛ ⎫ 1 − () ⎞ ⎪ ⎜⎜ − ∗ ⎟⎟ ⎬ ⎝ 1 ⎠ 1, ⎪⎭ ;0 ( ) j j j r j i p n p n ε γ =≠ Parameter ε is an arbitrarily small positive real number. Our reinforcement scheme differs from schemes given in [15]-[17], by the definition of H and φ k . We will show that are satisfied all the conditions of the reinforcement scheme (1) - (2). From (3), (4) we have: p n H n p n Hn pn p n p n ( ( )) ( ) ( ) ( ) ( ( )) ( ) ( ) φ γ − ∗ ∗ = 1 =− ∗ = 1 (3’) k k k k γ λ p n p n p n p n p n ( ( )) ( ) ( ( )) ( ) ( ) ψ γ − ∗ = 2 =− = (4’) 2 k k k k γ μ The conditions (5) – (8) become: p n H n p n H n p n ( ) ( ) (1 ( )) 0 ( ) ( ) − ∗ ∗ − > ⇔ < i 1 i 1 i p n (1 ( )) i γ γ ∗ − (5’) Condition (5’) is satisfied by the definition of the function H(n) . 2 ( ) (1 ( )) 1 i i p n +γ ∗ − p n < (6’) But 2 ( ) (1 ( )) ( ) 1 ( ) 1 i i i i p n +γ ∗ − p n < p n + − p n = since 2 0 <γ <1 2 ( ) ( ) 0, {1,..., }{ } j j p n −γ ∗ p n > ∀j∈ r i (7’) But 2 2 ( ) ( ) ( ) (1 ) 0 j j j p n −γ ∗ p n = p n ∗ −γ > since 2 0 <γ <1 and 0 < p j (n) <1 for all j∈{1,..., r}{i} 1 p n 1 − () 1 ( ) ( ) ( ) 1 ( ) ( ) j j j j p n H n p n H n p n γ γ + ∗ ∗ < ⇔ < ∗ (8’) ∀j∈{1,..., r}{i} . This condition is satisfied by the definition of the function H(n) . Therefore our reinforcement scheme is a candidate for absolute expediency. Furthermore, the functions λ and μ for our nonlinear scheme satisfy: 1 λ ( p(n)) = −γ ∗H(n) ≤ 0 2 μ ( p(n)) = −γ ≤ 0 1 2 λ ( p(n)) +μ ( p(n)) = −γ ∗H(n) −γ < 0 In conclusion, the algorithm given in equations (10) is absolutely expedient in a stationary environment. This algorithm defines a two-parameter dependent generic absolutely expedient reinforcement scheme. We will denote this scheme by 2 1 Rγ γ . Choosing different expressions for the parameters, such that (11) holds, we obtained several absolutely expedient reinforcement schemes. In [9] we introduced and studied the scheme *(1 ) (1 )* Rθ δ − − θ δ , with 0 <θ <1 and 0 <δ <1. Obviously 0 <θ *(1−δ ) <1 and 0 < (1−θ )*δ <1, therefore this is a absolutely expedient reinforcement scheme. In [12] we introduced the scheme θ Rθ *δ , with 0 <θ <1 and 0 <θ ∗δ <1. 4. Optimization of two-parameters reinforcement schemes A very important problem is to find the optimal values of learning parameters in the scheme 2 1 Rγ γ in order to reach the best performance. In [13], we introduced first the idea of learning parameters optimization in a reinforcement scheme using genetic algorithms. We develop here this idea and use a Breeder genetic algorithm, for providing the optimal learning parameters for the generic scheme 2 1 Rγ γ . We also apply the method for the particular schemes presented in section 3. Furthermore, we compare our results in terms of speed and efficiency. For the simplicity of notations, we consider, in our comparisons, the scheme Rθ δ , with 1 Rγ δ ,θ ∈(0,1) instead of 2 γ . The aim is to find optimal values for the learning parameters δ and θ in the schemes: Rθ δ , *(1 ) (1 )* Rθ δ − and − θ δ θ Rθ *δ . Because parameters are real values, we use the Breeder genetic algorithm, proposed by Mühlenbein and Schlierkamp-Voosen in [6], which represents solutions (chromosomes) as vectors of real numbers. This algorithm is closer to the reality than normal genetic algorithms which use discrete representation of solutions. The skeleton of the Breeder genetic algorithm can be found in [13]. The selection is achieved randomly from the T% best elements of current population, where T is a constant of the algorithm (usually, T = 40 provide best results). Thus, within each generation, two elements selected from the T% best chromosomes are subject to crossover operation. On the new child obtained from the mate of the parents is applied the mutation operator. The ISBN: 978-960-474-297-4 334
  • 4. Proceedings of the European Computing Conference process is repeated until are obtained N-1 new individuals, where N represents the size of the initial population. The best chromosome (evaluated through fitness function) is inserted in the new population (1- elitism). Thus, the new population will have also N elements. Let be 1,..., { }i i n x x = = and 1,..., { }i i n y y = = two chromosomes. The Breeder crossover operator gives a new chromosome z, whose genes are represented by ( ) i i i i i z x y x α = + − , i=1,…,n, with i α a random variable uniformly distributed between [−ε ,1+ε ], ε depends on the problem to be solved and typically is in the interval [0,0.5] . The mutation operator gives i i i i i x = x + s ⋅ r ⋅a , i=1,…n with { 1,1} i s ∈ − uniform at random, i xi r = r ⋅ domain , r∈[0.1, 0.5] (typically 0.1) , 2 k i a = − ⋅α withα ∈[0,1] uniform at random and k is the number of bytes used to represent a number in the machine within is executed the Breeder algorithm (mutation precision). The probability of mutation is typically choosed as 1/ n . In order to find the best values for learning parameters δ and θ of our reinforcement schemes and to compare the results, we consider the same example we used in [9], [13]. We used our reinforcement schemes for robot navigation in the grid world presented in Fig. 1. The current position of the robot is marked by a circle. Navigation is done using four actionsα ={N, S, E,W} , corresponding to the four possible movements along the coordinate directions. Fig. 1. Grid world for robot navigation We have a single optimal action (movement to S). In the learning process, only this action receives reward. Initially, we choose for the optimal action a small probability value (0.0005). We stop the execution when the probability of the optimal action, popt, reaches a certain value (popt=0.9999). We make the performance evaluation of our schemes using the “number of steps” of the learning algorithm until the stop condition is achieved. Using the Breeder genetic algorithm, we can provide the optimal learning parameters for our schemes, in order to reach the best performance. Each chromosome contains two genes, representing the real values δ and θ . The fitness function for chromosomes evaluation is represented by the number of steps necessary by the learning process to reach the value 0.9999 for the probability of the optimal action. The parameters of Breeder algorithm are assigned with following values: δ = 0 , r = 0.1, k = 8 . The initial population has 400 chromosomes and algorithm is stopped after 1000 generations. The results provided by the Breeder genetic algorithm are presented in Table 1. Optimal values for learning parameters provided by the Breeder algorithm 4 actions with (0) 0.0005, (0) = 0.9995 / 3 = opt p i≠opt p Scheme 4.1 θ Rδ Scheme 4.2 θ Rθ *δ Scheme 4.3 *(1 ) (1 )* θ − δ R − θ δ δ 0.5866 0.7036 0.5741 θ 0.9469 0.8983 0.3640 Average number of steps to reach 16.95 16.98 43.70 popt=0.9999 Table 1. Optimal values for learning parameters provided by the Breeder genetic algorithm Fig. 2. Schema optimization vs. time passed In figure 2 is presented the optimization process for reinforcement schemes analyzed in Table 1, using two dimensions of data: the time passed vs. the performance evaluation of optimized scheme (number of steps necessary to reach the stop conditions of the learning process). In figure 3 is presented the optimization process using as dimensions of data the number of generations in the Breeder algorithm vs. the performance evaluation of optimized scheme. ISBN: 978-960-474-297-4 335
  • 5. Proceedings of the European Computing Conference Fig. 3. Schema optimization vs. number of generations in Breeder algorithm With results obtained in Table 1, we can conclude that Breeder genetic algorithm is capable to provide the best values for learning parameters, and thus our schemes were optimized for best performance. The results obtained by our nonlinear optimized schemes are significant better than those obtained in [10], [12], [17]. 5 Conclusions Using a Breeder genetic algorithm, we found automatically the optimal values for the learning parameters of many reinforcement schemes, in order to reach the best performance, measured in number of iterations in learning process (“number of steps”). From graphical results of optimization process showed in Fig.2 and Fig. 3, we can conclude that scheme 4.3, R θ *(1 − δ ) (1 )* , − θ δ is more adequate for applications with less time allocated for schema optimization, and scheme 4.2, Rθ , θ *δ is very efficient if we allocate for optimization enough time. However, the new generic scheme θ Rδ , introduced in section 3, outperforms the other schemes in terms of speed and qualitative results in the learning process. There are many possibilities for choosing the form of parameters in generic scheme 2 1 Rγ γ such that the conditions (11) are satisfied. Breeder genetic algorithm, presented in section 4, can be used for optimization of parameters values regardless of choice of 1 2 γ ,γ . The graphical results obtained suggest than 1 2 γ =δ ,γ =θ , with 0 <δ ,θ <1 give better results than other more complicated choices. As further directions of study we want to rigorous prove or to invalidate this conjecture. References: [1] N. Baba, New Topics in Learning Automata: Theory and Applications, Lecture Notes in Control and Information Sciences, Berlin, Germany: Springer- Verlag, pp.750-758, 1984. [2] O. Buffet, A. Dutech, and F. Charpillet, Incremental reinforcement learning for designing multi-agent systems, In J. P. Müller, E. Andre, S. Sen, and C. Frasson, editors, Proceedings of the Fifth International Conference onAutonomous Agents, Montreal, Canada, ACM Press, pp. 31–38, 2001. [3] M. Dorigo, Introduction to the Special Issue on Learning Autonomous Robots, IEEE Trans. on Systems, Man and Cybernetics - part B, Vol. 26, No. 3, pp. 361- 364, 1996. [4] S. Lakshmivarahan, M.A.L. Thathachar, Absolutely Expedient Learning Algorithms for Stochastic Automata, IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-6, pp. 281-286, 1973. [5] J. Moody, Y. Liu, M. Saffell, and K. Youn. Stochastic direct reinforcement: Application to simple games with recurrence, In Proceedings of Artificial Multiagent Learning. Papers from the 2004 AAAI Fall Symposium, Technical Report FS-04-02. [6] H. Mühlenbein, D. Schlierkamp-Voosen, The science of breeding and its application to the breeder genetic algorithm, Evolutionary Computation, vol. 1, pp. 335- 360, 1994. [7] K. S. Narendra, M. A. L. Thathachar, Learning Automata: an introduction, Prentice-Hall, 1989. [8] C. Rivero, Characterization of the absolutely expedient learning algorithms for stochastic automata in a non-discrete space of actions, ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), pp. 307-312, 2003. [9] D. Simian, F. Stoica, A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata, Proceedings of the 12th WSEAS International Conference on Automatic Control, Modelling & Simulation, Catania, Italy, pp. 450-454, 2010. [10] F. Stoica, E. M. Popa, An Absolutely Expedient Learning Algorithm for Stochastic Automata, WSEAS Transactions on Computers, Issue 2, Volume 6, pp. 229- 235, 2007. [11] F. Stoica, D. Simian, Automatic control based on Wasp Behavioral Model and Stochastic Learning Automata. Mathematics and Computers in Science and Engineering Series, Proceedings of 10th WSEAS Conference on Mathematical Methods, Computational ISBN: 978-960-474-297-4 336
  • 6. Proceedings of the European Computing Conference Techniques and Intelligent Systems (MAMECTIS '08), Corfu 2008, WSEAS Press, pp. 289-295, 2008. [12] F. Stoica, E. M. Popa, I. Pah, A new reinforcement scheme for stochastic learning automata – Application to Automatic Control, Proceedings of the International Conference on e-Business, Porto, Portugal, pp. 45-50, 2008. [13] F. Stoica, D. Simian, Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm, Proceedings of the 11th WSEAS International Conference on Evolutionary Computing (EC'10), Iaşi, Romania, pp. 273-278, 2010. [14] R. Sutton, A. Barto, Reinforcement learning: An introduction, MIT-press, Cambridge, MA, 1998. [15] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study of Learning Automata Games in Automated Highway Systems, 1st IEEE Conference on Intelligent Transportation Systems (ITSC’97), Boston, Massachusetts, 1997 [16] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study of Multiple Intelligent Vehicle Control using Stochastic Learning Automata, IEEE Transactions on Systems, Man and Cybernetics – Part A, Systems and Human, pp.1-42, 1997. [17] C. Ünsal, Intelligent Navigation of Autonomous Vehicles in an Automated Highway System: Learning Methods and Interacting Vehicles Approach, dissertation thesis, Pittsburg University, Virginia, USA, 1997. ISBN: 978-960-474-297-4 337