You are on page 1of 7

This article was downloaded by: ["Queen's University Libraries, Kingston"]

On: 31 December 2014, At: 11:50


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK

International Journal of Systems Science


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/tsys20

Learning automata with continuous input and changing


number of actions
a

A. S. POZNYAK & K. NAJIM

CINVESTAV-IPN, Departamento de Ingenieria Electrica , Seccion de Control Automatico ,


A.P. 14-740, Mxico, D.F, 07000, Mxico Fax: E-mail:
b

E.N.S.I.G.C, Chemin de la loge , Toulouse Cedex, 31078, France Phone: Tel: +33 62 25 23 69
Fax: Tel: +33 62 25 23 69
Published online: 16 May 2007.

To cite this article: A. S. POZNYAK & K. NAJIM (1996) Learning automata with continuous input and changing number of
actions, International Journal of Systems Science, 27:12, 1467-1472, DOI: 10.1080/00207729608929353
To link to this article: http://dx.doi.org/10.1080/00207729608929353

PLEASE SCROLL DOWN FOR ARTICLE


Taylor & Francis makes every effort to ensure the accuracy of all the information (the Content) contained in the
publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations
or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any
opinions and views expressed in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be
independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses,
actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever
caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions

International Journal of Systems Science, 1996, volume 27, number 12, pages 1467-1472

Learning automata with continuous input and changing number of


actions

Downloaded by ["Queen's University Libraries, Kingston"] at 11:50 31 December 2014

A. S.

POZNYAKt

and K. NAJIMt

The behaviour oJ a stochastic automaton operating in an S-model environment is


described. The environment response takes an arbitrary value in the closed segment
[0, I] (continuous response). The learning automaton uses a reinforcement scheme to
update its action probabilities on the basis oj the reaction oj the environment. The
complete set oj actions is divided into a collection oj non-empty subsets. The action set
is changing Jrom instant to instant. Each action set is selected according to a given
probability distribution. Convergence and convergence rate results are presented. These
results have been derived using quasimartinqales theory.

Introduction
Learning automata have attracted considerable interest
due to their potential usefulness in a variety of engineering
problems which are characterized by nonlinearity and a
high level of uncertainty (Najim and Oppenheim 1991).
Learning automata have been initially used to model
biological systems (Wiener 1948, Walter 1953, Tseltin
1973). A learning system is connected in a feedback loop
to the environment or random medium (the input to one
is the output of the other) where it operates. The
environment is the system which communicates with the
learning automaton and supplies it with information. The
environment in which the automaton operates offers the
latter a finite set of actions. The automaton is constrained
to choose one of these actions on the basis of a probability
distribution. The outputs of the automaton form the
input to the environment and the reactions or responses
from the environment form the input to the automaton.
In adaptive control (Najim and M'Saad 1991), the
beha viour of the system is slightly improved at every
sampling period by estimating in real time the parameters
(model or control law parameters) to attain the desired
control objective. In learning automata (Narendra and
Thathachar 1989), the probability distribution p. is
I.

Received 15 April 1996. Accepted 19 June 1996.

t CINVESTA V-lPN, Departamento de lngenieria Electrica, Seccion


de Control Autornatico, A.P. 14-740,07000, Mexico D.F., Mexico. Fax:
+ 52 5 7477002; e-mail: apoznyak@ctrl.cinvestav.mx.
t E.N.S.I.O.c., Chemin de la loge, 31078 Toulouse Cedex, France.
Tel: +3362252369; Fax: +33622523 18.

recursively updated to optimize some performance index.


An adaptive control system is a system that contains only
a simple feedback loop and an adaptive loop. A learning
system contains all three loops: a simple feedback loop,
an adaptive loop and a learning loop (Sklansky 1966).
The heart of a learning automaton is the reinforcement
scheme, which is the mechanism used to adapt the
probability distribution. Based on the environment
response and the action selected by the automaton at
time n, it generates P. + 1 from P. It has been shown
(Poznyak 1975) that several reinforcement schemes can
be associated with the minimization of some functional.
Learning automata should, by collecting and processing current information regarding the environment,
be capable of changing their structures and parameters
as time evolves, to achieve the desired goal or the optimal
performance (in some sense).
A learning system is a sequential machine characterized
by a set of actions, a probability distribution and a
reinforcement scheme. An extensive literature has been
dedicated to the behaviour of learning automata with
fixed action sets (Najim and Oppenheim 1991). The
concept of the behaviour of automata where the number
of the actions available at each time is time-varying has
been studied by Ramakrishnan and Thathachar (1982)
and Thathachar and Harita (1987). The latter stated
convergence results for binary environment responses
(P-model environment). An important aspect of convergence that has not been considered is the rate of
convergence, which concerns the speed of operation of
the automaton (Thathachar and Harita 1987). Learning
automata with changing numbers of actions are relevant

0020-7721/96 $12.00 1996 Taylor & Francis Ltd.

A. S. Poznyak and K. Najim

Downloaded by ["Queen's University Libraries, Kingston"] at 11:50 31 December 2014

1468

in the modelling of several problems (CPU job


scheduling, optimal path in stochastic networks, etc.).
This paper deals with the study of a learning automaton
with continuous input (S-model environment) where the
number of automaton actions is changed in real time.
Learning automata with continuous inputs have been
used for optimization purposes (stochastic optimization
on finite sets) (Najim and Poznyak 1994). Theoretical
results concerning the convergence and the convergence
rate of this learning automaton are presented.
This paper is organized as follows. In the next section
the learning stochastic automaton is described. The
analysis and the convergence properties of this learning
system arc given in 3. Some conclusions end this paper.
2.

Learning stochastic automaton

An automaton is a sequential machine (Narendra and


Thathachar 1989) described by the set

RANDOM

Un

I--

ENVIRONMENT

LEARNING
I..--

AUTOMATON

Feedback connection of automaton and environment.

A learning system is a stochastic automaton connected


in a feedback loop with a random environment, as shown
by the Figure.
The probability distribution qn defined over all the
possible action subsets is

qn = {qn(l), q.(2), ... , qn(W)},

{3, V, .~, gn}, {un}' {Pn}, T},


where 3 is the automaton input set. V denotes the set
{I/( 1),1/(2), u(3), ... , u(N)} of all actions of the automaton,
2 ::s; N < CIJ. This action set is divided into (2N - I)
subsets. V(j) represents the jth action subset. The index
j is assigned by ordering these subsets in a lexicographical
manner beginning with the single-action subset, then the
double-action, etc., and ending with the complete set
of actions V. For example, if N = 3, the subsets
arc ordered as: V(I) = {u(l)}, V(2) = {u(2)}, V(3) =
{1/(3)}, V(4) = {u(I), u(2)}, V(5) = {u(l), u(3)}, V(6) =
{u(2), 1/(3)}, V(7) = v. .~ = (n, !!a, P) a probability space
with an increasing family of (J fields,

{~n}, ~n E

[0, I] is adapted to the sequence of increasing


algebras {!!an}, and it represents a sequence of
automaton inputs (environment response, ~n E 3) provided
by the environment in a continuous form (some
transformation of the learning goal) or in a discrete form
(binary: ~n = 0, called a reward or non-penalty; and
~n = I, called a penalty). {un} is a sequence of automaton
outputs (actions) Un E V. P = [Pn(l), Pn(2), Pn(3), ... ,
Pn(N)]T is a sequence of conditional finite distributions
where

where W

= 2N

I.

qn(j) = prob

v",

[v" =

V(j)!3';,-t],j = I, W,

is the subset selected at time n

is the (J algebra generated by the corresponding events.


This probability distribution qn is a priori known for
every n, n = I, 2, ...
The learning automaton operates as follows.
Let v" = V(j) and Un = u(s) be, respectively, the action
subset and the action selected at time n. The probability
distribution P is scaled and adjusted according to the
following stages (Thathachar and Harita 1987, Najim
and Poznyak 1994, Bush and Mosteller 1958):

(J

Pn(i) = prob [un = u(i)!!!an]

and

L:
j=

Pn(i) = I,

'In.

Trepresents the reinforcement scheme (updating scheme)


which changes the probability vector Pn to Pn+ ,:

Pn+ 1

= T(Pn, ~n, Un,

v,,),

This is the heart of the learning system.

Pn(i)
Pn*("/")
I J == Kn(J') ,

K n(J') ==

"L.,

i:u(i)eV(j)

(')
0
I > ,

(I)

(2)

where p.(i), i = I, ... , N, is the action probability vector


defined over the set V of all actions; p:(i/j) is the scaled
probability of the action u(i), u(i) E V(j); Kn(j) is the sum
of the probabilities of the actions of the subset V(j) at
the instant n; N(j) denotes the number of actions of the
subset V(j); Yn is the positive scalar correction factor; 0i,
is the Kronecker symbol (Oi, = I if i = s, Of, = 0 if i # s);
~n represents the environment response (automaton
input).

Learning automata with continuous input


The action probabilities of the action of the chosen
subset V(j) are rescaled:

1469

i.e.
a.s.

Pn(Cl)- I,
n~oo

Pn + t (i)

(3)

P:+ ,(i!j)Kn(j),

for all i, such that u(i) E V(j).

Downloaded by ["Queen's University Libraries, Kingston"] at 11:50 31 December 2014

The loss function associated with the learning


automaton is given by the following expression:

This theorem shows that this learning automaton generates


asymptotically an optimal adaptive strategy.
Remark 1: If ~n rfc [0, I] and does not tend to zero in
the average sense (see Assumption 2), the preceding
learning algorithm can be also used by introducing the
following' normalization' procedure:

[Sn(i) - min Sn-,(J)]

Some theoretical results concerning the convergence


and the convergence rate of this learning automaton
(reinforcement scheme (1)-(3)) are presented in the next
section,

(4)

where
3.

Convergence and convergence rate analysis

l:

In this section the asymptotic properties of the


probability distribution are examined (Najim and
Poznyak 1994),

~,X(U, = u(i))

t= 1

l:
t

X(u, = u(i))

Assumption 1: p;(i) > 0, Vi = '1, N, i.e. any action can


be selected as the initial one with non-zero probability.
Assumption 2:

[X]+ ,=

3U(Cl) such that

n"Egn(u(Cl), w)!3";,-t}

a.s.
--+
11- ....00

0,

X(U n = u(i)) ,=

u 0,

(5)

i = I, ... , N,

sn(i)'=-----:n,---------'

X,

if X ~ 0,

0,

if X < 0,

{~:

if un

= u(i),

(6)

if un of- u(i),

Note that

~n

min lim inf E{ ~n(u(i), w)!3"._t} '" c > 0,


i'ji!:/J

n-

00

i.e. for the action un = U(Cl) (the optimal action), the


conditional expectation of the automaton input ~n tends to
zero with probability I and with a rate greater than I!n". In
the case ofnon-optimal actions, this average is greater than
zero.
Assumption 3

L. 1'n

n=1

[0, I],

and

min { qn(j) }
1

N( ')

J-

= 00,

lim E{ ~n!un

= U(Cl)} = O.

This normalization procedure shows that Assumption 2


is not restrictive if ~n is replaced by ~n in the algorithm (2),
For the proof of the above theorem we need two
lemmas about quasimartingales which are stated in the
Appendix.
Proof: Let V(j) and u(s) be, respectively, the subset and
the action selected at time n, i.e.

un

= u(s) E v" =

V(j).

From the relation (2), we obtain

P:+ t(Cl!j)
The main result or this study is as follows.

If Assumptions 1-3 are fulfilled, then the


sequence {Pn} generated by the reiriforcement scheme
(I )-(3) converges to the optimal strategy

Theorem 1:

e~'

= (0, ... ,0,

1,0, ... ,O)T,

P:(Cl/j) + 1'nX(U(Cl) E v,,)


x

{fJs. _ p*(Cl/J') + ~
n

[I - N(j)fJ,,]}.
N(j) _ I

(7)

Taking into account (2), it follows that

Pn+ ,(Cl) = Pn(ex) + 1'nX(u(s) E v,,)


')'
()!'
' [I - NU)fJs .
x { KnU us. - Pn Cl + "nKnU)
NU)-l

]}

.
(8)

1470

A. S. Poznyak and K. Najim

Hence

Taking into account that

E{p,,+I(a)/3";,_11\

v" =

V(m

E{p,,+ l(a)/3";,-1 1\ (u" = u(s) E

v" =

L
VU}

x(u(a)

V(j))[K,,(j) - p.(n)]

j= J

s:u('~lGV(j)

L x(u(a) E V(j)) L

s:Lt.eVj

=L L

Downloaded by ["Queen's University Libraries, Kingston"] at 11:50 31 December 2014

x(u(a)

V(j))x(u(a) # u(s) E V(j))p.,(s)

j= 1 s= 1

NUl

p.,(s)

,,..

1'=1

v" =

Eg,,/3";,_ 1 1\ (u" = u(s) E

V(j))}p:(s/j)

L p,,(S)r

N _ 2(S)

= 2N -

s= 1

x (I - N(j)D,.)].

(9)

N(j) - I

where

Observe that
NUl

K.,(j)

D,.p:(s/j) - p,,(a) = K,,(j)p:(a/J) - p.,(a) = O.

rN -

2(s)

sal

x(u(a) E V(j))x(u(a) # U(S)

V(j))

=L

The preceding equality leads to

E{p,,+I(a)/3";,_11\
'~' p,,(a)

v" =

+ y"x(u(a) E

j~

(for W

V(j)}
v,,)

(I)

L -, -

NU )
[

= L
1'=1

(10)

L p,,(S)
,,..
s= 1

,,.. N())

I c.,(s)p.,(s) - c,,(a)p,,(a) ,

= J, r N -

C~-2

= 2N -

"Is

= J, ... , N, s #

a,

2(s)

I, "Is), and

N-2

(N - 2)!
j!(N - 2 _ j)!

(11)
represents the combinatorial function, we obtain

where

E{I - P,,+ l(a)/3";,- d "~. (I - p,,(et))

c,,(s) == E[~,,(u(s), W)/3";,-I]'

I, N.

x
By averaging over the probability distribution q" and the
index j, and taking into account that p" is 3";,-1measurable
= T(p"_,, ~"-l' U,,-I' v,,-I))' we obtain

t,

= I - p,,(a) - y"

q,,(j)x(u(a)

I
NU)
,
L c,,(s)p,,(s) - c,,(a)p,,(a) ] ,
[ N(j)-I,,..

y"c- min
j

q~(j)

N(j) - J

+ y"o(n-"),

(14)

Then the proof of this theorem follows directly from


Assumption 3 and Lemma I (see the Appendix) (Najim
and Poznyak 1994, Nazin and Poznyak 1986).
0
Results concerning the convergence rate are summarized in the following corollary.

V(j))

j= I

[I -

(12)

Corollary 1: If
q,,(j) = q(j),

It turns out that under Assumption 2

c,,(a) '~' o(n-").


Indeed, the preceding equality leads to

y" =

y
-- ,

n+a

> y > 0,

c
y--<I,

n+a

"~' I - p,,(a) - y"c -

L
j= 1

q,,(j)x(u(a) E V(j))

'then: the order of the convergence rate of the previous


reinforcement scheme (I - p.,(a)) is equal to p, i.e.
a.s.

.-00 0,

nP(1 - p,,(a)) -

0 < P S; /l < yc,

Learning automata with continuous input


also, the order of convergence rate, to zero, of the loss
function is equal to v, i.e.

.-00 0, <

1471

Then
lim

n; a~. 0.

a.s.

n'<D.

------+

v S; p,

and the best convergence rate lin' (v = p


reached for y = y* such that

= J.I.)

can be

J.I.
* <----,
I
-<Y

c(1

+ a)

Proof: In view of the assumptions of the previous


theorem and the Robbins-Siegmund theorem (Robbins
and Siegmund 1971), it follows that
a.s.

t'ln~

and
00

Downloaded by ["Queen's University Libraries, Kingston"] at 11:50 31 December 2014

O<a<--1.
As

00

Proof: The proof of this corollary follows from (14) and


Lemma 2 (see the Appendix) (Najim and Poznyak 1994,
Nazin and Poznyak 1986) for

4.

== I - P.(Cl), Cl.

v.

=n,

<

An + 1'1,. a<. 00,

"=1

J.I.

u.

1]*,

"-CO

fJ

= y.c,

= y.o(n-"),

P S; II < yc < I.

Conclusion

"\'

,,*

Lemma 2: Let {u.} be a sequence ofnon-negative random


variables u. measurable with respect to the (J algebra ff",
for all n = I, 2, ... If

(a) E(u.+ 1/ff,,),

Appendix

(C) the limit

Lemma I: Let {ff,,} be a sequence of (J algebras and ".,


0., ),. and v. are ff,,-measurable non-negative random
variables such that

00,

then, a subsequence ".. which tends to zero with


probability 1 exists. Hence
a~. 0.

A learning stochastic automaton with continuous


(S-model environment) input and changing number of
action has been presented. The analysis of this learning
stochastic automaton has been given and some properties
have been stated. These properties concern the convergence and the convergence rate. Further studies will
be concerned with the application of this learning system
to multimodal functions optimization (the realizations of
the function to be optimized wil1 be associated with the
environment responses) and to the extension of the
previous results to hierarchical structure of learning
automata.

This Appendix deals with the two lemmas about


quasimartingales which have been used in 3 of this
paper. Lemmas I and 2 concern the convergence (with
probability one) and the estimation of the convergence
rate, respectively.

a.:!.

An -

n=1

"In = 1,2, ... exists

(b) {u.} is a quasimartinqale such that


E(u.+ tiff,,) S; u.(1 - Cl.)

+ fJ.,

where {Cl.} and {fJ.} are sequences of non-random


variables such that

Cl. E (0, I],

fJ.;;:: 0,

00

00

L:

Cl.

L:

= 00,

n= 1

fJ. v. <

00,

n= 1

for some non-negative deterministic sequence {V.}


(v. > 0, n = 1,2, ...).

11m

V n + 1 - Vn

,= II < I

ctn Vn

n ..... oo

exists then
Un

= 0w

(I)

with probability I when v.


Proof:

Let

u. be

a.s.

-+

0,

v.

-+ 00.

the sequence defined as

00

L:

E(O.) <

CI),

Then, using the assumption (2), we obtain

11=1
00

"\'

1 a.:!.
"n
- 00,

11=1

E(".+ l/ff,,) a~. (I - ),.+ 1

00

V
n

3<. CO,

n=1

+ V.+ I)",
+ O.({".} is a quasimartingale).

_
0;;
a.s. _
(v.+ 1 )
E(u.+ tI.'fI'.) S; u.(1 - Cl.) --;;;:-

-(I

= Un

+ v.+ IfJ.

)(v.+ v.

(Xn

1 -

I) +

Vn + I

fJ

tJ'

1472

Learning automata with continuous input

By taking into account the assumption (3), we derive

E(ii"+

J:Y.,) ":5:' ii,,[1

- oc"(l - /I

NARENDRA, K, S,. and THATHACHAR, M. A, L., 1989, Learning


Automata all Introduction (Englewood Cliffs, New Jersey, U,S,A,:
Prentice Hall),
NAzIN, A, V" and POZNYAK, A, S" 1986, Adaptive Choice of Variants

+ 0(1))] + 1'"+ JJ",

From this inequality and the Robbins-Siegmund


theorem (Robbins and Siegmund 1971), we obtain

which is equivalent to

Downloaded by ["Queen's University Libraries, Kingston"] at 11:50 31 December 2014

o
References
BOSH. R, R.. and MOSTELLER, F., 1958, Stochastic Models for Leurninu
(New York: Wiley).
NAJIM. K.. and MSAAD. M.. 1991. Adaptive control: theory and
practical aspects. Journal of Process Control, 1, 84-95.
NAJIM. K.. and OPPENHEIM, G" 1991, Learning systems: theory and
application, Proceedinqs a/tile Institution oj Electrical Engineers, Pt
E, 13K, IH3-ln,
NAJIM. K" and POZNYAK. A. S., 1994, Learning Automata: Theory and
Applicatiolls (Oxford, U,K,: Pergamon Press),

(Moscow: Nauka), in Russian.

POZNYAK, A, S" 1975, Investigation of the convergence of algorithms


for the functioning of learning stochastic automata. AUlOmalion and
Remote Control, 36, 77-91.
RAMAKRISHNAN, K, R" and THATHACHAR, M. A, L., 1982, A learning
scheduler for job processing. Seminar on Large Scale Systems and
Signal Processing, School of Automation, Indian Institute of Science,
Bangalore.
ROBBtNS. H,. and SIEGMUND, D" 1971, A convergence theorem for
nonnegative almost supermartingales and some applications. Optimizinq Methods ill Statistics, edited by J, S, Rustagi (New York:
Academic Press),
SKLANSKY, J" 1966, Learning systems for automatic control. IEEE
Transactions Oft Automatic Control, 11.6-19,
THATHACHAR, M, A. L., and HARITA, R, R" 1987, Learning automata
with changing number of actions. JEEE Transaction Systems, Man
and Cybernatics, 17, 1095-1100,
TSELTtN, M, L., 1973, Automatoll Theory and Modeling of Biological
Systems (New York: Academic Press).
WALTER, W, G" 1953, The Liuiru; Brain (New York: Norton),
WIENER. N" 1948, Cybernetics (New York: The Technology
Press/Wiley)

You might also like