Professional Documents
Culture Documents
(K)
log
and dim
B
K = liminf
0
log N
(K)
log
, (1)
respectively.
Denition 2.2 Let s > 0. For > 0, dene
H
s
(K) = inf
(K)
(K)
(diam B)
s
, (2)
5
where the inmum is taken over the set
(K).
The Hausdor dimension of the set K is
dim
H
K = sup{s| H
s
(K) = } = inf{s| H
s
(K) = 0}. (3)
The Hausdor dimension is more subtle than the box-counting dimen-
sions in the sense that the former can capture details not detectable by the
latter. In a way, box-counting dimensions can be thought of as indicating the
eciency with which a set may be covered by small sets of equal size. In
contrast, Hausdor dimension involves coverings by sets of small but perhaps
widely varying size (Falconer, 1990). It is well known that
dim
H
K dim
B
K dim
+
B
K. (4)
3 Finite automata and automata-directed IFS
We now introduce some terminology related to iterated function systems (IFS)
with an emphasis on IFS driven by sequences obtained by traversing nite
automata. First, we briey recall basic notions related to sequences over nite
alphabets.
3.1 Symbolic sequences over a nite alphabet
Consider a nite alphabet A = {1, 2, ..., A}. The sets of all nite (non-empty)
and innite sequences over A are denoted by A
+
and A
, respectively. Se-
quences from A
= A
+
{}. The set of all sequences consisting of a nite,
6
or an innite number of symbols from A is then A
= A
+
A
. The set of
all sequences over A with exactly n symbols (n-blocks) is denoted by A
n
. The
number of symbols in a sequence w is denoted by |w|.
Let w = s
1
s
2
... A
and i j. By w
j
i
we denote the string s
i
s
i+1
...s
j
,
with w
i
i
= s
i
. For i > j, w
j
i
= . If |w| = n 1, then w
= w
n1
1
is obtained
1
by omitting the last symbol of w.
A partial order may be dened on A
as follows: write w
1
w
2
if and
only if w
1
is a prex of w
2
, i.e. there is some w
3
A
, such that w
2
= w
1
w
3
.
Two strings w
1
, w
2
are incomparable, if neither w
1
w
2
, nor w
2
w
1
. For
each nite string c A
| c w}.
3.2 Iterated Function Systems
Let X be a complete metric space. A (contractive) iterated function sys-
tem (IFS) (Barnsley, 1988) consists of A contractions f
a
: X X, a
{1, 2, ..., A} = A, operating on X. In the basic setting, at each time step all
the maps f
a
are used to transform X in an iterative manner:
X
0
= X, (5)
X
t+1
=
_
aA
f
a
(X
t
), t 0. (6)
There is a unique compact invariant set K X, K =
aA
f
a
(K), called
the attractor of the IFS. Actually, it can be shown that the sequence {X
t
}
t=0
converges
2
to K. For any -word w = s
1
s
2
... A
= .
2
under a Hausdor distance on the power set of X
7
compositions of IFS maps,
(f
s
1
(f
s
2
(...(f
s
n
(X))...))) = f
w
n
1
(X)
converge to an element of K as n .
More complicated IFS systems can be devised e.g. by constraining the
sequences of maps that can be applied to X in the process of dening the
attractive invariant set K. One possibility is to demand that the words w
corresponding to allowed compositions of IFS maps belong to a language over
A (Prusinkiewicz & Hammel, 1992). A specic example of this are recurrent
IFS (Barnsley, Elton & Hardin, 1989), where the allowed words w are specied
by a topological Markov chain. In this paper we concentrate on IFSs with map
compositions restricted to sequences that can be read-out by traversing labeled
state-transition graphs (see e.g. (Culik & Dube, 1993)).
3.3 IFS associated with state-transition graphs
Consider a labeled directed multigraph G = (V, E, ), where the elements
v V and e E are the vertices (or nodes) and edges of the multigraph G,
respectively. Each edge e E is labeled by a symbol (e) A from the input
alphabet A = {1, 2, ..., A}, as prescribed by the map : E A. The map
can be naturally extended to operate on subsets E
E, (E
) =
_
eE
(e), (7)
for = e
1
e
2
..., () = (e
1
)(e
2
)... A
. (8)
The set E
uv
E contains all edges from u V to v V . E
u
=
vV
E
uv
is the set of all edges starting at the vertex u.
8
A path in the graph is a sequence = e
1
e
2
... of edges, such that the terminal
vertex of each edge e
i
is the initial vertex of the next edge e
i+1
. The initial
vertex of e
1
is the initial vertex ini() of . For a nite path = e
1
e
2
...e
n
of
length n, the terminal vertex of e
n
is the terminal vertex term() of . The
sequence of labels read out along the path is (). As in the case of sequences
over A, for i j n,
j
i
denotes the path e
i
e
i+1
...e
j
, and
denotes the
path without the last edge e
n
, i.e.
=
n1
1
.
We write E
(n)
uv
for the set of all paths of length n with initial vertex u
and terminal vertex v. The set of all paths of length n with initial vertex u is
denoted by E
(n)
u
. Analogously, E
()
uv
denotes the set of all nite paths from u
to v; E
()
u
and E
()
u
denote the sets of all nite and innite paths, respectively,
starting at vertex u.
Denition 3.1 A strictly contracting state-transition graph (SCSTG) is a
labeled directed multigraph G = (V, E, ) together with a number 0 < r
a
< 1
for each symbol a from the label alphabet A.
We will assume that the state-transition graph G is both deterministic, i.e.
for each vertex u there are no two distinct paths initiated at u with the same
label read-out, and strongly connected, i.e. there is a path from any vertex to
any other.
Denition 3.2 An iterated function system (IFS) associated with a SCSTG
(V, E, , {r
a
}
aA
) consists of a complete metric space (X, )
3
and a set of A
contractions f
a
: X X, one for each input symbol a A, such that for all
x, y X,
f
a
(x) f
a
(y) r
a
x y, a A. (9)
3
the metric is expressed through a norm
9
The allowed compositions of maps f
a
are controlled by the sequences of labels
associated with the paths in G = (V, E, ). The IFS image of a point x X
under a (nite) path = e
1
e
2
...e
n
in G is
(x) = (f
(e
1
)
(f
(e
2
)
(...(f
(e
n
)
(x))...))) = (f
(e
1
)
f
(e
2
)
... f
(e
n
)
)(x). (10)
The contraction factor r() of the path is calculated as
r() =
n
i=1
r
(e
i
)
. (11)
For the empty path , (x) = x and r() = 1. For an innite path = e
1
e
2
...,
the corresponding image of a point x X is
(x) = lim
n
(f
(e
1
)
f
(e
2
)
... f
(e
n
)
)(x). (12)
The maps () are extended to subsets X
of X as follows: (X
) =
{(x)| x X
eE
uv
_
r
(e)
_
s
. (15)
The unique
5
number s
1
such that the spectral radius of M(s
1
) is equal to 1 is
called the dimension of the SCSTG (V, E, , {r
a
}
aA
).
Theorem 4.2 Let G = (V, E, ) be a labeled directed multigraph and (X, {f
a
}
aA
)
an IFS associated with the SCSTG (G, {r
a
}
aA
). Let s
1
be the dimension
of the SCSTG (G, {r
a
}
aA
). Then for all u V , dim
+
B
K
u
s
1
, and also
dim
+
B
K s
1
.
4
we slightly abuse mathematical notation by identifying the vertices of G
with integers 1, 2, ..., |V |.
5
Uniqueness follows from the Perron-Frobenius theory (see e.g. (Minc,
1988)).
11
Proof: By the Perron-Frobenius theory of nonnegative irreducible
6
ma-
trices (Minc, 1988), the spectral radius (equal to 1) of the matrix M(s
1
) is ac-
tually an eigenvalue of M(s
1
) with the associated positive eigenvector {
u
}
uV
satisfying
u
=
vV
eE
uv
_
r
(e)
_
s
1
,
uV
u
= 1,
u
> 0. (16)
Given a vertex u V , it is possible to dene a measure on the set of
cylinders
C
u
=
_
[w]| w
_
E
()
u
__
(17)
by postulating:
for w = (), E
()
uv
; ([w]) =
v
r()
s
1
. (18)
Indeed, for E
()
uv
, w = (), we have
([w]) =
eE
v
([w(e)]).
The measure extends to Borel measures on E
()
u
(Edgar & Golds, 1999).
Consider a vertex u V . Fix a positive number . We dene a (cross-cut)
set
T
= {| E
()
u
, r() < r(
)}. (19)
Note that T
E
()
u
there
is exactly one prex length n such that (
)
n
1
T
. The sets T
and (T
)
partition the sets E
()
u
and (E
()
u
), respectively. The collection
C
u
= {(K
term()
)| T
} (20)
6
Matrix M(s
1
) is irreducible because the underlying graph (V, E, ) is
strongly connected.
12
covers the set K
u
with sets of diameter less than = diam(X).
In order to estimate N
(K
u
), which is upper bounded by the cardinality
card(T
) of T
, we write
u
= ((E
()
u
)) =
term()
(r())
s
1
. (21)
By (19), we have for each = e
1
e
2
...e
n
T
,
r() = r(
) r(e
n
) r
min
and so from (21) we obtain
max
u
card(T
)
min
(r
min
)
s
1
, (22)
where card(T
) is the cardinality of T
and
max
= max
vV
v
,
min
= min
vV
v
and r
min
= min
aA
r
a
. (23)
Hence,
N
(K
u
) card(T
)
max
min
(r
min
)
s
1
=
max
min
_
r
min
diam(X)
_
s
1
s
1
.
It follows that
log N
(K
u
) s
1
log + Const, (24)
where Const is a constant term with respect to . Dividing (24) by log > 0,
and letting the diameter of the covering sets diminish, we obtain
dim
+
B
K
u
= limsup
0
log N
(K
u
)
log
s
1
. (25)
If we denote the number of vertices in V by card(V ), then the number of
sets with diameter less than needed to cover the set K can be upper bounded
by
N
(K
u
)
card(V )
max
min
_
r
min
diam(X)
_
s
1
s
1
13
and we again obtain
log N
(K) s
1
log + Const
(26)
where Const
max
N(s
1
).
Since for all a A, r
max
r
a
, we have 0 < M(s
1
) N(s
1
), where M N
for two positive matrices M, N of the same size means that applies element-
wise, i.e. 0 M
uv
N
uv
. Hence, (M(s
1
)) (N(s
1
)).
Observe that
(N(s
max
)) = 1 =
_
r
max
N(s
1
)
_
= r
max
(N(s
1
)) .
But (N(s
1
)) (M(s
1
)) = 1, so r
max
1 must hold. Since r
max
< 1, we
have that 0, which means s
max
s
1
.
4.2 Lower bounds
Imagine that besides the Lipschitz bounds (9) for the IFS maps {f
a
}
aA
, we
also have lower bounds: for all x, y X,
f
a
(x) f
a
(y) r
a
x y, r
a
> 0, a A. (29)
We will show, that under certain conditions, the bounds (29) induce lower
bounds on the Hausdor dimension of the sets K
u
. Roughly speaking, to
estimate lower bounds on the dimension of K
u
s, we need to make sure that
our lower bounds on covering set size are not invalidated by trivial overlaps
between the IFS basis sets.
15
Denition 4.4 Let G = (V, E, ) be a labeled directed multigraph. We say
that an IFS (X, {f
a
}
aA
) associated with the SCSTG (G, {r
a
}
aA
) falls into
the disjoint case if, for any u, v, v
V , e E
uv
, e
E
uv
, e = e
, we
have
8
f
(e)
(K
v
) f
(e
)
(K
v
) = .
Theorem 4.5 Consider a labeled directed multigraph G = (V, E, ) and an
IFS (X, {f
a
}
aA
) associated with the SCSTG (G, {r
a
}
aA
). Suppose the IFS
falls into the disjoint case. Let s
2
be the dimension of the SCSTG (G, {r
a
}
aA
).
Then for each vertex u V, dim
H
K
u
s
2
.
Proof: First, dene a metric on the power set of X as
d(B
1
, B
2
) = inf
xB
1
inf
yB
2
x y, B
1
, B
2
X.
Because the IFS falls into the disjoint case and the sets K
u
, u V , are compact,
there is some > 0 such that for all u, v, v
V , e E
uv
, e
E
uv
, e = e
,
we have
d
_
f
(e)
(K
v
), f
(e
)
(K
v
)
_
> . (30)
Fix a vertex u V and consider two paths
1
,
2
from E
()
u
such that the
corresponding label sequences (
1
) and (
2
) are incomparable. Let w be the
longest common prex of (
1
), (
2
) and let be the initial subpath of
1
that supports the label sequence w, i.e. = (
1
)
|w|
1
. We have |w| < |(
1
)|,
and so w (
1
)
, which means r
() r
1
). The paths
1
and
2
can be
written as
1
= e
1
and
2
= e
2
, where the edges e and e
1
,
2
may happen to be empty.
In any case,
1
_
K
term(
1
)
_
K
term(e)
,
2
_
K
term(
2
)
_
K
term(e
)
8
By determinicity of G, (e) = (e
).
16
and we have
d
_
e
1
(K
term(
1
)
), e
2
(K
term(
2
)
)
_
d
_
e(K
term(e)
), e
(K
term(e
)
)
_
> .
This leads to
d(
1
(K
term(
1
)
),
2
(K
term(
2
)
)) = (31)
d(we
1
(K
term(
1
)
), we
2
(K
term(
2
)
)) (32)
r
() d(e
1
(K
term(
1
)
), e
2
(K
term(
2
)
)) > (33)
r
() r
1
) (34)
Now, as in the upper bound case (Section 4.1), we construct a matrix
M
(s),
M
uv
(s) =
eE
uv
_
r
(e)
_
s
, . (35)
Let s
2
be the unique number such that the spectral radius of M
(s
2
) is equal to
1. Again, 1 is an eigenvalue of M
(s
2
) with the associated eigenvector {
u
}
uV
satisfying
u
=
vV
eE
uv
_
r
(e)
_
s
2
,
uV
u
= 1,
u
> 0. (36)
As before, we dene a measure
([w]) =
v
r
()
s
2
. (37)
The measure
. (38)
The corresponding cross-cut set is
T
= {| E
()
u
, r
() < r
)}. (39)
17
From (38) and the denition of T
, we have
for each T
; diam(B) = r
). (40)
The label strings () associated with the paths in T
are incomparable,
implying that there is at most one path T
such that (K
term()
) B = .
To see this, note that by (31)(34) and (40), for any two paths
1
,
2
T
, we
have
d
_
1
(K
term(
1
)
),
2
(K
term(
2
)
)
_
> r
1
) diam(B).
Denote by
B the set of all -sequences associated with innite paths start-
ing in the vertex u that map X into a point in B, i.e.
B = {()| E
()
u
, (X) B}. (41)
Since the cross-cut set T
such that
B
(K
term(
B
)
) meets B K
u
. In fact, as argued above, it is the only
such path from T
, and so
B [(
B
)]. In terms of measure (see Eq. (37)),
(
B)
([(
B
)])
max
r
(
B
)
s
2
, (42)
where
max
= max
vV
v
. From the denition of T
we have
diam(B) = > r
(
B
).
Using this inequality in (42) we obtain
(
B)
max
_
diam(B)
_
s
2
. (43)
Now, for any countable cover C
u
of K
u
by Borel sets we have
BC
u
(diam(B))
s
2
s
2
max
BC
u
(
B) =
s
2
max
BC
u
(B)
s
2
max
(K
u
), (44)
18
where is the pushed forward measure on K
u
induced by the measure
. Note
that this is a correct construction, since the IFS falls into the disjoint case.
Consulting the Denition 2.2 and noting that it is always the case that
H
s
2
(K
u
)
s
2
max
(K
u
) > 0,
we conclude that dim
H
K
u
s
2
.
As in the case of upper bounds, we transform the results of the previous
theorem into a weaker, but more accessible lower bound.
Theorem 4.6 Under the assumptions of Theorem 4.5, for each vertex u V,
dim
H
K
u
s
2
log (G)
log r
min
,
where r
min
= min
aA
r
a
.
Proof: As in the proof of Theorem 4.3, consider a SCSTG (G, r
a
=
r
min
, a A). The matrix (15) for this SCSTG is N(s) = r
s
min
G with (N(s)) =
r
s
min
(G). The dimension s
min
associated with such a SCSTG is
s
min
=
log (G)
log r
min
. (45)
Dene = s
min
s
2
. We have N(s
min
) = r
min
N(s
2
). Since r
min
r
a
,
for all a A, it holds 0 < N(s
2
) M(s
2
), and so (N(s
2
)) (M(s
2
)).
Note that
(N(s
min
)) = 1 = r
min
(N(s
2
))
and (N(s
2
)) (M(s
2
)) = 1. It follows that r
min
1 must hold. Since
r
min
< 1, we have 0, which means that s
min
s
2
.
19
4.3 Discussion of the quantities involved in the bounds
The numbers s
1
and s
2
are well-know quantities in the literature on fractal sets
generated by constructions employing maps with dierent contraction rates
(see e.g. (Fernau & Staiger, 1994; Fernau & Staiger, 2001; Edgar & Golds,
1999)). Since there is no closed-form analytical solution for s
1
as introduced in
Denition 4.1, the dimensions s
1
, s
2
are dened only implicitly. Loosely speak-
ing, spectral radius of M(s) is related to the variation of the cover function
H
s
(K) (see Denition 2.2) and the only way of getting a nite change-over
point s in Eq. (3) is to set the spectral radius to 1.
The situation changes when we conne ourselves to a constant contraction
rate across all IFS maps. As seen in the proof of Theorem 4.3, in this case
there is a closed-form solution for s. The solution is equal to the scaled spectral
radius of the adjacency matrix, which determines the growth rate of allowed
label strings as the sequence length increases. The major diculty with having
dierent contraction rates in the IFS maps is that the areas on the fractal
support that need to be covered by sets of decreasing diameters cannot be
simply estimated by counting the number of dierent label sequences of certain
length that can be obtained by traversing the underlying multigraph.
5 Recurrent networks as IFS
In this paper we concentrate on rst-order (discrete-time) recurrent neural
networks RNNs of N recurrent neurons driven with dynamics
x
n
(t) = g
_
D
j=1
v
nj
i
j
(t) +
N
j=1
w
nj
x
j
(t 1) + d
n
_
, (46)
20
where x
j
(t) and i
j
(t) are elements, at time t, of the state and input vectors,
x(t) = (x
1
(t), ..., x
N
(t))
T
and i(t) = (i
1
(t), ..., i
D
(t))
T
, respectively, w
nj
and v
nj
are recurrent and input connection weights, respectively, d
n
s constitute the
bias vector d = (d
1
, ..., d
N
)
T
, and g() is an injective non-constant dierentiable
activation function of bounded derivative from R to a bounded interval R
of length ||. Denoting by V and W the N D and N N weight matrices
(v
nj
) and (w
nj
), respectively, we rewrite (46) in matrix form
x(t) = G(Vi(t) +Wx(t 1) +d), (47)
where G : R
N
N
is the element-wise application of g.
Assume RNN is processing strings over the nite alphabet A = {1, 2, ..., A}.
Symbols a A are presented at the network input as unique D-dimensional
codes c
a
R
D
, one for each symbol a A. RNN can be viewed as a non-linear
IFS consisting of a collection of A maps {f
a
}
A
a=1
acting on
N
,
f
a
(x) = (G T
a
)(x), (48)
where
T
a
(x) = Wx +d
a
(49)
and
d
a
= Vc
a
+d. (50)
For a set B R
N
, we denote by [B]
i
the slice of B dened as
[B]
i
= {x
i
| x = (x
1
, ..., x
N
)
T
B}, i = 1, 2, ..., N. (51)
When symbol a A is at the network input, the range of possible net-in
activations on recurrent units is the set
R
a
=
_
1iN
[T
a
(
N
)]
i
. (52)
21
Recall that singular values
1
, ...,
N
of the matrix W are positive square
roots of the eigenvalues of WW
T
. The singular values are lengths of the
(mutually perpendicular) principal semiaxes of the image of the unit ball under
the linear map dened by the matrix W. We assume that W is non-singular
and adopt the convention that
1
2
...
N
> 0. Denote by
max
(W)
and
min
(W) the largest and the smallest singular values of W, respectively,
i.e.
max
(W) =
1
and
max
(W) =
N
. The next lemma gives us conditions
under which the maps f
a
are contractive.
Lemma 5.1 If
r
max
a
=
max
(W) sup
zR
a
|g
(z)|, (54)
22
it holds: f
a
(x) f
a
(y) r
min
a
x y, for all x, y
N
.
6 RNNs driven by nite automata
Major research eort has been devoted to the study of learning and implemen-
tation of regular grammars and nite-state transducers using RNNs (Casey,
1996; Cleeremans, Servan-Schreiber & McClelland, 1989; Elman, 1990; For-
cada & Carrasco, 1995; Frasconi et al., 1996; Giles et al., 1992; Manolios &
Fanelli, 1994; Ti no &
Sajda, 1995; Watrous & Kuhn, 1992). In most cases, the
input sequences presented at the input of RNNs were obtained by traversing
the underlying nite state automaton/machine. Of particular interest to us
are approaches that did not reset the RNN state after presentation of each in-
put string, but learned the appropriate resetting behavior during the training
process e.g. (Forcada & Carrasco, 1995; Ti no &
Sajda, 1995). In such cases
RNNs are fed by a concatenation of input strings over some input alphabet
A
.
We can think of such sequences as nite substrings of innite sequences gen-
erated by traversing the underlying state-transition graph extended by adding
edges labeled by # that terminate in the initial state. The added edges start
at all the states in case of transducers, or at the nal states in case of
acceptors.
Let G = (V, E, ) represent such an extended state-transition graph. Let
A = A
a
R
a
|g
(z)| = max
aA
sup
zR
a
|g
(z)| (55)
q = min
a,a
A, a=a
V(c
a
c
a
). (56)
Theorem 6.1 Let RNN (47) be driven with sequences generated by travers-
ing a strongly connected state-transition graph G = (V, E, ) with adjacency
matrix G. Suppose
max
(W) <
1
. Let s
1
be the dimension of the SCSTG
(G, {r
max
a
}
aA
). Then the (upper box-counting) fractal dimensions of attrac-
tors K
u
, u V , approximated by the RNN activation vectors x(t), are upper-
bounded by
u V ; dim
+
B
K
u
s
1
log (G)
log r
max
(57)
and also
dim
+
B
K s
1
log (G)
log r
max
,
where r
max
= max
aA
r
max
a
.
Proof: Since
max
(W) <
1
, by (55) and Lemma 5.1, each IFS map f
a
,
a A, is a contraction with contraction coecient r
max
a
.
24
One is now tempted to invoke Theorems 4.2 and 4.3. Note however that
in the theory developed in Sections 3 and 4, for allowed paths in G, the
compositions of IFS maps () are applied in the reversed manner
9
(see Eq.
(10)), i.e. if () = w = s
1
s
2
...s
n
A
+
, (x) = (f
s
1
f
s
2
...f
s
n
)(x). Actually,
this does not pose any diculty, since we can work with a new state-transition
graph G
R
= (V, E
R
,
R
) associated with G = (V, E, ). G
R
is completely the
same as G up to orientation of the edges. For every edge e E
uv
with label
(e) there is an associated edge e
R
E
R
vu
labeled by
R
(e
R
) = (e).
The matrices M(s) (15) corresponding to IFSs based on G are just trans-
posed versions of the matrices M
R
(s) corresponding to IFSs based on G
R
.
However, spectral radius is invariant with respect to the transpose operation,
so the results of Section 4 can be directly employed.
We now turn to lower bounds.
Theorem 6.2 Consider a RNN (47) driven with sequences generated by travers-
ing a strongly connected state-transition graph G = (V, E, ) with adjacency
matrix G. Suppose
max
(W) < min
_
1
,
q
N ||
_
. (58)
Let s
2
be the dimension of the SCSTG (G, {r
min
a
}
aA
). Then the (Hausdor)
dimensions of attractors K
u
, u V , approximated by the RNN activation
vectors x(t), are lower-bounded by
u V ; dim
H
K
u
s
2
log (G)
log r
min
, (59)
9
This approach, usual in the IFS literature, is convenient because of the
convergence properties of (X), A
.
25
where r
min
= min
aA
r
min
a
.
Proof: Note that diam(
N
) =
N || =
max
(W) diam(
N
).
It follows that
q = min
a=a
V(c
a
c
a
) = min
a=a
d
a
d
a
>
max
(W) diam(
N
)
and so
for all a, a
A, a = a
; T
a
(
N
) T
a
(
N
) = .
Since G is injective, we also have
a, a
A, a = a
; (G T
a
)(
N
) (G T
a
)(
N
) = ,
which, by (48), implies f
a
(
N
) f
a
(
N
) = , for all a, a
A, a = a
. Hence,
the IFS {f
a
}
aA
falls into the disjoint case. We can now invoke Theorems 4.5
and 4.6.
Using (4), we summarize the bounds in the following corollary.
Corollary 6.3 Under the assumptions of Theorems 6.1 and 6.2, for all u V ,
log (G)
log r
min
s
2
dim
H
K
u
dim
B
K
u
dim
+
B
K
u
s1
log (G)
log r
max
. (60)
7 Discussion
Recently, we have extended the work of Kolen and others (e.g. (Christiansen
& Chater, 1999; Kolen, 1994; Kolen, 1994a; Manolios & Fanelli, 1994)) by
pointing out that in dynamical tasks of symbolic nature, when initialized with
26
small weights, recurrent neural networks (RNN), with e.g. sigmoid activa-
tion functions, form contractive IFS. In particular, the dynamical systems (48)
are driven by a single attractive xed point, one for each a A (Ti no, Horne
& Giles, 2001; Ti no et al., 1998). Actually, this type of simple dynamics is
a reason for starting to train recurrent networks from small weights. Unless
one has a strong prior knowledge about the network dynamics (Giles & Om-
lin, 1993), the sequence of bifurcations leading to a desired network behavior
may be hard to achieve when starting from an arbitrarily complicated network
dynamics (Doya, 1992).
The tendency of RNNs to form clusters in the state space before training
was rst reported by Servan-Schreiber, Cleeremans and McClelland (1989),
later by Kolen (1994; 1994a) and Christiansen & Chater (1999). We have
shown that RNNs randomly initiated with small weights are inherently biased
towards Markov models, i.e. even prior to any training, RNN dynamics can
be readily used to extract nite memory machines (Hammer & Ti no, 2002;
Ti no,
Cer nansk y & Be nuskov a, 2002; Ti no,
Cer nansk y & Be nuskov a, 2002a).
In other words, even prior to training, the recurrent activation clusters are
perfectly reasonable and are biased towards nite-memory computations.
This paper further extends our work by showing that in such cases, a rigor-
ous analysis of fractal encodings in the RNN state space can be performed. We
have derived (lower and upper) bounds on several types of fractal dimensions,
such as Hausdor and (upper and lower) box-counting dimensions of activa-
tion patterns occurring in the RNN state space when the network is driven
by an underlying nite-state automaton, as has been the case in many studies
reported in the literature (Casey, 1996; Cleeremans, Servan-Schreiber & Mc-
Clelland, 1989; Elman, 1990; Forcada & Carrasco, 1995; Frasconi et al., 1996;
27
Giles et al., 1992; Manolios & Fanelli, 1994; Ti no &
Sajda, 1995; Watrous &
Kuhn, 1992).
Our results have a nice information theoretic interpretation. The entropy
of a language L A