Professional Documents
Culture Documents
A,YnB,Yn ... nC,Y
A", YUAa+l, YU .. . uAb
1
Y
---- ----
A"
1
YuA"+l, YU. _
YuA,Y
- --
where X is (strct)
3. 7 Properties of the semi-compositional formalism
Table 3. VIII: The syntax of patterns.
Syntax:
(patt) ::= 0 I (ne-patt)
(ne-patt) ::= (strct) I (strct) , (ne-patt)
(strct) ::= (prim) I { } I I
(ne-patt) (ne-patt)
--.(strct) I ((ne-patt))a-b I ((nc-patt))a I ((ne-patt))
(prim) ::= a I b I ... I z
Table 3.IX: The universe of patterns.
Universe:
(A,B)u = Au..__su
( { nu "uB"u uC"
uc"
(--.A)u =Au
((A)a-b) u =(A a) uu (Aa+l) uu ... u (Ab) u
((A)a)u= (Aa)uu(Aa+l)uu
((A))u"" {e}UA
0
where x is (prim)
where A is (strct)
and B is (ne-patt)
65
66 Chapter 3 Extending regular CXJHC88ians
If the succeeding structure (f) were not concatenated to the individuul
arguments before the meaning is determined-as would be the in a st:ritt
compositional definition-the string 'bff ... ' would match this pattern. This
i8 not the kind of thing one wants to happen when 'hlf' is excluded from
pl:lf.tem (3-26):
(3.26)
With the given definition (3.24), however, it can be seen fairly eat>ily that
(3.25) will exclude at least the same strings as (3.26):
[
(cons)l-2] { b }
,f,0e=(cons)l-2,f,0n c,d ,f,GJ
(.327)
The second term of the righthand side of (3.27) eq\l<tls pattefn (3.26).
S. 7. 2 Relation to the composition(J.l fonnalism
'l'hc given definition is thus consistent with the philosophy of excluding ex-
plicit nofits_ is, however-, a complication with simultaneity which do(cs
not occur with complementation and alternation- Cnnsider, for inst.ance, pat-
tern (3.28):
(:3-28)
According to the definition, (3.24), thi::> p<lttern denot;es <1ll sttings starting
with 'att':
(3.29)
This iR an example of the property that sirrmlbneity disttiblltes over concate-
nation. This is not so surprising since it is defined that way:
[
A,Yl
D,Y
C,Y
3. 7 Properties of the semi-compositional formalism 67
However, the point is that, as opposed to alternation, simultaneity does not
distribute over concatenation in the compositional case. In the compositional
case concatenation and simultaneity arc defined as follows
1
:
A,B=A ...... B
-- --
=AnBn ... nc
Thus:
a, e t,e t,e
[
{
t,e} l
}na,t,{
while;
(3.30)
Therefore, beside the inconsistent patterns, this is a second case for which
the proposed semantics differs from a compositional one. In this case, however,
it is not so dea1 that (3.29) is a more expected or desirable interpretation
than (3-30) 011 the other hand, it may be argued that patterns like these,
which can be characterized by the fact that simultaneity operates on paths of
different length, have very little intuitive meaning to begin with.
3. 7.3 de Morgan's laws
Since the formalism now features an 'or', an 'and' and a 'not', one may wonder
whether an equivalent of de Morgan's laws is valid. Ftom the field of logics we
know that ( a V b) = a A --,b and ""(a A b) = ...,a V ...,{), Is something equivalent
valid in the semi"compositional formalism, in other words, are equations (3-31)
and (3.32) valid?
4
The double underlining serves to distinguish this definition from the
formalism.
68 Chapter- 3 Extending regul<i-r expr-essions
(3.31)
(3.32)
It turns out that these are not valid in generaL can be
for insf.l:IJtCe, a:-; follows:
"" ..,{ a },t,0=
o,u
{
a } u 0\ { a 0 = a-'
2
to- \ (at.o U outa) =}
o,u -- o,u --
(3.33)
-.a, t ,CHI-.[o, u], t ,0 = (ato \ata*) n (a-a-to- \ outo-*) =:--
(3.34)
Here, o-
1
2
denotes all strings of length 1 or 2.
Thus, (3.33) and (3.34) disprove (3.31). The other law, (3.32) can be disproved
in n. :,irnilar way.
The equivalent of de Morguu'c; laws is thus not valid in general. It Hiwuld
lH-! noh!d, however, that in the compositional for:r:na.lii:lrn the two equations are
not valid either. This can be verified by htking A ""' a and B = o, u.
sonly in " compo>itional formalism where the univer" is defined "'" () = (f
0
will de
Morgan's laws be
{
A } [ '...,AB ] = u' \ = (u' (,y"
The other law is proved analogously.
3.8 Discussion 69
On the other hand, it is not so that the equations are always invalid. If,
for instance, the lengths of all paths in the complemented structure are equal,
it can be shown that the equations are valid. Thus, for instance:
(3.35)
3-1-4 Complementing simultaneity
A final pattern which deserves some discussion is the usc of complementation
in co-ordination with simultaneity. Consider, for instance, pattern (3.36):
[a: t J u "-'t '0 \ 0 :::: 0'1,2ta* (3-36)
Here, the result is somewhat unexpected. One could expect the strings
'att ... ' to be excluded from this pattern, since they n:vtt('b pattern (3.29),
which only from (3.36) in the absence of a complementation sign.
The reason for this phenomenon is that explicit a:re defined as A .. B
than A, B. Then, when A= 0 the explicit nofits are empty, too--Th-.;-
does not have to be-the case for A, B, as exemplified in (3.29).
This pheuomenon of the explicit nofits being empty did not occur previously
in simple complemented structures (no nested complementation) since there
is no pattern for which A= 0. So (3-36) is a simple complemented structure
of which one may fed that the semi-compositional does not behave in the
expected manner.
3.8 Discussion
In general, the phenomenon that not all strings are excluded which one would
expect to be, occurs when A""'B f- A, B. Apart from simultaneous structure:::
with paths of unequallengilis,this can only occur when A contains a comple"
mented structure (also) containing paths of different lengths. So only when
complementation is nested to a level of two or more does this phenomenon oc-
cur. This is exad.ly the reason why double complementation may not always
be annihilated.
70 Chapter 3 Extending reg11lar expressions
In this light it would be more consistent to define explicit: nofite; as A, B.
In fact, this is the most logical choice since A, B denotes precisely those
structures which match -.A, B if the '-.' sign is omitted. This would resolve
the aJ)()ve particular case (3.36), hut not ali of the problems attached to t.his
matter. Appendix 3.0 discusses this option with regard to double comple-
ment<tion. The following is prob::tbly an ewm stronger argument against this
option.
One of the main consequences of this choice is that complex patl.enlS
must he evaluated as a whole. When for instance a pattem is specified with
complement.::ttion W:'!>ted to a the full pattern is important to
dd.ennine the meaning of the d(leply neRted complementation, something
which also holds for all the other complementations. This tendo: to got so
complicated that the only way to evaluate patterns in which complewent.at.ion
is nested to level two or more is to write a computer progn1.m which p<l.titlntly
executes the definition. This is not conducive to quick design of complex
patterns.
This is, in fact, a plea for compositionality. In a compm;itional syst.ert1, the
rrwaning of any pn.ti.Prn, complex, is determined by the composing
p<uts. The meaning of a part is not changed by altering something outside
Umt part. The meaning of a. pntt:ern is dd,ermined from small to large rather
than having to tlw pattern flS a whole to start with.
The current definition, with the explicit nofits defined as A,...,fl, cu.n be
a.s 1t compromiso betwee11 the two extremes. It resolves the obJection to
the compositional system which does not exdnde nofits. On the other
hm1d, within complemenUtion the sucC<leding structure is 'hidden', i.e., in dc-
h!mlining the tnefming of a complemented pattern one can separately deter
Au, A and B, and combine these afterwards. When complementation is
nested, 110t. extend outside the complementation
which i:> nested one level less deep. As we have seen, double complcrnenta.tiou
may for this rca.wn not be lJ.nnihilated. Ou the other hand, all patterns
can, with some practice, now be evaluated by hand, 1>ince t.he path-lnls can be
decomposed into smaller parts which cun be evllnated separately.
On theoretital gr-ounds the semi-compositional formalism it:: tlm,: not com-
pletely satisfactory. For practical purposes, however, it might he :>ati:>hctm:-y,
because the cases which arc not resolved satisfadorily, snc:h as double comple-
mentation, deeply nested complemented or complementing simul-
taneity which contains paths of different lengths, are highly unlikely to be used
in practice. The semi-compositional definition excludes explicit nofits in the
pnu::tic(d ca::>es aud a semi-compositional behaviour for complex patterns.
It thus seems a practical compromise between qmflid.ing r-equirements
3.9 Conclusion 71
of strict compositionality on the one hand and interpretation according to
expectation on the other.
Based on these practical grounds it was decided to implement the semi-
compositional formalism in Toor).P to investigate its merits. In the next chap-
ter the implementation of the formalism will be discussed, and in the con-
cluding chapter (among other things) the merits of the semi-compositional
fonnalism will be considered.
3.9 Conclusion
In this chapter a formalism is proposed in which a complementation opera-
tor (the 'not') and a simultaneous operator (the 'and') are included in the
formalism of regular expressions.
Regular expressions are defix1ed in a compositional manner. This is an
important theoretical and practical property, since the meaning of any ex
pression, however complex, can be determined in parts, from small to large.
The inclusion of the new operators in regular expressions in a composi-
tional manner has the practical objection that for a certain class of patterns,
the inconsistent patterns, the formal interpretation does not corre-
spond to the impression the patterns create.
The main motivation to try and resolve this objection is of an ergonom-
ical nature. The main users of the system in which the forrnnlism must be
implemented c<J.nnot in general be expected to be familiar with the detail::> of
the behaviour of such extended regular expressions.
The attempt to resolve this objection has resulted in a slightly different
formalism, which can best be characterized a.s a semicompositional formal-
ism_ AHhongh the pattern succeeding a specific structure must be induded
in determining the meaning of that 1itn1cture, it remains possible to decom-
pose complex patterns into smaller parts, determine the meaning of each
of those parts, and determine the meaning of t.he whole pattern from the
meaning of the individual parts_ The extent, however, to which the patt.en:t
may be decomposed is smaller than in the compositional In the semi-
composit.ional case patterns can be decomposed only wit.h to comple-
mentation, whereas in the compositional case patterns can be decomposed
with respect to all structures.
In defining the semi-compositional formalism three essential choices ha.ve
been made. The first one the choice of universe. Rather than taking
the set of all strings, a, as universe, a chanwter counting mechanism has
been defined. As a consequence, the user does not expliciUy have to specify
72 Chapter 3 Extending regula.r
the desired universe for eacll complemented structure that is used. On the
other hand, the validity of the equivalent for de Morgan's laws is lost. Both the
compositional and the semi-compositional formalism described in this chapter
feature the character counting universe.
The second choice is to exclude explicit nofits. This is the main differ-
(ence hetween t.he r::<rni-compositiona.l the eornpositior1al formalism. As a
consequence strict compositionality is lost, but the patterns behave more to
(lXpm:tation_
The third choice is the definition of explicit nofits. In the semi-
compositional formalism which is proposed here this is defined ar:: A"-'B,
whereas the more logical choice would be A, B. a consequence the serili-
compo:sitional formalism still does not alway:. behave as expected, i.e., not
always art> the 'rea.l' expli6t nofit.s (A, B) excluded. On the other hand,
cornpnsit.innality is unt all
From a theoretical point of view the serni-compo:sitiom1J formalism is thtts
not completely s(<tisfa.(:f.ory_ For practical purposes, however, it might be satis-
factory, because the cases for which it docs not bd1ave satit>f;u;to:r:ily are highly
unlikely to be used in practice. The semi-compositional definition exclude:;;
plicit nofits in the practical cases and has a bdaviom fM
cmnpkx It seems:;. prad.i(:al corrtpr-omise between the conflict-
ing of pr-a.d.icaJ needs and theoretical elegance. Based on these
practical grounds it has been decided to implement the semi-compo:;;itioJlal
formalism in ToorjP.
3.A Distributivity of patterns
Appendix 3.A
Distributivity of patterns
73
In this appendix it i8 proved that, given the semantics of Table 3.ll or 3.V
and the definition of string concatenation, alternation is distributive over
concatenation for patterns.
Propositions (A-1) and (A2) hold in general for sets of strings.
Proof:
(AU B)"'C = (A ..... C) U (B"'C)
A'""(B u C)"" (A,.... B) u (A.-..C)
(Au B)"-'C {d1d2 I d1 E (Au B) A d2 E C}""'
{ d1 d:l I ( d1 E A A d2 E C) V ( d1 E B A dz E C)} =
{drd2 I dt E A 1\ d2 E C} u {dld:l IdlE BAa. E C} =
(A"'C) u (B"'C)
(A--2) is proved analogously.
(A-l)
(A-2)
From these general properie1.'! of 1.'!tring sets the distributivity of concatenation
over alternation for patterns follows directly:
Similarly,
... ""'!i
(A_:i) X"'B U Y "'BU ... U
= X ,BUY ,BU ... UZ ,B
{
X,B}
Y,B
Z,B
(A-3)
(A-4)
74 Chapte1 3 Extending ngula.r expressions
Appendix 3.B
Simplification of complementation
In this appendix it is proved that the definition of complementation given
in (3-15) can he simplified to (3.16):
Proof:
X
Y B)
:ld1, d2 I d = d, d2, d, E P, dr !f_ A, E B
d'f_
:ldt,d2 I d = dtd2,dl E E B
d'f-
dEY
{
:Jdl I d2 I d = dt d2 I dl E PI d2 E B
r1 'f-
{
::lrl"d2l r1 = rlrd2,d1 E P,d2 E B
=?- 'Vd1, = d1d2 1\ dt E A 1\ d2 E B)
d g A"'B
{
?Jd],d2 I d = (hd2,dt E P,dt It A,dz En
=:> d 'f- A'""B
dE X
(13-1)
3.C Equivalence of semantics
Appendix 3.C
Equivalence of semantics
75
In this appendix it is proved that for consistent patterns the semantics given
by Table 3.V and Table 3.VI are equivalent. For this purpose, the semantics
of Table 3.V is referred to by a single undedining, the semantics of Table 3.VI
by a double underlining.
First, the formal definition of and patterns is given.
Next, given the consistency of a pattern, the equivalence of the two seman-
tics are shown. Finally, two characteristics of appearance, which guarantee
consistency, are discussed.
Definition: A pattern X is consistent if = true, and inconsistent
if Cons(X) = false.
Cons(X) is given by:
Cons(0) = true
Cons(&)= true
Cons( X) = Cons( X, 0)
Cons( X, Y) = if X = 0
if X"" (prim)
if X= -.A
then Cons(Y)
then
then Cons(A
1
Y) A Cons(B
1
Y)
A ... A Cons(C, Y)
then Cons( A) /1. Con$(Y)
= 0]
76
Chapter 3 Extending regular mcpressions
For consistent patterns the :oemantics of Table 3.V and the semantic:> of Ta-
ble 3.VI yield the !lawe reHnlts. This can be shown as follows:
0 = (T. = 0
=
{f}
A,B""
if
if
if
AU B U - - -U f .,. A , 0 U B , 0 u ... u C, 0
A,0uB,0U ... UC,0= {i}
A=0
--+ !!__ = A,B
A=x
----> i>;;!. =A ,B
A={:}
_, (X ... u
(A_:_1) X-vBU y,._,BU .. . UZ"'D
X ,BUY ,BU ... uz ,B
in d.
X ,BUY ,BU ... uz ,B
A,n --
if A=X
---->
(Xu\ (Xu\
(Xu\
(H_:_l) =----,X ,B
A,B
3.C Equivalence of semantics 77
Characteristics of appearance
If a pattern consists exclusively of positive structures, it diredJy from
the definition that it is consistent.
If a pattern contains a complemented :otructure which containr:; exclnsively
patterns of a certain specific length, it can be shown that this pat.tem is
consistent.:
1)
a)
b)
dis a. candidate for ---,X, Y => :3 d1d2 I d1 E Xu\X A d2 E Y
did; = d1 dz = d --+ -
ldi I =I= I rill ., .. X '*
no rea..'Ml for E
ld;i = ld1l => di = d1 => FiX=>
no reason for dj_ EX ,.,y -
2) d is an explicit nofit for ---,X, Y => :3 d1d2 I d1 E X Adz E Y =>
d1 X=> d1d2 xu\x .... - -
78 Chapter 3 Extending regular exp1essions
Appendix 3.D
Alternative formalisms
In this appendix some alternatives for defining complementation are discussed.
A drawback of the semantics of Table 3.VJ i::; that double negation may not
unconditiomtlly be annihilated. The reason for this is that cornpl(mentat:ion
011 l.he inner level loses track of the succeeding stnu.:ture of the (niter level.
Therefore, the inner complement;ttion cannot fully compensate for the explicit
nofits of the outer complementation. For instance, con::;ider pattern (Dl ):
--.--. { a } ,t
o,u
(D-1)
The inner complementation,
doe::; not see the succeeding structure t.
the Htriug 'at' will not be excluded from
Then, 'at""t...'
is an explicit nofit of thz outer complementation, and thus not an clement of
th( of strings denot.f)d by (D-1).
A solution to thi::; could be to redefine complementation so that the
complement.ation can also sec the succeeding of the outer comple-
mentation. This can be done by defining complementation as follows:
(D-2)
Cornparcd with the definition (3.16), the two only differ in t.lw set
which is subtracted from the set A u .... B. It. is as if the explicit
rtofit.s have heeu redefirwd. With this definition ofcomplernentat.iou one can
prow that a double complementation may alway::; be annihilated:
(A)u,..,_Q\ -.A, D =A u,__,Q \(A u,._,_Q \A, B)
= A,BnAu,.._,B=A,B
-- - --
(D-3)
However, of complementation nul lead to unexpected
results. Consider the pattern -.A, D, where
(D-4)
3.D Alternative formalisms 79
This leads to pattern (D--5):
(D-5)
Applying (D-2) to interpret this pattern, (D-5) will include fm:- the
strings 'at t ... '. However, if we exrunine the candidates of ----,A , B, we find that
there are none; the 11niverse of the structure (A D) is equal to the meaning
of the structure (A), the difference (Au\ A) being empty. It seems awkward
that the new complementation includes sTrings whkh are not in
the first place.
The for this is that the simplification of (3.15) into (3.16) is not
valid if explicit nofits lire defined differently. In other words, the generating
set of (3.16) as compared with the generating set of (3.15) includes strings
which are not exduded by A , B.
A logical way to overcome this objection is to redefine complementation
u.s follows:
{D-6)
Here, we stMt by selecting the candidates, rather than the strings which
satisfy the length conditio{! as required by the universe, so the objection to
the previous attempt is automatically removed.
Unfortunately, (D-6) does not satisfy the double complementation reqnire
mer)t_ This can be seen as follows:
----,----,A,B (D,;
6
)
\
=
(D-7)
80
Let A=
Then:
atta E ArvB
attu (/. (Au\ A),..,fl
<t.t.to- \t A , D
}
Chapter 3 Extending regular expressions
atta E -.-.A , B
atto- 9! A ,n
Thus, when wmplcrnent<ttion defined as in (D G), double complementation
may not always he annihilated. It must be noted, however, that in the
compositional formalism, where explicit nofits are ddined this effect
is mon1 frequent than for UHl last definition. In the semi-compo;;;!tional ca.:le
this occurs for all inconsistent patterns, i.e., when A, B -f: A"" B. When
definition (D-6) applied thifl only occurs when Au = A or A =-0.
last two cases will ptobably never occur in practical
A;; argued in th(; nHtin :;;ediqn of Hl(o chapter, (D-6) is the most
tent. in Uw philm>ophy of excluding explicit nofits. A further discussion
on this matter is included in section 3.8. For this appendix it sufllces to con-
clude the two attempts illustrate the dilemma: either unexpected results arc
yielded for certain p11Hern:>, or the double complementation may not
lw m)Tl ihilahlcL
Other .:tttcmpts do not seem very promising, In t.he philosophy
of excluding explicit nofits no other options seem avaibble. Thus eithe1 one
must try and find a solution in a totally diffennt dirc-ldion or accept that
the property of being to annihilate double complementation is lost.
Since complementation will probably not occur very often in prad.i-
cal situations (the simple positive statement is more transparent), and other
a.U.empt:> will probbly vinhte the philosophy of excluding explicit nofitr::, it
t:o <l<:<:elpt. the loss of this property.
Chapter 4
Some aspects of the implementation of
ToorjP
.Abstraet
In this chapter those aspectH of the implementation of Tooi}P are
described whi('.h cow:ern the process of matching patterns to the
input, where input should be understood in the general :sense of
synchronized buffers.
For this purpose fir:st the internal representation of pat. terns is dis-
cussed. The user-specified patterns are transformed into a dynamic
data structure which is accessible for the matching ron tine. The dy-
namic structure codes the structure of the patterns, but. some simple
adjustments have been m11.de also, which facilitate pattern matching
during run-time.
Next, the algo:6thrns which perform the pattern matching arc
presented. First the situation of a single input buffer is considered.
In view of this input situation most of the functions for matching a
particular structure in a pattern are given. Special attention is paid
to the function for matching the complement.ation operator, since its
definition gives rise to some additional computational complexity_
Then the more general situation of synchronized buffers is consid-
ered_ The algorithm for matd1ing primitives is somewhat altered in
this situation. Since the synchronization mechanism is important. for
this routine, two possible synchroni?:ation mechanisms are discussed
and compaJ"ed. The more general one is chosen to be implemented
and the buffer swit('.hing algorithm is given.
On the whole, with respect to the processing of patterns, ToorJP
can be viewed as a compiler/interpreter. The user-defined patt.erns
are compiled from high-level source code to an internal representa-
tion. The internal representation is then interpreted by the functions
for pattern matching"
82 Cllflpter 4 Some aspect.<; of tlJe implementation of Tool)P
4.1 Introduction
T
HE previous t.wo chapters served to give a more or less cowpld.e func-
tional of ToorjP. In chapter 2 the overall architecture
together with the mechanisms available to define 1m arbitrary con-
version scheme. In 3 the kernel of the linguistic rule, the pattern, is
discussed, ar)cl its meaning is formalized in Table 3. VII. the two
chaptors fully describe Toor_iP's behavio11r.
However, not only hus Toor_iP's behaviour been define.d, Toor.jF ha8 nhw
been irnplerneut.ed 1u1d has been opcration<J.l in evolving ver8ions since 1986.
A on the implementation may therefore not be omitted fmm its de-
scription_ This chapter deals with those implcmcntation However,
no attempt has been made to cover all aspects of the implementation com-
pkt.(ly, since that would be a rather technical and Only the
irnportant parts arc den.lt with_
'T'o pr<lcise: this chapter deals with those aspects of the irnplernentation
which have to do with matching 11 pattern to the input. 'input' should b!
understood in a geueral it is not t.he input given by l.he user.
It. can also be an intermediate result of a module, or the synchronized results
of several modules. fnnd:inn which matches pntt.erns to the input is
kernel of the system: a linguistic rule is evalu<ttcd by examining the patthns
of the ldt <utd right contexts the result of '' module is
by I-epetitivc applicntion of mles; the overall in turn,
is by succcssiv(l of the moduJ,s_
The implcmentt1.t.ion of this sttucturcd repd.ition will not be
The exnd nature of this structure hw,; been described in detail in chapter 2 and
its implementation sti-iilghtforward; ca.ch function (such as cwJ.luating
n rule) f1mhedded in the function th<:tt directly needs it (such as executing
the The algorithm closely n1sembles the functional specification
given in Appendix 2-A- Since it docs not seem very useful to repeat this in
detail once again, t.lHt part of the implcrnentn.tion is not included.
to pat.tMn matching, ToorjP contains function8 which <re similar to
found in compilers. The which arc spccifi(xl in a certain user-
friendly notation, often called code, are parsed <tlld r-epresented inter-
nally, before a,t:tu<:tl pattern matching place. However, some functions
arc not irnpkment.ed according to standarcl compiler techniques, such as given
hy Aho, Sethi & Ullman (1986). The functionality of HIA fiystem is, of course,
of primary inter<lSL Moreover, Toor.)P fe<Jbrres extensions to the Ht:andard
theory, the extended regular expressions, in respect of which it is not obvious
how autornat.a can be constructed to ewJ.Juate them. Throughout j,he chapter,
4.2 The repre.5entation Df patterns 83
however, the relation to the standard theory and techniques will be i>\dic<tted
where possible.
Thus, the main topic of this chapter is how pat. terns 3Te rnat<:hed against
input (in the general sense). For this purpose it is first discussed how the
patterns specified by the user in the linguistic source files are represented
internally (section 4.2). Then, the strategy of matching these (internally
represented) patterns to the input is formulated explicitly in algorithmic-like
structures (section 4.3). The process is initially derived for the
special case of a single input buffer (for instance an intermediate module
Derivational history cannot yet be accessed. The general case of
synchronized input buffers, which provide inforwation, is the topic of
section 4.4. A general discussion on the characteristics of ToorjP---as far a:.;
the implementation is concerned-is included in section 4.5. Finally, the most:
important conclusions of this chapter are summarized in the last section.
';1;.2 The internal representation of patterns
In almost all computer applications there is a conversion phase of the
tions a user has specified in source code to a computer internal code (object
code). For instance, programs written in Pascal should be compiled first be-
fore they can be executed. In ToorjP there also exists 3 high-level source code,
the linguistic rules. These arc coded in a format which adheres as closely as
possible to the linguist's wishes .. Just likE: code in other applications,
the linguistic rules are also <:onverted to an internal format. This contains the
same information but is more efficiently processed by the computer.
Fig. 4.1 depicts this process. The high-level linguistic rules are compiled
into an internal representation, which is input to the pattern matcher. In this
case the internal representation is a dynamic data structu.-e rather than a
sequence of machine instructions. The data structure reflects the organization
of the linguistic input in a way which is accessible more quickly for a computer
program.
In this section it is discussed what. t.he internal representation of patterns
loob like. For this purpose first an informal strategy for matching patterns
is formula.ted. With the informal strategy in mind we will then consider t,he
internal representation of patterns. The construction of the compiler which
translates the source code into this representation is not as it closely
resembles the parsing phase of a compiler (see for instance Aho et at. (1986)).
84
input
_ __l ______ _
4
-----.-,- j' internal I pilttem
('0l1l jlW'T _ _____.,_ - -------j
--- .. " _ _ _____ L matcher
;-: L '!..T ;: .. Hf-.i(lr: of rfc.(;ljP. '.l'h ., . .,t.i(:
input is cowpilNl into an int.ornal rcprcscnt.at.ion, which ill tLHn
is inpnt. t.o pat!.cm md.-;h('r.
4. 2.1 An infor-rrw.l stm.tc&'Y
The scnumtics, which (hriwd i11 r:ha.ptl\1' 3 and. given in Table 3.VII,
'' Tf.'C'lJ"l to a Tn he .it definer; t}w of strings which are
L_y Li1e pal.ttn-1L It it< the kind of tbing a j,; iH,
oincc whcH he spccifi<;.s a putt.cm, he must. be able to detennine which strings
rm.tch thut pa.tVru. \N)p,,l we slull con6idcring througlw,It this ch11ptcr
j,; <l-11 algorithm which ddurmiucs whether or not an input string
IIlaLda'S ml i:t.l'Litt!.l,ry p<tttern, that i::;, whet.lw1 or not the input p<.l-1."1. of the
sd dcnot<cd by t.he
hl general, the d<":notcd by a pn.tt('rn is not fi11ik; the pattern dcf:lcrib{)S a
tic'(,._, ,-,hi:ll ,!"n,,' cl'r1<tin m<umr'r; in otlw1 wonlc, Ute hnJ.J of the
:::trwp; m11t;t. s<1 Li,.,iy tcquiremcnts. It is rwt a very
ii,pproii<:!l to tcy and dctcnnirtc t.hc full 1>!.'1- (lf st.riHgs awl thcu determine
wl:c:U.wr tbc input i3 ))l\l't <jf timt rwL. Inskad, it SCt'llb mcm' t.o
start with th(' aetnuJ inp11t ;;h-ing and try tu establish whether or 11ot it. is ow'
of l,he po:-;:-;il!k l1cu<b.
Frotn the synJax, rn.hle it can be I hat. a non-rrnpty is a.
UH!Cal.nation of stl In t.1w tht1 cour.at.enation
h<'ll-\:'hl tlH ,[iJfcrcnt h>r 1111 ll'i"LHI<l:>t str'Ildllf<'. Five possibk
can be if t:lue ttivi<cl of empty p;lt.(.Nns.
They <:Lrc iu rda.ti(HI to the matching routine.
:.; ;1.
,.,_:..e ul' u. :,btHlld bf.' fuuud
al. t.!Je H.ppropriate plM:c in the input. and it. be followed by a
denot-ed by th" sncc('Nling pa!.t.Prn (this iH the definition of c(mcat.ena-
t.iou) _ TlHltdoH:, if Mtch a seguv:nt is not. found, we ca.n ::;top the nw.tchinF;
prou.dnre, smc tli(' cm-rPul. inpttt :;t.rill!!, cun rwver it'n<l i:o tl;c head
4.2 Tbe internal representation of patterns 85
of the strings denoted by the pat.tern. Only when the segment. matches the
primitive do we have to continue matching.
If an alternative structure is specified, the individual arguments of the
structure should be concatenated to the succeeding pattern before the mean-
ing of the thus formed patterns is determined, as explained above. For the
matching routine this means that we can test for these patterns one by one,
terminating with a positive result if one of the patterns is found, since they
are ordered in an 'or' relation.
The simultaneous structure is similar to the alternative structure, save for
the fa<;t, that t.he current input 1>tring should be present in 31l (:om:3tenated
sub-patterns. Therefore, the same strategy as above can be followed, only in
this c::u::e the matching fails if the string does not match a sub-patterns-
The strategy which is to be applied inside the complement.ed structure
differs from the strategy described above. So far, when a primitive does not
match, we can conclude that the path of concatenated primitives which we
are considering does not match. We ca:n start matching the next path in t.he
case of alt.ernatives, or conclude that the whole pattern does not match in
the case of simultaneity. Inside complementation, however, a mismatch of a
primitive will lead to the condu:;;ion that that part of the complementation
matches, which may---but does not wlcessarily have to-lead to a p<x>itive
r ~ u l t value of the pattern as a whole. On the other hand, if a primitive
matches inside a complemented structure, this may but does not necessarily
have to lead to a negative result value of the pattern. Therefore, in both cases
further processing is necessary as no direct conclusion can be drawn, and so
the matching may not terminate in either case.
The optional structure is actually a shorthand notation for an alterna-
tive :>tructure. It saves coding time for the users and improves legibility of
the rules. Since the optional structure is only a notational shorthand for a
certain alternative structure, the optional structure (for normal, positive pat-
terns) ca.n internally be represented as an alternative structure. This includes
pattern:> denoting infinite repetition. Inside complemented patterns, how-
ever, this is not appropriate- As explained above, the matching routine must
fully consider all cases inside complementation, which in the case of infinity
would lead to non-termination. Therefore, a different internal representation
is needed in the case where optionality is used inside complementation.
Thus, in positive structures, we may terminate the matching of the cur-
rent path when a primitive does not match t.he inp11t_ Inside complemeM.ed
stn1ctures this is not the case. Amongst other things a separate internal rep-
resentation is then needed for optionality. With these chanl.cterist.ics in mind
we can now take a closer look at the internal representation of patterns.
86
Chapter 4 Some aspects of the implermmbtion of Tool)P
4.2.2 The repre8enta.tion
The internal repre:sent<t.tion of paHerns is a modified syntax tree. A syntax
tree is hierarchical structure which accounts for the rclation:ship of the el-
emeut:s in an expre:ssion (in t.his <;a.,c;e a pattern). The modification involves
some simple adjustments whid1 can take place at :so a.s t.o speed
up the run-time performance.
Two basic data structure concepts are used to represent patterns inter-
nally. These are t.he record type, in which one can join clements of <trbitrary
types into a compound type, and the linked list, in which one can store a list
(a.n arbitrary number) of element:. (for instance a record) and access them >e
quentially. As illustrated below, the record type is portrayed by a nod:angular
box, possibly sub-divided, while the linked li:st is represented by an open cir-
cle wit.h a.n arrow ( l.he link) pointing at an element, a rectangular box, which
contains a link t.o the next element. The end of the list is by a
filled drde.
A:s follow:s from the a pattern is a concatenation of structurec;.
The number of concat.enations is free, and varies significantly in prHd.ica.l
situations. Therefore, a linked list of is an elegant way to represent
rather' than a fixed array. Each link represents a conca.teua.t.ion, each
clcmcnt of thc list represents a structure:
Note that: the which indicates the end of the list, can be inter
preted as thc @-pattern, to which all strings rna.t.ch. So as soon as this point
one {W>Y condude t.ha.t the pattern which is searched for is present.
A structure can be u.ny of the given five types. The kind of information a
type represents differs per type. A:s only one type at a time can be used at
" place, the different types of linguistic data (graphemes, phonemes,
features, etc.) can be stored at the same place, provided an additional indi
(:a.l.ion nf which type is used. In Pascal terms this is called a variant record.
The record which represents a structure then contains three fields: a type
indication, the linguistic data, and a link field:
G-1 typejdata]el
The internal will r)ow be discussed for all possible struc-
turec;. This the four types of primitives: graphemes, phonemetl,
4.2 The internal representation of patterns 87
grapheme features and phoneme features, and the four 'real' structures: al-
temation, simultaneity, optionality and complementation.
Graphemes
Graphemes refer to orthography, and can be any printable character. For
in:st.ance, the pattern 0, u, t (throughout the chapter all example patterns
are printed in bold face) is internally represented as follows:
G--4 gra 1 o iGH gra 1 u IGH gra 1 t iJ
Phonemes
Phonemes refer to pronunciation. They are defined by the user and consist
of a lirrlit.ed :,;eL They are represented in a way similar to graphemes. For
instance, the pattern SJ , 00 is represented as follows:
Grapheme feo.t"U1't:l3
Grapheme features are user-defined, and arc used to describe common proper-
ties of graphern!l:>- A feature specification can consist of any positive (nonzero)
number of features, each of which has a value('+' or'-') and a name ('cons',
'son', etc)- For example, the feature specification <+cons, -son> denotes all
consonantal graphemes which are not sonorant. Since the number of features
used in a specification varies per rule, a. linked list of feature elements is ap-
propriate to represent them. The representation, for instance for the above
specification, is as follows:
Used as part of a pattern, the linked list representing the feature specifi-
cation forms the data part of t.he structure, as illustrated below:
succeeding pattern
I+ I cons 1Gt4 -I son ).I
88 Chapter 4 Some aspects of the implementation of Tooi)P
Phoneme
Phoneme features describe common properties of phonemer::, analogous to
grapheme features. These, too, follow from the defiuit.ion Their rep-
eqnal to grapheme features, save for- the type indication. So
for instance, the pattern L, <+CONS>, which refers to a sequence of the
phoneme /L/ followe(l by a consonantal phoneme, is represented as follows:
An alternative structure can have any number of arguments, each of which
can be a non-empty pattern. In accordance with the earlier r::olutionr::,
Ciu1 be elegantly by a linked In thi8 case, the elements of the
linked list arc patterns, i.e., linked lists of structures. Thus, we see the
same recursion in the internal representation as is present in the syntax.
The true syntax tree of the pattern j t (presented in the way
are represented) ir:: as follows:
However, the semantics prescribe that the :mcceeding paUern should be
concatenated t.o alter-native to form new patterns, whose meaning then
sho1.1ld he determined_ This transformation can be performed explicitly in
the compiling phase by linking the end of each alternative to the succeeding
Tlmt-:, the above representation is adjusted to the following:
4_2 The internal representation of patterns 89
The original link, from the element marked as 'alt' to the succeeding structure
is now spurious, but on the other hand it does no harm, S() it may as well
remain present. To achieve this representation, the compiler routine which
constructs the internal representation must perform some additional compu-
tations, of course. Before parsing the arguments of the altemative :;tructurc,
first the continuation entry for the succeeding structure should be computed,
so it can be patched at the end of each argument. Per rule, this only n e e d ~
to be done once. Moreover, constructing the internal representation is done
off-line, as a preparation phase for the pat.terrl watcher, so this does not affect
the run-time perforwancc.
Simultaneous structures
The simultaneous Htructure is similar to the alternative strudure, he it th<tt
the interpretation differs. As this is the tMk of the matching routine, the
internal representation for simultaneity only differs in the type indication
from the alternative structUI'e:
Optional structures
As previously explained in chapter 3, an optional structure is generally char-
acterized by (A)a-b, where the parentheses denote the optional structure, A
denotes an arbitrary pattern, and a and b denote the minimum and maximum
number of t.irnes the pattern is required. Here, a ::::; b, and a and b may be
any non-negative number. One may, however, use a notational shorthand and
omit either b or both a and b. In the first case, the pattern expresses infinite
90
Chapter 4 Some aspects of tlH; implementation of Toor)P
repetition, that i:>, it should be present at least a but may be present
any higher nmo.ber of times. In the second case, the structure is interpreted
in the hue optional sense: it may be present or not:, so it is as if a = 0 <Jnd
b = 1.
As mentioned, in nonnal positive structures it can be by an
alternative structure. This saves the need to code a separate routine for
positive optional :>tructures. If/, is limited, the pattern (A)a-b irs represented
a::; follows:
Typically, a and b are small, so explicitly representing the st:rticture as
a seqtwnce of similar elements docs not burden thfl computer memory too
heavily. If b is omitted, <Lnd thus codes infinite repetition, the above s<;herne
C<lJmot be us()d, sinct" one cannot go on infinitely creating new alt.ernatives.
However, the infinite repetition .-;;en bf' coded by a self-reference pointing b<J.ck,
so that for the nmtching routine it seems as if the pattern endlessly:
If the rninimum number of times the pattern must be present is zero (a = 0),
this l)ndergoes a slight change:
4.2 The internal representation of patterns 91
succeeding pattern
These representations effectively code infinity, and therefore a terminating
mechanism should be provided for when the patterns arc matched outside the
range of the input,_ On the other hand, in practical situations this mechanism
will seldom be used; either at some point the succeeding pattern matches
(leading to a positive result and terminating the matching process) or the
optional structure A does not match (leading to a negative result and also
terminating the matching process).
However, inside complemented structures such a safety mechanism cannot
be used. As argued in the previous section, inside complemented structures all
alternatives must be considered. If no further provisions are taken the match-
ing routine will always encounter a new alternative which it will investigate, as
it might add new information. These provisions could be taken, of course, but
that would burden the matching routine with extra processing, which then
would also be executed when finite structures arc used. This would deterio-
rate run-time performance, and therefore a separate internal representation
is used in the circumstances that optionality is used inside complementation:
Here, the type field indicates that an optional structure is used, which
will cause the matching routine to select the part with the necessary extra
processing. The data field contains a pointer to a data i t r x ~ which provides
information on the minimum and maximum number of times the optional
structure should and may be present. A negative number for the maximum
codes infinity. This way of explicitly coding infinity provides the matching
routine with additional information, which enables it t.o determine when the
matching process can be terminated.
92 Cllapter 4 Some aspects of the implementation of ToorJP
Complemented structures
The argument of a structure is an arbitrary non-empty pattern.
This ;,., t.he first. structure which follows the complementation sign. If one
wants to complement 11 s1quence of structures, these should be enclosed in
square brackets, -.[ ... ], thus indicating the range of the complcwen1.ation
(the normal are already used for optional strudt1tes). Confusion
with the does not occur, since in this Cllse the structure (which
is being c::ornplement.ed) only consists of one argument.
The non-empty pattern which is complemented is represented in the same
way as it would be, and forms the dat.a field of the complemented stmct.ure.
For instunce, the pattern -.[o, u] , t is represented as followe>:
With this, all structures aiHl their internal reprcscnt1ttion have been discussed.
It was stated that the internal was a modified form of the
synta.x tree. The goal of the internal reptesentation is to the data
in sud1 a mmw<;r that the matching routine !;all process them rcasonubly
wdl and reasonably fast. For that purpose, for instuncc, the arguments of an
alternative structure are linked to the succeeding stnJ.dnre, and the optionuJ
structure has been trnm:la.ted to an alternative stn\l_:tu<-ll in positive
On the other h<1nd, computations which arc difficult and laborious to do
olr-line, such a._<; determining whiz:h ;;t,rings possibly match complemented
structure, arc postponed to the run-time part of r><ttem matching, where
they arc computationally less complex since only a particular input string h<<S
to be nmtched (.o the pattern.
The internal representation of patterns is thus a hybrid kind of repre-
sentation somewhere between a syntax tree and a non-deterministic finih
tl.utonmt.on with E-transitions (NFA) for instance Hopcroft & Ullman,
1979). The representation for zmnplernentation and optionalit.y typically con-
tains structural inforruation of a syntax tree, whereas an alternative structure
and concatenation::: can he viewed as an NFA
1
.
1
The open circles with arrows can be considered .u; the filled circle is the accepting
state. The datl't of the record, the primitives, l'tre la.belled transitions, and the list of
structures in the alterno.tive representation are
4.3 The algorithm for pattern matching
93
From the viewpoint of compiler design, this is probably not the most ele-
gant type of representation. The choice fN- this representation rests mainly on
practical grounds. Translating the optional structure to an alternative saves
the need to .:::ode <t sepai;'<tte routiue for ma.tching positive option<tl strndures.
Linking the succeeding structure to the arguments of an alternative or simul-
taneous structure saves administrative effort during run-time. On the other
hand, computing during run-time whether or not a string matches a comple-
mented structure is probably less complex than transforming the syntax tree
of the complementation into some kind of finite automaton. These consid-
erations have resulted in the internal representation to represent patterns as
described above.
4.3 The algo:rithru for pattern matching
We can now consider the matching strategy in more detail. As depicted iu
Fig. 4-1, the internal representation is one input to the pattern matcher, the
user-provided input the other. The pattern matcher is an interpreter which
is driven by the organization of the linguistic rules. At the basis patterns are
interpreted by the pattern matcher and matched to the input. This is the
kemel of the system. In this section the algorithm for matching patterns will
be given.
The general task of the matching routine is to determine whether or not
a certain pattern-internally represented by the linked list of structures-is
present in the input buffers or not.. For the present, it is assumed that there is
only one input buffer, containing either graphemes or phonemes. In t.he next
section the sihlation of synchronized buffers is dealt with.
The ta.sk of the matching routine is the following. Suppose the rule ( 4.1) is
specified to deal with the pronunciation of words like 'container' and 'trailer'.
( 4.1)
Suppose the word 'container' is typed, and suppose the focus pattern ha.s been
tested and found. Then the internal status will be as follows:
input:
c on t ajilnlelrl
l
output: K 0 N T
(4.2)
94 Chapter 4 Some a..'>Jkd.5 of tl1e implementation of TooJ)P
The arrow points at the position where the matching of the right
(:out: ext will COIYIInence, directly to the right of the focus p;tttenL
The output buffer ha.5 p<trtly been filled, as a result of rules
included ir:t t:he sar:ne module, which have already been applied-
From the fnd that the outplJt buffer is filled to the left of the
focus it can be concluded that the module scans its input frmn
left to right.
The task of the matching routine is to match the right context of (4.1) to the
internal status of ( 4.2), and return a boolean-valued result: true (it matches)
or false (it does not match).
Thus, the nmtine can be seen as a function which returns a
boolean value and is fed by two parameters: the which to match
against the buffer contents, and the position at: which to start matching the
pattern. The last parameter can actually be subdivided into two parts: the
buffer in which to match (momentarily a and the position in the
buffer at which to start JW:l.t(:hing_ The combination of buffer and position i::;
called internal position.
function Ma.feh(pa.IJ : p(Lftern;
int_pos : : lwolmn;
Here, pattern is the data stnH.:t.Urtj of the linked list of struc-
tures, and intenwLposition a record which contains buffer aud
position information. Customarily, will be presented
in the above mmmer; function!) ::tud procedures (both starting
with a capital letter), variables and types are printed in italics
and keywords arc printed in bold face.
The functionality of the matching routine can be fonnali%ed a.s follows:
Match(patt, inLpos) is true if!' (if and only if) the pattern patt matches (1.
string starting at position Pos(inLpos) in the buffer Huf(inLpos)_ Pos and
Buj respectively select the position and buffer of int_po.>. Inside this func-
tion, we may expect the same differentiation structures as is present
in the a:nd the se<Mu)tics. In this respect, however, it is relevant to
t.he into two gtoups: primitives verSllS 're<l.J' The
primitives are characterized by the fact that they to exactly one segment
in a buffer, whereas the structures do not refe1- directly to the input buffer
hut: indicate how to combine sub-patterns.
Structures can be concatenated in any order- However, when a structure
is used, the semantics prescribes that the full pattern (the pattern
SlKcecding HI(' st.rnd.nrt> <:()!lCerned) must be included in the determination of
the me<nling of t,he structure. Therefore, if function Match encounters such
4.3 The algorithm for pattern matl'lling 95
a structure, control will be given to a routine which handles these structures,
including their succeeding patterns.
On the other hand, if a primitive is encountered, the matching value of
that specific primitive can be determined. If a negative value results, the
maJching process can be terminated (unt.il a complementation sign is en(:oun-
tered, we are rnatching a positive structure). If a positive value results, it
should continue. Then, both the starting position at which to matcb and the
pattem with which to match must be updated.
Such a recipe can be formulated compactly and explicitly in an algorithm.
For this purpose I use a pseudo-Pascal code; Pascal since the program is
implemented in that language, and a pseudo to be able to leave out
irrelevant detail. Thus, the function Match can be formalized as follows:
function Ma.tch(patt :pattern;
inLpos: boolean;
begin
result := true;
while r(;S1J.lt and Type(patt) =primitive
do
od;
result := Mntch_primitive(patt, int_pos );
inLpos :== Update( inLpos);
:= Select_next(patt)
if result and Type(patt) =
then rtlsu.lt ::::::: Match_structure(patt, inLpos) fi;
Match := result
end;
Type is a ftmction that returns the type field of the current pat-
tern. Update updates the internal poJ>it.ion. In the current case
of one input buffer this con:;;ists of shifting the anow in ( 4.2) one
to tl11J right Of the left, depending on the matching di-
rection. Ia this tase, this is to the right since the right context is
beiug matched, but for left contexts, for instance, this would be
to the left. Select_ next updates the patt.ern, i.e., selects the next
element in the linked list of structures. In true code this
is equivalent to the statement: patt := pa.ttj.next, but to abstract
from the implementation this is presented as a function.
Note that as soon as result turns false the matching process terminates with
negative result. On the othet hand, if the end of the linked list is rea.ched and
result i:;; still true, the process terminates with positive result; both tests,
96 Clwptcr 4 Some aspects of tile implementation of Toor)P
Typt(patL) =primitive and Type(patt) = structure, fail (it. is assumed that
1}1PC recognizes the empty li8t ret1<rns a unique code for that case).
The division bet.w,en the primitives and structures manifests in a
call to two different functions, Match-primitive and Match_structure_ These
fcmn thl-' topics of the next sections, 4.3.1 and 4.3.2. Amongst, other things,
de<tls with complementation. As will be explained, matching
complementation differs from matching outside, which is the normal
mode. In section 4.3.3 the matching- strateb'Y inside complementation is
cusscd. A brief discussion in scdion 4_3_4 concludes this section on how
patterns arc matched against a single input buffer.
4-3-1 M(Ltching primitives
M(!tch_prirnitivc is a bookan {l)nd.ion that, matches a single prirnit.ivtl
to a single segment in a cc!'tain buffer. Its is given by:
int_po8) true iff the primitive pattern p-rim
mat:ches t.hl-' at the intefnal position int_pos.
Four type::; nm he used: gr-apheme, phoneme, graphem!l features
and phoneme features. When a grapheme or phrmewl-' is specified in the rules,
that specific grapheme or phoneme must be found in the input buffer. The
matching value can be determined with a single
result := Data(patt) = ScgrnpJ.t( inLpos );
Data rPt.nrns t.he couhmt.s <>f the data field, Segment selects the
segment at the internal position-
WlJ!.'D an sph:ilied in rules, the segment in the input bulrer- must
satisfy all feature specifications. One by one, these specifications arc verified,
a process that terminates if one fails, or when all specification:>
have been dealt with:
result:= true;
feat := Data(patt);
.H:gm :.,. Segment(int_pos);
while rcsult and Prcsent(fent)
do
od;
res11.lt := (segm in Featurc_sct(!eat)) = Value(feat);
feat := SelecLncxt(!cat)
Present is a boolcm1 function that. returns true if there arc still
dements in the linked list: to be checked. In this phu;e it is
4.3 TIH:l algorithm for pattern matc.bing
equivalent to the Pascal statement feat =/= nil. Feature-set is a
fund.ion that returns the set of all segments which a.:re described
by the given feature. Value is a boolean function which returns
t:rue if t,he specified feature value in the rule is '+', false if this
is'-'.
97
The whether l:he input-buffer segment is in correspondence with the fila-
ture specification consists of two parts. The first test determines whether
the current segment is an element of the set denoted by the feature:
segm in Fe.ature_set(!eat). The second test cbecks if that, wa..c; intended: "ele-
ment of feature-set" = Value (feat). Then, of course, after testing one feature
specification, the next one must be selected, which is performed by Select_ next.
If <Lll features have been dealt with, this will cause the test Pre$ent to faiL
Thus, the result will only be positive if the input segment satisfies all the
feature specificat.ions_ As soon M one of the requirements is not met, the
matching process terminates.
The one-input buffer whi<:h is assumed here does not yet dis-
criminate between graphemes and phonemes. In section 4.4, which deals with
the synchronized buffers situation, the distinction will lead to a slightly dif-
ferent algorithm, for which reason the full algorithm for Match_primitive will
he given there.
As argued in dmpter 3, when any of the structures is encountered, the pattern
succeeding t.ha.t !:ltrudure is necessary to determine the meaning of the strllc-
ture. For this reason control is transferred to the function Match_structurc
when a structure is encountered. The task of this function is to match the
remainder of the input to the remainder of the pattern. The first element of
this pattern is a struct,ure.
Basically, the matching strategy differs for each structure. Therefore, the
only thing the function needs to do is to differentiate between the possible
types and transfer control to a function which is specialized to deal with the
encountered Since optional :structures are coded as alternatives
in positive (non-cornplernented) these cannot be encountered, only
the alternative, simultaneous and complementation structures must. be dea.lt.
with:
98 Chapter 4 Some aspects of tlHl implementation of ToorJP
function Match_structure(patt : patteTn;
inLpos: intemaLposition): boobm;
begin
case Type(patt) of
(Lllernath;e :result := Match_alt(pa.tt, inLpM );
simultaneous :result := int_pos );
complementation: result := Match._cm.p(patt, inLpos)
end;
Match_strncturc := result
end;
For each of the three possible types a different. function is called. These will
b(! di:,enssed in this order.
Alternative structures
The alternative structure is represented by a list of patterns. The head of each
such pattern consists of one of the arguments of the alternative the
tail of each argument consists of the succeeding pattern:
sueeeeding pattern
arg. 1
u.rg. 2
arg. n
Each pattern in the list i:s exactly like the patterns we encountered at the
highest level: t,hcy a linked list of structures. This t:hat we can usc
the fmu;tion we already 'have', Match, to clet:en:nine the matching W1]ues of
the patterns in the list. Actually, this rec11rsive call follows directly from the
semantics, which prescribes how to rearrange the pattern and apply the same
function, "the meaning of" to the new patterns.
With this, the strategy for matching au alternative structure becomes
simple: Match the pattems of the list one by one until one of them matches,
<md termint>t,f' matching with negative result if all patterns fail. Note that
4.3 Tl1e algorithm for pattern matd1ing 99
each time a new pattern is tested, it is matched starting at the same internal
position as the first time.
function Match_alt(patt ; pattern;
inLpos: boolean;
begin
result := false;
alternative := Data(patt);
while not result and Present( alternative)
do
od;
result:= Match(Pattern(alternative), inLpos);
altemative := Select_ next( alterno.ti11e)
Match_alt := result
end;
Pattern selects the following pattern from the list of alternatives.
The number of patterns which are represented in the linked list is finite. Ei-
ther it directly follows from the number of arguments the user has specified,
or the alternative structure consists of two elements when it is derived from
an optional structure. The routine specified above will therefore always ter-
minate, provided that the internally used routines terminate_
Simultaneous structures
The simultaneous stmctme can be t.reat.ed analogously to the alternative
structure, with appropriate adjustment of combining the results. ApB.I-t from
the type indication, the int.e:rnal representation is the same. Therefore, the
simultaneous structure can be treated with the following algorithm:
function Match_sim(patt :pattern;
inLpos: internaLposition): boolean;
begin
result := true;
simultan := Data(patt);
while result and Present( simultan)
do
od;
result := Match( Pattern( simultan), inLpos);
simultan := Select_ next( simultan)
Match_sim :=result
end;
100 Cl1apter 4 Some aspects of the implernentatioJJ of Tooi.)P
This routine only terminates with positive result if all of the patterns a1e
found. If one of t.hem f<l-il.s, the re.sult is negative.
Com.plernented
The complementation structure is represented as <l- p<>tt.ern embedded within
complementation marks:
pattern n
um1plMnent:ed pattern A
To determine whether a string belongs to the pattern, the semantics lne-
t.hat it should lw 11.n of A but not at the same time be an
clement of A'""B, where A is the complemented pattern and B the succeed-
ing pa.ttcrn-:-Ifthis is implemented straightforwanily, the membership of the
input must he er,;t.ahlished for both set:s. There is no simple shortcut, since
in (:hapter 3 it ha.s been shown that in general A u,.._,B \ A""'B f= (Au\
(the lcfthand side is the semi-compositional definitiol1, the rlghthatHl side tlle"
compositional one). So this would mean more or less a redoubling of the
amount of work, cornptlled to positive stnrctures. Therefore, a more effi-
cient implementation has been searched for.
The definition of complementation is derived from a formulation which
origin 3, page 55):
(4.3)
Thus, complementation can be interpreted a,s follows. A , B denotet>
strings which are candidates, but which at the same time un! 1wt explicit
nofits. Candidates arc defined as the strings t}uLt can he fitted to the pattern
such that the segments fitted to the complemented part (A) do not match and
thoo:e to the succeeding p<lttern (B) do (=(Au\ A)""'B). Explicit nofits arc
the that nLn b1! fitted such that theirsegments both match
the complemented part and t.he succeeding pattern (=A""B). The string 'at.t.'
for instance, is both a candidate and an explicit nofiCfo;rfJaUeni ( 4.4) (and
should, for that rcnson, not be inc:luded by the set of strings denoted by thnt.
pat.t.ern).
( 4.4)
4.3 The algoritl1m for pattern matching 101
In patterns we can distinguish paths of concatenated primitives, or briefly:
paths. For instance, pattern (4.4), the internal representation of which is
depicted below, contains four paths: -,a, t, --oa, r, --o[o, u], t and --o[o, u], r.
A certain string can be a candidate for a specific path, it can be an ex-
plicit nofit, or it can be neither, but it cannot be both at the same time for
the same path_ If a string is both a candidate and an explicit nofit for the
pattern, this is due to different paths. Now we can separate the 'matching
value' of a string for a certain path into two parts: one for the complemented
part, a.ml one for the succeeding pattern. In this way we have a tuple of two
boole<ul values, a.nd thus four possible combinations: (false-false), (false-
true), (true-false) and (true-true). Here, the value is not yet adjusted
for whether or not being complemented, so (true-true) means the first. p&t
matches the complemented structure and the second part matches the suc-
ceeding pattern.
The definition of complemenl:atiM, ~ given in ( 4.3), can now be inter-
preted as follows: the input string must result in a (false-true) tuple for a
certain path (it must be a candidate), lmt it may not result in a (true-true)
tuple for any other path (it may not be an explicit nofit).
Therefore, the general strategy could be: compute and consider all tuples.
If there is a. (true-true) tuple, the pattern ~ a whole does not fit, as this
is an explicit nofit. If there is no (true-true) tuple, then look for a (false-
true) tuple_ If such a tuple is found a. positive result must be returned, as a
candidate is found which is not an explicit nofit. If such a (false-true) tuple
is not found, the routine must return a negative result; the string is neither
an explicit nofit nor a candidate-
In this :scheme, however, the matchillg value of the :succeeding pattern must
be determined for ea.ch path in the complemented pattern. If t,he position at
which to start matching is different for each path, this has to be done anyhow.
For instance, in pattern ( 4.4), { ~ } (the succeeding structure) must be n1atched
to the second character (elative to the starting position of the whole pattern)
102 Chapter 4 Some of tl1e impleme11tation of Toor)P
for thP. path, a, and to the third character for the second path, [o, u]_
On the other hand, if certain paths of a complemented pattern are of equal
ler)gth, !mch M in pattern ....,{:}, t the starting position for the sue<:eeding
pattern is the same, and therefore the matching values of those cases will all
be similar. So in these cases some increMe in efficiency c:an be gained.
can be done by o:;ombiniug t..he results of those complemented
which have equal lengths. For paths within an alternative the com-
bining operator is the logical or: if one of them m<ttches, this leads to an
explicit nofit if the succeeding pattern abo mat<;hes. Analogously, for paths
within simultaneity this is the logical and_
In fact, it is not the length of the complemented pa.ttern that is essential,
but the position where the matching of the succeeding pattern should corn-
mmce. Therefore, the results of all paths which lead to the same internal
position are combined_ Thus, we obtain a list of matching values for com-
plemented part, of which each element is associated with a different internal
position. Now, for each list element the matching value of the succeedinf!:
is determinP-cl, and the same strategy for combining result-tuples
before can be applied to determine the matching value for the p<>ttern as a
whole_
Yet, there is still some efficiency to be gained. If the list of m<1tching values
for the complemented part if> ordered in such a way that first for all true-
values the succeeding patl:ern is matched, and next for ull false-values, we
can terminate matching the succeeding pattern as soon as a positive result
returned. If the accompanying complemented path is hue the string appears
to be <lH explicit !l()liL ;md t,hus a ne?;ative result cun be returned directly.
If t.he accompanying complemented path is false, t.lJP. !;!.ring appears to be a
candidate. But since true-values in the result list precede the false values,
explicit nofits can no longer be encountered_ Therefore, a positive result
ccm be returned directly. If the E>uo,;eeding structure fails to match for all
list elements, again a negative result for the whole pattern must be returned.
Note that only in this case docs the succeeding pattern have to be matched fo;r
all list: dements_ This strategy can be formalized in the following algorithm:
function Match_cmp(patt :pattern;
inle:rnaLposition) : boolean;
begin
:= Ex/Lmatch(Data(patt), inLpo8 );
1-e,qult :=false;
succ_patt := SelecLnext(patt);
while not 1esult and Prescnt(res_li#)
do
4.3 The jj.Jgorithm for pattern matching
od;
neg _res := Res_ value( r-ea_list);
result := Match ( succ_patt, Int_pos( res_list));
r-es_list := Select_next(
Match_cmp :=r-esult and not neg_res
end;
Res_value select the matching value of the current element of a
result list, lnt_pos selects the intema.l position from that clement.
103
Exh_match determines for each patch in the complemented pattern its matr.h-
ing value and the resulting internal position. It combines the results for paths
with the same resulting internal position, and it orders the results in a list
such that those paths that match are included first, and this list.. In
the next section, ( 4.3.3), this function is discussed in more detail. While no
definite decision can be made, i.e., when result = false, which means neither
a.u <O;x;plic.:it nofit nor a candidate has been found, the succeeding pattern is
matched at the internal position of the next dement of the result list. Storing
the accompanying result of eaclt result list element in ncg_res enables a quick
computation of the final result_ This can be done with the single statement:
Match_cmp := result and not neg_res
If the succeeding matches for a certain path, then the while-loop ter
minates with result = true. The final result now become the inverted
value of the accompanying complemented part, whic.h is in neg_res; if
neg_res = true t,heu au e;x;plicit nofit has been found, and thus the final rel5ult
is negative, if neg_rcs = false this indicates a candidate, and t.hu5 the final
result is positive. On the othe1 hand, if the succeeding pattern did not match
f()r w>y of the paths (no candidates are found), the loop evelltually terminates
with result = false, in which cru>e a negative result is returned.
4.3.3 Exhaustive matching
The function Exh_match deserves some additional discussion. lt performs the
task inside complemmted structures that is performed by Match outside (in
positive structures). Compared to Match., it differs in some aspects. First
of all, it does not return the matching value of a pattenl, but it returns a
list of matching values for all paths inside the complementation. Second, for
each matching value it includes information on where to start matching the
succeeding pattern_ Third, it only terminates matching at the end of the
complemented part, rather than as soon as result turns false.
Despite these differences, there is quite some similarity in task. We will
find the same differentiation in type of structure as we saw in Match. Also,
104 Chapter 4 Some aspects of the impleme;;t<ition of Tooi)P
more or less similar algorithms a<e encountered. Again, these algorithms will
differ fmm the p1eviously presented ones in t.he respects given above. Since
for ead1 path an indication is needed on where to commence matching the
succeeding pattern, we cannot simply terminate matching a path and return
a negative result when a certain primitive appears not to be present. One
solution for this is to continue rmtt.c;hing until the end of the (:omplemented
pattern is encountered. only affects the termination crite.-ion for the
routine. Another solution is to only count the remaining elements of the path,
but this has more cone>equences for the algorithm; more or less a redoubling
of source code is netessary, since the structures that <:an be encountered arc
t.he same, only the matching of the primitives differs.
As in practical situations complementation is not used with high frequency,
and, moreover, the patt.erns being complemented arc fairly simple (mostly
consisting of one or two paths of one or two cha,a.ct.ers), one may not cxped <t
very spectacular increase of run-time performance when the second solution is
chosen in favour of the first. Therefore, in the implementation I chose the lhst
solution. For this reason the routine is called Exh_match, as it exhaustively
matches each path to its end.
With this, the important differences of Exh_match with it:> positive coun-
terpart. Mutdt have been discussed, except the fud !'.hat. optlonality must also
be dealt with. This structure has not yet been discussed, as it does not occur
in positive structures. Therefore, how t.o deal with this structure where
this done <\re in the next section. The algorithms fm- matching
the other structures (alternation, simultaneit.y and complementation) inside
structure are given in Appendix 4.A.
Optional Structures
The only case in which ;1. separate internal representation for optioua.l st.rnt-
tures can occur i:s when opt.ionality ls used inside complementation_ The
general appeaxaJtee of sueh a pattern is as follows:
X, -.[Y, (A)min -max, B], Z (4.5)
Here, A, B, X, Y and Z are arbitrmy all of which may be absent,
except A. A is also called the optional pattern and B is the pattern succeeding
the opt.ional pat.tHrn within the complementation. Within the remainder of
this section ( 4.3.3) H will also be called in shm-t the succeeding pattern, in
contrast to the gen<'ral convention where it means the pattern the
complementation (whith is Z, here). A and B a:re of interest for the routine
denling with optionality, X, Y and Z are dealt wit.b hy ot.her routines.
4.3 The algorithm for pattern matching 105
The internal representat.ion of pattern ( 4.5) is as follows:
G---1 X IGt-{cmplcpQ>t--l ---------'"ll Z 11
succeeding pattern B
optional pattern A
The sub-pattern matches if the optional pattem A is present at least mn times
and up to at most max tiwes before the succeeding pattern B. Of course, in
the end the matching result should be inverted, since the structure is present in
a complementation. However, the interpretation of the complemented struc-
ture as a whole is dealt with by the ro1)tine dealing with complementation,
Match_cmp. Here, we are concerned with determining the matching values
of the paths inside the complemented structure, that is, determining the first
values of the result-tuples described in the previous section ( 4.3.2). Therefore,
the task of this routine, which will be called Exfunatch..opt, and which deals
with optionality inside complementation, is to determine whether the optional
structure (A) followed by the succeeding pattern (B) is pres!lnt or not
A straightforward strategy to determine this is as follows. First match the
optional pattern A min times and then match the sue<;eeding pattern B. This
gives the first matching tesuH to be put out in the result list. Just beforA fl
is matched, the current position is saved, so that th!l next option,
A min+l, can be checked without having to match the first min occurrences
once again. Then, after matching A once, B is matched again and the result
is added to t.he re::;ult list. Each time a next occurrerl(:e of the optional pattern
has been matched the internal position is saved. This continues until the op-
tional structure has been matched up to max times. The following algorithm
does the job:
for nr :"" 1 to min do "match optional pattern A" od;
save := inLpo.9;
"match succeeding pattern B";
res_list := result;
for nr :=min+ 1 to max
do
inLpos :;;:;:: save;
"match optional pattern A'';
save := inLpos;
"match succeeding pattern B";
106 Cl1aptcr 4 Some a.->pects of the implcmcntatiorJ of Tooi.)P
rcs_list := Unite( n:dist, result)
od;
retul'n re.Llist;
Unite is a function which simplifies two re5ult lists suc.h that
each internal position only occurs once.
As to this algorithm, there arc, however, two complicating factors:
for each matching operation, a list of result value1:> may result, as match-
ing takes place inside complementation- All elements of the list should
be dealt with-
rrw.x may be infinite, which cannot be dealt. with by the routine above_
to the fii-sl: uu:tor, we may assume that when we encounter the optiomtl
sl".rnd.m:e, we are investigating a specific p<tth. There can of course be more
paths before an optional structure, but for each path the optional structure
will be investigated individually. So we stru-t matching the optional structure
at a certain internal position, transported to the routine by means of the
variable int_pos-
It is possible that t.he optional structure contaim: (t):l alternative structure,
which mntains paths of different lengths, The rern1lt of matching structun A
may t.lwrefore be a list of result values. If, in succession, we must once
match A or rnat.:.:h the pattern B, we do this for each list
element. So, generally, on the hand a list of result values is :returned by
the matching pwcess (the result list), and on the other hand a list of at
which to commence watching is presented to the process (the state
list)- Only the first time we start matching we know we have to deal with
exaetly one state. This can be seen as a state list containing one element..
The result list and the state list a:re e1:>sentially of the same type: each ele-
ment of a list contains information on where to continue or commence match-
ing (by means of the internal position) and the I-esHlting or initial matching
value. The mM.ching value of an arbitrary pattern inside complementation
can then be seen as a function of the pattern and the list of commencing
values, which results in a. Est of result values:
result_list := LisLmatch(patler-n,
Inside the fundion the pattern t:(mcerned should be matched
for ea.ch element.. Each list clement contains an internal position and
an initial makhing value, The intem<tl position is selected by Int_pos, the
initial matching value by Re.L'IIalue. The initial matching value is needed
for a correct result; if the first n - 1 occurrence:> of A do not match, then,
4.3 Tl1e algorithm for pattern matching 107
despite the fact that the nth occurrence may be present a.t that particular
position, the sequence A" still does not. match. This is taken care of by the
function Adjust. Each matching process produces a result list, and all the
result lists are combined as before, that is, simplified for paths ending at the
same internal position by the function Unite. This is expressed below:
function LisLmatch(patt :pattern;
comm_list: result-list): result-list;
begin
res_list := empty;
while Present( comm_list)
do
od;
inLpos := InLpos( comm_list);
prv_val := Res_value(comm_list);
tmp_list := Ezh._match(patt, inLpos);
tmp_list ;;;;;;; tmp_list);
res_list := Unite( tmp_list, res_list);
comm_list := Select-next( comm_list)
List-match := res_list
end;
Now, the routine for rnatc:hing optional structures, given above in an in-
formal style, can be made more explicit. For non-infinite matching of optional
patterns in complemented structures we can give the following routine;
function E:r,h_m.atch_opt(patt :pattern;
comm_list: rtTt.dL/ist;
begin
opLpatt := Data(patt);
succ_patt := SelecLnext(patt);
for nr := 1 to min
do comm_list := LisLmatch(opLpatt, comm_list) od;
res-list:= LisLmatch(succ_patt, comm_list);
nr :=min;
while nr =F m=
do
nr := nr + 1;
comm_list := LisLmatch( opLpatt, comm_list);
res_list :;;;; Unite(LisLmatch(s'I.Lcc_patt, comrrdist), res_list)
od;
:= res _list
end;
108
Clwpter 4 Some aspects of the implementation of 'Ibo[jP
Saving the internal positions is 11ow done implicitly in the result lists,
which serve as lists for the next matching process. Note that abo the
intemal position where matching is first to commence itl tra.nsfm-red to the
function by means of a state list. That state list, contains, apart
from the intema.l a.lso the initial matching value, the result of ( m1e
of the pa.ths) of pa.t.t.em Y _
The last for-loop is repl:ed by a while-loop to be able to make the
last step: to make the routine suited to deal with 'infinite matching', this
is when maa: "" -1, and the routine would therefore not tenninate_ In a.
practical system the physica.l string is always finite. This means that the end
of that string will always be reached by repetitive matehin!:( of the optional
structure. This can be detected by Exh_rn(}.tch_opt by means of a special
function, Out_of_bou.nds. This function returns true when all of the position
pointers are outside of the input range. This can be added to the termination
criterion:
while nr f- ma.x and not Out_of _bounds( comm_list) do . __
Thus the relevant repetitions are dealt with, since the paths of the op-
tional structure which fall outside of the input ran!:(e will cause the pattern
succeeding the complementation (Z) to fall outside of the input range as
well, and t.herefore ;l. (:(l:udidat:!, can never result for those paths. Therefore,
Exh_match_opt can be terminated as soon as repetitions reach beyond the
input range-
The algorithm /or exhanstive rn(J.tr;hing
Now we can return to the general routine which deals with mat('.hing inside
complementation, the function Exh.rnaf(:h_ As stated above, this function
resembles the function Match, but difl'ers in thr-ee respects: it returns a list of
matching values rather than a single result, it, also returns the accompanying
internal position, and only tcrmirmtes at the end of a pattern (which is the
end of cumplernmttation). Thus, the following algorithm re:oulto;:
function Exh_match(patt : pattern;
internaLposition): boolean;
begin
result :""' true;
while T-ype(prLtt) = 1nirn
do
od;
result := Match_primitive(patt, and
inLpos := Update( inLpos );
patt := SelecLnext(prdt)
4.3 The algorithm for pattern matching
res _list := Create _list( result, inLpos);
caee Type(patt) of
altematiue : r-es_list := Exh_matclLalt(patt,
simultaneity : rcs_list := Exh_match_sim(patt, res_list);
complementation: res_list := Exh_match_cmp(patt, res_list);
optionality : res_list := Exh_mo.tch_opt(patt, res_list)
esac;
Exh_nwtch := Sort(
end;
109
The first part is the same as function Match, save for the fact that the
routine doe$ not terminate when result turns false. When all initial primi
tives have been matched, an initial state list is created by Create_list. This list
serves either the as starting point for the routines which deal with structures,
or as a result list when no structures are used and the end of the comple-
mentation or pattern has been reached. If a structure is encountered the
structure-specific routines arc called. The last, Exh_match_opt is discussed
above. The first three, Exh-match_alt, E:oh_match_sim and Exh_match_cmp
are the exhaustive matching counterparts for 'positive' routines distussed in
section 4.3.2, and are given in Appendix 4.A. When the remainder of the
complemented pattern has thus been handled, the ]a.<;t. task of the routine
is to rearrange the result lists such that those paths that match are in first
position, so that the routine Match...cmp can terminate as soon as an explicit
nofit or a candidate has been found. This is done by the function Sort.
With this, the important characteristics of the matching routine have been
discussed, when the input consists of a single buffer containing segments. The
matching routine is part of an interpreter, which interprets the linguistic rules
one by one. The kernel of the interpreter is the function Match, which matches
a.n arbitrary pattern to arbitrary input.
The function Match is syntax-directed, that is, the same structure which
is found in the syntax of patterns is found in Match. Match
called recursively, just as non-empty patterns can be used recursively. How-
ever, due to the implementation of complementation (which avoids redoubling
of computational effort), inside complementation a slightly different match-
ing strategy is applied. Therefore, when complementation is used, Match is
not called recursively, but the function Exh_match is called. This function
exhaustively matches the complemented patt.ern, and a list of :matth-
ing values and internal positions, essentially one pair for em;h path. Inside
110 Chapter 4 Some aspects of the implementation of ToorjP
ExiLmatch all recursions of non-empty patterns call on Exh_match, so here
recursion is restored.
The situation of synchronized buffers, as opposed to a single input buffer,
only affects the routine for matching primitives. The routines for matching
1::trod1.n:e1:: retnl:l.i<1 un<>ltered. In fact, the only extra task Match..primitive
must. perform is t.o determine in which of the synchronized buffers the specified
primitive is to be looked for. Thi8 is the topic of the next section.
1.4 Synchronized buffers
Synchronized buffers are needed to synchronize the input. with output.
The term syndrwniz.ation is used in imitation of Susan Hertz, who intro
duced the notion in her Delta system (Hertz, Kadin & Karplus, 1985). Al-
though synchronization is generally used to indicate time between
processes, here it will be used to indicate segment alignment between buffers.
In Tooi)P :::ynchronization is needed for two purposes. On the one hand it i:o
needed to be able to u::;e info:onation on orthography and pronunciation si
nmltuJwously in the rules. This will be called intemal synchronization, since
it doe:; not necessar-ily manifest itself on the level of original input and final
output. On the other hand it is needed to determine overall input-to-output
relations, that is, how the characters of the original input correspond to those
of the final output. This will be called overall -'>Ytl('.htOilization. Both over-
all synchronization and the possibility to usc inform<l-tion or1 orthography a.nd
pronunciation arc ::;pcci<tl feature!:: which were required in the design of ToorjP
( chapter 2) _
Synchr-onization of buffers, or, more in general, synchronization of t.wo or
more levels means that each segment or sequence of l:l.t (me level can
be associated with a (sequence of) segments at another leveL For instance,
the orthography and pronunciation of the word 'cadeau' /KAADOO/ (pr-esent)
arc associated as follows:
input:
output:
l
ei a ldleaul
K AA D 00
In words: the letter 'c' is pronounced a.s a /K/, the 'a' as an
/AA/, t.he 'd' as a /0/ ar1d the sequence 'eau' as an /00/.
(1.6)
For ret:o.ini<lg the input-to-output relations each unit which converts input t(J
output (in Tool)P these are modules) must have separate buffer::; for input
and output. During the conversion process these buffers are synchronized
according to the rules that apply. For in ( 4.6) apparently a rule of
the form "c ----; K / some context" has been applied. Apart from adding the
4.4 Synchronized buffers 111
original input
I
1
grapheme input
I_
I
grapheme--to-grapheme 1
copy
I grapheme output
I
I
grapheme-to-phoneme 1
I phoneme input
I
I
phoneme-to-phoneme l
copy
I phoneme output
I
I
Figure 4.2: Buffer architecture of ToorjP.
'K' to t.he outpllt., the system will also synchronize the 'K' in the output with
the 'c' in the input.
T<1 achieve overall synchronization, that is between the input and the re-
sult of the consecutive modules, one can create a buffer for the initial input
and a separate buffer for the output of each module, which simultaneously
serves as input buffer for the next module. Synchronization between each two
subsequent buffers then directly determines the overall synchronization. Thus
one needs n + l buffers where n is the number of modules.
There is a way, however, to make the number of buffers fixed and inde-
pendent of the number of modules. As explained in chapter 2, there are three
types of modules in Toor)-P: one which manipulates graphemes, one which
transforms graphemes into phonemes, and one which manipulates phonemes.
Instead of creating an output buffer for each module, it is also possible to cre-
ate an output buffer for each type of module. Then, four buffers arc needed
for the c;onversion proc;ess: oue for the iuput of the first type of module, a.nd
three for the output of each type of module. In this scheme the contents of the
output buffer are copied to the input buffer in the case of successive modules
of the same type. For overall synchronization one extra input buffer is needed
to preserve the original input (see Fig. 4.2); for if two or more grapheme-to-
grapheme modules are used in the four-buffer scheme, the original input is
overwritten by the output of the first module.
112 Chapter 4 Some Mpects of the implementation of Too[jP
:scanning direction: --->
rule:
intenlal situation:
input:
output:
V)(jW:
graphemes:
n, k ---> n, k, # / cons, voc _ consl , '1/'0C
g e d e
g e - d e
left context +--
focus __.. right context
n k d a g
left +-- I focus I ----1 right
g e - d e n k d a g
Figure 4.3: Selecting buffers in GTG modules.
In Toor}P t.he scheme of five data buffers is implemented. In section 4.4.1
it is discussed how the situation of five synchronized buffers affects the rou-
tine Match_primitive. In section 4-4.2 t.wo possible mechanisms to implement
synchronization dio:ciiSsed, which are compared in section 4.4.3. The irn-
pletuentatiorl of the one that is to be preferred is discussed in section 4.4.4.
4.4.1 Matching primitives to
In section 4.3.1 a strategy for mo.khing prirnitives is presented for the case of
a single input buffer. There, it is assumed by pre-condition that thtl buffer in
which to match and the position at which to match are known. Each tirne a
pdmitive is matched, afterwards the new matching is computed, thus
satisfying the pre-condition for the next time a primitive will be matched.
The actual :oituation with synchronized buffers is slightly different. There
arc three situations to be distinguished, corresponding to the three types
of In the grapheme-manipulating modules so-called grapheme-to-
gwpherne (GTG) ttrles are used. One may only refer to graphemes, since at
that point no phonemes are yet available- When graphemes are referred to,
either the grapheme input bufrer or the grapheme output buffer is consulted.
This depends on the pattern which is being matched and the direction in
which the module's input is scanned. If the input string is sca>med from
left to right (-->) then the left context will be matched against the output
buffer, tl.nd the right. against the input buffer (sec Fig. 4.3)- CmlVen>ely, if
t:he input string is scanned from right to left ( +-), the bnffflts against which
the patterns are matched are exchanged accordingly. The focus pattern, of
comse, is always matched against the input bufFer. In this way, it seems to
the 11ser as if there is only one buffer, where all transformations arc executed
4.4 Synchronized buffers
scanning direction: ---+
rule: a , u ---+ 00 1 SJ _
internal situation =
user view:
input:
output:
graphemes left t- focus
c h a u
SJ
phonemes left <--
---> right context
f f e u r
Figure 4.4: Selecting buffers in GTP modules.
scanning direction: <--
rule: 0
--t
<*lstress*>
I
{ g,a } l p
e
1
n
1
v
1
e ' -
internal situation: graphemes left +- focus ---+ graphemes right
graphemes: e n v e l 0 p j e
input: E N V C L 0 p J c
phonemes left +- focus
output: PJ C
_, phonemes right
u:.er Vlew: graphemes left +-- focus ...... graphemes right
graphemes:
0
P j I e
phonemes: 0 PJ C
phonemes left +- focus "-4 phonemes right
Figure 4.5: Selecting buffers in PTP modules.
immediately.
113
In the second type of module, grapheme-to-phoneme (GTP) rules are used.
These transform graphemes in the input buffer to phonemes in the output
buffer. Therefore, when graphemes are referred to, these should always be
searched for in the input buffer. Phonemes, on the other hand, are searched for
in the output buffer (see Fig. 4.4). However, one may only refer to phonemes
in contexts where they are present, so for left-to-right this is the
left context and for right-to-left scanning the right context. If one refers to a
phoneme in the other context, a compile-time error occurs.
In the phoneme-manipulating modules phoneme-to-phoneme (PTP) rules
114 Chapter 4 Some aspects of the implemerltation of Toor)P
arc used. Since derivational history is being retained, one may refer to
graphemes as well as phonemes, or require that a phoneme is derived from
a (sequence of) gn1.phemes. Refereuce to in these modules works
the same as reference to graphemes in the GTG modules, that is, phonemes
for in the output buffer if that already contains information, and
otherwise in t.he illput buffer- However, when graphemes are referred to in
PTP rules, these must be searched for in the grapheme level. In this case the
last grapheme level preceding the phoneme level is used, the grapheme output
buffer, in which the result.s nf t.he last GTG module are stored (Fig. 4.5 ).
Parl'.i(:ularly in the J;.u,;t two types of modules, when graphemes and
phonemes may both be referred to, the buffer to which the primitive should
be matched depenc.h upcm which of the two is used. This means that the
pn-condit.ion of the original aJgorit.hm, having selected the internal position,
cannot. be satisfied without. knowledge of the type of the primitive that is
11S()d.
Therdorc, a different strategy is more appropriate: select and adjust. the
int!emal poo:itiou ju(';t. before matching t.he current primitive, rather than pre-
pare it afterwards for the next matching action. In other words, the new
pre-(ondit.ion is the internal position indicates t.he position where the
previous primitive has been matched. This new condition on the one hand
tktt the i1ttermtl poE>it.iou should he initialized properly before the first
matching action takes place, but on the other hand, it is not prepared in wtin
after t.he h1.st makhing
For the algorithms presented thus far, this c()nsequences. The
selection of the done in Match--will now be part of
to he corn hi ned with the selection of the buffer. As this
depends on the type of primitive (grapheme or phoneme), inside this mut.ine
between these cases must be made. Therefore, as to matching
pr-imitives, the following routines perform the tasks.
function Match(patt : p1IttiJ'I'11.;
int_pos : intcrnaLposition) : boolean;
begin
result := true;
while 1'esult and Type(pall) = primitive
do
od;
( inLpos) := Match_prim.itive(pa.tt, inLpos );
patt := SelecLnext(patt)
if re.w.ll. and T,1;pe(patt) = structure
then result:"" Match_structurc(patt, fi;
4.4 Synd1ronized buffers
Match := res1.dt
end;
function Match_primitive(patt :pattern;
inLpo$ : internaLposition)
; (boolean, internaLposition );
begin {pre: buffer and pos pointing at last matched segment}
( inLpos, fail) := SelecLinLpos ( Tgpe(patt), int_pos);
if fail then result :=false
else
fi
'
segm := Segment( int_pos );
case Type(patt) of
grapheme, phoneme:
resv.lt := Data(patt) = segm;
g_feat, p_feat: begin
result := true;
feat := Data(patt);
while result and Present(feat)
do
result:= (segm in Feature_set(feat)) = Value(feat);
feat := SelecLnext(feat)
od t::nd
esac
:= (result, inLpos)
end;
115
The major part of the routines is simply copied from the previous algo-
rithms. New in Match_primitiiJe is the function SelecLinLpos, which deter-
mines the new internal position. The first panuneter indicates the type of
the pattern to be matched, the second is the current internal position, the
value of which has to be adjusted. The variable fail indicates whether the
returned values are valid; the position parameter might have been moved
outside the buffer boundaries, or at the current position there might not be
synchronization with the desired buffer. Note that this construction serves as
a termination criterion when infinite repetition is used.
4.4.2 Synchronization mechanisms
The task of routine SelecLint_pos is to determine the new internal position. If
this happens to be in the same buffer this is easy, just increment or decrement
llG Chapter 4 Some aspects of the implementa.thm of ToorJP
pos by one. lf, on the other hand, it turns out to he in a different buffer, one
:fir!::t find the in that buffer which is aligned with the original
in the or-iginal buffer. For this it is important to k11ow how the
synchronization mechanism works.
One can probably think of several mechanisr:ns to synchronize buffers. Two
of them prompted directly from the two different purposes for which syn-
chronization is needed in Toor_}P. In this section I will discuss these two mecha-
how they rebte to each other, how --by means ofsynchronization-wie
can switch from one buffer to another, and what is needed to have synd"ro-
nization operate correctly.
Th< first. stem . .,: from the need to select an adjacent buffer if both
gr<tpht\mt)R and are referred to in one pattern. The idea is to switch
fr-om the curtent position in the current buffer directly to the correct posi-
tion in the adjacent buffer by means of a direct synchronization between the
of each buffer. synchr-onization mechanism will therefore he
called direct buffer switcl1ing (DBS). An integer is <tttnched to each segment
in a buffer, which points to the in t:he adjacent buffer with which it
is synchronized. ln this way, individual segrnenb c<J-(1 he synchronized with
values that fall within the buffer n1.nge.
A pa.rt from tlynchwni'l.ation of individual segments, it must also be
to a sequence of segments as an inseparable unit to a segment or
another sequence. For the 'ean' sequence of the word 'cadeau' is
pronounced as a !::ingle phoneme /00/, which one would like to represent
in the system a.s the grapheme sequence 'eau' being synchronized with the
phoneme segment '00'. Also, it must possible to represent or
that itl, when a segment in one buffer is not associated with any
segment in another buffer. These two cases of alignment can be represented by
pointer values that fall outside the buffer range. For synchronization purposes
in:>e:d-ions and ddetions ate the same, and can therefore be by a
single value. This value will be denoted by a '"' tlign_ Inseparable sequences
will he represented by a diffetent value, which will be denot.ecl by a '-' sign.
To synchronize two buffers in the DI3S mechanism, the leftmost segment
of an inseparable sequence (which can consist of a single segment) receives a
pointer Wtlue. This is either a normal value (inside the buffer range), or the
insertion value'"' to an insertion or deletion. The other of the
sequence the 'inseparable' value. For instance, the synchi-onization of
the Otthography of 'cadeau' with the pronunci<ttion of 'KADOO' is as follows
(cornpal'e with (4.6)):
4.4 Synchronized buffers 117
gra c
a Q. e a u
next: 1 2 3 4
prev: 1 2 3 4
phon K AA D 00
The 'c' is associated with the first element of the next buffer, the 'K', which
in t.nrn is associated with the first element of the previous buffer. Similarly,
the 'a' is synchronized with the 'AA', the 'd' with the 'D' and the sequence
'eau' with the '00'.
The second mechanism stems from the need to determine overall input-
to-output relations, and is inspired by the synchronizatlc.n method used by
Susan Hertz et al. (1985) in her Delta. System. This mechanism is z:<tlled
overall synchronization (OS). Rather than directly synchronizing segments
with each other, the positions between the segments are synchronized. Before
and behind each segment one or more synchronization marks are placed, and
buffer:> are synchronized at a certain place if the values of the synchronization
marks are the same. Thus, synchronization is defined between any number
of buffers, rather than between two adjaceut buffers. Thi:s mechani:sm can
for instance be implemented by attaching a (list of) synchronization marks
(:sync rmuks) to each segment, which represents the sync marks behind the
segment. In front of the first segment a similar list is placed.
'Normal' one-to-one synchronization of segments is characterized by the
fact that sync marks enclosing the segments have the same value. Inseparable
sequences are characterized by the fact that certain sync marks arc present in
one buffer but absent in another. Insertions and deletions axe cha.racteri:;;ed by
the fact that between certain segments more than one sync mark is present.
With this mechanism the synchronization for 'cadeau', fo.- illStaoce, l$ as
follows:
gra c a
d
e
a
1,1,
sync: 0 3 7 6 8 1 5
phon K AA D DO
sync: 0 3 7 6 5
The values of the sync marks are arbitrary; of importance is only whether
the values are the same. Here, the grapheme 'c' and the phoneme 'K' are
aligned as they are 'enclosed' between sync marks with the same values
118 Chapter 4 Some of the implementation of ToovP
(0 und 3), The for a-AA and d-D, are no sync marks
with the values 8 and 1 in the phoneme 8ync: buffer the grapheme sequence
'eau' is associated with the 'DO'.
Equ:ir1fllence
The two synchronization mechanisms are in the sense that they
both describe synchronization fully arHl can be transformed into each other.
Taking a b1oader viewpoint than only the scheme of grapheme and phoneme
buffers, both synchronization can be transformed into the 11.lready
implicitly int:1odueed of vertical lines (sec for ( 4.6)) for an
arbitrary number of.buffers. The OS mechanism a irnplem1mtation of
this representation: the sync mark values can placed between the
and the marks of equal values can be connected with ragged lines, which, so
to spt'<\k, are then pulled :straight:
lmf J.: a h c d
sync 1: 0 2 3 4
0alb2c3d4
buf 2: a c d
0 a 1-2 c 3d 4
sync 2: 0 3
4
0 a 1-2-3 d 1
buf 3: a d
:3: 0 l-2 .. 3 4
a b c d
c d
a d
The correspondence of the DDS method to the vertical line
is more complex. A norm<lJ value, that is, a value which falls wi t:hin the
lmffer should be interpreted as a sync mark to the left of the character
to which it belongs; the segment which is pointed to has an associated pointer
pointing back. This is called consistency of DDS synchronization. Thus, the
two adjacent buffers are synchronized between those two segments. Such
synchronization can be extended to other buffers if they, too, at that
have such a synchronization:
44 Synchronized buffers 119
buf 1: X X X a
next 1: 3
prcv 2:
4/
XXX
3
a
I
buf 2: X X a
x:d a
U!OXJ 2: 2
prev 3:
3/
I
x
2
a
buf 3: X a
Inseparable sequences are characterized by the fact that they do not have
left-synchronization to one of the adjacent buffers (but perhaps do to the
other one):
buf 1: a
b
c; d
next 1: 1 2 3 4
prev 2:
1 2 3 4
buf 2: a
b c d
next 2: 1
- -
3
prev 3:
1
-
4
bl).f 3:
X y d
-+
a I b I c
a b c
X I y
1 a.
I
t a
I
1 X
d
d
d
2 b
I
b
y
3 c 4 d
I I
c d
/
4d
Deletions and insertions can only be detected in the buffer where they oc-
cur, since in the relevant adjacent buffer there is no character to be associated
with them. Here the synchronization lines have to "propagate" through the
buffers; deletions propagate forwards and insertions backwa:rds. Deletion and
insertion lines stick, so to speak, to the closest sync.hronization liue to their
right.
120 Chapter 4 Some a<;pects of tbe implementation of Tboi}P
Deletion:
buf 1: a b c d
ne:xt 1: 1 2 3
prev 2: 1 3 4
buf 2: a c d
2: 1
=
2
prev 3: 1 3
buf 3: a d
alb I c.:
a c
a
Insertion:
buf I: a b
next. 1: 3
prev 2: 2
buf 2: '' c.: h
d
d
d
t a
I
a
I
1 a
a
a
a
- b z c
//
!, b c
/
3d
b d
d
d
$ d
Via the vertical line representation the two synchronization U1echanisms
can be tranformcd into each other and hcnq! !.hey are equivalent. The vertical
line representation is the mo8t natural way to represent synchroniza.tion, and
in Too[.iP it, is the actual user interface. On !.he one hand, when
synchronization information is given, this is done in the vertical line rcprel:len-
tation. On the other hand, the concatenation comma and t.he square brackets
the 11ser uses in the rules can be mapped directly on synchronization marks.
switching
Synchronization between buffers is defined by means of the vertical line repre-
:::ent,ation: one may switch to another buffer alor1g the synchronization marks.
So for instance, when the internal status of the system is as depicted below
4.4 Synchronized buffers 121
in the vertical line representation, st.a.rting a.t the 'a.' in the third buffer and
moving one position to the right (this is typically needed when patterns are
matched; after teo:ting one segment an adjacent one must be tested), one can
shift to the first buffer and reach the 'b', or shift to the second buffer and
reach the 'c', or stay in the third buffer I ~ d reach the 'd':
buJ 1;
buf 2;
buf 3:
aauc d
a c d
a. d
To state it in general: between each two segmeuts a l i ~ t of sync marks
is present. The list always contains at least one element. The first element.
that. is encountered is the synchronization point (that is, the leftmost sync
mark when scanning -->, or the rightmost sync mark when scanning +- ); all
buffers which contain the same sync mark (that is; with equal sync values)
are synd1ronized at that position. Thus, in principle, one can switch to any
of those buffers. In t,hat respect, starting at the 'a' in the third buffer once
aga.in and moving -+
1
one cannot end up in the 'c' in the first buffer by taking
the second sync mark between the 'a' and the 'd' in the third buffer.
Also, inside an inseparable !Sequence one cannot shift to the adjacent buffer
which 'caused' the seq11ence to be inseparable. For instance, moving to the
right, one cannot shift from the 'e' in the grapheme buffer to the 'DO' in the
phoneme ln1ffe;
gra:
phon:
I
c I a I d 11 e1.l a1J u 11
K AA D It-- 00 ~
What is possible, is to move from the grapheme 'd' to the phoneme '00',
or from the grapheme 'u' to the phoneme following the '00', or to require
alignment of 'eau' with 'DO'.
Installing synchronitation
Since synchronization is introduced for two purposes, it must meet the two
requirements of being able t.o switch from one buffer to another and t.o deter-
mine the overall input-to-output relationships. For both mechanisms we will
now look at how this influences the installation of synchronization, that is,
which actions mJJ.St be taken to fulfil the two requirements.
When a rule applies, that is, when all three patterns of focus, left and right
context match, the structural change is added to the output and synchronized
with the focus part of the input. Thus, one or more segments in one buffer
122 Chapter 4 Some Mpeds of tlle implementation of Too[jP
arc synchronized with zero or more segments in another buffer. The case
of zero segments corresponds with an insertion or deletion, the c3i:le of one
segment with a 'normal' synchronization, the C<L."le of more than one segment
corresponds with an inseparable sequence.
In the OS method the structural change which is added to the 011tput is
synchronized with the input 1>egment or sequence at the outer ends: the sync
marks to the left lJ.TJd right of the structural change are synchronized with
(giv1m values equal to) the sync marks of the input buffer, which are defined
in a previous stage. If the structural change consists of a single 1:1egment. no
further action is necessary. Between the segments of an inseparable 1:1equence
new sync marks are generated in t,he output buffer.
If a rule deletes one or more cha1act.ers, no character is added to the output,
but at that pluce the output: buffer is synchronized with the input buffer at
two pht(:el:>, thus repr-esenting the deletion. Thus, a list of sync marks results
in the output, the head of which is (the list of) sync marks to the left of the
deleted segment(s), and the tail of which the synchronization to the right.
of the segment(s). Such a list will propagate forwards through the buffers
automatically:
buf 1: a b c d b----+0
buf 2: a
c d c-+0
buf 3: a d d----+e
buf 4: a e
Insertions, however, are a little different, for in the input, buffer left and
right t::yn<: marks are not both available; only one side (depending on the
scmming direction) is available. For int::tance, when the rule
0----+ c/a_b (4.7)
is specified, and the input hulfer consists of:
the character 'c', which is added to the output, only has the mark be-
tween the 'a' and the 'b' to refer to. Therefore, a new sync mark i:o inttoduced
at the fmtet side of the insertion, that i1:1, at the right side in forward scanning,
and at the left in backward scanuing_ This sync mark must propagate back-
wards through the existing buffers to distinguish t.he case from an inseparable
sequence. Compate, for instance:
4.4 Syncl1ronized buffers
a b
a b 0--->cja_b
a c b
new
a b
a b
a c I b
new
123
b--+c,b
The diffeJ;ence between these two cases is that in the insertion case one can
switch to a previous buffer along the new sync mark, and in the inseparable
sequence case one cannot.
With insertions propagating backwards, the OS mechanism both ensures
coned buffer switching and ove1-all synchronization, irrespective of the num
her of buffers which are actually implemented. The only demand is that
insertions propagate back in previous buffe:r;-s 1:1$ far as they can.
Since the DBS mechanism is equivalent to the OS mechanism, DBS auto-
matically also meets these two requirements. However, the prcassumption for
this equivalence is that the basic synchronization of adjacent buffer:> is con
sistent, i.e., that pointers are always mutually pointing at each other. If each
module has its own output buffer, this condition is automatically satisfied, for
when output is added to the output buffer it is synchronized consistently, and
no changes to this information are made further along the derivation. How-
ever, if the five-buffer scheme is used (see Fig. 4.2), this is not automatically
the case. Suppose, for that a GTG module has been applied and
that the output is synchronized with the input. The synchroni?.ation of the
original input with the output of this module is only defined via the grapheme
input buffer, and hence one may not delete the synchronization information
by overwriting it with the information of the output buffer.
Therefore, both the pointers of the original input pointing at the input
buffer and the pointers of the input buffer pointing backwards will have to be
recomputed so as to make the synchronization consistent_ Rather than going
into the details of this algorithm, I will give an example of how these buffers
are adjusted.
Consider the situation depicted hdow:
org in: a c
next 1: 1 3
org in:
prev 2: l 2
a c
grain: b d
next 1:
=
a -+
prev 2:
next 2: 2
=
prev 3: 2
gram: a b
e
gra out: a b e
124 Clw.ptm- 4 Some aspects of tl1e implementation of Toor)P
App;treutly 'a' in t.he original input has expanded to the sequence 'ab'
in previous modules. The 'c' ha.s been changed into a 'd'. In the current
module, the 'b' expands to 'be', and the 'd' iH deleted. Considering t,he over-
all synchronization-which a,t that moment is the sym;hronization of origin3l
input. and grapheme (n1tput-it is as if the 'a' has expanded via 'ab' to 'abe',
and the 'c' has been deleted (via 'd'). When the grapheme output is copied
into the grapheme input buffer, the synchroni11ation pointers should be ad-
just(ed snch that they code the cumulative effects of all previous modules, as
is depicted above.
Summarizing, it can be stated that both mechanisms need some special
attention during the conver:::iou process to ensure that they are reliable. For
the OS insertions must propagate backwards through existing buffers,
fot the DBS method resynchronization is needed when a.n output buffer is
copied to t.he input
4.4.3 Compa.ri.wn of the two mechani.m<s
To b!j able to use synchronization for the two intended purposes, four functions
must be implemented in the system. The first one installs the synchronization,
that is how the segrnenb in the output are associ<tted with those in the input.
The second function is the mechanism to swit.ch from one buffer to another.
A third function is to keep the :::yw:hronization consistent when the contents
of the output buffer are shifted to the input buffer when t;he next module
is to be eonsulted. The last function determines the overall
from input to output, that from the original input grapheme t.o the final
phoneme output.
In particular the last t.wo functions are sensitive to the Inechanism used.
In the OS method, they arc more or less trivial, since the sync marks directly
code the desired informat.ion- In t.he DES method, considerable
computing effort is since the information i::: indirectly, that is,
ouly hd.ween the adjacent buffers synchronization information is available,
which will have to be made more globa.l, in a manner described above.
On the other hand, one may the DBS method to be rnore appro-
priate for the second function, the information to r:;wikh between two
adjacent buffers is coded directly by means of the pointers, except for the rare
of an insertion or deletion. Motlt 1:ommonly it is desired to switch
bet:ween two adjacent buffers, and otherwise there is only one inte:cn1ediate
buffer, in which ca:sc a second buffer .switch is needed. In OS method the
will have to be searched for, a process one may exped. t.o be
slower than direct switching.
4.4 Synchronized buffers 125
The first function is more or less of the same complexity for both mech-
anisms. In both cases the sync marks have to be installed according to the
mechanisms scheme, whieh both C&les comprises comparable action.
The order of magnitude in which these functions are used varies greatly.
The first function, installation of sync marks, is needed in each module in
the order of magnitude of the numbe< of segments to be transcribed by the
module. The need for buffer-switching (the second function) depellds g.-eatly
on the kind of rules being used. In PTP modules this may be a number of
times within one rule, but it may also be absent. The third function, keeping
the database consistent is needed less than the number of modules per overall
transcription, and the fourth function is needed only if one is interested in the
derivational history, and in that ca.'le once per transcription.
It is hard to tell on theoretical grounds how the balance will dip. Therefote,
the two have been implemented as altewatives, and have been tested on
a prad.ical situation. The pronunciation of Dutch words was determined
by some 8 modules, where buffet switching was used frequently in only one
module which contained some 80 rules. It turned out that the two mechanisms
were more or less in bala.1ce, that is, the DBS method was some 5% faster
than the OS method.
However, I favour the OS method above the DBS method. The main rea-
son for this is that it directly relates to the vertical-line representation_ And
in spite of the fact. th;tt the DBS method originated from the idea of directly
switching to adjacent huffers, synchronization and buffer switching are de-
fined in terms of the vertical line representa.t.ion. In the common situation
of one-to-one of input-to-output segments the DBS method
is undoubtedly faster, but when insepa..-able sequences and in particular in-
sertions and deletions occur, in one way or another the DBS representation
must locally be transformed to the vertical line representation before a cor-
rect buffer switch can be made. This alqo rneans that the designing of buffer
switching algorithm is simpler for the OS method than for the DBS method.
For a complex system in development, simplicity of such a mechanism, which
influences several parts of the system, was thought to compensate a 5% loss
of run-time efficiency. Moreover, when the system needs to be extended in
the sense that extra layers of information must be synchronized with existing
ones, the OS method is rsuperior to the DBS method. With the OS method
the extra layers are the same as any other buffer, with the DES method the
original buffers have to be extended with extra pointer buffers, and some of
the algorithms have to be extended, too" For comparison both mechanisms
have been implemented, but for the above reasons the latter mechanism, the
OS method, is preferable and has t.herefore been chosen to serve as the syn-
chronization mechanism both for Tool)P in the relea.sed versions and Toor.W
126 Chapter 4 Some aspects of tile implementation of Tooi)P
in future development_
4-4-4 The algor-ithm for buffer switching
At this point we can return to the algorithm and see how it works out, in the
function Sclect_int_pos. The tMk of this routine is to detcrmiM the new inter-
nal position based on the position where the previous plimitive was matched.
The rout:ine will be presented for the OS mechanism.
In general terms, the adjustment of the internal position takes place in
three First the appropriate buffer must be determined, then the
position in this buffer which is synchroni:-;ed with the old one in the old buffer
must be determined, and the new position must be determined by
shifting position to the left or to the right. In the following algorithm
these three phases can be di:;;tingltished, although they are not fully separated:
function Select_inLpos(prim : prirn_t:Jpe;
int.po.'i: internaLposition)
: ( intcrnal_position, /)oalean);
begin {pre ; lower _bountl(Lty < pas < upper _boundaTYJ}
desb11j := De.9ired_buffer(patt_type, rnod_type, prim);
(b-uffe-r, : = inLpos;
if ( rnatciLdir- = <---)
then pas := pos l;
fail:= (pas< lower _boundary) fi;
if de.9buf .J buffer and not fail
then
fi
'
syw; _{i.9t :=> Synchronization( );
if ( match_dir = ---;.)
then :Jync_oa.l := Leftmost( sync_ list)
else sync_wal :"" Rightmost( fi;
(pos, fail) := Sf:.C.Tth_sync( desbuf, sync_va.l)
if ( matcfLdir = -->) and not jr1il
then pos := pos + 1;
fail := (pos > upper -boundr1ry) fi;
;"" (desbuj, pos);
SelecUnLpos := ( inLpos, fail)
end;
The first phase of determining the appropriate buffer depends on the pat-
tern (focus, left m: right con text), the scanning di:redion ( <---- or ---+), the type of
4.4 Synchronized buffers 127
module (GTG, GTP, PTP) and the type of primitive (grapheme or phoneme),
in :;;cction 4.4.1. Only the type of primitive can vary during the
process of matching a pattern, and therefore only the type information is
transferred to SelecLinLpos by me1;1.;o:;; of parmncter prim. The other infor-
mation is constant during the process of matching a pattern, and is therefore
available inside the procedure by means of global variables, which are adjusted
by higher level routines only when necessary. To emphasise the dependence
they a.re included in the pa.ra.meter list for the function Dcsired_buffer. This
routine directly determines the appropriate buffer from these parameters, and
is a straightforward implementation of the rules given in section 4.4.1.
If the thus determined buffer differs from the original buffer, the position
in that buffer which is aligned with the original position in the original buffer
must be determined by means of t.he s:yJ:"chronization information. The sync
mark attached to a certain position is the sync mark behind the segment. If
the matching direction is backwards (f-), (which it is, for instance, when a
left context is being matched,) the relevant sync marker is the one before the
current position, and can be selected by first decrementing pas. The relevant
sync marker can now be determined by selecting the first clement of the sync
marker list at that position (see 4.4.2). Synchronization selects the list, and
Leftmost or Rightmost determines the first element, which a,ga,in depends on
the matching direction. Then, the sync mark is searched for in the new
by This routine searches for the value sync_val in buffer desbuj
and returns the position where it is found, pos. SiJ1ce nothing special is known
of the E>YIK marker, this is done by means of a linear search.
If the sync is not found, there is no syncbronization between the
two buffers at the original position. This means that the path of the pattern
that is being matched is not present. This is transferred by the second value
which Searcluytu; retums, jail, which is true if the search fails, and false if
it succeeds.
Finally, the new internal position is determined. In the case of backwards
matching ( <----) no further action is needed, due to the fact that the sync marks
denote the synchronization points behind the segments. The decrementing
action has been done beforehand, before t,he buffer switch has been made. In
the case of forwards matching (---+) the correct position is one positi<m to the
right, just across the synchronization mark, so in this case pos is incremented
by one. As when decrementing, it should be checked whether the buffer
boundaries have been crossed.
The variable fail thus performs a double task: it either indicates that there
is no synchronization between the t.wo buffers at the original internal position,
or it illdicates that the new internal position falls outside of the buffer range.
In both cases the investigation of (that specific path of) a pattern can be
128 Chapter 4 Some aspects of the implemer1ta.tim1 of Toor.jP
terminated, and since it is not import<lnt to know which of the two is the
case, this can be coded by means of one variable.
4.4.5 SumnuJ.ry
In the of synchr-onized buffers for the matching algo"
rithm Match have been discussed. First, the architecture of the synchronized
buffers as is present. in ToorjP has been shown and how this specific archite(:-
ture affects the algorit.hm for matching primitives. Then a more general view
on r,;ynchronization has been t<>ken, that i:::, how an arbitrary number of buffers
can be synchronized, what the characteristics of synchronization are, and how
one can switch in gen.::raJ. from one buffer to anot.her. Next., two alternative
mechanisms to implement synchronization have been presented, one prompted
hy thtl need for buffer switching, the other by t.he need for overall synchroniza-
tion_ They are fund.ioually equivalent., but the main difference between the
two is that the first mechanism is somewhat faster in run-time perforrnani.:e,
whereas the second is a more simple and elegant algorithm as it (:odes
the definition for buffer switching and overall synchroni7,at.ion. For this reason
rnedJ.anisrn i::: is therefore implemented in the re-
leased versions of ToorjP. For this synchronization mechanism the algorithm
which performs the task of buffer switching is given in SeLinLpos, which is
an important: part of the rout,ine for mat.d1ing primitives.
4.5 Discussion
In the foregoing, the important IISJH!Ct.s of how patterns can be built, what they
mean, how they are represented internally, and how they are matched against
synchronized buffers have been dealt with. Looking at the way patterns are
processed, ToorJP can be viewed at> a compiler/interpreter. The patterns,
:::pecified by the user in a high-level language, are compiled into their internal
reprtlSentat:ions, whid1 are then interpreted by Match. This is undoubtedly
slower than full compilation of patterns into machine (:ode, but it has an
a.dwLnbge thl:it might be of importfLrlr.e to the future development of the
system. The internal representations of the various patterns arc stored in
a dynamic structure in the order in which they arc to be matched. El:l.ch
pattern is a separate and traceable entry in that structure. This means that
c<J.Ch such entry can in principle be replaced by a new one, for instance by a
(;mnpile(l version of an a<:ljm>ted pattern. This could be done on-line, during a
test session of the rules, which means that a linguist can interactively his
rules. Such an extension of the system can be implemented relatively easily in
the compiler-/interpreter architecture of ToorjP, in contrast to a full compiler
4.5 Discussion 129
system. Only the rule which undergoes a. has to be recompiled, which
is typicilly so fast that it does not disturb the development session. The new
internal representation then must be patched over the old one, which can be
done instantaneously. Patching machine code, which would be the equivalent
in the full compiler scheme, is g<merally viewed as a cumbersome enterprise,
which requires detailed knowledge of the object code is error-prone, and
is therefore not to be recommended. Therefore, when slight adjustments
must be made, generally in a full-compiler scheme the whole system is to
be recompiled, which generally takes orders of magnitude longer than the
incremental compilation in the compiler/interpreter scheme. Since ToorjP is
first and foremost a development tool, such a feature of being able to test
modifications quickly is interesting. The implementation of such <> fe<>ture
planned in the future.
On the other hand, the run-time performance of ToorjP is indeed that of
an interpreter, that is, some 10 times slower than comparable fully compiled
systems. It is difficult to give an exact fignre of relative performance, since
there is no system available with the same functionality that is fully compiled.
The figure of 'some 10 times slower' is derived from a comparison with the
SPE rule compiler of Kerkhoff, Wester & Boves (1984), called Fonpars, run
on the same machine for the same input. In this comparison Fonpars differs
from ToorjP in at least five respects. First of all, Fonpa.rs is a compiler:
the linguistic rules and modules are compiled int.o a PMe<tl program rather
than some kind of dynamic internal representation. The Pascal program is
then compiled into machine (;ode in the usual way. The second difference is
Fonpars docs not support complementation in the way ToorjP does.
It does support a not:io:u of <:.:omplementation, but only for strings of the
(';=e length. This restricts the use of this operator, but \)I! the other hand
simplifies the algorithm which evaluates patterns. A third aspect is tha.t
Fonpars does not keep track of the derivation, and therefore does not make use
of synchronization between buffers. A fourth point is that Fonpars does not
include extensive interactive debugging facilities which are present in Tool}P
(see 2.4.1 and 5.4.4). Finally, the rule sets which define the conversion differ in
structure and magnitude. The Fonpars includes some 250 rules for
to-phoneme conversion which includes conversion of numbers, acronyms, etc.,
a.nd the assignment of word accent. Toor)P, on the other hand, uses some
550 rules for the same purpose. The reason for this rat.her large difference
in number has not been investigated. It is important to mention, however,
that the Fonpars rules be more concise, this is not an artefact of
Tooi).P. The formalisrm supported by the two systems are comparable in
power of expression (see 2.6.3). The relative contribution of each of these five
factors is hard to measure exa,ctly without considerable programming effort,
and presumably not very interesting since these are only two of many systems
130 Chapter 4 Some aspects of tl1e implemerltatiorl of ToorjP
ptocessing linguistic rules.
Comparison with existing similar systems is often fL suitable manner to
put into perspective one's own. As far a.s detailed implementat.ion aspects
cm!Cerned, !lud1 above, this is difficult to do, since in the
literature there is no information available on this subject on other
Generally, only general characteristics of a system arc des<Tibed, as how
a u.scr may formulate rules (the formalism), whether or not synchronl7.ation
is (not hmo thill is JUlplernent.ed), whether it is a compiler-like structure
or not, and so on. Such a functional comparison hws been made in <:hapt,ero. 2
n.nd 5.
4.6 Conclusion
In this chapter ;:m algorithm is given wbid1 matches an arbitrary pattern
A.gn.in:lt. arbitrMY input, 'input' sho11ld he viewed in the
of synchronized buffers.
First., the internal repr-esentation is given into which the pat.tems, given by
the user in this high level language, arc Since the transformation
closely the parsing phase of ordinary compilers, this
not touched upon, so only the output of this process is given. The
internal representation is more than a parse tret, but. less t:han an NFA (non-
deterministic with -transitions. This rcprcscnbttion has
been chosen on practical grounds; those computations which can he done at
compile t.ime are processed into the internal rep1esentation, those which are
simpler during run-time arc done at run-time"
Next, the interpreter for these (internally represented) patterns is pre-
sented, the function Match. This is done in two phasce>. First, the simplified
Cil."><l of single input buffer, to which the pattern is to be matched, is dis-
cussed. Guided by the syntax and scrn<J.nt.ics of th" patterns the various sub-
routines in J:m1tching fnnd.ion are presented. Special attention has been
givc-m to the routine whic:h deals with complementation. The rtxursim1, which
is directly present in the syntax, is interrupted ;1t. t.he highest level due to an
dficicut implementation of cwnpl!mH;nt.at.ion. Once inside complementation,
nlcm-sion is r-estored.
'l'he second phase is the generalization to synchronized buffer's. This only
affects a small part of Match, the part which deals with the actual compari-
son of the buffert> and the req\liremm1t.s imposed by the pattern. Two possible
to synchronization arc presented and compared, one
which is somewhat faster but more complicated, the ot,her which is slower
but simple and elegant. In a system which is in development and which for
4.6 Conclu&ion 131
instance could need to be extended in the respect of synchronizing capabil-
ities, run-time performance is not a top priority, whereas transparency of
algorithm is, and therefore the slower but elegant algorithm is preferable. For
this mechanism the algorithm to switch from one buffer to another has been
presented.
With respect to the processing of patterns Too:rjP can be viewed as a
compiler/interpreter. User-defined rules are compiled from high-level source
code to an internal representation, whk]l are f ~ r t h e r interpreted by Match. As
a consequence, Tooi).P is not as slow as an interpreter which interprets directly
from source code, but not as fast as a compiler which compiles the source
code directly into machine code. Nevertheless, an important characteristic
of an interpreter has potentially been preserved, the possibility to modify
a rule interactively and directly see the effect of the modification. This is
an attractive feature of a development tool and is therefore planned to be
implemented in the near future.
132 Chapter 4 Some a..'lpeds of tile implementation of Toor)P
Appendix 4.A
Matching inside complementation
In this appendix the algorithms are given for matching structures inside a
complemented structure. Of the four po1:>1:>ible structures only for optionality
has the algorithm been given in the main text (section 4.3.3). Therefore, the
algorithms for alternation, simultaneity and complementation will be given
here.
Alternation
function Exh-match_alt(paU :pattern;
comm_/ist: resulLlist) : resnlLlist;
begin
inLpos := InLpos(
prv_val := Res_ value( cornm_li8t);
resnlts := e:mpt11;
alternatioe :;;= Data(patt);
while Present( alternative)
do
od;
res_ list := Exh_match( Pattern.( ulternative), int_pos);
1e8 _li8l == Adjust(prv _val, res_ list);
results := Unite( res_list, );
:..: Select_ next( alternative)
Exh_match_alt := results;
end;
The of ExlLmatch..alt is similar to MntdLalt. comrrLlist consists of
only one (:;{!e page 106) and serves to carry inCpos and prv_val into
routine. pnu1al denotes the matching v(l.h)e of the path inside the comple
Intmted structure before the alterrwtive st.rud.ure was encountered. Compared
to the non-complemented routine, the while-loop containtl two extra state-
ments. One is to include prv_val in the list. that. the current alternative h<tt::
yielded. This is done by a routine e<l.l!ed Adjust. This routine is not giveu, but.
consists of intersecting e(l.<:h individual result of the result list with prv_val.
The second the stat,ement that prepares the re:mlt l.n be returned. This is
done by the function Unite, which is given below.
function Unite (new : re:mlLlist;
res : resuiLlist) :
4.A Matr.hing inside compleme.11tation
begin
while Present( new)
do
od;
ip := InLpos(new);
SelecLfirst( res);
while Present( res) cand ip =/= InLpos( res)
do res:= SdecLnext(res) od;
if res) {This means : ip ""
then res[result] := Res_value(res) or Res_valuc(ncw)
else Append( new, res) fi;
new := SelecLnext( new)
Unite := res;
end;
133
The purpose of Unite is to combine the various result lists. If the inter"
na.l positions of the various elements are not equal, they should both be
included in the resulting return list. This is dealt with by the statement:
Append(nem, res). If the internal positions of two elements are the same they
are united: res[re.mlt] := Res_value(res) or Res_value(new). The surround-
ing control structure serves to compate each element of the new list with the
existing list"
Simultaneity
function Exh-match-sim(patt :pattern;
comm_/ist: resulL/ist): resulLlist;
begin
int-pos := InLpos(comm_list);
prv_val := Res_value( comm_list);
:= empty;
simultan := Data(patt);
while Preunt(simultan)
do
od;
res _list := Exh_match (Pattern( simultan), inLpos);
res_list := Adjust(pr1J_val, res_list);
results := Intersect( res_list, result.9);
11imultan := SelecLnext( simultan)
;:= re$ults;
end;
134 Chapter 4 Some aspects of the implementation of Tooi)P
The st.ructure of Exh_match_sim is symmetrical to Exh_match_alt. Instead of
Unite the function Intersect is used.
function Intersect( new : result-liM;
res : result_list;
begin
if not Present( res)
then
res :=new {Initialize}
else
fi ,
while Present( new)
do
od
ip := InLpos(new);
Select-first( re.1);
while Pre.wmt( res) cand ip =1- InLpos( res)
do res :"" Select_next( res) od;
if Present( res) {This means : ip = lnLpos (res)}
then rcs[rcsult] := Res_M,lue( res) and Res_ value( new)
else rcs[new] := false;
Append( new, res) fi;
new := Sel;;cLnext(
Inter.H;cf := Tes;
end;
Inter.>ect is the counterpart of Unite. There arc two significant differences.
The main one is that eac.:h new internal position whic.:h is added to the resulting
list is initialized false. Intersecting result lists means that internal positions
must be equal and matching values must be true. If these are not both true,
this may not ll:lad to an explicit nofit. For the matching values this i:o :;.ccounted
by the statement: rcs[result] := Res_11alue( res) and new). If
the internal positions arc not equal this means that the new internal position
must be added to the li:;;t wit.h a result value which i:;; false. As a consequence,
must be initialized the first time Intersect is called. In Unite no explicit
test is needed, since Unite simply adds new internal positions to the list
without c.:hanging the matching values.
Complementation
function ExLmatciLcmp(patt : p(1ttern;
comm_list : resulLlist) :
4.A Matching inside complementation
begin
prv_val := Res_ value( comm_liat)
cmp_list := Exh-match ( Data(patt), lnLpos( comm_list));
aucc_patt := SelecLnext(patt);
results := empty;
while Present( cmp_list)
do
res _list := Exh_match ( succ_patt, lnLpos( cmp_list));
135
res _list := Cmp_adjust(prv_val, Res_ value( cmp_list), res_list);
results:= Unite( res_list, results);
SelecLnext( cmp_list)
od;
Exh_match_cmp := Cmp_simpliftJ(
end;
Exh_match_cmp IS set up in the same manner as Exh_match_alt and
ExiLmateluim. There are noticeable points. Cmp_adjust is similar
to Adjust except that a facility has been included to keep track of the explicit
nofits_ In Matclu:mp this not necessary because of the sorted result list.
Here, this is not possible since all the results must be returned instead of
stopping when an explicit nofit has been detected. Therefore, the explicit
nofits must be distinguished from regular non-matching paths. For this pur-
pose expnof has been introduced as a third 'boolean' value. Only inside
Exh_match_cmp the explicit nofits have to be known explicitly. Therefore,
before returning the result list, the e:x:pnof values are simplified to false. As
a consequence however, Unite must now be able to handle expnof vah1es.
Given their function this is defined as follows
2
:
expnof or true = expnof
expnof or false = expnof
Cmp_simplify and are giv-en below.
function Cmp_simplify(res: result-list): resulUist;
begin
while
do if Res_value(res) = expnof then res. result:= false fi;
SelecLnext(res) od;
Cmp_simplify := res;
end;
2
The behaviour of expnof to 'and' is not neo:essary here, but can be defined analogously.
136
Clwptcr 4 Some aspects of the implementation of Toor,iP
function Cmp_adjust(first_po,rt, com_po.rt: boolean;
res_h:st : resulLlist): result-list;
begin
while Present( res_list)
do
od;
if first-part and com_part and Res_value( res_/i.9t)
then := expnof
else res _list. result :=
and not com_po,rt and Re.Lvalue( res_list) fi;
res_hst :=
Cmp_adjust :=
end;
Chapter 5
Evaluation
Abstract
In this chapter the merits of ToorjP are evaluated, and some recom-
mendations for future development are made. Three sides of ToorjP
are evaluated: (a) the outside, i.e., how Toor,iP is used in a practi-
cal application, (b) the inside, i.e., how SO:Ltisfactory was the semi-
compositional formalism in practice, and (c) the surroundings, i.e.,
how does ToorJP relate to other systems which have been designed
for similar purposes.
The main application for which Toor.i-P has been used is the de-
!::ign of a grapheme-to-phoneme conversion system. Two aspects of
this application are discussed. The first is the spelling out of in-
teger numbers, which is part of a pre-processing phase, the secm1d
concerns the linguistically slanted modules which perform the actual
grapheme-to-phoneme conversion.
The second side of the evaluation concern1:> the use of the comple-
mentation operator. A semi-compositional formalism was devised to
overcome the problems which occurred with respect to complemen-
tation in the compositional formalism. The usage of the complemen-
tation operator in the above practical application will be reviewed
in the light of the choices which were made in chapter 3, and thus
the validity of these choices are evaluated.
The third side concerns some more general aspects" For a m"m.-
bcr of features which can be seer1 to characterize development tools
for linguistic rules, Toor}P is compared to seven important existing
systems. This gives an overview of what is unique in ToorjP and
what is common practice in such systems.
Finally, for the future development of Toor}P some recommenda-
tions are given which are concerned either with improving the system
or extending it along natural lines.
138 Clla.pter 5 Evaluation
5.l Introduction
I
N the previous three chapters a detailed inside view of Toor}P has been
given. In chapters 2 and 3 the functional specification ha;:; been given and
in chapter 4 the implerner\t.at.ion. In this last chapteJ: Tool).P will be viewed
from somewhat more once again, to put the system in some broader
perspective. For three t.opic:s ita merits will be investigated.
In the first. place the major application for which ToorjP ha.s been used
will be discussed, viz. how it is applied as a grapheme-to-phoneme convcr-
1;ystem (section 5.2). Secondly, the complementation operator will be
reviewed. Based on practical assumption!l the semi-compositional forrnalism
was implemented. The vnlidit.y of t.hese assumptions will he investigated, and
with that the of the semi-composition::<! formallsm (section 5.3).
Toor).P will be evaluated as a development tool. The system will be
reviewed in the light. of some important features and is compared in those
to <lOme important similar systems (section 5.4).
By way of conclusion some aspects which are not y(ct included in ToorjP
are dir;cur::r::ed: somfi recommendations arc made fo extensions to Toor)P to
increase its power and user-friendliness (section 5.5).
5.2 Applications
As reported in c:ha.pt.er 2, the main application for which Too)jP has been used
is thfi development of a grapheme-to-phoneme conversion system for Dutch
(Bcrcndsen, Langcwcg & Van Leeuwen, 1986). This applic;.ttion will now be
discussed in greater detail than was provided in that chapter, without going
into too much linguistic detail, however.
The conversion scheme is depicted in Fig. 5.1. It is subdivided into tl:u:ec
major part<l: text preprocessing, the actual gra.pheme-t.o-phoneme conver-
sion, imd some postprocessing. The text preprocessing serves to normali"'e
the orthography, and concerns for iJll)t:ance the spelling out of numbers and
acronyms. No full text.-normalizing component har:: been a!'llieved as yet,
:>ince t.his is quite an intricate problem. h1 principle, however, all iJJ{.mt the
normalizing component can handle is put out in a spelled out. forrn.
This output iTJput to the grapheme-to-phoneme conversion, which con"
:si:sts broadly of parts. First a morphosyntact.ic analysis takes pl::tee,
which at locating t.he important morpheme boundaries. Next the
spelling-to-sound conversion takes place, and finally word stress i:s detennined.
These processes arc rule-driven. Exceptions axe stored in a s:mall exception
5.2 Applications
139
ACRON
CAf>X'l'ALS
NUMBER-1
NUMBER_2
NUMOER_J
NUMBERA
MORPH_l
MORPH_2
MORPH_3
GRAPH ON LEXICON
STRESS_l
STRESS_2
STRESS_3
y
PUNCTUATION
ACCENT
REDUCTION
WORDS
Figure 5.1: Modular compos1t10n of the main application,
grapheme-to-phoneme conversion,
lexicon which is consulted first. The output of this grapheme-to-phoneme
conversion is then sent to the post processing modules, the task of which is
to deal with inter-word processes (such as assimilation) and to relocate sen-
tence accent. Some of t.hese t.a.sb, as assigning word stress or spelling
out numbers, are divided into several physical modules. This is done because
140 Chapter 5 Evaluation
some processes are 'crucially ordered', which will be elaborated on presently.
less ling,liS(,ie<tlly slanted as the spelling ouL ()f integer
<tnd the relocation of accent, have been devi:sed by the
author; the olht!r::> have been devised by the for whom the syst,em is
intended, linguists. The spelling out of numbers will be discussed in detail as
an example of how specific tasks can be achieved with Too:rjP. The linguistic
modules will he frorn somewhat more distance, for instance what.
type of constructs have been used frequently and which not. Finally some
possible other applications are discussed.
5.2. t
The rules
The spelling out of integer numbers is pre-eminently to be k\,<.:hieved by
rule, since it is extremely regular. The automatic of number-s into
Dutch has already been solver! (Brandt Corstins, 1965), but the implementa-
tion into Too:rjP was considered to be a good test case for the system.
To pronounce Dutch numbers couectly, two phenomena should be dealt
with. The first is characteristic for mabie numberr::, viz. the of a digit de-
pends on its position. This is not expressed explicitly in the orthogrn.phy, '56',
but in the it i::;: 'fift.y in1il.ead of 'five six'. The relative value
of the digits will t.l:teiefore have to be recovered. The second phenomenon
characteristic for Germanic languages, viz. the fact that the digits of
between 13 and 99 are pronounced in inverse order. Instead of 'fifty six' 'six
and fifty' is said. So after determiui11g the relative values, the 'tens' and the
:::hould be inverted.
These two processes arc the kernel of the number gr11II1mar. Since knowl-
edge of the units is needed to invert some of t.he digits, this information must
be available before inversiou can take place. This motivates a scp<tnt.te mild-
ulc for each process. Beforehand some number rwnnalizatior1 may take place,
e,:uc:h a_, deletion of leading zems and spurious dots (in Dutch '1 ,000,000' is
written as '1.000.000'). Afterwards 'and' insertion (we say and fifty' rather
than 'six fiHy') and the o:pelling must take place. Thus we have five
modules, which will be discussed in order:
NUMBER.l: for ltO{mali?.ation
for unit detenniuation
for digit inven>ion
NUMilER.A: for 'and' insertion
NUMBEIL5: for spelling out
5.2 Applications
141
NUMBER_l: The normalization module currently consists of two simple rules.
The first, (5.1), deletes all leading zeros in a number (D stands for a digit,
# stands for a space). The second rule, (5.2), deletes all spurious dots. The
module scans the input from left to right ( _,). Note that this scheme only
functions satisfactorily for integer numbers, and that no account has been
taken of other types of numbers, such a5 telephone numbers or numbers with
decimals.
zero __, 0 I # _ { }
(5.1)
__, o I
D
A single '0' in the focus or change denotes an insertiou or deletion. To
refer to the single character '0' one must specify 'zero', as exemplified in (5.1).
However, when t.he character '0' is part of a sequence there is no ambiguity
and it can be used as any other character, as will be exemplified in rules (5.10)
and (5.11).
NUMBER_2: As to the determination of uriits it can be observed that num
hers naturally fall apart in groups of three (which for that reason are often
separated by a comma h1 Eriglish and a dot in Dutch). The first group (seen
from the right) arc the ones, the second group the the third group
the millions, and so on. Within each group of three, the leftmost digit denotes
the hundreds, the middle digit the tens and the rightmost digit the ones.
The division into groups of three takes place from right to left, since the
relative weight of a digit depends on the number of digits to its right. There
fore, assigning the relative weights to the digits is preferably also done from
right. t.o left..
For this purpose we define the module to scan backwards, i-e-, from right
to left ( +-). The input of the module will be a normalized number, for instance
'1234567890'. The output must be a number where each digit is followed by
an indication of its relative weight, thus the previous must be put
out as 'ln2h3t4m5h6t7d8h9tO'. Here 'n' indicates milliard, 'm' million, 'd'
thousand, 'h' hundred and 't' ten.
The rules included in Table 5.! perform the task. First some often oc
curring constructs are defined. D is an arbitrary digit, B is a boundary at
the righthand side of a trio, and DDD is a trio into which the hundreds and
tens are already inserted. Then the rules are stated. Each time, between
two digits the unit. is inserted. Tens will be inserted before any
digit which has a trio boundary on its right side (the first rule). The first
142 Chapter 5 Evd.bw.tion
time this will space, but the second time a 'd' (the thousands marker)
will be present. Similarly, the hundreds are only inserted when t.he terts have
just been inserted, the thousands only when a space (or other non-segment)
delimits the trio, and so on. Note that this scheme makes use of information
provided by a previously applied rule. Note abo, because of the left contexts
of all the rules, tha.t only between two digit.s will unit-markers be inserted,
and that only one rule at a time can operate.
NUMBER_3: With these unit-markers the task of inverting digit.H t:an be
t.ackled. As stated, normally inversion takes place between 13 and 99, that
is around any place where now a 't' is present. However, this is not the only
place, for numbers consisting of 4 digib abo exhibit this phenomenon. In
English 'one thousand two h11ndred' is as correct as 'twelve hundred', but in
Dutch the latter is preferable. This means that 1234, which is repreHented
by 1d2h3t.4 at this stage, should be transcribed to 2tlh4t3 in the next M.age.
Note th>LI: apart from inversion the units have changed as well. This string
now codes twelve (2tl) hundred (h) four (4) and thirty (t3). Inversion around
the thousands-marker 'd' also occurs for numbeJ;"!> bigger than one million, the
digits of which for hundred-thousand and ten-thousand are zero: 1002300
'one million twenty three hundred'. If one of the:>e not
zero, the thousands digit dings to the thousands trio: 1032300 is pronounced
'one million, two and thirty thousand, three hundred', and t:hus only the
Table 5.I: Module NUMBEIL2: inserting unit markers.
definitions
# = <-segm>
ODD= D ,h,D ,t,D
insertions
0 -+ t
I
0 _ D,B
0 -+ h I D D,t
0 -+
dID- DOD,#
0 -+ m I D _ DDD,d
0 -+
n I D DDD,m
0 -+ o I D _ DDD,n
5.2 Applications 143
'normal' inversion (within a trio) occurs. The normal and special inversion
can be expressed by the following two rules:
<+dig>i 't '<+dig>j -+ <+dig>j 't '<+dig>i
(5.3)
d" . d [<+dig>j]
<+ tg>l , , .... o --->
d
. . d' . I { o, h#, o, t } _
<+ lg>} ,t,<+ lg>t (5.4)
The first rule deals with the normal cases and inverts all digits around a
't'. The second rule deals with the thousands case and replaces the thousandi>
marker by a tens marker. The exclusion of the z;ew i ; ~ ~ the focus ensures
that 'multiples' of thousands are not included. The left context ensures that
inversion only applies in the appropriate circumstances.
Since in Tool}P the i and j indices only select the unequal digits, one
additional rule is needed for the 1100, 2200, etc. cases:
d' . d [<+dig>i] -+
<+ l g > ~ ' ' -,o
d
. . t ct . I { o , h#, o, t } _
<+ l g > ~ ) ) <+ l g > ~ (5.5)
Note that for normal inversion such a rule is not necessary, since in the cases
of 11, 22, etc., nothing needs to be changed.
This mod11le, t.oo, m11st. scan from right to left ( <---). If it were l>Carlned
forwards, rule (5.3) would perform normal inversion inside the second trio of a.
number like 1002345 (. 002 ... -+ . Oh0t2d ... -+ . Oh2t0d.".) which is not thf:l
intention, in this case. If the input string is scanned backwards, the desired
effect will be achieved, since now rule (5A) applies at the appropriate place
and blocks rule (5.3) from working in the second trio. This is shown in the
derivation below. The internal situation, when the inversion module i5 called
for, is shown in (5.6).
input:
output:
1 m 0 h 0 t 2 d 3 h 4 t 5
T
(5.6)
Now, rule (5.3) applies, plus an additional copy act.ion, which is performed
since there are no rules which apply for the 'h':
144 Chapter 5 Evalua.tiorJ
input: lm0h0t2d 3 h 4 t 5
t
(5.7)
output: h 5 t 4
Then, rule (5.4) applies:
input: 1 m 0 h 0 t 2 d 3 h 4 t 5
i
(5.8)
Otltpttt.: 3 t 2 h 5 t 4
Rule (5.3) now cannot apply any more benmse the rnal:dting position has
been advanced too much to the left. Thus, the other digits and unit markers
will simply be copied.
NUMBERA: The and-insertion is governed by the next module. There is
small point of attention. In Dutch we say 'six teen' to 16 but 'six and twent:y'
to 26_ Only Immlwrs above 20 have an 'a:nd'-insertion. Further, multiples
of 10 also lack the 'and'; we don't say 'zero and twenty', but just 'twenty'_
Nevertheless, one rule suffices, (5.9):
t -. &,t I
(5.9)
At this stage, the digits have been inverted, thus the restriction of a mul-
tiple of ten is imposed by the left context and the restriction of 20 and greater
by the right context. The additional restridion that it concems digits is to
ensure that normal text containing a 't' will not receive an additional & (the
symbol coding tht'
N UMSEft_5: The module for spelling out the numbers has now become fairly
straightforward. The units arc marked, digits are invert.ed, t.he 'and' marker
is inserted, and thus most of the symbols just code a sequence of letters_ The
only part which needs some attention is the units whose value is zero or one.
Generally, if a digit. is zero, the accompanying unit marker not. be
pronounced. This can be dealt with using the following
O,h
t,o
zero
--+
-+
0
0
0
(5.10)
(5.11)
(5.12)
When all t.htee digits of a trio are zero, the unit marker ac(:ompallying t.he
trio must not be pronounced either. This is accounted for by rule (5.13):
5.2 ApplicatiollB 145
o,h,o,t,o, {t} ~ o
(5.13)
In contrast to all other zeros, a single ;r,ero must be pronounced ('nul' is
the Dutch pronunciation for 'zero'):
zero --+ n ; u ; I I # _ # (5 14)
Finally, one extra rule is needed to deal with a special case. Numbers of
the type 1002345 have a slightly deviating form wheu they enter this module:
'lm0h0t3&t2h5&t7'. The '0' between the 'h' and the 't' denotes tens (of
thousands). This is in contrast to all other digits before a 't', which denote
the 'ones' due to the inversion. Only in this specific case, consisting of numbers
with two zeros on the fifth and sixth digit (- __ DOODDDD), no inversion has
been applied (in the trio of thousands) and therefore the zero and the 't'
should be deleted:
O,t --+ 0 I - D,(&),t (5.15)
The right context specifies the c.:ircumsta.nces in which this should happen.
Only when a tens marker is followed by a digit and another tens marker is
this the cMe.
When a digit's value is one, this number is generally pronounced. An
exception to this is when it concerns hundreds and thousands. In Dutch we
say 'hundred three and twenty' to '123' rather than 'one hundred ... '. This is
expressed in rule (5-10):
1 ~ o 1 _ ~ } (5-l6)
Here, too, one speci<ll ca.se has to be dealt with. In numbers of the form
' ... DOOIODD' the '1' rnay not be pronounced as it denotes 'one t.housar1d' just
like the previous case. The '1', however, ha.s become separated from the 'd':
'_-- DmOhltOdOhDtD', since normal inversion ha.s been applied. Therefore,
only if the '1' is enclosed by two zeros must it be deleted. However, by the
time we reach the '1' in the rule base, part of the string will already be spelled
out (the part to its left). In the left context we only have available the spelled-
out form. Thus, the spelled-out form of 'h' ("" honderd) may not be present,
146 Cilaptcr 5 Evalua,tion
since when it is, apparently the digit of the hundreds is not zero. In the right,
context the scnllld digit., originally dltectly to its left, must <J.li:io be zero. This
is expnlssed in (5.17):
1 --+ 0 / -.[h,o,n,d,e,r,d] _ t,O,d (5.17)
Now, spelling out the codes is trivial, and collsists of :;;omc thirty f11les of
the following type:
5
t,5
v,i,j,f
v,i,j,f,t,i,g
The full rule :,;et, of module NUMBER_5 is included in the !:Lppendix.
(5.18)
(5.19)
Looking at the functionality of the modules we can observe the following. The
task of the first two modules, normalizing and determining the unit:;;, ;needs
to be done in all languages using arabic numbers. Inver:oion of digits and
'and'-insertion, which is achieved by the next two modules, is necessary for
Gllnnanic languages. The last module is language-dependent but it
for mwtll<l< languagll will mainly consist of translating the spelling of number:;;.
So, for iust&J::tCll
1
German number:;; can be handled by translating the bst
module. The same goes for English numbers, only in this case tht! third and
fourth module c:au be omitted.
It would be too strong a conclusion l:o p11t forward that spelling on!, num-
bers is solved for htnguages using arabic number:,;, since many languages
will probably have their own specific exceptions. In French, fM instance, num-
bers between 90 and lOO ('quatre vingt dix ncuf') bd1ave so differently from
the between 20 and 90 that this cannot. be handled elegantly ir the
language dependent module, Probably this would motivate a separate '90"
exception' module. The s<:heme, however, of describing different. phenomena,
in separate modules is of course advantageous, and can probably be followed
fot a large number of
As to the implementation in Toor}P some further observations ean be
made. In the first place the possibility of scanning the i<lJ)nt from right to
left is advantageous. In spite of the fact that we rea.d from left to right in
many languages, some phenomena can be dealt with better from right to left.
Integer numbers, which have their anchor at the right side, are an example of
this, but also stripping suffixes can he done best frorn right to left.
A second observation is t.hat using the output buffer as a reference for
those contexts for which it already contains information-as explained in the
5.2 Applications 147
previous chapter, and whkh for the user has the functionality of working with
one st.ring, in which the transformations are available directly-is generally
advantageous, too. The unit insertion module a.dvailtageously makes use of
this by referring to the last insert.ed unit marker. Thus the rules become sim-
pler, and therefore more transparent and faster. Only in one case, rule ( 5.17),
one might argue that the rule becomes more complicated than necessary.
A third observation is that metathesis, inversion of characters as exem-
plified ixl rules (5.3) and (5.4), is advantageous for a number grammar. In
fact., it WaB first implemented when this application was developed, for in the
conversion of normal words into a sound representation this had not yet ap-
peared to be necessary (nor has it been used in the final set of rules). The
alternative for rule (5.3) would in this caBe be an enumeration of 90 special
case inversion rules of the type:
5,t,6 -+ 6,t,5 (5.20)
which is, of course, not very elegant.
5.2.2 Lingui.9tic modules
FUnctionality
The central part of the conversion system consists of the linguistic modules
which perform the grapheme-to-phoneme conversion (see Fig. 5.1). This com-
prises a crude morphologic analysis, letter-to-sound transcription and stress
assignment. The great majority of the 550 rules which define the total con-
version are included in these seven modules.
The first three modules are a rule-based approach to the morphological
problem. By means of rules it is attempted to determine the important mor-
phological boundaries. No morpheme lexicon is utilized. Of course, in the
morphological 3ense errors will be made, but for grapheme-to-phoneme con-
version it does not seem necessary to have the full morphologic structure of
the word available in all caBes. Some boundaries are crucial iu the seuse that
missing them will introduce a pronunciation error, other boundaries are not.
The first three modules aim at locating these important boundaries.
Since prefixes and suffixes form a relatively small and closed group they can
be identified reasonably elegantly by means of rules. Morpheme boundaries in
compound words (which occur frequently in Dutch) can often he determined
on morphosyntactic grounds. The sequence 'kp', for instance, does not occ11r
in Dutch morphemes, and therefore a morpheme boundary between these two
letters may be assumed. Combining the two, stripping affixes and applying
148 Chapter 5 Evaluatiot1
morphosyntactics, it appears (Berendsen, Lammens & Van Leeuwen, 1989)
that only few crucial boundaries are missed.
The first 'morphological' module is the largest of all modules and
the input from I"ight t.o left ( <--- ). It strips the suffixes and inr>erts morpheme
boundaries. The boundaries which arc inserted before the suffixes, differ,
where needed, in coding in order to indicate whether the suffix is strcss-
or st.ress-neutral The second module scans it1.' input
(this is the output of MORP!Ll) from left to right(...,.) and strips the prefixes.
The last morphological module serves to delete r::ome of the boundaries which
were inserted wrongly by the two modules, which apparently were
too difficult to exch1de at the appropriate stage.
The fourth module, GRAPHON, deals with t.he second major task, the
letter-to-sound transcription. Since most of the morpheme boundaries are
placed correctly, this module consists t.<) a considerable extent of relatively
simple without too many context restrictions. Quite some effort, how-
ever, is put into controlling the pronunciation of which in Dutch
are extracted from several foreign languages.
The last task, assigning word !>tre!>!>, separated into three modules, too.
n.ather than assigning stress to a syllable, stress is with vowels in
this implementation, since accent"lending pitch movements are related to the
vowel in Dutch ('t Hart & Cohen, 1973; 't Hart & Collier, 1975).
in Tool)P, is implemented by means of labels rather thml by inst>rt.ing special
characters as is done with the morpheme
The assignment of stress is more or le:>s au implementation of the cur-
rent theories (HI stress assignment. in Dutch (Nunn, 1989; Kager, Visch &
Zonneveld, 1987), expressed in the ToorjP formalism. Not every notion of
non-liuc<t:r :representation, such as binary feet, can be expressed in Too]jP,
but the effect of heavy and super" heavy syllables can he captured in the rules.
In priuciple, primary and secondary stress arc distinguished. In tln: first. mod-
ule, all vowels (except the schwa) marked such that they can potentially
str<)SS- Then, in the second module, rather than directly marking pri-
mary stress, the appropriate vowels are marked with secondary stress, making
usc of the previously inr>erte(l morpheme boundaries. In the last module, the
appropriate secondary stress lahels are raised to primary stress a.:nd other
vowels (which have not any stress at all) arc reduced or shortened.
Discussion
Since these modules constitute the ma.ill application, it is interesting to study
the whidt constitute these modules, for they reveal which constructs arc
often used and therefore in practice, and which are not.
5.2 Applications 149
Two of the observations made for the rule grammar can also be ma.de here.
By having the applied transformations available directly, and by being able
to scan the input in inverse order, suffixes can be stripped off recursively. The
suffix boundary which is inserted by one rule triggers a next, similar rule-
Another observation is that macro patterns, which abbreviate frequently
used constructs, are used extensively. Not only the more corrunon voc (vowel)
and cons (consonant) definitions, but also more complicated structures arc
used to indicate well-formed consonant dusters, syllable boundaries and vowel
dusters.
Next, nearly all constructs which arc available to the user arc employed,
the operators presented in chapter 2 are used frequently th:wughout all mod-
ules. In one of the stress modules, extensive use has been made of the
possibilty to refer to graphemes and phonemes simultaneously. The two-
dirnensicmal notatim1 in which alternation and simultaneity is expressed has
been used to good advantage for devising pattems that would be utterly un-
readable in a one-dimensional representation.
5.2.3 Other possible applications
Virtually any rule-governed segmental conversion scheme can be defined in
ToorJP. Apart from the applications which have been discussed above, Toor)P
can therefore also be used for other purposes. In the first place it can be used
to describe or deal with other processes in a text-to-:speech :system, such as the
recognition and spelling out of abbreviations, other types of numbers (tele-
phone numbers, reals, etc.), or addresses. In the second place it can serve
as part of other modules in the text-to-speech system. For instance, Too])P
could be used to perform morphological tasks, such as correcting for root mu-
tation due to affixation, or deriving the singular forms from the pluraL Also
applications outside the text-to-speech range can be thought of. For exam-
ple, one could build a module which can !:lyllables, with which one
could build an automatic hyphenation machine. Another possibility might be
to automatically determine inflectional forms, Or one can go the other way
around and define phoneme-to-grapheme conversion. Al:so, a braille trans-
lation machine cotrld be built, i.e., a program that translates text files into
ASCII representative grade 2 braille files, Finally, also other rule-based sym-
bolic transcriptions might possibly be expressed with ToorJP. For example,
Bliss-to-text or Dominolex-to-text might be possible. Although these last ap-
plications arc somewhat speculative, a Bliss-to-text system has been reported
(Carlson, Granstrom & Hunnicutt, 1981), implemented with functions which
are either present in Toor)P or can be simulated.
150 Chapter 5 Evaluatirw
5.3 The complementation operator
In chapter 3 the introduction of the complementation in the senli-
cornpo.sit.innal h::t.<:: been The semi-compositional formal-
ism is a. compromise between the practical needs of excluding the explicit
nofits of a complemented structure and the theoretical elegancy of cornpn-
sit.ionality. It decided to implement. the semi-compositional formalism,
since on the one hand it. would 'succeed' in excluding the explicit nofits where
the compositional fonna.lism would 'fail', and on the other hand would fail
only on patterns which are highly unlikely to be specified in practice.
patterns for which the compositional case would fail, however, are such that
can be exped.ed t.o he used in prad.ice, and thus it was felt that ToorjP should
be able to handle these cases.
Now that the first major application has been completed, it is interest.ing; to
setl how frequently added functionality has been used. Added functionality
means the functionality provided by the semi-compositional form<JJism which
is not present in the compositional formalism. It is also interesting to sec
whether, t.lHjir low probability, patterns are used for which the semi-
compositional formalism fails.
It turns out that complementation is used quite In a total of
some 700 the operator is used 214 times (for comparison: this is twice
ar; ofl.eu ;,s the simultaneity, as often as optionality and one fifth as often
as altArnation). Its use can be subdivided into three classes. The first class
is where complementation is used to exclude ca.<Jes from limited set, for
instance as in pattern (5.21)
1
:
(5.21)
In other words, the universe to which the complementation i:, pro-
vided by the user.
Thr. sAcond class is where complementation is used without explicitJy spec-
ifying the universe and where the resulting pattern is consistent, M in pat.-
tem (5.22):
1
Thcsc three cxal"Ilples, (5.21)-(5.23), a.re directly taken from the existing rulcbase.
5.3 The complementation operator 151
l
-.a,a )
-.o,o
-.u,u
:m::
(5.22)
The third class is where complementation is used without explicitly spec-
ifying t.he 11niverse and where the resulting pattern is inconsistent, such as
pattern (5.23):
{
i,n,g }
.., voc, <-seg:rn>
(5.23)
All complemented structures fall into one of these three classes. The first
class comprises 50% of all cases, the second 42.5% and the third 7.5 %. This
means that in 50% of the cases the complemented plJ.tterns are naturally
suited also for a compositional formalism where the universe is defined as
U = u, If the universe were indeed defined as such, complementation could
cmly be used irl practical situations if the universe is specified explicitly. Thus
it turns out that in 50% of the cases the user feels the need to do so anyhow.
On the other hand it probably also means that in the other 50% of the ca.'les
he ir:; glad that he does not have to do so.
In 92.5% of the cases the compositional formalism where the universe is
defined as in Table 3.IX is satisfactory, since in the additional 42.5% the
universe is not specified explicitly, and only consistent pat.t.erns result.
Only in 7.5% of the cases, 16 in total, does the functionality of the semi-
compositional formalism appear to be necessary. This is not a large number,
but nor is it negligible. No patterns have been specified of a complex form
(for instance, double complementation of inconsistent patterns) for which the
semi-compositional formalism is not adequate. This means that the practical
choice of implementing the compromise between pra.ctical needs and compo-
sitionality has not been invalidated by this particular application.
On the whole it can be concluded that the choice of defining the universe as
the character count of a pattern has been validated by practical use. Having to
define the universe for each complemented pattern explicitly would needlessly
burden the user and mistify the patterns. Furthermore, the choice to exclude
explicit nofits by means of the semi-compositional formalism has on the one
152 Clw.pter 5 Evaluation
hand not been invalidated, hut on the other hand ha::> not fnlly been validated
since this functionality has been needed too infrequently. Given the
theoretical objections to the formalism (<)ot: fully exclud-
ing explicit no:fit!>, not. being fully compositional, and its relative complexity),
it can be argued that these objections a.nd the increased comput<>t.ional com-
plexity cannot be justified by the rel::>tivdy small ergonomical advrLntages for
the user. With some rewriting few cases can probably be made suited
for the compositional forma]if.>m.
5.4 Toor).P in relation to comparable systems
In this TooJ)P will be compared with a nurnher of existing systern,;,
most of which arc used as the grapheme-to-phoneme component
of a text-to-speech system. The comparison will be guided by a number of
propcrt.ie!S which characterize the functioning and pos:'libilities of such systernH_
First the systems will be introduced, and then each of them, induding ToorJP,
will be viewed wit.h regard to those properties.
The syst('ms included in the comparison are all more or less high-level
ruk-based Syf:;t.erns which approach graphcrneto-phonemc conver-
sion, the application of ToorjP, from a. different angle, such as a lexicon-
based (Lammens, 1989) or morpheme decomposition apprmwh (Pounder &
Kommenda, 1986), differ too much fr-om ToorjP for there to be any point
in cornparing them. Abo, ot.het systems which are rule-ba.<;ed, but in which
the rules arc compiled by hand into a general-purpo:oe programming language
(Dkl.demall.s, 1987; Rtihl, 1984) do not enough interface with ToorjP for
1111 interesting comparison, Finally, some well-known systems,
such as MITalk (Allen, Hunnicutt & Klatt, 1987), (:annot. be included in the
comparison as the linguistic contents rather than the technique of implemen-
tation have been described in literature. In the following order, the system1> to
be comp<u-ed with ToorjP arc: Rulsy8, Fonpars, SPL, Dcpes, TWOL, Delta
a11d Parspat. The first four !::Yi:d'ems, like ToorjP, arc bllsed on SPE-bascd
notations borrowed from generiLI.ive phonology, where<l-'> the latter three each
have their own approach to r11le-based conversion. The Bystems will now be
introduced briefly_
Rulsys is the grapheme-to"phoneme conversion componeHt: of the multi-
lingual text-to-speech devised by Carlson & Granstrom (1976),
(see also Carlson, Grarmt.rom & Hunnicutt, 1989)- It ill one of the earlier
systems which provide an environment for the development of
rules, The rules are expressed by the user in a format borrowed from
generative phonology, to he compiled and executed by the computer.
5.4 Too'[jP in relation to comparable SySteuiS 153
Fonpars is the system developed in Nijmegen by Kerkhoff, Wester & Doves
(1984), especially for the linguistic front end of a text-to-speech system.
Like Rulsys, the rules are expressed in a phonological format. Here,
these rules are compiled into a Pascal program, which, in tum, needs to
be compiled to obtain roJ. executable program.
SPL was developed in Copenhagen by Holtse & Olsen (1985) closely re-
sembles the previous two. In correspondence with t.he a.bove two sys-
tems, each rule operates on the output of the previous rule.
Depes was developed by Van Coile (1989) in Ghent and also belongs to the
SPE-based rule compilers. Inspired by the Delta System, instea.d of
having only one buffer available to store input, intermediate result.s and
output, the system suppor-t.s multiple layers in which related informa-
tion croJ. be stored and synchronized, for instance the relation between
graphemes, phonemes aud pitch movements. Also, control structures
can be used to control the conversiorl process more freely.
TWOL was developed by Koskenniemi (1983) at the University of Helsinki
(sec also Karttunen, Koskennieroi & 1987) and originated as
a morphological processor. It differs from the SPEbMed systems in
that t.he are of a declarative nature rather than imperative. The
rules are therefOre not ordered and can lead to conflicts; for instance,
two rules which both apply may want to change a character differflntly.
Sudl conflicts can be detected by a computer but must be resolved by
the user.
Delta is the system developed by in Ithaca, N.Y. (Hertz, Kadin &
Karplus, 1985)- It was the first system to introduce a whole new ap-
proa.ch to the linguistic front end of a text-to-speech systems. On the
one hand t.he multi-layered, synchronized data structure, called a delta,
was introduced, and on the other powerful control structures and
functions to manipulate the delta became available. The cust.orn:o.ry no-
tational conventions of generative phonology axe lost, however, which is
partly due to the multi,Jayered data structure.
Parspat is a system developed by Vander Steen (1987) ir1 Amsterdam. Rather
than devised only for grapheme-to-phoneme conver:sion the system is
meant to generate ptograrrls for recognition, parsing and transduction.
Grapheme-to-phoneme conversion is merely an application of this sys-
tem. This system is interesting as it emerged from computer science
rather than from the field of linguistics, and thus thfl solutions to some
problems differ from those in other Like TWOL, the transcrip-
tion rules are decla.r<ttive rather than imperative. An interesting aspect
154 5 Evaluation
of the syst<)m is that a complementation operator has been introduced
fully in the regular expressions like formalism, be it in a way
than proposed in chapter 3.
These seven systems plus ToorjP will be discussed with regard to some t.hat-
actrrizing properties, some of which will be subdivided. Not <Jll systems can
be viewed for all properties, since not every detail is reported for every system
in the literature, but (n) the whole a reasonable overview can be given. Five
main propert.ies will be considered.
1. The first property t.he formalism or in which the
linguistic knowledge must be expressed. This is subdivided into five
characteristics: (a) whether the formalism is based on t.he formal-
ism introduced by Chomsky & Halle ( 1968) in the Sound Pattern of
English, which became very in generatiw phonology_ This
will be referred to M 'SPE-ba.c;ed'; (b) whether text specifica-
tions (ptterns) m:e one-dimensionally or two-dimensionally;
(c) whether <tnd 11nder what circumstances complementation ntn be
U:>()d; (d) whether explicit control structures can be used to control the
conversion process; and (e) whd.her numerical function:> qtn be em-
ployed.
2- main property concerns the central data structure on which
the formalism opera.t{'s_ This can be multi-layered or not_ If it is multi-
layered, t:he system provides information on different levels of
tation, which arc posr::ibly co-Mdinated. If the system is multi-layered,
it will be discu:>:>ed whether this induces the possibility of co-ordinated
usc of the dif['ex-ent levels in the and whether it can be used
t.o derive input-to-output (:orrespondenccs .
. ). The third property is the manner in which the linguistic rules are eval-
uated, i.e., if the rules are evaluated in order, if one can define the
scanning dir!:'CI.ion, and what assignment str-ategy is applied.
4. The fourth property concerns whether or not a flystem provides support
for the development: of rules, and if so, which type of support is given.
5. The last property concern:> some implementation features, viz. the com-
puter language of implementation and the nature of the internal oper-a-
tor, i.e., whether it. is a compiler or not.
5.4 ToorjP in relation to comparable systems 155
5.4.1 The formalisms
As already mentioned, five of the systems are explicitly SPE-based, viz. Rul-
sys, Fonpars, SPL, Depes and ToorjP. TWOL has a representation which
resembles SPE-type rules, but they are of a declarative nature. This means
that they describe a relation between one representation and another rather
t.han a description of how to derive the output from the input. In TWOL the
application concerns the relation between a lexical representation (how the
entries are stored in a morph lexicon) and a surface representation (how the
entries are written, for instance wolf-s .,___. wolves). Delta and Patspat have
more deviating rule formats_ In Delta the rules are adapted to the program-
ming language, which makes them more powerful but somewhat less readable.
In Patspat the rules are expressed in the 'unifying formalism' described in Van
der Steen (1987), which resembles the declarative BNF notatiou for Chom-
sky type-0 grammar::;. However, for both Delta and Parspat SPE-type rules
can be transformed into the local formalism to produce the ::;arne effect. Ex-
cept Toor_jP, none of the systems feature a two-dimensional repres<mtation
of pattem:,;_ As argued in chapter 2, a two-dimensional representation takes
mm-e space <tnd (off-line) computational effort, but results in more transpar-
ent rules; vertical positioning denotes <tlt.emat.ives for the same position or
co-ordination between two layers, and horizontal positioning denotes juxta-
posit.ion_ For comparison, in each formalism the pronunciation rule for the 'c'
which precedes an 'e' or an 'i' ha:o been expressed in Table 5.Il.
Most system!> more or less the same operators to specify patterns
as provided by ToorjP (see chapter 2)_ The complementation operator, how-
ever, is interesting since its introduction gives ri:se to unexpected problems,
as discussed in chapter 3. Therefore, the presence and status of this opera-
tor has been investigated for the systems being compared. Not all reports in
liter::>ture are as detailed, and in particular with respect to the complementa-
tion operation information is scarce, so the result will be presented with some
reticence.
Most probably, SPL does not feature complementation as a.n operator. In
an internal report (:Holtse & Olsen, 1985) the language's syntax is described,
which does not mention a complementation operator. Probably, the same goes
for Rulsys, Dcpes and TWOL, but here the available information was less de-
tailed, none of which contained any reference to complementation. In Fonpars,
Delta and Parspat complementation is available_ Fonpars a restricted ver-
sion is a.vailable in the sense that only constructions which contain strings of
equal length may be complemented. This is more restricted than demanding
patterns to be consistent, but syntactically less involved. In Delta, inconsis-
tent pat.terus may be built, but from the documentation (Hertz, 1989) it is
,_.
C.'<
Formalism Expression Remark
Paper &
c
......
s
I
A 'c' is pronounced a.'i an 's' if it is
Pencil
-
followed by an 'e' or an 'i'.
g't;3
... 0"'
s ;::=
"'
0
0
apply rule only once
8
"C
Rulsys
crule c -> S I _ e
2 rules: The presence of an a1ter-
crule c -> S I _ i native operator could not be esta.b-
I"
>i
;i].
0
;:
lished in the literature.
;:;
Fan pars c -> S I --- {e/i}
---: focus mark
I: separates alternatives
><
'0
..,
"' 5:.
replace c / _ e -> S
replace : type indication of the rule.
SPL St.ructural change and context have
replace c I _ i -> S
been exchanged.
::l
(lq
"0
..,
'"'"
:=;
Focus and mange field a.re enclosed I
by' I' markers. '
Depes lc[ ---> 1<2:s>l I _ {e,i}
c ---> <2: B>: a 'c' in the default
first layer (graphemes) will be asso-
s
D
..,
..,
;::
{l
;-' ...
;r
('j
ciated with an 's' in the second layer
__ ___l
(phonemes).
L__
----
"'
ttl
:::
o;
;::
o
Formalism Expression
TVmL c:s <= _ e: I i: ;
_-1 c !-2 {eli}->
Delta
insut [s] -1 ... 2
Par spat
c,S,Rc1 :: c,Rc1.
Rei ::;;b.
ToorjP
c -> S I _ {e}
{i}
- -
Remark
A colon(:) separates the two layers.
Therefore the ru1e states: a 'c' in the
input layer is always(<=) associated
with an 's' in the second layer, if the
'c' is followed by an 'e' or an 'j' in
the input layer.
--1: the anchor of the expression,
assumed to be positioned to the left
of the 'c'.
! -2: the sync !llii.Tk to the right of
'c' is stored in 2.
BNF-like notation. The presence of
the 'c' in the left-hand side of the
first expression is not understood by
the author.
The 'or' braces have to be dou-
bled to simulate the large enclosing
braces.
---- ---------
Qo
;.,..
63
c
t;
...
'""
g
c
j
2:
<1>
i
<ll
.....
0.
-J
158 Cbaptflr 5 Evaluation
nndear what the expressions mean. In Parspat complcment<J.tion is defined
exactly in accordance with Table 3.V, in ot.ber woHls in the compositional
extension of regular expressions. As argued, this definition is not satisfactory
for inconsistent patterns, but it is unclear whether t.his has been recognized
in Parspat.
A fourth aspect is the possibility to use control i.e., the pos-
sibility to explicitly (:(mtrol t.he eonversion process. In generative phonology
tlwse are not present, and only in Depes and Delta are they fully available,
due to the programming language character of tla! fon:nalisr:n. In Rulsys and
SPL some control is posc;ible by indicat:ing whether the rule should be <l.pplied
once or cyclically (H.ulsys) or whether the rule is obligatory or- optional (SPL),
hut this can hardly he counted as full control structures. In the other systems
(except perhaps Parspat) flow of control is detcnniued by the system itself
and cannot be manipulated by the
A final aspect which dw.ra.derir.es the fotmalisms is whether or not nmner-
ic<ll fnnd.ions available_ For fr-ont-end linguistic generally
is not necessary, but in more phonetic processing, ae; control of segmental
duration, such functions are needed, and discrete segments are still the main
units. Therefore, several systems which originated fo:r liitguistic purposes have
evolved to be <lhk to dt!al wi!.h more phonetically oriented rules. Examples
are Rulsye;, SPL and Fonpars. A future version of Delt.a ha.c; also been an-
nounced to support these facilities. The systems either do not mention
such facilities, or do not support (Tooi)P).
5.4.2 Central data structure
A mutli-laycred dat11 :;tructme is a convenient mechanism to be nhle to access
deriva.t.icmal information in the rules and to be uble t.o determine input-to-
output. relations. For this purpose the should ensure that synchro-
nization between buffers is rdiabJe ..... this is the consistency requin!rnent of
4.4.2. For instmKf.', if !L character ls inserted in one layer, of the charac-
ters in the other l11yer must be re-associated with the cha:ract.ers whose indices
have changed_
Of the eight systems, H.ulsys, Fonpars and SPL do not have a multi-layered
da.ti1 Htrud.me_ TWOL and Toor,iP both have two layer-s and keep the syn-
chronization consistent. Depes and Delta leave it to the user t() define the
number of layers. Par:,;pat, probably has a notion of multiple layers, that is, it
could nnt he found explicitly in the literature, but in e;ome examples it shows
that one can refer to gra.phewet> a.o; well as phonemes within rule. All
systems with multiple layers support buffer Depes and Parspat 11-rt>
unclear as to whether co-ordinated information can be used, such as r-eferring
5.4 Toor_jP in relation to comparable systems 159
to an fo:l which is derived from 'eau'; the other three support co-ordination.
A1> to overall input-to-output relations, in Depes and Delta one can write a
program to compute these. Too)jP provides an option to file these relation
ships. Parspat and TWOL are unclear as to this facility.
5 . .(..3 Inference
Two different inference mechanisms 3J:"e used to evaluate the rules in the sys-
tems which are included in the comparison. Declarative rules are evaluated
in parallel, imperative rules sequentially. Declarative rules describe a relatitm
between input and output, and G<Ul therefore often be used in two directions.
In TWOL, for the same morphological rules can be used to derive
the surface structure from the lexicon structure (wolf-s ---. wolves), or to de-
rive the possible set of lexical structures from the surface structure (wolves
__...,. wolf-sjwolv-sjwolves). Declarative rules are not ordered, that is, they are
applied in parallel rather than sequentially. This has the disadvantage that
rules can easily be conflicting and therefore have to be disjunct for the pur-
pose of grapheme-to-phoneme conversion. For instance, the chatacteJ:" 'c' can
give rise to quite f)Offie pronunciations: J s/' /kl' If J' I xl' etc. In seqtlential
systems one can first deal with one case (for instance, when 'c' is followed by
'e' or 'i') and then consider the remainder of the cases-which means one does
not have to consider the previous cases, since those have already been dealt
with, and thetefore the individual rules can be simpler.
On th other hand, imperative mechanisms only describe the path from
input to outpnt, not the other way around. Also they are mud! more deter-
ministic by nature and therefore less suited to deal with ambiguities. Within
the sequentially evaluated imperative systems there are two strategies. One is
the 'rule-by-rule' mechanism, where the input string is completely dealt with
by the first rule, before the output is dealt with by the next rule. The second
mechanism is the 'segment-by-segment' mechanism where the first character
consults all rules from top to bottom before the second character is dealt with.
Both strat,egies can be useful, depending on the application. Finally, the pos-
sibility to determine the scanning direction is of interest, sirKe for certain
applications, such a.s stripping suffixes, scanning backwards is convenient.
TWOL and Parspat are of a declar3tive nature, evaluating the rules in
parallel. The other systems arc of an imperative nature, and hence the rules
are inherently ordered. Most systems provide only rule-by-rule strategy, only
Fonpars also provides a segment-by-segment strategy. Tool)P provides the
segment-by-segment strategy, modules can be utilized for crucial ordering. Of
the imperative systems most provide both forwar:d and backward scanning,
only SPL and Fonpars do not.
160 Chapter 5 Evaluation
A:s to dealing with ambiguities, TWOL and Parspat. are well suited to
generate and handle them, due to their inference mechanism. SPL has a
mechanism in the imperative setup to express the possibility of ambiguities.
By means of defining a rule as 'optional', both the input and the output of
the rule will be input for the following rules.
As the support which the various systems offer for rule development is
in the literature, con1parison of this chm:acteristic cannot be com-
plete.
systems, however, will probably feat me some kind of syntax checking
and ac;compa.nying error without it. large sets of rules can
hardly be debugged even mt the level. In ToorjP, when a syntax error
has been encountered, the file, place and nature of the error arc reported. In
some cases, the error triggers some additional errors, but these will only o1:cur
within the rule in whi(h the first error occurred. The other n1les are parsed
as if no error had on:urred, so each first enor of a rnle can be taken seriously,
and genera.Uy most enn:r: messa.ges are of diagnostic value. In Delta, on the
other hand, one synta.:x error generally triggers a whole lot of others (in the
user manual the user is warned again:st this), r::o hert'l getHlr<Llly only the first
error mc:ssagc is diagnostic. This has th<Lt each real error
will geiwrally take an edit. session and a compiler run before the next one can
be identified.
Once a program (set of rules) is syntactically correct one is :set to the tMk
of making it. sema,utically correct., that. is getting the program to do what you
want. Despite the rare occasions that this is reached at once, o;mne effnrt. is
generally needed to reach it. The output of the program telb you e>omething
is wrong, but often does not give a clue about why und where it went wrong,
so one will either have to reduce the program to the simplest version that
still contains the error (in which Ol.se the error often is discovered) or one can
advantageously use some debugging tools to analyse the larger and erroneous
program directly. Only when the latter fails docs one turn to the former
strategy.
As to the debugging tools, two can be distinguished. One class is
that of the tracing facilities, the other is often called the debugger. A tracer,
when activated, reports the intermediate stages during conver5ion, reports
which arc generated by the system rather than by the ttser. Debuggers typi-
cally interrupt the progrl'l.m, when:!npon the user can investigate the internl'l.l
status by examining variables. Tracers arc typically used in special-purpose
programs, where flow of control is more or less the same for all applications,
5.4 ToorJP in relation to comparable systems 161
and the number of central data structures is limited, so that special display
routines can be devised. Debuggers are typically used in general purpose
programs, where both control flow and data structures are user-determined.
Ttacets can often display the televant information quic:Jdy and transpmeutly,
whereas dcbuggcrs only interrupt the program, whereupon the user must
search the relevant information interactively. On the other hand, tracers are
pre-programmed; if for some reason there is no access to sub-parts of the pro-
gram in which the error happens to be located, the user has no other option
but to reduce the program. Debuggers do have access to all parts, so gener-
ally the user nevet has to decide upon the redudion of his program.
Thus, on the whole, debuggers are less user-friendly but more powerful.
Of the eight systems, Fonpars, Parspat and TWOL do not report any de-
bugging facilities. Depes and Delta feature a debugger where the bteakpoints
can be set and examined at. run-time, so no compilation is needed for each
new examination. Rulsys, SPL and ToorjP feature a trace facility. The exact
nature of the t1ace facilities in Rulsys and SPL a:re not reported, but in both
cases it is claimed that quite a detailed view of the actions performed by the
system may be obtained.
01l the trace facilities of Toor}P chapter 2 was not very specific and there-
fore some additional information will be given here. Toor}P distinguishes
normal input from commands. Commands start with 3 dot followed by some
mnemonic code indicating the type of command;' .h', for instance, gives help
on the mnemonic codes by producir1g a list of commands which arc available
to modify the system's settings.
Locating a semantic error is done in three First the module in
which the error occurs must be determined. For this purpose '. m' must be
typed, whereupon the intermediate results of each module are printed. The
module in which the error is locat.ed can be identified. Next, the rule which
is erroneous can be identified by i!IVokirlg the tracer with '. t '. The user is
prompted for the erroneous module. For that module 3 shorthand status
report of each rule is given. If a rule has been applied its ranking number and
its contribution to t.he output buffer is printed. If a rule has not been applied
the pattern which did not match is printed: ':f' is the focus pattern failed, 'l'
for left context, 'r' for right context.
A more detailed view of a rule's operation can he obtained by invoking
the second level of t.he tracer. Once again giving the command ' . t' will
achieve this ('. tl' and '. t2' are the explicit commands; '. t' increments the
tracing one step to a maximum of two). Now, for a specific character and
for a sequence of rules (for which the user is ptompted) a det.ailed report
on the rules operation is given. This may lead to a respectable amount of
information (all primitive matching values are reported, as well as all operator
1G2 Chapter 5 Evaluation
con[:ltruction[:l u.wl their couch<:>ions), but generally the selection of the rules to
be reported can be so lirnited that the information flow is not overwhelming,
and on the other hand the report is minute enough l:o pinpoint the error. In
the experience of users this three-step strategy is an efficient way to locate a
semantic or conceptual error.
The finnl step in program development is, when it is semantically coned:,
improving the efficiem:y of the program. Rulsys reports that statistics of rule
productivity can he gathered, the other systems do not report on tools for
this purpose. As touched upon in chapter 2, Toor.}P features a mle coverage
whidl r'1port.s on the frequer1cy with which individual rules are con-
sulted. Like the trace facility, it can be invoked hy a single command '.lr'
(log rule covetage). The frequency rule consult<Jt.ions are then filed for all
input nunula.t.ively. Thus the infrequently consulted rules can be identified
;l.nd h( t<)Writt.lln and reartanged. With this tool the efficiency of the existing
grapheme to phoneme conversion rules has been increased by 1:!<)me 20 %.
5.4.5 Im.plem.entalion
As to the implementat.ion, not. many details ate provided, either. Oft.en,
only a schern11tic and short section is included on the implernenkd:ion, which
states the htnguagc of impl<mumt.at,i<m ll.Iid gives a crude sketch of the
architedure. Not much more than a list of facts can be given here_ The
choice of programming language is often influenced by the local conditions
and of pn>grarnmer with programming languages, One con-
sideration which is heard is that the program should be ttansportable, but in
this sense the generally known programming languages do not differ grettt.ly.
SPL and Rnlsys d() not specify in which progra:m.ming language they arc
implemented. TWOL is implemented in Lisp, Delta in C, and the other four
systems in Pascal. As for Toor)P, <tpart. from acqt1alntance with the la..ngu<Joge,
a consideration has been tlmt stJ:udmes and recursion is needed in exteJtl.,
which i8 supported by Pasea.l. On the other hand it has beeu suggested
t.hM. t.he unifying capabilities of Prolog might be very well suited for pattern
matching. Apart from some uncert<1.inty <tS to the execution speed,
seems no real reason why it could not have been implemented in Prolog afl
well.
All systems have been implemented as a compiler, except ToorjP which
can be described a.5 a compile;r This topic has been disn1ssed in
the chapter, secti()n 4.5. An advantage of <\{1 interpreter-like scheme
is that on-line editing of rules is e<tsily irnplementable. The price to be paid
is often a slower execution of run-time. For the development <).f mles the
[ Rulsys Itonears [ SPL [ Depes jTWOL ] Delta ]I'arspat [ ToorjP I
--
SPE-based yes
Two-dimensional no
Complementation no
Control structures rudim.
Numerical functions yes
Multi-layered no
Co-ordination
-
Derivational history
-
Rule ordering yes
Scanning direction ?
Assignment. strategy rbr
Ambiguities no
Synta.x diagnosis ?
Trace facility yes
Debugger no
Efficiency support yes
Implemented in ?
I Compiler /interpreter I c
I
restr.
stfw.
pro b.
rudim.
pro gr.
restricted
straightforward
probably
rudimentary
programmable
yes
no
restr.
no
yes
no
-
-
yes
--i
rbr/sbs
DO
no
no
DO
?
Pascal
c
yes
no
no
no
yes
no
-
-
yes
......
rbr
partly
?
'
}'8
no
?
?
I
c
rbr
sbs
mbm
?
yes partly no
no no no
no no special
yes no yes
no no future
yes yes yes
? yes yes
progr. ? pro gr.
yes no yes
......
parallel ......
rbr - rbr
progr. yes pro gr.
? fragile yes
no ? no
yes ? yes
? ? ?
Pascal Lisp c
c c c
rule--by-rule
segment-by-segment
module by module
no yes
no yes
stfw. full
no no
no no
pro b. yes
? yes
? yes
no yes
parallel
.....
- sbs/mbm
yes no
? yes
? yes
? no
? yes
Pascal Pascal
I
c
I
c/i
could not be Bstablished from the literature
....
8-t;l
'"'"c'
t
'"'
'"0
01
s ...,
i-<
0 >-<
'"0
"'
[!
- 0
::;-
b
,::...-6
0
!;:;' l;
r; --
"'
"'
0
[J') 0
2!
0')
!:!
't::l
- <11
"
0"
""" "'
<i"' ::::---...,.
"' !:!
o+ <+-
ro t:r'
>< <11
a.
e1"
:!1. E
""
"'
t:r'
o+
"' '<
"'
,....
"'
s
"'
0
::l
<+
t:r'
()
0"'
"'
-
0>
w
164
Chapter 5 Evaluation
first argument is more important, for opertttion in an application such M
text-to-speech system the latter may be of more importance.
5.4.6 Conclusion
The iwporl.il.Il.t cha.fa.cterist.lcs have now been reviewed. Fm an overview,
they are induded in Table 5.III for all systems included in the comparison.
By way of conclusion the merits of ToorjP now be enumerate-d. ToorjP
is a system whid1 fits in the SPE tradition. The rules ttJ:!j expn'lSSed in a
formalism which is familiar to linguist::;, the rnain use.fs of the system. Two
feu.tures of the fonnalism are characteristic for Toor)P ::tnd ate not present
in other systems. One is the two-dimcnsimml rep1esentation of 'and' and
'or' :strudunl:>, u.nd the othe1 is the definition of cornplernent::tl;ion, the 'not'
structure. Eithtlr tht: pmblem discussed in chapter 3 ha:s uot. been recognized
or it }w!l )tot elabotated on, but the solutions cm:ounte1ed in the various
othet systems arc of a different nature of the solution in ToorjP.
Anol.he1 chaiactclistic of Toor.,iP is the support of a two-level data repre-
sentation, one for graphemes, the otlH.or {or phonemes. With it, gra,pheme-l:o-
phoncrnc relations arc both fot 11sage in the rules and for
In !:his reHpect two other systems might be be it that
for statistical purposes the user must prob!tbly write hi:; own program. F\u-
thcnnore, Toor,iP provideH relatively extensive development tools- Apart from
clear diagnostic messages on the syntttctic levd and a powerful trace facility
for debugging, TooJJP also features a rule analyset t.o im-
prove the eltlciency of a rule set. Most of these fentun:s are rarely touched
upon by reports on the other r::yr::term:.
Finally, Tooi).I' is restricted to deterministic r:;ymbol manip11htion. It docs
not feature control structures in its formalism, nor numerical functions to
specify for instance durational rules on the segmental leveL From the input
exactly one output it: generated and thus ambiguous pronunciations cannot be
generated. Fhrthermore, TooJ)P docs not provide any tools to represent tne
like structures which arc typically desired in syntactic analysis of
ToorjP, therefore, is suited for front-eiHI linguistic processing in a text-to-
speech system, whid does not need tools to perform a complicat.ed struc-
tural analysis. So typically Toor.jP is for the applications it has been
used for, ::;pdliug out ac1onyms, numbers, a.bbreviatioru;, inserting
boundnries on rnorphosyntactic grounds, performing grapheme-to-phoneme
c:onvel!liou and assigning word stress, but it i:s less suited to perform full mor-
phological or syntactical in or-der to improve grapheme-to-phoneme
conversion or insert. phrMe boundaries and determine sentence accent.
5.5 Possible Extensions 165
5.5 Possible Extensions
This last section will discuss some possible future extensions of Tooi).P. It
comprises five topics concerned either with improving the system or extending
it along natural lines.
5.5.1 Rule-by-rule assignment
As mentioned in chapter 2 and in the previous section, processes can be
crucially ordered, that is one process must. he finished before the other can
apply. For this purpose Toor}P features modules; the next module will apply
only when the current has fully been applied. This is the only mechanism in
ToorjP to express crucial ordering. Despite the limited number of modules
needed in a full-grown application it might be attractive to implement a more
direct mechanism to express crucial ordering. In the current implementation
some modules consist of only one rule, for instanc.:e NUMlltRA, a>s this rule
must be applied after NUMBER-_3 and before NUMDER-5. Also, a. user who
is accustomed to specifying in a crucially ordered manner does not want to
create a module for each rule---whkh is a separate file for each module in the
current implementation. The user retains more overview when such cruCi<tlly
ordered rules can be included in one file.
A possible solution to this is to allow the user to indicat.e for eac-J:t group
of rules in a module how to process them, either in the rule-by-rule strategy
or the segment by-segment strategy. Internally, the crucially ordered rules
can be interpreted as a separate modt1le for each rule which gives them the
desired functionality, but for external (debugging) purposes the rule must be
accesBed via. the module in which they are included. This will probably require
little effort. fm- the functional implementation but somewhat more for the user
interface.
5.5.2 Simultaneous operator
One of the functions of the simultaneous operator is to express co-ordination
between graphemes and phonemes, a.nd it has been used rather frequently for
instance for the purpose of stress assignment. While the operator is satisfac-
tory for co-ordination and exclusion (such as "any consonant but 'c' "), the
operator is not satisfactory fo1 e:>epressing tHm-synchronization, such as "the
vowel 'II' which is not derived from 'ie' ". Currently, when a patte.n like
II )
'[i, e] ]
(5.24)
166 Chapter 5
is expressed, thio; is interpreted M "the vowel 'li' which is syndu:oni?.ed with
tl. sequern:e of t.wo characters, the sequence 'ie' excluded". If the 'li' happens
to be derived from a. 'y' or a single 'i' this pattern will not match.
OnP. possibility is to express this pattern positively (and 'II' can only be
<lerived from a limited number of characters), but in other situations this
may lead to patterns of undesirable complexity, and moreover, in the line of
complete availability of negation this does not seem elegant.
Another possibility il> to allow the user to explicitly denote non-
synchronization, for instance in the following marHtcr:
+[ II]
- [i, eJ
(5.25)
Here, the minus in front of the brackets denotes the fact that. the 'II'
may be synchronized with ll<wt:hing but 'ie'.
The introduction of such a mechanism involves some additona.l theoreti-
cal work Cuttently, the simultaneous operator is defined symmetrically to
the operM.or, that is, analogously with approp:r:ia.te ex('ltange of
disjunction and conjunction openttor::>- This also implies that the pattern
::>lJi.:c1:ed if)g tJ1P. simultaneous operator is put into For t.he purpose
of synchronization and this might not be
Per-haps iul explicit distinction on the lcvd of user programming between
simultaneity used as an 'and'-ope:wt.or and simultaneity used as a synchro-
nization operator is desirable. Detachment of the succeeding from the
synchronizntiou however, will have influence on tl1P. solution for the
introduction of complementation, the exact nat.ul'e of which will have to
studied closely before a satisfactory is found.
5. 5. 3 Extension of l<1yer-s
In 4 the architecture of a two-layered central data structure hru:>
been discussed, including the me<.:hauisrn t1sed to synchronize them. This
architecture lw.s heen devised with the main application in mind, grapheme-
to-phoneme conversion. The first layer <.'nnt.a.ins graphemes, the second
phonemes. Both data-type:> <ne
With this, the kernel of a multi-layered data structure has been achieved.
Extra layers can bP. added without too much effort if the OS synchronization
mechanism is used to synchronize them. However, if the DBS synchronization
Trlflcha.nism is used, this will involve considP.rable effort.
5.5 Possible Extensions 167
The extension of layers is not restricted to segments; there is no rea-
son why larger units such as rnorphe:mes (prefix/root/suffix) or words (lexi-
cal/number/abbreviation) should not be introduced. However, SU('b an exten-
sion will also affect the formalism, which is the user's only tool for accessing
the data structure- Currently, graphemes and phonemes are disjunct, so ref-
erence to either does not need disambiguation. When extra layers are defined,
one possibility is that the elements they hold are disjunct to all other layers;
another possibility may be that a layer selectiml mechanism is included in the
formalism. The first solution may be undesirable in some applications, the
latter may obscure the rules to an undesirable extent. A practicable solution
might be a compromise of only having to disambiguate in ambiguous cases,
and encouraging the user to use as few as possible equal sequences in different
layers.
5.5.4 On-line rule editing
One of the characteristics of ToorjP is that the patterns are interpreted at a
certain level. As argued in chapter 4, this property may well be capitalized by
implementing on-line editing and testing of rules. This can probably be done
with little effort, since the infrastructure is present_ The rule to be edited,
inserted or deleted can be selected in the same way a.'l the rules to be traced
(;uuently are. The source text is available, so this can be read into the buffer
of an editor. When the rule has been adjusted the system can compile the
rule into its internal data representation and patch it over the old one so that
it can be tested.
Of course, now the internal data differ from the original source. So a new
source must be produced_ e v e r ~ solutions for this a:re possible. A safe and
disk-space-friendly manner seems to be to store each change into a file with
an indication which rule in which module it concerns. This serves as a journal
file for recovery. Then, when the session is finished, like an editor, the system
may ask whether the changes must be stored. For each module in which a
rule has changed t,he system can generate the new source by appending all
individual source texts.
5. 5. 5 Compiler implementation
An opposite extension is also po:osible, not serving the development tool but
serving the run-time performance. When a rule set has been developed and
performs satisfactorily, there is no reason why all development support tools
should still be available. Therefore a full-compiler version of Too[.iP is at-
tractive for implementation in the text-to-speech system. On the one hand
168 Chapter 5 Evaluation
wl development support tools can be omitted, but the major pl:l.rt of the in,
crease in rnn-t.ime performance may be expected of t.he implementation of a
full compiler.
One possibility is to follow the scheme of Fonpars, to translate the lin-
guistic rules into a Pascal program, which in turn is compiled into machine
code. An informal test, where a pattern was translated by hand into a Pascal
program, indicated an increase of speed by a factor of 6, which seems a rea-
sonable indication of the possible gain in speed with this approar_h. Another
possibility is to translate the patterns into finite state machines as is done in
TWOL. However this involves some theoretical work, since it is not dear be"
forehand how complementation should be translated to a finite state machine.
An estimate of the possible gain in speed is hard to give, since the theoretical
work has not been done yet, but it seems reasonable to believe that, since the
is more direct, the increase in speed will be greater than in the first
approach. However, the amount of effort will presumably be greater, too.
5.A Module NUMBER_5
Appendix S.A
Module NUMBER.5
169
In this appendix the full rule set of module NUMBER_5 is given. It serves to
spell out the number when all important processes have been executed, viz.
unit insertion, inversion and 'and' insertion. The rules follow below.
ddinitions
# = <-segm>
D = <+cijf>
grapheme 0
o,h,o,t,o,{ f} ~ o
O,t --t 0 I - D,(&),t
O,h --+ 0
zero -->
zero -+
grapheme 1
n,u,l I # _ #
0
1 __. o 1 _ ~ }
1 --> 0 I -.[h, o, n, d, e, r, d] _ t, 0, d
l,t,1 -+ e,l,f
1 --> e,e,n
grapheme 2
2,t,1 --> t,w,a,a,l,f
2 --+ t,w,e,e
grapheme 3
3,t,1 --> d,e,r,t,i,e,n
3 --+ d,r,i,e
grapheme 4
4,t,1 --> v,e,e,r,t,i,e,n
4 --+ v,i,e,r
grapheme 5
5 --t v, i ,j , f
grapheme 6
6 --> z,e,s
grapheme 7
7 --+ z,e,v,e,n
grapheme 8
8 --> a,c,h,t
170
grapheme 9
9 ......, n,e,g,e,n
grapheme &
& --+ e,n
grapheme t
t,O ---> 0
t,l ......, t,i,e,n
t,2 ---> t,w,i,n,t,i,g
t,3 --+ d,e,r,t,i,g
t,'.l --+ v,e,e,r,t,i,g
t ,s ....... v' i ,j 'f' t 'i, g
t,6 ---> z,e,s,t,i,g
t,7 ---> z,c,v,c,n,t,i,g
t,s ___,. t,a,c,h,t,i,g
t,9 --+ n,e,g,e,n,t,i,g
grapheme h
h --+ h,o,n,d,e,r,d I D
grapheme d
d --+ d,u,i,z,e,n,d I D
grapheme m
rn --+ m,i,l,j,o,e,n I D
grapheme n
n --+ m,i,l,j,a,r,d I D
grapheme o
o --+ b,i,l,j,o,e,n / D
Chapter 5 Evaluation
References
Aho, A.V., Sethi, R. & Ullman, J_D_ (1986). Compilers; Principles, Tech-
and Tools_ Addison Wesley, Reading, Massa.chussetts.
Ainsworth, W.A. (1973)- A for converting English text into speech.
IEEE Transactions on Audio and ElectmatMt8tics, AU-21, 288-290-
Allen, J., Hunnicutt, S. & Klatt, D. (1987)- From TeCI;t to Speech: The
M!Talk system. Cambridge University Press, Cambridge.
Berendscn, E., Langcweg, 8. & Van Leeuwen, H.C_ (1986)- Computationi:l.!
Phonology; merged not mixed. Proceedings International Conference
on Compatational Linguistics 86, 612-614.
Bercndsen, E. & Don, J. (1987)- Morphology and stress in a rule-based
grapheme-to-phoneme conversion system for Dutch. Eum-
pean Conference on Speech Technology 81, 1, 239-242-
Berendsen, E., J.M_G_ & Van Leeuwen, H.C. (1989). Van tekst
naar fonecmrcprcsentatie, Jnformatie, To appear_
Brandt CMstins, H- (1965). Automatic translation of numbers into Dutch,
Foundations of Languagc, January 1965, 59-62.
Carlson, R & Granstrom, B. (1976). A text-to-speech system b;i.!)ed entirely
on rules_ Proceedings of ICASSP 76, 586 688.
Carlson, R., Granstrom, B. & Hunnicutt, S. (1981 ). BliS!> communication
with speech or text output. Quarterly Progress Status Report, 4, Royal
Institute Of Technology, Stockholm, Sweden, 29-38.
Carlson, R, Granstrom, B. & Hunnicutt, S. (1989)- Multilingual text-to-
speech development and applications. Internal Report, Royal Institute
Of Technology, Stockholm, Sweden.
Chomsky, N. & Halle, M. (1968)- The Sound Pattern of English. Harper &
Row, New York.
Daelemans, W. (1987). An Object-Oriented Computer Model of Mor-
Aspects of Dutch. Ph.D. Dissertation, Ka.tholieke Uni-
versiteit. LeuveJL
172
Elovitz, H.S., Johnson, R., McHugh, A. & Shore, J_E_ (1976). Letter-to-
sound rule:s for autornati.;; translation of English text to phonetics. IEEE
Trnn .. on Speech and Signal Processing, ASSP-24(6),
11ii 459.
't Hurt, J. & Cohen, A. (1973)- Intonation by mle: a perceptual quest-
Jrm.rnnl of 1, 309-327 _
't Hart, J. & Collier, R. (1975). Integrating different levels of intonation
analysis- Journal of 3, 235-255.
Hertz, S-R (l98l)- SRS rules; a three-level ruk strategy_
Proceedings of lCASSP 81, 102-105.
Hettz, S.R (1982). From text to speech with SRS- Joumlli of Uu: Acoustical
Society of Arnericn, 72(4), 1155--1170-
Hertz, S.R., Kadin, .l. & Karplus, K. (1985). The Della rul8 clev8lopment
system for speech synthesis from text. of the IEEE, 73(11),
1589--1601.
H(rb;, S-R- (1989)- The Delt:a System. User Manual.
Holtse, P. & Olsen, A. (1985). SPL: a speech synthesis programming lan-
guage. Annual Report of Phonetics, 19, University of Copen-
hagen, 1---42 _
Hopcroft, J.E- & Ullman, J_D. (1979). Introduction to Automat(l
Languages and Computation. Addison Wesley, Reading, Massachus-
setts.
Kr1ger, R-, Visch, E_ & Zonneveld, W. (1987). woordklemtoon
(Hoofdklemtoon, J3ijklexntoou, Reductie en Vo<C:tcn). Glot, 10.2, 197
220.
Ka.rttnnen, L-, Koskenniemi, K. & Kaplan, R.M. (1987), A r:ompiler for Two-
levd Phonological Rules. Internal Report, Xerox Palo Alto Research
Center, Stanford University.
Kerkhoff, J., Wester, J. & L. (1984). A compiler for impleu1enting
the linguisti.;; of a text-to-speech conversion Linguistics
in thfl Netherlrmds, 111-117.
Kmmnenda, M_ (1985). GRAPHON - cin System zur Sprachsynthesc bei
Tcxteingabc. In: Ostcrreichische A r-tifidal Intelligence- Taynng, (H.
J_ Rel:t:i eds.), Springer, Berlin.
Koskcnnicrni, K. (1983)- Two-level Morphology: A General Computational
Modd for Word-Form Recognition and Production- Ph_ D. Dissertation,
P-ublication no, 11, Department of GenenJ Linguistics, University of
Helsinky.
Kucera, H. & Ftancis, W_N. (1967). Computational Analysis of Present Day
American English, Brown University Press, Pfovidence, Rl.
References 173
Kulas, W. & Riihl, H.W. (1985). Syntcx-unrcstricted conversion of text-
to-speech for German. New Systems and for Automati1;
Speech Recognition and Synthesis, 517- 535.
Lammens, J.M.G. (1987). A Lexicon-based Grapheme-to-phoneme conver-
sion system. Proceedings European Conference on Speech Technology
87, 1, 281-284.
Lammens, .J.M.G. (1989). From text to speech via the Lexicon. Internal
Report, no. 7, SPIN, Utrecht.
S.G.C. & Kaye, G. (1986). Alignment of phonemes with their
corresponding orthography. Computer, Speech and Language 1, 153-
165.
Nm1r, P. (1963). Report on tJ1e Algorithmic Language ALGOL 60. Journal
ACM, 6, no. 1, 1-17.
Nunn, A.M. (1989). Hct optimaliseren van de rcgelset van GRAFON, Inter
nal Repmt, no. 697, Institute for Perception Research, IPO, Eindhoven.
Pounder, A. & Kommenda, M. (1986). Morphological Analysis for a Ger-
man Text-to-Speech System. International Conference on
Computational 86, 263-268.
Riihl, H.W. (1984). Sprachsynthese nach Regeln fiir unbeschrii.nkten
deutschen Text. Ph.D. Dissertation, Ruhr-Universitiit Bochum.
Van Coile, B.M.J. (1989). The Depes development system for text-to-speech
synthesis. of ICASSP 89, 250-253.
Van der Steen, G.J. (1987). A program generator for recognition, parsing
and transduction with syntactic patterns, Ph.D. Dissertation, Faculteit
der Letteren, University of Amsterdam.
Van Leeuwen, H.C., Berendsen, E. & Langewcg, S. (1986). Linguistics a.s an
input for a flexible grapheme-to-phoneme conversion system fm 0\1tch,
lEE [ntemational Conference on Speech Input/Output 86,
200-205.
Van Leeuwen, H.C. (1987). Complementation introduced in linguistic re-
write rules, Proceedings E1L't0pean Cf.mference on Speech Techn.ology 87,
1, 292-295.
Van Leeuwen, H.C. (1989). A development tool for linguistic rules. Com-
puter, Speech and Language, 3, 83-104.
174
Summary
F
OR the purpose of automatically converting (printed) text. i;ntn Hpeeeh,
among other things grapheme-to-phoneme conversion is required, i.e., the
assignment of a pronunciation code to the orthography. Since many words in
a language a.re reg11lat a.nd the number of words in a language principally is
not. tinit.e, a mle-based approach to this matter seems appropriate for at least
a major part of the task. Exceptions, then, can be stored in a small lexicon.
In this thesis a tool is described for the development of linguistic rules with
which one typically can define the transitions which are needed to derive the
pronunciation from the orthography. The development tool is called Toor_jP,
which stands for "Tool for Linguistic Processing". And as the name suggests,
although grapheme-to-phoneme conversion as yet has been the only major
application for whid1 ToorjP ha.<> been ToorjP is not. restricted
to this application only. Probably any rule-based segmental transcription
process can be implemented in ToorjP.
A special characteristic of Toor.,iP is that input-to-output relations arc
preserved. This mea.ns that one can mal<.:c usc of derivational information in
the linguistic rules, so for instance for stress assignment rules this can be
advantageously. On the other hand it means that the system can be used to
gather statistics on input-to-output. relation$. Given t:he majm; application
for which ToorjP has been used, t,he system CM be used as an analysis tool
for statistics on grapheme-to-phoneme relations.
In this thesis ToorjP is treated from several points of view. In clmpter 2
a user's point of view is taken, and ToorjP is described as it presents itself to
the user. First the basic configuration is discussed. Linguistic rules arc the
user's main tool to manipulate input characters. With them, one can select
and transcribe characters dependent on the context. The possibilities for
transcribing input characters and the facilities for defining contexts are then
described. Such rules can be grouped into a module, which thus provides
a mechanism to strings. Modules, in turn, can concatenated
to form a conversion scheme, which performs the desired task. The chapter
176 Summary
concludes with a discussion on some extensions which are indt1ded to increase
it.s nsm--fritmdliness and applicability. Also, some characteristics of the
<'m' discussed and compared to those of M>me other systems.
Jn 3 n. mathematician's point of view ir; taken_ It concerns a specific
aspect of Too[jP, whid1 remained underexposed in dHLpter 2, i.e., the exact
rrwauing of patterns (the mechanism to der)()l:e sets of strings). In Toor}P
patterns arc an extended form of regular expressions. The con-
sists of the 1(idition of two operators to the standanl re!-(ular expressions for
u8(:r convenience: complementation (t.h!l 'not.') and simU:!taneity (the 'aud').
ThP int.rodudion of one operator, the complementation operator, spccific<J.lly
1-(ives rise to an unexpected problem. If the complementation operator intro-
duced in a composit-ional rnam1er, the formal interpreta-tion of a certain class
of pn.U.tli.ns differs from what one would expect them to mean. To be precise:
cer-tain strings one would expect to be exduded, are not. This is consider-ed
to be an undesirable as generally users simply using a
system rather than first studying <-,xad nature. Therefore, an <lH.ernative
definition for complernl:nt.ation is proposed, which for the mentioned class of
heha.Vts in accordance with expectation. The essential difference
with the (:omposit.iorHLl formalism is that now th( 'explicit nofits' are always
As a consequence, however, strict. cmnpo8itionality is lost, which
shows, for instance, in t.lw f<l.ct that double may not always
he annihilated. From a theoretical point of view the proposed is
thus not complctdy sat.isfadory. It might be sat.isf<l.('tory, however, from a
prad.ical point. of view. Those patterns for which it b(lhave:s unsatisfactorily
arc highly unlikely to be in pa<dice, and the proposcd formalism can
be seen as a prad.ical compromise between the pra.dica.l needs and theoret-
ical degmKe- On practical grounds it hus thetefore been decided to
the proposed (serni-cornpo:,;itional) formalism in Toor.JP-
ln chapter 4 a techniotl point of view is taken, and "ome a.spects of Too[}P's
are discussed. To be precise: those aspects of the implemen-
tation are de8cribed which concern the process of matching to the
input:. Here, 'input' should be undcr.,toocl in the general sense of
buffers, i.e., buffer1> of whid1 the segments are such that derivational
infornmt.ion is available. For this purpo.,e, first. the internal
of patterns is discussed. The user-specified patterns transfonned into a.
dynamic data structnn; whic:h is accessible for the matching routine. The
dynamic structure codes the structure of the patterns, but sume simple ad
justments are made also, which fa<:ilitate pattern mat-ching during run-time.
Next, the algorithms which perfor-m the pattern matching are presented. First
the situation of a single input buffer is considered. In view of this input situ-
t.bll fund-ions for matching a particular structure in a pattern Me given.
Special a-ttention is paid to the function for matching the complementation
Summary 177
operator, since it.s definition gives rise to some additional computational com-
plexity. Then the more general situation of synchronized buffers is considered.
The algorithm for matching primitives is sornewhat altered in this situation.
Since the synchronization mechanism is important for this routine, two pos
sible synchronization mechanisms arc discussed and compared. The more
general one is chosen to be implemented and the buffer switching algorithm
is given. On the whole, with respect to the processing of patterns, ToorJP
cru1 he viElwed M a compiler/interpreter. The user-defined patterns a.re com
piled from high-level source code to ::>n internal representation. The internal
representntion is then interpreted by the functions fcrl" pattern matching.
In the final chapter, chapter 5, the merits of ToorjP are evaluated. Three
sidtJs of ToorjP are considered: (a) the i.e., how Too]jP is used
in a pract:ical applinttion, (b) the inside, i.e., how satisfadory is the semi
compositional formalism in practice, and (c) the surroundings, i.e., how does
ToorjP relate to other systems which have been designed for similar pur-
The rrm.in application for which ToorjP ha.') been used is the design of
a grapheme-to-phoneme conversion system. Two aspects of this application
arc discussed. The first is the spelling out of integer numbers, which is part.
of a. pre-processing phase, the se!'ond concerns the linguistically slanted mod-
ules which perform the actual grapheme-to-phoneme conversion. The second
side of the (wa.luation concerns the use of the complementatior1 oper::>tor. The
usage of the operator in the above application is reviewed in the light. of th(
choices which were made in chapter 3, and the validity of these choices
arc evaluated. Tlw us<tgc of the operator in the specific situa-tion that the
cornpm;itionnl formalism would fail whereas the semi-compositional one suc-
ceeds is relative scarce. In the light of theoretical objections the choice for the
sern.icornpositional formalism does not seem fully justified in this particular
application. The third side some more general aspects. For a
of features whirl1 can be seen to characterize development tools fo( linguistic
rules, Toor).P is compared to seven important existing systems. This gives an
overview of what is unique in ToorjP and what is common practice in such
systems. The thesis is conduded with some recommendations for the future
development of ToorjP. These concern both the improvement of the system
and its extension along natural lines.
Samenvatting
0
M langs automatische weg van gedrukte tekst naar spraak te komen is
het onder andere nodig om over grafeem-fonccm <;onversie tc besdtikken,
dit. is hd toekennen van een nit.spraakrepresentat.ie aan de geschreven vorm.
Omdat vcel woordcn in een taal min of mcer regelmatig in hun uitspraak zijn
en hct aantal woorden in een taal niet eindig is, lijkt een aanpak die
gebaseerd is op regcls nict ongef>Chikt om cen groot gedeelte van de wonrden in
(len taal aan te kunncn. Uitzonderingen kunnen dan vervolgens in een kleine
opgenomen worden.
In dit prodschriH wordt een gcrccdschap beschrcvcn, dat bedodd is om
taalkundigil regels te ontwikkdcn, w;o.a.rmee men typis(:h het soort ovcrgangen
knJI heregelen dat nodig is om vm1 spelling naar uitspraak tc komen. Het
ontwikkelgercedgchap lH:d 'Too]JP', hetgccn (":l:aat voor "Tool for Linguistic
Procc,.,sing" (gert:edschap voor ta::.lknndige verwcrking). En, zoals de na3m
al aangecft, ondanh het feit dat grafeem-fmHlem omzctting tot nu toe de
toepassing is waarvoor Tool)P gebruikt is, is het systecm beslist
niet bcpcrkt tot dcze enkde t.oepassing. W<t.;j,rschijnlijk lmn elke segmcntele
ornzetting die door regels be8chreven kan worden in ToorjP gei:mplcment.eerd
word<ln.
Een speciale eigenschap van ToorjP is dat de individucle rehl.t.ieH t.ussen de
invoer en de uitvocr behouden blijvcn. Dit betekent dat men in de taalkundige
regcls gcbruik kun maken van de afieiding tot dan toe, wat bijvoorbecld bij de
tockcnning V<l.n wo<"mlklemtoon erg handig kan zijn. Anderzijd.s betekent het
ook dat het syst.eem geschikt is om st<1tistische gcgeven:o t.e verzamclcn over
de rehtt.ie tussen de invocr en de uit.vocr. Gcgeven df; t.ocpassing waar het
systeem voor gebruikt. is, kan Tooi.)P di(;ntm als gereedschap ter
bepaling van wdke hoe vaak tot wclkf' klanken aanlciding geeft.
In dit. proefschrift wordt ToorjP v<ul een aantal :t.ijclen helicht. In boofd-
shtk 2 wotdt een gebruiker>st.andpunt ingcnomcn, en wordt ToorjP be(":threven
?:oals hct zich aan de gebruiker prcscnteer-t. Allcrccrst wonlt. de
tiguratic De taalknndigc regcb t.ijn hct middel om karakters te
Sa.men vatting 179
manipuleren; ';!:e worden gebruikt om karakters te selecteren en te wijzigen
afhankelijk V<Ln de kontekst. De mogelijkheden om ze te herschrijven en de
kontekst te definiercn worden vcrvolgens beschreven. Dit soort regels kunnen
in ccn module gegroepeerd worden, om r.odoende een mechanisme te vormen
dat woorden of zinnen aan kan. Vervolgens kunnen modules weer in volg
orde in een conversie schema gcplaatst worden, dat zodocnde de
taak uitvoert. Hct hoofdstuk besluit met de beschr-ijving van enkele uitbrei-
dingcn op de basisconfiguratie, bedoeld om de gebruikersvriendelijkheid en de
toepasbaarheid van het systeem te verhogen. Tevcns worden ecn aantal eigen
schappcn van het systeem besproken en vergeleken met die van soortgelijke
In hoofdstuk 3 wordt een wiskundig standpunt ingenomen. Het bctrcft
een onderdeel dat in hoofdstuk 2 ondcrbelicht is gcblcvcn, namelijk de pre
cieze betekenis van P<l.t.ronen (het mechanisme om verzamelingen strings aan
t.e duiden)_ In Toor)P zijn patronen een uitgebreide versie van reguliere ex-
pressles. De uitbreiding bestaat erin dat twee operatoren aan het formalismc
zijn toegevoegd t.erwille van het gebruikersgemak, t.e weten complement.atie
(de 'niet') en coordinatie (de 'en'). De introductie van complementatie geeft.
spccifiek aa.nleiding tot problemen. Als de operator op euJ cornposit.ionele
wijze aan het fonnalisme wordt t.oegevoegd, wijkt de formele betekenis van
een bepaalde klasse van patronen af van wat je zou vcrwachten. Om prccics
te ',!;iju: bepa.alde strings waa..-van je zou dat ze uitge,;loten wor-
den, worden dat niet. Dit wordt onwenselijk geacht, omdat gebruikers in het
algemeet1 een gewoon gaan gebrniken een voorafgaande st.udie
van z'n exactc wcrking, zodat zc dan voor onvcrwachtc en wcllicht onnodige
problemen kornen te staarL Daarom wordt er in dit hoofdstuk een alter-
natieve definitie van complementatie voorgesteld, die voor de bewuste klasse
van patroncn wcl volgcns vcrwachting rcagecrt. Hct csscntiClc vcrschil met
het voorgaa.nde formalisme is dat nude 'explicit nofits', de strings waarvan je
cxplicict vcrwacht dat zc uitgcsloten zullen worden, nu wel altijd uitgesloten
worden. De konsekwcntic hicrvan is cchtcr wel dat strikte compositionaliteit
verloren gaat, wat bijvootbeeld ge111ustreerd wordt door het, feit. dat dubbele
complementatie niet zomaar altijd geschrapt mag worden. Theoretisch gezien
is hct voorgcstcldc formalisme dus niet geheel bevredigend. Praktisch gezien
zou het. dit echter wel kunnen zijn. Het is uiterst onwaarschijnlijk dat de
pat.ronen wa.a.rvoor hct voorgestclde formalisme niet bevredigend reageert ge-
bruikt worden in de praktijk, en zodoende ka.n het. formalisme beschouwd
worden als een compromis tussen praktische wensen en theoretische elegantie.
Op grond van deze praktische overwegingen is er besloten om het voorgest.elde
(semi-compositionele) formalismc in Toor.W te implementeren.
In hoofdstuk 4 wordt een technisch standpunt ingenomen en worden een
a.antal aspecten van de implementatie besproken. Om precies te zijn: die
180 Samenvatting
aspccten worden beschouwd die te makcn hcbben met het evaJueren van pa-
troncn tcgcn de invoer. Hier moet 'invoer' gezien worden in de algemene zin
Vall gesynchroniseerde buffers, dit zijn buffers waarvan de segmentcn zodanig
zijn opgelijnd da.t. de beschikbaar is. Allereerst wordt de
interne representatie van patronen besproken. De door de gespeci-
ficccrdc patroncn worden omgczct naar een data.structuur die toe-
g(lJJkdijk voo;r de cvaluatieroutines_ De dynamische datastructuur codccrt
de structuur van de patronen, waar enkele kleine aanpassingen aan geda::t{l 7.-ijn
die het ev<Llueren vergemakkelijken. Vervolgens wotden de algoritmcs gegeven
die de patronen evalueren tegen de invoer. Berst wordt de situatie he!:l<:houwd
also er maar een enkcle invocr buffer Voor die situatie worden de
fundies voor het van een bepaalde structuur gcgcvcn. De com-
plementatie operator wordt speciaal bclicht omdat deze geeft tot
extra computationcle complcxitcit. Tenslottc wonlt. <le algemene situatie van
gcsynchronisecrde buffers beschouwd. Het aJgoritme om primitieven te eval-
mren verawlerl: hierdoor enigzins, maar de algoritmes voor structuren uid ..
Orndat het synchronisatiernechanisme belangrijk is voor gcnoemde routine
worden twee mogelijke mechanismes besproken en vcrgcleken. Er
om het algemenere mechanisme in Toor,iP tc implcmenteren. Tenslotte wordt
hct a.Igoritme gegeven om van t.e veran.deren. Over het gehcd gczicn
kan Too]jP als een compiler/interpreter gczicn worden ten aanzien van het
cva.lucrcn van pat.roncn. De door de gebruiket in een taalkundig forma.a.t
gespecifieeerde patronen worden gecompilccrd naar de interne representatie.
Vervolgcns wordt die reprcscntatie ge'int.erpreteerd door de evaluatie functics.
In het hoofd"tuk, hoofdstuk 5, wordt TooJjP op zijn
beschouwd. Dric zijdcn van ToorjP worden beli<:ht.: (a) de buitenkant, d.w.z.
hoc is ToorjP gebruikt in ccn praldische toepassing, (b) de binnenkant, d.w.z.
hoc voldced hct voorgcstcldc forma.li!5me in de praktijk, en (c) de omgeving,
d.w.z. hoc verhoudt ToorjP 'i,)ch t.ol. an.dete systemen die voor soortgclijke
.>.ijn nntworpen? De voomaamste tocpassing wa.ar ToorjP voor ge-
hruikt is, is het. ontwerpen van een grafeem-fonecm omzetting!>!>ysteem. Twee
aspecten van deze applicatie worden nader beschouwd. Enetzijds is dat een
systccm om gehele get.alleu nit te schdjven, wat dccl is van de voorbewerk-
ingsfa.c;e, en anderzijds zijn dat de taalkundig getinte modules die de echte
letter naar ldank omzctting uitvoeren. Het t:weede aspect van de cvaluatic
is het gebruik va.n. de complementatie operator. Ret gebruik van deze ope,
rator in de genoemde toepassing wordt bekeken in het lid1t van de keuzcs
die in hoofdstuk 3 gcmaakt zijn, en zodoende wordt. de juistheid van die
keuzcs gccvalucerd. Het gebruik va.:n de opetator in de specifieke situatie dat
hct compositiouelc formali:,;me font gaat maar het semi-compositionele fo:r-
mali:,:me goed is betrekkelijk gering. De keus voor het sem.i-cornpositionele
formalisme lijkt hierdoor gezien de thcoretische in de:o;e toepassing
Samenvatting 181
nict gchccl gcrechtvaardigd. De derde zijde betrcft wat algemenete aspecten.
Voor een aantal eigcnschappen die voot dit soort systemen van belang zijn,
wordt ToorjP vergeleken met een aantal belangdjke vergelijkbare systemen.
Zodoende ontstaat er een overzicht over wat uniek is in ToorjP en wat ge-
bruikelijk is in dit soort systemen. Het proefschrift wordt afgesloten met
enkele aa.-ubevelingen voor verder onderzoek. Deze betreffen enerzijds mo-
gclijke verbeteringen aan het systeem en anderzijds uitbreidingen la.ngs een
natuurlijk lijn.
Curriculum Vitae
1 fehruari 1959 te Cast.ricum
aug El7l - juni 1977 SL St.anislascollcgc Delft.
Gymna.sium /3.
sept 1977 - nov 1984 Technische Hogcschool Delft:, Elektrotechniek.
j<lJI 1 985 heden
Afstudccrrichting: Regeltechnick.
Afstudceronderwerp: Ontwerp en realisatie van cen in-
t.eractieve, robotonafhakclijke robot.programmecrtaaL
Wctcnschappclijk medewerker in dicnst het Na-
tuurkundig Lahoratorium van Philip:>, gedetacheerd op
hct Instituut voor Perceptie Ondcrzock (IPO).
Stelling en
bchormde bij het. proefschrift
ToorjP: A development tool for linguistic rules
van Hugo van Leeuwen
1. In veel bestaande tckst naar spraak systernen worden verSdliliElnde
niveau's van informatie in elkiar geklapt tot een cnkelc informatiestroom.
Het is ra.adzamer deze verschillende niveau's expliciet weer te geven door
rniddel van een gelaagdc, gcsynchroniseerde structuur, zoals die voo twee
lagen besproken is in sectie 4-4 van dit procfschrift.
2. Bij het luisteen naar spraak valt hct doorgaans zeer snel op of een spreker-
spontaan spreekt of voorleest. Een belm:tgrijke oorzaak hiervan is dat een
spreker die voorlccst begint te spreken voordat hij/zij de zin geheel gdezen
en begrepen heeft, en daardoor accenten legt die in stijd kunnen zijn met
de inhoud van de tekst.
3. Dank zij de zgu. PSOLA-tcchniek, het eerst gerapporteerd door Hamon
ct al., die hd ruugelijk maakt om zonder veel verlies aan spraa.kkwaliteit
toonhoogte en temporele structuur van sprv.ak tc mm:tipulcren, zal de
kwaliteit van kunstmatige spraak op korte termijn ccn belangrijke vcr-
betcring in kwaliteit ondergaan.
Harr1cn, C., Moulines, E. & Charpentier. F. (1989): A diphrme
thesis system based on time-domain p'!'osodic modifications on speech,
ICASSP, 238 241.
4. Zowel in de uverhcidsscctor als in het bedrijfsleven worden veelal de bud-
getten voor een a.fdeling of werkgrocp voor het komende jaar
op grond van budget en uitgaven vau het huidigc jaar volgens de formule:
= Min(U;,B;)
B hct budget is, U de uitgaven, Min een fuud.ie is die hct minimum
van twee getallcn bcpaalt, en de indices de jaren in kwestie aanduidcn. In
tegenstel!iug tot de bczuinigcnde bedoeling hiervan dit een verkwistendc
maatregel.
5. Momenteel kan een verdachte bloedafnmm: voor identificatiedoeleinden
d.rn.v_ de zgn, DNA-print wcigeren door zich te beroepen op de onschend-
baa.rheid van de lichamclijkc intcgriteit. Het afnemen v>JJ:l vingerafdrukken
kan de verdachte r1iet weigeren. Gczicn de zekerheid waa.rrnee de DNA-
print identificeed, dient deze identificaticmethode dezelfde status te kdj-
gcn als de vingerafdruk, d.w.z. een verdachte client hct rccht ontnomen te
worden dit te kunnen weigeen.
6. Ret is algemeen bekend dat specialist en en ad.s,assist.ent.en in ziekenhuizen
rcgelmatig onverantwoord lange werkt:ijden maken. Als gevolg hiervan
worden sorns emstige maar onnodige fouten gemaakt. Als speeialist.en van
d!' toekomst u ~ t er op de arts-assistenten speciale vera.ntwootdelijkheid
om aan dczc misstand icts te doen. Zij dienen zich te vercnigen wn
zodoende vcrantwoorde werktijden a.f te dwingen.
7. Het file probleem zou voor een dee! opgclost kunnen worden door hct
fictscnvervoer iu de trein gratis te maken en specialc faeiliteihm op NS-
stations te scheppen zodat men snd vanuit de trein per fiets 11it het station
kan komen en omgekcerd.
8. Hct is de vraag ofhet merken van eigendommen een afdoendc bcscherrning
tegen dief5tal biedt; de praHijk leert dat ze doorgaans ongemerH ver-
dwijnen.