You are on page 1of 63

S CHOOL OF C OMPUTER S CIENCE AND S OFTWARE E NGINEERING

M ONASH U NIVERSITY

Typogenetics
Thesis
for
Bachelor of Science (Computer Science) Honours (0088)
Clayton Campus

Andrew Snare
ID: 11915374
Supervisor: David Albrecht

November, 1999

Abstract
Typogenetics is a system, introduced in Hofstadters Gdel, Escher, Bach, that was intended to encapsulate
knowledge of genetics at the time within a typographical system. Since its proposal, few direct studies
have been carried out on the system. This project develops an unambiguous specification for typogenetics
as well as simulation software. After describing the different behaviour that strands can have with repeated self-application, methods for constructing special strands are presented. Systematical exploration
of typogenetics, using the simulation software, proved difficult due to the problem of devising an efficient
heuristic. The computational strength of typogenetics is also investigated. While some components of primitive recursion were present, full primitive recursion could not be shown nor could Turing-completeness in
general.

Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1

Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Project Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1

Typogenetics Strands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Typogenetic Enzymes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4

Binding Preference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

Enzyme Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6

Strand Self-Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Strand Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.1

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.1.1

Duds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.1.2

Self-Perpetuators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.1.3

Self-Replicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.2

Known Example-Strands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.3

Constructing Strands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

System Implementation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.1

Implementation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.2

Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.3

System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.3.1

Strand Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

4.3.2

StrandFactory Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

4.3.3

Enzyme Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

4.3.4

Applicator Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

Searching for Strands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

5.1

Search Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

5.2

Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

5.2.1

Graph Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

5.2.2

Graph Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Search Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Computation Using Typogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

6.1

Machine Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

6.2

Primitive Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

6.3

Turing Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

6.4

Other Equivalences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

6.4.1

Posts Tag System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

6.4.2

Lambda Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

6.4.3

Bracket-Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

7.1

Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

7.2

Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

Appendix A Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

A.1 Typogenetics.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

A.2 pqueuemodule.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

5.3
6

ii

List of Tables
2.1

Base-pair to amino-acid mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Enzyme binding preference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Figures
2.1

Base classes and complementary mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Example enzyme secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1

Strand classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.2

Example self-replicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

5.1

Performance of cycle-detection algorithm using DFS. . . . . . . . . . . . . . . . . . . . . . . .

22

iii

Chapter 1

Introduction
Proposed by Hofstadter [3, pages 405513], typogenetics is a formal system intended to capture the essence
of what was known of genetics at the time (1979). In the most general sense, typogenetics allows for manipulation of strings that represent DNA strands. Like biological genetics, typogenetics has DNA strands
and enzymes. DNA strands implicitly represent enzymes due to a process called translation; this process
allows for enzymes to be generated from a DNA strand.
Typogenetics starts to diverge from biological genetics at this point with the translation processes operating differently. Enzymes within typogenetics are also vastly simplified compared to biological enzymes.
In particular, the translation process in biological genetics operates on triplets of DNA bases whereas its
typogenetics counterpart operates on duplets. Another difference is that the amino-acids used to make
up enzymes in typogenetics are not related to real biological amino acids; to reflect this the amino-acids
within typogenetics have different names. Enzymes in typogenetics also differ from biological enzymes
in the determination of their binding preference. In biological genetics enzymes attach to molecules based
on their three-dimensional molecular structure. However enzymes within typogenetics are simplified by
determining the binding preference based on a two-dimensional structure.
Despite these differences, enzymes within typogenetics are similar to their biological counterparts in that
they can manipulate DNA strands they act as operators on DNA strands to perform typographical manipulation. In computing terms, the DNA strand takes the role of data and the enzyme becomes a program
operating on the data. The amino-acids within the enzyme are analogous to functions strung together within a program in a particular order. While amino-acids may have a similar role in biological enzymes, their
role was not clearly understood when typogenetics was formulated. This means that the actions of aminoacids within typogenetics are artificial, even though Hofstadter intended them to be indicative of biological
amino-acids.
An interesting symbiosis exists due to the dual role that DNA strands play. Not only are they the data
that gets operated on by enzymes, but they also encode (via the translation process) the enzymes that are
operating. While unrelated enzymes and strands can interact, much of this project concerns the situation
where a strand produces enzymes that operate on itself. Of particular interest are strands that are able to
duplicate themselves. In biological terms these strands correspond to the ability to self-replicate, while in
philosophical terms these strands indicate ability to represent themselves. A joint interpretation of these
perspectives is the motivation behind Hofstadters teaser problem (see Section 2.6). Here he invites the
reader to devise a strand that reproduces through the repeated process of translating a strand and then
applying the enzymes to the strand that generated them. In addition to the challenge of devising such
strands, efficiently searching for strands with specific properties such as these poses an interesting search
problem.
Also of interest is the computational power of typogenetics and what methods exist to perform generic computations. This essentially determines the scope with which strands can be manipulated. In this situation
it is not as essential for strands to produce the enzymes that operate on themselves; of more interest is the

2
way data can be represented in strands and the computational strength of the enzymes that operate on such
strands. This can be related to self-replicating strands since computational strength and self-representation
are linked[6, pages 262268].

1.1 Previous Research


Little direct study has been carried out on the typogenetics system since its proposal. Studies have been
conducted by Morris[10, 11] and Varetto[16, 17], although neither address the system from the perspective
of computer science. While both implemented software to simulate typogenetics, this was not a major objective. Morris addressed the system from a philosophical perspective, while Varetto studied the system
in terms of biological behaviour and metrics. Both authors used a system that is different from the original such that while they operate in the spirit of typogenetics, subtle differences cause enzymes to operate
differently.
In approaching typogenetics from a philosophical perspective, Morris[11] reasons about the types of behaviour that strands can have, and in some cases methods for producing strands with certain characteristics. In particular a method for devising self-reproducing strands is presented, along with an example
strand. He also argues that it is not possible for a strand to completely erase itself.
Research by Varetto has mainly been concerned with finding examples of self-reproducing strands and
then determining various biological statistics. In particular, population growth is modelled[16] and the
behaviour of strands with limited resources examined[17]. While Varetto does find some examples of selfreproducing strands[16], these are particular to his system and do not apply directly to typogenetics as
proposed by Hofstadter.

1.2 Project Plan


The first stage of this project involved developing a specification of typogenetics prior to implementation.
Presented in Chapter 2, this involved comparing the systems described by all three authors (Hofstadter,
Morris and Varetto) and resolving differences or ambiguities. With the system specified, the next stage of
this project involved classifying the types of behaviour strands can have within the typogenetics system, as
discussed in Chapter 3.
The third major stage, described in Chapter 4, concerned implementing software to simulate the typogenetics system. Closely related to this was devising and implementing algorithms to allow for searching for
specific types of strand, as outlined in Chapter 5. The particular goal was to identify strands that exhibit
self-replicating behaviour.
Finally, Chapter 6 analyzes typogenetics in terms of its computational power. This last stage of the project
was particularly aimed at determining whether the typogenetics system was Turing-complete.

Chapter 2

System Description
While appearing to be complete and straightforward, the original description of typogenetics was fairly
ambiguous. Many factors were considered when deciding what to do in ambiguous situations. Most important was preserving the spirit with which typogenetics was devised ie. a system capable of imitating
real genetics, albeit in a simplistic manner. A secondary consideration was observing the same interpretations made by others who have studied the system. It was also necessary to keep the system consistent.
The typogenetics system consists of several components. These include describing what strands and typoenzymes are, how enzymes are obtained by strands, how to determine where typo-enzymes start operating
on strands and the functions of the amino-acids that make up typo-enzymes.

2.1 Typogenetics Strands


A strand is a string of symbols from the alphabet {A , C , G , T }. These symbols are bases and the positions
they occupy within a strand are units. An example strand is ACGGTTA with the base C occupying unit 2.
Bases are grouped into two classes called purines and pyrimidines. Each base also has a complementary base.
This classification and complementary relationship is illustrated in Figure 2.1.

purines

complements

T
pyrimidines
C

Figure 2.1: Classification and complementary mapping for bases.

Strands are defined as continuous strings of bases. However during the process of applying an enzyme to
a strand, it is useful to temporarily consider a strand to be sparse. In this situation, the terms gap and blank
are used to represent no base in a unit position.

2.2 Typogenetic Enzymes


Typogenetic enzymes consist of a string of amino-acids. The names of the amino-acids in typogenetics are
given in Table 2.1 with an example being copswimvr.1
1 Varetto[16,

17] gives the amino-acids different names to those listed but they are intended to perform the same operations.

2.3 Translation
Translation is the process of generating an enzyme from a strand. This process is considered to be nondestructive; the original strand remains intact after translation has completed. It is also considered to be a
one-way process enzymes are always produced from strands, not vice versa. However there is nothing
preventing one from inferring the strand that would encode (translate to) a specific enzyme or amino-acid.
The translation process groups bases in a strand into pairs. If there is an odd number of bases, the last
one is ignored. Each possible base-pair corresponds to an amino-acid as shown in Table 2.1(a). Translation
generates an enzyme by looking up each base-pair in order to find the corresponding amino-acid.

A
C
G
T

mvrs
inas
rpyr

Second Base
C
G
T
cuts dels swir
mvls copr
off l
incr ingr
intl
rpul lpyl lpul

(a) Original translation table presented by


Hofstadter[3, page 510].

A
First Base

First Base

A
C
G
T

mvrs
incs
rpyr

Second Base
C
G
T
cuts dels swir
mvls copr
off l
ingr intr
inal
rpul lpyl lpul

(b) Modified translation


Varetto[16, page 187].

table

used

by

Table 2.1: Mapping from base-pairs to amino-acids.

The case of the AA base-pair is special since it has no corresponding amino-acid. It is considered to be a
punctuation marker and marks the end of one enzyme and the start of the next. Hence a strand can encode
several enzymes through the presence of AA base-pairs.
An example of translation is that the strand CGCTAATAAGT translates to the enzymes copoff and rpydel.
Note that the final T contributes nothing and the second occurrence of AA is not considered punctuation
since it does not occur on an even unit-boundary.
Sometimes translation can result in null enzymes. An example of this occurs when translating CGAAAAGCAAT
whereby no enzyme is emitted between the middle AA pairs or after the final AA pair. These null enzymes
are treated as if they never existed.
Translation is almost consistent between all three authors. The only difference is that Varetto has a different
third-row whereby the ina amino-acid is in the right-most column and the other three columns are shifted
left. This is shown in Table 2.1(b).

2.4 Binding Preference


Enzymes operate on strands by attaching to them and performing operations. While the operations that are
carried out are determined by the amino-acids that make up the enzyme, the results differ depending on
where the enzyme starts operating. All enzymes have a binding preference. This property determines which
base the enzyme will attach to before starting.
The binding preference of an enzyme is a function of its secondary structure. The secondary structure is
determined by drawing the enzyme as a snake-like structure and giving every amino acid a kink. The
kinks for each amino-acid are given in Table 2.1 as the subscripted letter. The letter s corresponds to no
kink, while l and r correspond to kinks to the left and right respectively.

5
By convention, the first amino-acid is always drawn such that the next amino-acid is to its right. The
orientation of the link between the last two amino-acids then determines the binding preference as given
in Table 2.2. An enzyme that consists of only a single amino-acid will not have any links, and thus is not
covered by this rule with no guidance provided by Hofstadter. From the point of view of implementation,
it is easiest to assume a binding of A .
Last Link

Binding Preference
A
C
G
T

Table 2.2: Binding preference of an enzyme, based on the relative orientation of the last two links. This assumes the first link goes from left to right.

As an example of determining binding preference, Figure 2.2(a) shows the secondary structure of an enzyme. Since the last link points towards the left, from Table 2.2 the binding preference of this enzyme is
T.
int lpy

inc rpy
cut

swi off

cop rpu
(a) Hofstadter

int lpy

inc rpy
cut

swi off

cop rpu

cop
inc

rpu swi
rpy int

off cut lpy

(b) Varetto/Morris

Figure 2.2: Examples of the secondary structure for an enzyme using both Hofstadters method and the Varetto/Morris method.

Both Morris and Varetto use a different method for determining the binding preference of enzymes. Their
method takes into account the hanging link before and after the first and final amino-acids. The hanging
link prior to the first amino acid is oriented to the right, while the direction of the hanging link after the last
amino-acid is used to determine binding preference, again using Table 2.2. Using this method on the same
enzyme shown in the previous example results in the secondary structure shown in Figure 2.2(b) and a
binding preference of A . This method has the advantage of easier implementation, as well as handling the
case where an enzyme consists of a single amino-acid. However it explicitly contradicts an example given
by Hofstadter [3, page 511]. In this example rpyinarpumvrintmvlcutswicop has a binding preference
of C whereas the Morris/Varetto method results in a binding preference of G .
As originally presented, typogenetics is non-deterministic in many respects. A problem arises when an
enzyme has a choice of several bases to bind to since they are all equal to the binding preference. In this
situation Hofstadters system allows for any of these bases to be chosen arbitrarily. Morris maintains this
position, while Varetto makes the system deterministic by always binding to the right-most occurrence of
a base. To simplify the system this project makes the same assumption. This does not weaken the system
since any results obtained also apply to the non-deterministic system.

2.5 Enzyme Application


When an enzyme acts on (or is applied to) a strand, each amino acid acts upon the currently bound unit.
As the amino-acids operate, the enzyme can move along the strand to different units. The location at which
the enzyme is attached is the enzymes position or currently bound unit. The movement of the enzyme along
the strand is analogous to the movement of the head along the tape in a Turing machine. By convention, the
currently bound unit is displayed in lower-case. For example, CAGGCtA indicates the enzyme is currently
bound to the T .
As the enzyme operates, a mode of operation can take place called copy-mode. During this mode the complementary base (see Figure 2.1) is inserted above the currently bound unit as the enzyme moves along
the strand. The enzyme can switch to this complementary strand and operate on it instead of the original
strand. The upper strand(s) that get formed are unusual in that they are read from right-to-left. Bases in
either the original strand or the complementary one are only considered to be part of the same strand if not
separated by a gap. To illustrate this, after an enzyme has ceased operating the result could be:
GCA

(2.1)

GG

ACCATTGCA
This is interpreted as three strands: ACCATTGCA, GCA and GG.

Just as reading the complementary strand(s) is reversed, the sense of left and right is reversed for aminoacids operating on it. Hence in the visual sense left is equivalent to clockwise while right is anticlockwise.
When an enzyme operates on a strand, it ceases to act after all amino-acids have been applied. It can cease
to act prematurely if the enzyme moves off the end of a strand. Generally speaking, the enzyme also ceases
to act when it tries to move into a gap between strands however this rule is violated if the enzyme is in
copy mode and the rpy, rpu, lpy and lpu amino-acids are applied (see below).
Each of the amino-acids listed in Table 2.1 performs a specific function. These functions are:
cop : This turns on copy-mode and inserts the complementary base above the currently bound unit. With
copy-mode activated, the complementary base (as described in Figure 2.1) is inserted wherever the
enzyme moves. The complementary base is inserted after the enzyme has moved, such that gaps still
cause the enzyme to cease action even when copy-mode is active rather than filling-in the gaps ahead
of the enzyme as it moves. If copy-mode is already activated, this operation has no effect.
An example of cop operating is:
cop
ACCAGTc ACCAGTc

(2.2)

GT

GT

off : This turns off copy-mode. If copy-mode is already turned off, this operation has no effect. This
operation will never alter the strand(s) present.
swi : This switches the binding of the enzyme from a strand to its complement. If there is no complementary base to the currently bound unit, the enzyme ceases to operate.
An example of how swi operates is:
swi
ACCAGTC ACCAGTc

GT

GT

Another example whereby the enzyme ceases to operate is:

(2.3)

7
swi
ACcAGTC ACCAGTC

(2.4)

GT

GT

cut : This severs the strand (and its complement) to the right of the currently bound unit. The detached
strands remain inaccessible to the enzyme from that time. If there are no bases to the right of the
currently bound unit (either in the current strand or its complement), this operation has no effect.

8
An example of the cut operation is:
cut
ACCaGTC ACCa

(2.5)

GT

GT

Here the strands G and GTC are detached. This is equivalent to the system being in the state:
G

(2.6)

GT

ACCa GTC

del : This removes the base at the currently bound unit, and moves the enzyme one unit to the right. If
there is no base to the right then the enzyme halts. This amino-acid does not affect the complementary
strand, although if copy-mode is active then the complementary base will be inserted after the enzyme
has moved one unit to the right. However it is unclear from the original specification as to what
happens after the base is removed. If the base is simply removed, a gap in the strand appears and
it has been split into two separate strands. An alternate interpretation would be to fill in the gap by
shifting all units on the right to the left. This is the interpretation made by Morris[11, page 376] which
is problematic as it results in misalignment of complementary bases between the two strands.
It is not explicitly stated that complementary bases must be lined up. However, in real genetics complementary bases appear as a result of chemical bonding. As a result of this complementary bases
should remain aligned. Hence the system description for this project assumes that bases are not shuffled to the left into the gap. Assuming copy-mode is active, an example of the del amino acid operating
is:
del
ACCaTCA ACC tCA

(2.7)

ATGGT

TGGT

ina, inc, ing, int : These operations insert the bases A , C , G and T respectively to the right of the currently
bound unit. If copy-mode is active the complementary base is also inserted. If copy-mode is not active
then a gap is inserted into the complementary strand.
An example of ina operating with copy-mode enabled is:
ina
ACCgTA ACCgATA

(2.8)

TAT

TA

A further example of ing operating with copy-mode disabled is:


ing
ACCgATA ACCgGATA

(2.9)

TAT CGC

TATCGC

mvl, mvr : These operations move the enzyme one unit to the left or right respectively. If copy-mode is
active, the complementary base (see Figure 2.1) is inserted above the newly bound unit. If the enzyme
moves off the end of a strand (or into a gap between strands) the enzyme ceases to operate. While this
behaviour is not explicitly stated by Hofstadter, it is inferred and consistent with the interpretation
made by Morris while Varetto does not explicitly address it.
An example of this operation is:
mvr
ACCAGTCA ACCAGTCA

(2.10)

TGg

TgG

Another mvr operation would result in the enzyme moving off the end of the top strand and halting.
lpy, lpu : These operations search left from the currently bound unit for the nearest pyrimidine or purine
(see Figure 2.1) and binds the enzyme to that location. Hofstadter does not specify whether these
enzymes can traverse gaps or not. However Morris makes the interpretation that unlike mvr and mvl,

9
these searching amino-acids can traverse gaps in either of the strands, but not both (in which case
the enzyme ceases to operate). If copy-mode is active then gaps are filled in. In this situation,the
bases that are filled in are not considered candidates for terminating the search. An example of the
lpu operation is:
lpu
ACGTCCTA ACGTCCTA

(2.11)

rpy, rpu : These operate like the lpy and lpu amino-acids except they search right instead of left.

2.6 Strand Self-Application


Much of this project involves strands that are applied to themselves. This is due to a puzzle that accompanied the original description of typogenetics by Hofstadter [3, pages 512513]. This puzzle is stated as:

. . . it would be most interesting to devise a self-replicating strand. This would mean something along the following lines. A single strand is written down. [This is translated] to produce
any or all of the enzymes which are coded for in the strand. Then those enzymes are brought
into contact with the original strand, and allowed to work on it. This yields a set of daughter strands. The daughter strands themselves [are translated] to yield a second generation of
enzymes, which act on the daughter strands; and the cycle goes on and on. This can go on for
any number of stages; the hope is that eventually, among the strands which are present at some
point, there will be found two copies of the original strand (one of the copies may be, in fact, the
original strand).
While this describes how strands are applied to themselves (after being translated to enzymes), exact details are omitted. For example it is not clear what happens when a strand codes for several enzymes, in
particular the order in which they are applied (if at all). The most flexible interpretation is that the enzymes
can be applied in any order. However assuming a given order of enzyme application, it is also unclear
which strands subsequent enzymes are applied to when the application of an earlier enzyme yields several
strands. As an example of this dilemma, consider the strand CCCCACAAAG which translates to mvlmvlcut
and del which both have a binding preference of A .
Application of the first enzyme results in:
mvlmvlcut
CCCCACAAaG CCCCACa

(2.12)

The strand AAG is produced as a side-effect. It is not clear which of the two strands the del enzyme should
operate on, a problem often compounded by the presence of additional complementary strands.
In biological genetics such issues arent important since there will always be a large number of strands
present. Even if operating non-deterministically, the large number of strands means that due to chance
every combination will effectively occur simultaneously.
However in terms of implementing a system for typogenetics, this is not easily realised due to time- and
space-complexity constraints. Hence this project assumes a more restricted interpretation:
1. Enzymes are applied in the order they were translated;

10
2. An enzyme is bound to the strand that the previous enzyme was last bound to when it ceased operating.
These restrictions simplify typogenetics without violating the original description. Since these rules are restrictions or restraints, not modifications, any resulting daughter strands could also result from the original
process described by Hofstadter.

11

Chapter 3

Strand Classes
During this project a fundamental process undertaken using typogenetics was strand self-application, as
described in Section 2.6. Through this process strands can be thought to have daughter strands, as well
as sibling and parent strands.
The daughter strands of a strand (called the parent) are the strands that result from applying a strand to itself.
Just as a strand can have several daughter strands, it can also have several parent strands. In the case where
a strand has several daughter strands, the daughters are sibling strands to each other.
The relationships between strands can be represented as a directed graph (digraph) with vertices representing strands and edges representing the daughter relationship.

3.1 Definitions
The digraph representation of the relationship between strands can be used to classify the behaviour of
strands during repeated self-application as described in Section 2.6. While Hofstadters puzzle is concerned
with finding (and therefore classifying) self-replicating strands, various other classes of strand exist and can
be described in terms of graph-theory.

u
I
(a) selfM

u
@
Re
@
I
(b) pro-selfM

u
I
Re

(c) selfP

u
@
Re
@
I
Re

(d) pro-selfP

uY
I
Re

(e) selfR

u
@
Re
@
Y
I
Re

(f) pro-selfR

Figure 3.1: Strand classification (for the strand denoted by ). Dotted-lines may be replaced by arbitrary graph structures with at least one incoming
and outgoing edge.

3.1.1 Duds
The class of dud1 strands includes all strands that do not produce any daughter-strands. This can occur
when the strand encodes an enzyme which cannot bind to the original strand. An example of a dud strand
is CGGC. This strand translates to copinc which has a binding preference of A . Since A does not occur in
the original strand, this enzyme cannot bind to it. As a graph, a dud strand is represented by a vertex with
no outgoing edges.
1 Term

introduced by Morris[11, page 379].

12

3.1.2 Self-Perpetuators
This class includes all strands that are self-sustaining. This means that during repeated self-application
the strand will periodically be present, but there will never be more than one copy at the same time. In
terms of graph-theory, this corresponds to a strand being a vertex that is part of a cycle.
Varetto[17] divides the class of self-perpetuators into two subclasses. The first of these subclasses, selfM, occurs when the cycle consists of a single edge. The second subclass, selfP, consists of all other selfperpetuators where the cycle is made up of more than one edge. These classes of strand are illustrated in
Figures 3.1(a) and 3.1(c).
Varetto[17] also introduces the notion of classifying strands that have at least one self-perpetuator as a
descendent. Such strands that lead to a selfM or selfP strands are of the pro-selfM and pro-selfP classes
respectively. Examples of these strand classes are shown in Figures 3.1(b) and 3.1(d). The class of selfM
includes strands that translate to enzymes containing only amino-acids that do not modify the strand they
are operating on. Such strands are called trivial by Morris[11, page 380].

3.1.3 Self-Replicators
The strands sought in Hofstadters puzzle (Section 2.6) belong to the class of self-replicators. These strands
are more than self-sustaining since they result in additional copies of themselves being produced. In terms
of graph-theory, these strands correspond to strands that are common to more than one cycle. Similar to the
notion of strands that lead to self-perpetuators, strands that lead to self-replicators are of the pro-selfR class.
These classes are illustrated in Figures 3.1(e) and 3.1(f) respectively. Note that the examples illustrated for
each class of strand are simple ones the definitions for each class also apply generally to cycles of longer
length and arbitrary complexity.

3.2 Known Example-Strands


Previous studies have presented examples of the various classes of strand. Morris presents several example
strands to illustrate the various classes of strand within his system. The strand TCAGCGATGTCTAGAGCT is
given as an example of a non-trivial self-perpetuator[11, page 380]. This strand translates to rpu-del-cop-swiint-off -del-del-off and has a binding preference of T (using the Morris binding-method). This enzyme can
bind to any T but for self-perpetuating behaviour to be observed it must bind to the second right-most T .
The enzyme then operates:
rpu
del
TCAGCGATGTCtAGAGCT TCAGCGATGTCTaGAGCT TCAGCGATGTCTgAGCT
cop
swi
int
TCAGCGATGTCTgAGCT TCAGCGATGTCTGAGCT TCAGCGATGTCTAGAGCT
off del
del
TCAGCGATGTCTAGAGCT TCAGCGATGTCTAGAGCT
cT

(3.1)

At this point the enzyme ceases to operate since it is no longer bound to a strand. This example relies on
Morris[11] interpretation of del which assumes the gap is filled. However, it also illustrates the previouslymentioned problem of base-alignment. Strand (3.1) shows a violation of how complementary bases should
always be aligned. This occurs because the del operation results in the upper T being aligned with a G .

13
As an example of a self-replicating strand, Morris[11, page 381] presents CGCGCGCGTAATATAACGATCGCGCGTATTAATTAATACGCGCGATCGTTATATTACGCGCGCG. This translates into four enzymes with binding
preferences C , G , C and A respectively. The first enzyme is intended to bind to the left-most C and uses
copy-mode to duplicate the first three bases. The second enzyme binds to the right-most G and inserts a
complementary C using copy-mode before switching to the upper strand. Once bound to the upper strand,
the enzyme then traverses the length of the entire strand with copy-mode enabled. At this point the upper
strand is a complementary copy of the entire lower strand. The remaining two enzymes do not alter the
strands. The upper and lower strands are then separated resulting in two identical strands. The strands
are identical because the original strand was crafted by Morris such that the second half of the strand is a
complementary mirror of the first half. Strands which have this property in biological genetics are called
inverted repeats[7, 16].
Varetto[16] claims the strands GC, GGC and GTGC are examples of self-replicators within his system. According to simulation results provided by Varetto[16, page 194], the daughter-tree of GC splits at the tenth
generation. In one tree the strand GC recurs in generation 28, while it occurs in the other tree in generation
43. Using our digraph representation, this corresponds to GC being a vertex that is common to at least two
cycles which are of length 28 and 43 respectively.
However these results were not verifiable. The strand GGC at generation 1 translates to int with a binding
preference of G . Using Varettos rules this enzyme should bind to the right-most G and insert a T to the
right:
int
GgC GgTC

(3.2)

This disagrees with the simulation results presented by Varetto[16, page 194] which show a daughter of
GTGC (one of the example self-replicators given). This would seem to indicate that the simulation software
either bound initially to the left-most G or inserted the T to the left of the currently bound unit instead of
the right. Both of these behaviours contradict the rules of simulation presented earlier in the paper.
Another example of CGTTTTTTTG is given as a self-replicating strand by J-Aro[4]. This strand achieves
self-replication by generating its complementary strand (CAAAAAAACG). However instead of using our
(added) rule that enzymes can only attach to the strand that encoded for them, the enzyme from the original
strand is applied to this new complementary strand which results in a new copy of the original strand being
generated the complements complement.

3.3 Constructing Strands


This section outlines some methods that can be used to generate or construct strands of various classes.
Dud Strands: These can be constructed by building a strand that consists of only three bases and padding it in
such a way that the enzyme it encodes has a binding preference of the missing base. For example, we might
choose to build a strand that does not have an A base but translates to an enzyme with a binding preference
of A . This is a convenient choice because from Table 2.2 the final link in the enzymes secondary structure
must point to the right (like the first link). While it is possible to ensure this by choosing amino-acids that
bend back and forth to eventually end up pointing to the right, it is simpler to only use amino-acids which
result in a straight link. From Table 2.1 this allows us to use the cut, del, mvr, mvl and ina amino-acids in
our enzyme. The list of available amino-acids is further reduced since we dont wish to use A in our strand
which eliminates all the amino-acids except for mvl. This is coded for by CC meaning we can assume that

14
any strand consisting of just the base C is a dud strand. The method described here can be extended to use
other bases by ensuring the strand translates to an enzyme with the correct binding preference.
Self-Perpetuators: Of the self-perpetuators, selfM strands are the easiest to construct. This is because it is possible
to ensure that a strand translates to an enzyme that leaves the original strand intact when applied. One
method for constructing selfM strands involves ensuring the strand translates to only amino-acids which
do not change the strand theyre bound to. Enzymes that consist purely of such harmless amino-acids
are called trivial enzymes. Such enzymes must be constructed exclusively from: swi, mvr, mvl, cop, off ,
rpy, rpu, lpy and lpu. This allows considerable freedom in constructing strands. Since this is more than
half of the available amino-acids, it stands to reason that more than half of all strands are of this type.
An example of such a strand is TCCGCAATTT which translates to rpu-cop-mvr-swi-lpu. This enzyme does
produce an additional strand when applied, but the original strand is left intact. The additional strand can
be prevented by ensuring the cop amino-acid is not coded for in the strand (as it was in this example).
Another method for constructing a selfM strand is to ensure that the enzyme(s) it encodes terminate before
reaching any amino-acids that modify the strand. An example of this technique is ensuring the strand begins with the AT base-pair which translates to the swi amino-acid. This becomes the first operation when the
enzyme is applied; the enzyme attempts to switch to the non-existent complementary strand and immediately ceases operation before any further amino-acids can be applied. However not every strand beginning
with AT exhibits this behaviour it is possible for AA base-pairs to encode subsequent enzymes which will
be applied even if the first one terminates. To preserve this behaviour any AA base-pairs on even boundaries
must be followed by AT as well. An example of this form is ATGTCAAGCGAAATCG.
Self-Replicators: Self-replicating strands are harder to construct than the previous classes. However, the method
used by Morris[11, pages 380382] can be modified to allow for construction of self-replicating strands
within our system. This method works by devising a strand that constructs an inverted repeat of itself in
such a way that the inverted repeat is the same as the original strand. Examples of such inverted repeats are
the strands TA and CG. To be an inverted repeat, the strand must be palindronic in that the base in position i
must be complementary to the base in position (n i + 1) where n is the length of the strand. The following
relation must hold true:
i, bi = b(ni+1)

(3.3)

where b denotes the complement of b.


While it is possible to make any number of strands that satisfy this requirement, a self-replicator must also
code for an enzyme that will generate the required complementary strand. One way to achieve this is to
assume an enzyme binds to the right-most base and turns on copy-mode (which inserts the complementary
base). The enzyme should then switch to the upper strand and traverse the entire length of the strand,
filling in the complementary bases as it goes. The amino-acids required to do this are cop-swi-rpy where
the last could be rpu instead the choice is arbitrary since either amino-acid would result in the enzyme
traversing the entire length of the lower strand. These amino-acids are coded for by base-pairs of CG, AT
and TA, meaning the start of our self-replicating strand would be CGATTA.
To satisfy the requirements for an inverted repeat, the right-half of the strand must be TAATCG, yielding a
minimal strand of CGATTATAATCG. While this strand translates to an enzyme with the correct operations,
the enzyme has a binding preference of A . This can be corrected by padding out the center of the strand
with an arbitrary inverted-repeat that causes the binding preference to be corrected. The amino-acids coded
for by this arbitrary strand are not important since they never get applied if the enzyme operates correctly

15
the rpy operation causes the enzyme to move off the end of the strand and cease operating. Some padding
that corrects the binding is AT which results in a self-replicator of CGATTAATTAATCG. The operation of this
self-replicator is illustrated in Figure 3.2(a).
As an alternative to padding in this manner, the rpu amino acid can be tried instead of rpy since it causes
a different kink in the enzymes secondary structure (left instead of right). This results in the left-half of
the strand being CGATTC, the right-half being GAATCG and an overall strand of CGATTCGAATCG. Unlike
the first self-replicator, this strand translates to an enzyme which has a binding preference of G which will
bind initially to the right-most G . This self-replicator is illustrated in Figure 3.2(b). An arbitrary number
of self-replicators can be constructed using this strand as a wrapper around any other strand that is an
inverted-repeat.

CGATTAATTAATCG

CGATTAATTAATCG

CGATTCGAATCG

CGATTCGAATCG

CGATTAATTAATCG

CGATTCGAATCG

?
cop-swi-rpy-swi-rpy-swi-cop

?
cop-swi-rpu-inc-swi-cop

(a) Initial attempt at constructing a self-replicator

(b) Subsequent attempt at constructing a self-replicator

Figure 3.2: Example self-replicating strands.

In addition to the method of self-replication through constructing an inverted-repeat, Morris[11, pages 381
382] describes another method with which a self-replicator could operate. This involves a strand extending
itself (in either direction) to a point where it consists of two copies of the original strand concatenated. At
this point the cut amino-acid would operate resulting in two copies of the original strand. There are no
known examples of strands which operate in this manner. Indeed it is impossible for a strand to operate
in this way within a single generation because the only amino-acids capable of extending a strand in this
manner are the ones which insert bases. To encode an amino-acid which inserts a base requires a base-pair.
Hence at most a strand of length n can encode for n2 base-insertions. The matter is further complicated by
the need to encode the cut amino-acid and ensure that it eventually operates at the correct location.

16

Chapter 4

System Implementation
This chapter outlines the software developed to simulate typogenetics as well as details regarding the implementation. The source-code discussed is listed in Section A.1 starting on page 31.

4.1 Implementation Platform


The Python language was chosen for the implementation for reasons including speed of development and
author familiarity. Since Python is interpreted, the resulting implementation was not as fast as it would
be using a compiled language, but this was offset by the convenience with which the system was developed and modified. The interactive Python interpreter also provided a convenient environment in which
typogenetics could be simulated in an ad-hoc manner.

4.2 Design Considerations


Several factors were taken into account when designing the framework for implementation.
A clear model of typogenetics was important to aid understanding of the source code and being able to
play with the implementation. In practice this was realised using OOP and classes representing the
various components of the typogenetics system.
The clear modelling of typogenetics also made it easy to modify the system with a short development
cycle. While it was clear how typogenetics was intended to operate, many ambiguities were discovered
after initial implementation. The modelling of the system allowed for changes in the specification to be
incorporated with minimal effort. Changes tended to be localised which reduced the amount of code that
needed to be re-tested with each change.
The final design influence aimed to compensate for the slowdown associated with the interpreted and
highly dynamic nature of the language. This impacted more upon the implementation of the smaller components of the system.

4.3 System Design


Within the typogenetics module the classes developed were Strand, StrandFactory, Enzyme and Applicator.
In addition the helper functions translate() and mogrify() were implemented.

17

4.3.1 Strand Class


The Strand class was designed to encapsulate the behaviour of a strand within typogenetics. Internally the
class stored the strand information using the array module.
Arrays created by the array module arent as flexible as built-in Python arrays. However, they have the
advantage that they are intended to only store homogeneous data and are optimized accordingly. The
Strand class represents the strand internally as an array of byte values with a mapping of: A = 0, C =
1, G = 2, T = 3. Values stored within a strand are not forced to be within this range, allowing for gaps in
the strand to be represented using a value of 32.
In addition to exporting methods that allow the Strand class to mimic other sequences, methods are present
to increment the strand as well as return the positions of all the bases in the strand. The basepositions()
method is useful when determining where an enzyme should bind.

4.3.2 StrandFactory Class


This class was designed to allow for enumeration of all possible strands. Internally the StrandFactory class
uses a priority queue to allow for the strands to be returned in an unusual order if desired. In a search
strategy this is useful in allowing some strands to be looked at before other strands. The StrandFactory
class exports a mapping interface which allows a priority to be associated with a strand.

4.3.3 Enzyme Class


The Enzyme class encapsulates the behaviour of an enzyme within typogenetics. The class contains the
amino-acids that make up the enzyme and handles translation from base-pairs to amino-acids.
The translate() helper function simplifies the creation of Enzyme objects by performing the entire translation
process. It takes a strand as a parameter and returns the list of enzymes that result from translation.

4.3.4 Applicator Class


The Applicator class creates a context within which a series of enzymes can be applied to a strand.
After an Applicator object is created from a strand, the apply() method can be used to apply an enzyme to the
strand. The extractstrands() method returns the strands that are present within the Applicator. Internally the
moveto() method is used to set the position at which the enzyme is bound. It in turn uses compcopyunit()
to create the complementary base if copy-mode is enabled.
The insert() method is used to insert a base at the currently-bound location. The predicate methods pyrimidine(),
purine() and isblank() return true depending on the base at the currently-bound location. These can be used
as a parameter for the search() method which seeks bases that match the predicate.
The amino-acids are implemented in self-named methods, with the exception that del is implemented in a
method called nuk() as del is a reserved word in Python. These methods are relatively simple to implement
given the available support methods. The execute() method is responsible for dispatching amino-acid
requests.

18
To simplify the process of applying an enzyme to a strand the mogrify() helper function can be used. This
takes a strand and a list of enzymes as parameters, returning a list of strands that result from application of
the enzymes. This can be used in the form of daughters = mogrify(parent, translate(parent)).

19

Chapter 5

Searching for Strands


The puzzle put forward by Hofstadter (see Section 2.6) involves finding strands that are self-replicators.
This can be generalised to classifying the behaviour of strands according to the classes described in Chapter 3. This chapter describes the method used for this project to explore typogenetics and classify strands.
This process of systematic exploration was treated as a search problem with the search-space being the set
of all strands.

5.1 Search Representation


The digraph representation for typogenetics provided a representation by which searching can be performed. Classifying strands became a two-step problem:
1. Building a graph structure containing the strands that need to be classified.
2. Identifying cycles in the graph structure and how many cycles each vertex is part of.
This approach identifies selfP and selfR strands. While it does not identify pro-selfP and pro-selfR strands,
the model for the problem is easily be extended to allow this.

5.2 Search Algorithm


An efficient approach for this problem would have integrated both steps. An incremental algorithm allowing for cycle-detection with O (1) time-complexity after each strand insertion would have been ideal. The
graph would be constructed by inserting strands and new cycles detected after each insertion. However
such an incremental algorithm could be not be found in the literature[2, 5, 13] or devised. Hence the best
approach was to build the graph and periodically check the entire graph-structure for cycles.
For simplicity (as well as a degree of scalability) these tasks were separated into two pieces of software.
The first built a graph structure containing strands and stored this structure in a file. The second piece of
software read this structure from disk and then identified and displayed cycles as well as strands that were
common to more than one cycle.

20

5.2.1 Graph Generation


To keep the graph structure as general as possible it was not always connected. This means that any strand could be present in the graph, but not necessarily related to any other strand. The algorithm used to
generate the graph was:
1. Create empty graph and initialize priority-queue of strands to expand.
2. Extract first strand from queue.
3. Determine all daughters of the strand.
4. Add vertices for each daughter to graph with edges from the parent strand.
5. Increase the priority1 with which daughters will be expanded.
6. If time to checkpoint, store the graph structure to a file.
7. Go back to step 2.
This algorithm employs the use of a priority-queue that implicitly contains all possible strands. The initial
priority associated with each strand as well as step 5 are discussed further below.
Priority-Queue
Used by the StrandFactory class listed in Appendix A, the priority-queue operations required were that of
inserting a strand with a priority, finding the strand with the minimum priority, extracting the strand with
the minimum priority and decreasing the priority associated with a strand.
The algorithm was expected to run for substantial periods of time such that that many (thousands or more)
strands were expected to be in the queue at any given time. Hence it was important that the priority-queue
scale well. For this reason an efficient priority-queue was developed using Fibonacci heaps[2, pages 420
439]. This allowed all the above operations to be implemented with O (1) time-complexity. The main code
for the priority-queue is listed in Appendix A starting on page 43.
Heuristics
Implicitly the priority-queue used by the algorithm contains the set of all possible strands which means that
eventually every possible strand will be expanded. However since there is an infinite number of possible
strands the order of the strands in the queue becomes important in determining how long it takes to find
cycles.
One possible ordering would be to order the strands from smallest to largest; this would be equivalent
to each strand having its length as its priority. Apart from proceeding very slowly, this approach treats
each strand as equally likely to be a member of a cycle. While this is a valid assumption a priori, during
generation of the graph it is reasonable to assume that a strand with a known parent is more likely to be
part a cycle than other arbitrary strands.
At the other extreme it is not possible to expand all strands in the graph before any new (parentless) strands
are introduced some strands increase forever and neither produce a dud nor result in a cycle. An example
1 Higher

priority is actually associated with a lower numerical value in the implementation.

21
of such a strand is GA which translates to ina and has a binding preference of A . The daughter of GA is GAA
which in turn has a daughter of GAAA and so-on ad finitum.
Hence an approach is needed that allows all strands to be searched, but biases the algorithm to expand
strands that are more likely to be part of a cycle. Such a heuristic is extremely difficult due to the nature of
typogenetics. According to Pearl[12] one of the essential ingredients in formulating an effective heuristic is
knowledge of the rules that govern the transitions between states in the search-space. In our representation
of the search states are represented by strands and transitions correspond to the daughter relationship
governed by strand self-application as described in Section 2.6.
Despite the rules for transition being well-known in our situation, there is no general way to determine
the result of self-application without actually simulating it. The earliest one can predict the result is after
translation where it is possible to ascertain if the enzyme(s) will create new strands, modify the parent
strand, or do nothing at all. This information is only revealed half-way through the self-application process.
This effectively means that there is no way to form a heuristic based on the contents of the strand.
The approach used by the StrandFactory class was a heuristic based on the length of the strand. Using this
method, a strand is initially assigned a priority equal to 10 times its length. Step 5 of the algorithm changes
the priority to the length of the strand. This means that the algorithm is biased towards expanding strands
already present in the graph but will still introduce new strands. This prevents ever-increasing strands
(such as GA) from halting the introduction of new strands into the graph structure while simultaneously
biasing the search towards strands with known parents.

5.2.2 Graph Traversal


With a graph structure stored in a file, the second piece of software was developed to analyze the graph.
The purpose of the analysis was to detect cycles and count the number of cycles each vertex was part of.
Cycle-detection for directed graphs is a problem encountered during deadlock-prevention in operating

systems. Detecting all cycles can be performed with O n2 time-complexity where n is the number of
vertices in the graph[14, page 235]. This is performed using a depth-first search (DFS) of the graph structure.

In practice this project enjoyed behaviour better than the expected O n2 due to the disconnected nature

of the graph structure. This means that each traversal from a given node reached much less than n vertices.
The performance of the cycle-detection algorithm is shown in Figure 5.1 where it can seen the behaviour
was more linear than quadratic.
When each cycle was detected, the vertices on the cycle were noted along with the length of the cycle.
Once all cycles had been detected how many cycles each vertex was part of was counted. This allowed for
classification of strands as belonging to either of the selfM, selfP or selfR classes.

5.3 Search Results


The search results showed that self-perpetuators are very common within typogenetics. The largest graph
structure analyzed contained 90428 vertices and 102323 edges. In this structure a total of 25335 cycles were
found, but most were due to selfM strands only 66 cycles were found with a length of more than 1.
The longest cycle found was 3 generations long. The two cycles of this length were CCGGA CCGGGA
CCGGAGGA CCGGA and GGGGA GGGGAGG GGGGAG GGGGA.

22

Cycle-Detection Performance
11000
10000
9000
8000
7000
6000
Vertex Visits
5000
4000
3000
2000
1000
0

Vertices vs Visits
Edges vs Visits

3 +

3
+

3
3
3

3 +
3 +
3+
+
3
3+
0

1000

2000

3000
4000
Vertices/Edges

5000

6000

7000

Figure 5.1: Performance of cycle-detection algorithm using DFS.

No vertices were found that were common to more than one cycle meaning no self-replicators were found
within the search-space explored. This does not necessarily mean there were no self-replicators present
since it is possible that while part of the cycles were present in the graph structure, the remaining vertices/edges hadnt been inserted.

23

Chapter 6

Computation Using Typogenetics


This stage of the project was concerned with determining the computations typogenetics can perform. The
aim was to determine whether the system is Turing-complete. Several approaches can be taken to show
this:
1. Showing an equivalence for a Turing machine directly;
2. Showing an equivalence to another system that is known to be Turing-complete.
To show such equivalences a representation was needed whereby one system could be translated to another.
This came down to how the components of a given system could be represented using typogenetics strands
and enzymes.

6.1 Machine Representation


Most formal systems have data and functions that act on data. In typogenetics these are naturally represented by strands and enzymes respectively. Functions that act on data are most conveniently represented by
any enzyme that can be devised to perform the required task. This is different to the case of self-application
used so far in this project since here the enzyme that operates on a strand doesnt have to be encoded by
that same strand.
However, choosing a representation for data isnt easy. A compact representation for natural numbers, for
example, could involve representing the number in base-4 with the bases representing the digits using A =
0, C = 1, G = 2, T = 3. Numbers of this form, however, would be hard to manipulate. Although Gdel
numbering[9, pages 259260] could be used to represent arbitrary parameters with this single number, in
practice it would be difficult for enzymes to extract parameters. This problem could be alleviated by using
a base-3 representation with the digits represented as before and T reserved to separate numbers. This
would allow easier encoding of multiple numbers within a single strand, but manipulation would remain
difficult given the limited functionality provided by the amino-acids.
The simplest representation for natural numbers is to use a base-1 notation whereby the length of the strand denotes the number. For example, a convenient form for representing n could be TAn . This would
allow strands that represent numbers to be concatenated to form a strand that represents many numbers
unambiguously. Here the use of the bases T and A is arbitrary but this particular choice becomes useful
later on. As an example of this particular representation, the number 3 could be represented by the strand
TAAA.

24

6.2 Primitive Recursion


Primitive recursion is a weaker form of computation than that of Turing-completeness. For primitive recursion to be possible, several initial functions must be present. These are the zero-function, z () = 0, the
successor function, s (n) = n+1, and the projection functions, pi (
n) = ni . In addition the composition function, c (
n) = g (h1 (
n) , . . . , hl (
n)) must be possible as well as primitive recursion where f (
n, 0) = g (
n) and
f (
n, m + 1) = h (
n, m, f (
n, m)). A computation is said to be primitive recursive if it is an initial function or
can be generated from the initial functions using composition and primitive recursion[8, pages 232233].
The following sections describe how some of these functions can be represented within typogenetics and the
problems associated with representing others. Unless otherwise mentioned, the representation for integers
is the representation given in the previous section whereby n TAn . The primitive recursive functions often
take several numbers as parameters. Where this is done, the parameters are represented by a single strand
that consists of the representations for each number all concatenated. For example, the parameters (5, 4)
would be represented by the strand TAAAAATAAAA.
Zero-function: The zero-function is relatively easy to represent within typogenetics if one assumes only a single
parameter is present. In this case, the enzyme lpycut is sufficient. This enzyme has a binding preference of
A , which for our number representation means the enzyme will bind to a part of the body of the number.
The first amino-acid, lpy, will search left for the T after which the enzyme will cut the A bases off, resulting
in just the T being left (which is equal to zero). For example:
lpycut
TAAa T

(6.1)

However this enzyme is not sufficient for the case of several parameters since z(5, 4) must also be 0. This
would result in:
lpycut
TAAAAATAAA TAAAAAT

(6.2)

It is possible to handle an arbitrary number of parameters by also cutting off all bases to the left of the
T . This can be done by enabling copy-mode and then switching to the top strand before performing a cut
operation. Hence a more robust version of the zero-function is the enzyme cutcopswicutdel. Here the
initial lpy operation can be omitted since this enzyme has a binding preference of T and thus starts in the
correct position on the strand. Ignoring the strands that were cut off (and assumed discarded), after this
enzyme operates all that remains is T. This is the desired output from the function:
cutcopswicutdel
TAAAAAtAAA T

(6.3)

Successor-function: This function always takes a single input value and increments it by one. In this case the
enzyme ina is sufficient. With a binding preference of A , it simply inserts another A which increases the
value represented by the strand by one:
ina
TAAAa TAAAaA

(6.4)

Projection-functions: This is a family of functions such that pki () is the function that returns the i-th parameter
out of k total parameters. An enzyme which performs this can be constructed based on the desired values
of i and k. The operations required to do this can be broken down into two steps:
1. Remove parameters ni+1...k

25
2. Remove parameters n1...i1
These two steps are most conveniently represented using two different enzymes, each of which have a
binding preference of T .
The enzyme corresponding to the first part is skipped for the situation where k = i. In the remaining situations (where k > i), an enzyme that performs the desired result can be constructed as off (lpy off off off )
mvlcutoff off off . While this may seem complicated, the off operations are present to ensure the enzyme
has the correct binding-preference regardless of the value of k i 1. When this enzyme terminates, the
desired parameter ni is the right-most component of the strand. An example of this part of p32 (2, 0, 3)
operating is:
off mvlcutoff off off
TAATtAAA TAAt

(6.5)

The second enzyme is similar to the second part of the zero-function and can be represented as copswi
cutdelswiswi. The final two swi operations are present to ensure a binding preference of T they never
get executed because the enzyme ceases to operate after the del operation when it has nothing to bind to.
Continuing from the previous example results in:
copswicutdelswiswi
TAAt T

(6.6)

This is the parameter (n2 = 0) that the projection function was supposed to return for this example. A
convenient way to represent a projection function, given that it consists of two enzymes, is by presenting a
strand that would translate to the required enzymes. For the projection function illustrated this would be
p32 = CTCCACCTCTCTAACGATACAGATAT.
Composition-function: It is not easy to show a method for composition of functions using typogenetics. This is
because the functions h1 . . . hl must all be applied to the same initial data. This means that l copies of the
input parameters to the composition function are needed, and that each function would need to know how
to operate on its data and not that intended for other functions.
Conceivably it might be possible to generate l separate strands that are copies of the input parameters and
have each enzyme (function) operate on its own strand. However the results would need to then be reassembled into one strand before g could operate. Unfortunately typogenetics provides no mechanism for
such joining or concatenating strands in this manner.
An alternate approach would be to generate l copies of the input strand all concatenated together and apply
hl first, then hl1 etc. However generating copies of the input strand s of the form sl is difficult. This process
is an extension one of the methods by which replication can occur as described by Morris[11, pages 381
382]. Unfortunately there are no known enzymes which can perform such an operation, as discussed in
Section 3.3 on page 15. However assuming such an input strand could be constructed, composition would
still be likely fail because some functions, such as the zero-function defined above, would erase parts of the
strand intended as parameters for other functions.
Primitive-Recursion-function: It was not possible to find a method for representing the primitive recursion
function in typogenetics. Even if the recursive part of the function could be realised, there is no easy way to
represent the base-case of f (
n, 0) = g (
n). This would require testing the final parameter; there is no known
method for performing such a test within typogenetics.
The closest one can come is to test for the presence (or non-presence) of a specific base. This can be achieved
through the binding preference of the enzyme whereby the enzyme effectively tests for the presence of the

ki1

26
base and only operates if true. While one could test for the presence of a non-zero number using an
enzyme with a binding preference of A , the data representation would mean that the enzyme could bind
to the A in any parameter, not just the one being tested. An alternative approach would be to have an
enzyme with a binding preference of T and mvr as the first operation. This would cause the enzyme to
cease operating if the right-most parameter was zero, since the enzyme would move off the end of the
strand. Unfortunately this corresponds to an if non-zero test where an if zero test is needed.
While the initial functions could be represented within the typogenetics system, the machine representation
used did not allow for composition or primitive-recursion.

6.3 Turing Completeness


A system that is primitive recursive can be extended to be -recursive if unbounded minimisation is
present[8, page 249]. Furthermore, a function that is -recursive can be calculated using a Turing machine[8,
page 252]. This means that if one can represent the -recursive functions with a system, the system is also
Turing-complete. Unfortunately a representation for the primitive recursive functions (and thus also the
-recursive functions) could not be found within typogenetics.
Despite this, typogenetics exhibits many characteristics that are similar to Turing machines. For example,
the notion of the enzyme binding to a specific unit in a strand and operating on that unit is similar to the
movement of the tape-head along the tape within a Turing machine. Another similarity is found in the
difficulty of predicting the result of self-application, as mentioned in Section 5.2.1. At face value this is
similar to the halting problem with Turing machines it is not possible to determine the outcome of a
Turing machine in a general manner without actually executing it.
Despite these apparent similarities, however, a representation for Turing machines in typogenetics is not
readily apparent. In particular, the method by which enzymes operate is not as powerful as the method by
which Turing machines operate. This is because an enzyme is effectively always in the same state (the same
amino-acids will operate every time) and because there is no way to choose different operations depending
on the base where the enzyme is currently bound.

6.4 Other Equivalences


Other approaches were attempted to show Turing-completeness during the course of this project. These
were aimed at showing a way for representing other systems using typogenetics that were themselves
known to be Turing-complete.

6.4.1 Posts Tag System


Initially it was thought the symbol manipulation systems of Post could provide a method for showing
Turing-completeness within typogenetics. Posts systems involve representing theorems using strings and
combining them (using productions) to form new theorems. This system has been shown to be Turingcomplete[chapters 1214][9].
However it was not possible to represent this system in typogenetics since enzymes have no mechanism for
operating on multiple strands at once. For an enzyme to represent a production operating on two theorems,

27
both theorems must be encoded within a single strand. At this point it was not possible to devise a general
scheme for the enzyme to be able to differentiate between the theorems encoded by a strand. This is a form
of the bracket-matching problem described in Section 6.4.3.

6.4.2 Lambda Calculus


The system of lambda calculus is another system that is known to be Turing-complete[15]. Deriving a
representation for lambda calculus using typogenetics would also tie in with the previous stages of this
project since there are known methods for self-representation in the lambda calculus[6, pages 262268].
A canonical form for the lambda calculus is the SK system whereby all functions within the lambda calculus can be represented using two functions:
KP Q
SP QR

(6.7)

P R (QR)

(6.8)

For the K function to operate, it must be able to operate on a single strand and discard the Q component.
This is similar to the initial function pki described in Section 6.2. Unlike with primitive-recursion, however,
the parameters P and Q arent simple numbers rather they can be recursive expressions. Due to this
recursive nature of the parameters, there is no easy representation for the parameters that allows for an
enzyme to be able to extract just a single parameter. This is also a form of the bracket-matching problem
described in the next section.

6.4.3 Bracket-Matching
Many of the data representations within typogenetics were inappropriate for computation due to the problem of delimiting where parameters start and stop within a strand. For the initial functions this problem
was avoided because all parameters were numbers. However the parameters in both the Post and SK
systems are not simple numbers but can be recursive in nature. For example, the Q parameter for the K
function can be any expression within the lambda calculus, which in turn has its own parameters.
The problem of representing parameters in this situation is equivalent to matching or counting brackets
(where brackets refers to delimiters that mark the start and end of a parameter). This problem is best
illustrated by considering how one might encode a series of parameters of the form (a, (b, (c, d)) , e) where
the variables a e are the representations for the parameters. Using C and G as delimiters, this would
translate to CaCbCcdGGeG. Assuming an enzyme is operating on a strand of this form, it has no way for
determining how deep within this structure it is. To do this would require counting or matching the C and
G bases. This isnt possible within typogenetics since enzymes have no state that allows them to record
this.
In language terms, this form of bracket-matching can be represented using a context-free grammar. The
minimum machine needed to process such a grammar is a pushdown automaton[8, page 110], a class of
machine that is not as powerful as the Turing machine. Hence the difficulties in decoding parameters of this
form seem to indicate that typogenetics cannot represent pushdown automatons, let alone Turing machines.

28

Chapter 7

Conclusion
7.1 Summary of Results
During this project a comprehensive specification for typogenetics was developed. This specification was
based on the original by Hofstadter[3, pages 405513] as well as the interpretations and modifications presented during other studies[10, 11, 16, 17]. The resulting specification adhered to the spirit with which
typogenetics was originally proposed and was as self-consistent as possible.
With the specification of the system complete, definitions were presented for various kinds of behaviour
that strands can exhibit during repeated self-application. This allowed the strands sought by Hofstadters
puzzle to be labelled self-replicators.
Software was also developed to simulate the typogenetics system, as well as systematically explore strandspace looking for self-replicators. Due to the difficulty in using a heuristic to guide this search, the systematic exploration did not result in any self-replicating strands being identified. However the process did
identify many non-trivial self-perpetuating strands.
Despite the developed software not being able find any self-replicators, techniques were described that
allowed an infinite number of such strands to be constructed, thus satisfying Hofstadters original puzzle.
Like devising a heuristic to guide the search for self-replicators, showing the computational strength of
typogenetics also proved to be quite difficult. While it was possible to develop several of the functions
required for primitive recursion, not all of the components could be implemented. Similar attempts to show
Turing-completeness using other formal systems also failed. However this does not necessarily mean that
typogenetics is not Turing-complete. Rather it leaves an open-question as to whether there exists alternate
representations to those considered which do allow for more powerful computation. Indeed it is not clear
whether attempts to show Turing-completeness failed due to inappropriate machine representation or due
to typogenetics simply not being powerful enough.

7.2 Future Directions


More efficient methods for traversing the graph structures in Section 5.2.2 would have enabled a larger
graph structure to be searched for self-replicating strands. It is possible that an algorithm for deadlock
prevention in distributed operating systems[1] could be adapted to allow for more efficient cycle-detection.
An ideal algorithm would also be incremental and allow for detection of new cycles as they are formed
when vertices are inserted. It may also be possible to devise heuristics that allow for the graph-generation
to be biased towards including strands that are more likely to be part of cycle structures.

29
Future study of computation using typogenetics should consider alternative machine-representations to
those used in this project. In particular, representations that consider several generations of enzyme application need to be examined. It is possible that this would allow for more powerful operations to be carried
out.

30

References
[1] Israel Cidon. An efficient distributed knot detection algorithm. IEEE Transactions on Software Engineering, 15(5):644649, May 1989.
[2] Thomas H. Cormen, Charles E. Leisermon, and Ronald L. Rivest. Introduction to Algorithms. MIT Press,
Cambridge, Massachusetts, 1990.
[3] Douglas R. Hofstadter. Gdel, Escher, Bach: An Eternal Golden Braid. Basic Books, Inc., New York, 1979.
[4] Kai-Mikael J-Aro. Review of Chapters XVI and XVII with Dialogues in Gdel, Escher, Bach: An
Eternal Golden Braid. http://www.nada.kth.se/~kai/lectures/geb.html, February 1996.
[5] Donald E. Knuth. Fundamental Algorithms, volume 1 of The Art of Computer Programming. AddisonWesley, Reading, Massachusetts, third edition, 1997.
[6] Dexter C. Kozen. Automata and Computability. Springer, New York, 1997.
[7] Benjamin Lewin. Genes IV. Cell Press, Cambridge, Massachusetts, fourth edition, 1990.
[8] Harry R. Lewis and Christos H. Papadimitriou. Elements of the Theory of Computation. Software Series.
Prentice-Hall, Englewood Cliffs, New Jersey, 1981.
[9] Marvin Lee Minsky. Computation: Finite and Infinite Machines. Series in Automatic Computation.
Prentice-Hall, Englewood Cliffs, New Jersey, 1967.
[10] Harold C. Morris. Typogenetics: A Logic of Artificial Propagating Entities. Ph.d. dissertation, University
of British Columbia, Vancouver, British Columbia, 1988.
[11] Harold C. Morris. Typogenetics: A logic for artificial life. In Christopher G. Langton, editor, Artificial
Life, pages 369395, Redwood City, California, 1989. Addison-Wesley.
[12] Judea Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving, chapter 1, page 16.
Addison-Wesley, 1984.
[13] Robert Sedgewick. Algorithms in C. Addison-Wesley, December 1990.
[14] Abraham Silbershatz and Peter B. Galvin. Operating System Concepts. Addison-Wesley, Reading, Massachusetts, fourth edition, 1994.
[15] Alan M. Turing. On computable numbers with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society, 42:230265, 1936.
[16] Louis Varetto. Typogenetics: An artificial genetic system. Journal of Theoretical Biology, 160:185205,
1993.
[17] Louis Varetto. Studying artificial life with a molecular automaton. Journal of Theoretical Biology, 193:257
285, 1998.

31

Appendix A

Source Code
This appendix lists some of the source-code developed for this project.

A.1 Typogenetics.py
#!/usr/bin/env python
"""
This module exports routines useful in modelling the Typogenetics
system, as put forward by Hofstadter in his book "Gdel, Escher,
Bach".
Classes exported:
Strand
StrandFactory
Enzyme
Applicator
PrintApplicator

------

models a DNA strand


enumerates all possible Strands
models an enzyme
provides contexts for applying enzymes to strands
noisy specialization of Applicator

Methods exported:
translate
-- translates a DNA strand into an enzyme
mogrify
-- applies an enzyme to a strand, returning results
"""
import array, types
import sys, string, time
import bisect
from pqueue import PQueue
# Constants to represent the available bases
A = 0 #was ord("A")
C = 1 #was ord("C")
G = 2 #was ord("G")
T = 3 #was ord("T")
BLANK = 32 #was ord(" ")
class Strand:
"Concrete strand that models a DNA strand."
# Mapping table from real letters to bases
basemap = {A:A, C:C, G:G, T:T,
a:A, c:C, g:G, t:T}
# Mapping for next base
nextbase = [C,G,T,A]
# Used to translate a strand into a human-readable string
transbase = "ACGT"
transbase = transbase + ?*(256-len(transbase))

32

def __init__(self,init=""):
"Initializes the value of the strand."
# Were using an array of bytes to represent a strand
#print "type(self):",type(self)
#print "type(init):",type(init)
if type(init) == type(self):
# Copy constructor
self.array = array.array(b,init.array.tolist())
elif type(init) == types.ListType or \
type(init) == types.TupleType or \
type(init) == types.SliceType:
# Sequence constructor
self.array = array.array(b,list(init))
elif type(init) == array.ArrayType:
# Array constructor
self.array = array.array(b,init.tolist())
elif type(init) == types.StringType:
# String constructor
self.array = array.array(b)
for char in init:
self.array.append(self.basemap[char])
else:
raise TypeError, \
argument 1: expected array or sequence or string, %s found %\
type(init).__name__
# This is a cache of the positions of each base in the strand
self._basepositions = None
return
def __repr__(self):
"String representation of strand. This is as fast as I can make it."
return "Strand(%s)" % str(self)
def __str__(self):
"Informal string representation of the strand."
return string.translate(self.array.tostring(),self.transbase)
def __len__(self):
"Return the length of our strand, when necessary."
return len(self.array)
def __cmp__(self,other):
"Compares the strand to something else."
if type(self) == type(other):
# Strand comparison
return cmp(self.array,other.array)
else:
# Try to convert to a strand then compare
try:
other = Strand(other)
except TypeError:
# Meaningful conversion not possible. Doh.
return -1
# With the conversion done, recursively compare
return cmp(self,other)
# Should never reach here
return

33

def __getitem__(self,x):
"Return a given unit in the strand."
return self.array[x]
def __setitem__(self,x,value):
"Set (and return) a given unit in the strand."
self.array[x] = value
self._basepositions = None
# Throw out the base-position cache
return
def __delitem__(self,x):
"Delete a given unit from the strand."
del self.array[x]
self._basepositions = None
# Throw out the base-position cache
return
def __getslice__(self,i,j):
"Return a slice of the strand."
return Strand(self.array[i:j].tolist())
def __setslice__(self,i,j,sequence):
"Sets a slice of the strand."
self.array[i:j] = sequence
self._basepositions = None
# Throw out the base-position cache
return None
def __delslice__(self,i,j):
"Deletes a slice of the strand."
del self.array[i:j]
self._basepositions = None
# Throw out the base-position cache
return
def insert(self,index,base):
"Inserts the given base at the given index."
self._basepositions = None
# Throw out the base-position cache
return self.array.insert(index,base)
def reverse(self):
"Reverses the strand."
self._basepositions = None
return self.array.reverse()

# Throw out the base-position cache

def increment(self):
"Increment the strand to the next position."
unit = len(self.array)
nocarry = 0
# To force initial loop eval
while not nocarry and unit:
# While theres carry and not at start
unit = unit - 1
self.array[unit] = nocarry = self.nextbase[self.array[unit]]
else:
if not nocarry:
self.array.insert(0, A)
self._basepositions = None
# Throw out base-position cache
return
def basepositions(self):

34
"Calculate the positions of each base in the strand."
# Generate a list of the positions for each base
if not self._basepositions:
positions = [[],[],[],[],[]] # List of positions of each base
for i in range(len(self)):
try:
positions[self[i]].append(i)
except IndexError:
# We hit a gap, remap to [4]
positions[4].append(i)
self._basepositions = positions
return self._basepositions
class StrandFactory:
"Class that returns all strands one-by-one."
def __init__(self,start=Strand(A)):
"Initialize our enumeration state."
self.strand = start
self.pqueue = PQueue()
return
def pop(self):
"Return a priority/strand tuple."
if len(self.strand)*10 < self.pqueue.peek():
retval = (len(self.strand), Strand(self.strand))
self.strand.increment()
else:
retval = self.pqueue.pop()
return retval
def __getitem__(self,key):
return self.pqueue[key]
def __setitem__(self,key,value):
self.pqueue[key] = value
class Enzyme:
"Concrete class that models an enzyme."
# This is the Typogenetic Code, which maps base-pairs to both
# an operation and a kink direction for the secondary structure.
# Direction constants, for readability only:
S=0; L=-1; R=1
# Enzyme hints, which can be ORed together.
NOC = 0x00
# Instruction does not change strand
NEW = 0x01
# Instruction will cause a new strand
MOD = 0x02
# Instruction will modify strands
# This is the code itself
TypogeneticCode = \
[[(None, S,NOC), (cut,
[(mvr, S,NOC), (mvl,
[(ina, S,MOD), (inc,
[(rpy, R,NOC), (rpu,

S,NEW),
S,NOC),
R,MOD),
L,NOC),

(nuk,
(cop,
(ing,
(lpy,

S,MOD|NEW),
R,NEW
),
R,MOD
),
L,NOC
),

(swi,
(off,
(int,
(lpu,

# Reverse mapping of the typogenetic code, alas sometimes needed.


# Generate this dynamically, for maintenance purposes
TypogeneticOdec = {}

R,NOC)],
L,NOC)],
L,MOD)],
L,NOC)]]

35
for x1 in xrange(len(TypogeneticCode)):
for x2 in xrange(len(TypogeneticCode[x1])):
TypogeneticOdec[TypogeneticCode[x1][x2][0]] = (x1,x2)
# And this maps which direction binds to which base preferentially
BindPref = [A,G,T,C]
def __init__(self,acids=[],quirk=0):
"""
Initialize the enzyme. The quirk flag determines if use the
Morris/Varetto convention for determining binding preference
instead of the original Hoftstadter method.
"""
# Initialize the data members
self.acids=[]
self.dirs=[]
self.__bindcache = None
self.hint=0
self.quirk=quirk
# Add the initializer amino-acids
for acid in acids:
self.addAminoAcid(self.TypogeneticOdec[acid])
return
def __repr__(self):
"Returns a string representation of the enzyme."
return "Enzyme(%s)" % repr(self.acids)
def __str__(self):
"Returns a pretty representation of the enzyme."
return "%s: %s" % (Strand.transbase[self.binding()],
string.join(self.acids))
def __cmp__(self,other):
"Compares two enzymes for equality."
return cmp(self.acids,other.acids)
def addAminoAcid(self,basepair):
"Adds new amino-acid to our enzyme. The basepair is a tuple of bases."
# Retrieve the amino-acid and stuff
(acid,direction,hint) = \
self.TypogeneticCode[basepair[0]][basepair[1]]
if not acid:
# Return false to indicate punctuation
return 0
# Add the amino-acid, invalidate the binding cache, update the hint
self.acids.append(acid)
self.dirs.append(direction)
self.__bindcache = None
self.hint = self.hint | hint
return 1
def binding(self):
"Returns the binding of the enzyme."

36
if (self.__bindcache is None) or (self.__bindcache[0] != self.quirk):
# The Varetto/Morris/quirky method uses the full amino-acid
# list... the real way ignores the first and last amino-acids.
if self.quirk:
directions = self.dirs
else:
directions = self.dirs[1:-1]
# This just sums the series
binding = (reduce(lambda x,y:x+y,directions,0) % 4)
self.__bindcache = (self.quirk, self.BindPref[binding])
return self.__bindcache[1]
def translate(strand,quirk=0):
"""
Translates a strand into a list of enzymes.
The optional flag determines whether we produce quirky Enzymes or not.
"""
enzymes = [Enzyme(quirk=quirk)]
for x in xrange(0,len(strand)-1,2):
# For each base-pair...
basepair = (strand[x],strand[x+1])
# Add the basepair, and create a new enzyme if it was punctuation
if not enzymes[-1].addAminoAcid(basepair):
enzymes.append(Enzyme(quirk=quirk))
# Remove all empty amino-acids and return the rest
return filter(lambda e: e.acids,enzymes)
class Applicator:
"Used to model the application of amino-acids to a DNA strand."
# Mapping for complementary bases -- probably the wrong place
# to have these things, OO-speaking.
compbase = [T,G,C,A]
purines = [1,0,1,0]
# Could use odd() but this is faster
# Map for showing where were bound (using the Morris convention)
lowerbase = {A:a, C:c, G:g, T:t, ?:.}
def __init__(self,strand):
"""
Initializes the application context, with a default starting
position of 0.
"""
# The two strands which will be manipulated (a copy of the
# strand is made so that we wont modify it)
self.strands = [strand[:],Strand([32]*len(strand))]
# Reset our state
self.reset()
return
def __str__(self):
"Returns a convenient displayable form of the applicator state."
strands = map(str,self.strands)
# Use the Morris convention and make the bound base
# lower-case (if possible)
try:
strands[self.bound] = strands[self.bound][:self.pos] + \
self.lowerbase[strands[self.bound][self.pos]]+\

37
strands[self.bound][self.pos+1:]
except IndexError: pass
# Flip them so that theyre displayed in the right order
strands.reverse()
if self.copying:
strands.append([Copying: On])
else:
strands.append([Copying: Off])
return string.join(strands,\n)
def reset(self,position=0):
"Reset the applicator state, prior to an enzyme being bound."
# Applicator state:
#
self.done
- flag thats set when enzyme should cease
#
self.bound
- which strand were bound to
#
self.copying - flag for if were in copy mode
#
self.pos
- position were bound to
self.done = self.bound = self.copying = 0
self.pos = position
return
def extractstrands(self):
"Returns a list of strands represented by the current state."
strands = []
# These ones dont need to be flipped
for substrand in string.split(self.strands[0].array.tostring()):
# Simply convert to a Strand object, and add to list
s = Strand()
s.array.fromstring(substrand)
strands.append(s)
# These ones do need to be flipped
for substrand in string.split(self.strands[1].array.tostring()):
s = Strand()
s.array.fromstring(substrand)
s.reverse()
strands.append(s)
return strands
def apply(self,enzyme,bindpos=-1):
"Binds the the enzyme to the first strand and applies each amino-acid."
# Save the current strand were bound to
bound,position = self.bound, self.pos
# Reset the applicator state
self.reset()
# Fix up bindpos for if on the top strand
if bound:
bindpos = -bindpos - 1
# Find the locations of all the bases
bp = self.strands[bound].basepositions()
candidates = bp[enzyme.binding()]
# This trickery restricts the bases we can match to the ones
# in the strand we were last bound to.
hi = bisect.bisect(bp[4],position)

38
try:
if hi:
lo = bp[4][hi-1]
else:
lo = -1
hi = bp[4][hi]
candidates = filter(lambda y,x=lo,z=hi:x<y<z, candidates)
except IndexError:
# Either no gaps, or end gap
if hi:
lo = bp[4][hi-1]
hi = len(self.strands[0])
candidates = filter(lambda y,x=lo,z=hi:x<y<z, candidates)
# Bind the enzyme to the correct location
self.bound = bound
try:
self.moveto(candidates[bindpos])
except IndexError:
# Nothing to bind to, exit straight away
return
# Apply each amino-acid in the enzyme
for acid in enzyme.acids:
self.execute(acid)
# Exit loop if we are done prematurely for some reason
if self.done:
break
return
def moveto(self,position,checkdouble=0):
"Moves the current position well act on."
self.pos = position
self.done = not 0 <= position < len(self.strands[0]) or \
self.isblank(checkdouble)
if self.copying and not self.done:
self.compcopyunit()
return
def execute(self,op):
"Causes the given operation to be executed if possible."
if not self.done and hasattr(self,op):
apply(getattr(self,op),())
return
def compcopyunit(self):
"Performs a complementary copy of the current unit."
# Note: were assuming bases are always lined up. One of the
# two strands can contain a blank (doesnt matter which) but
# not both.
if self.strands[0][self.pos] == BLANK:
self.strands[0][self.pos] = \
self.compbase[self.strands[1][self.pos]]
else:
self.strands[1][self.pos] = \
self.compbase[self.strands[0][self.pos]]
return
def insert(self,base):

39
"Inserts a base to the right of the bound unit."
# position to insert the character at
inspos = self.pos + (not self.bound)
# Do the insert for the string were bound to
self.strands[self.bound].insert(inspos,base)
# Update the unit were bound to, as is appropriate
self.moveto(self.pos + self.bound)
# Work out whether to insert the complementary base or a blank
# and do it.
if self.copying:
self.strands[not self.bound].insert(inspos,self.compbase[base])
else:
self.strands[not self.bound].insert(inspos,BLANK)
return
def purine(self):
"Returns true if were bound to a puride."
try:
return self.purines[self.strands[self.bound][self.pos]]
except IndexError:
return 0
def pyrimidine(self):
"Returns true if were bound to a pyrimidine."
try:
return not self.purines[self.strands[self.bound][self.pos]]
except IndexError:
return 0
def isblank(self,checkdouble=0):
"Returns true if were bound to a blank space (ie, nothing at all)."
# Note: optimized in light of short-circuit analysis for the
# default case of checkdouble=0.
return (self.strands[self.bound][self.pos] == BLANK) and \
((not checkdouble or \
self.strands[not self.bound][self.pos] == BLANK))
def search(self,mover,predicate):
"""
Searches for a unit of the given type, moving using the supplied
function, and terminating when predicate returns true.
"""
# Ugh this is ugly. Basically the problem that
# complementary stuff is done when we move, but the
# predicate needs to be checked after the move but before
# the complementary copy is done. Kludge it by saving the copy
# state, turning it off, and doing it manually at the right
# time if need be. The only obvious alternate solution would
# prevent us from searching across blanks.
copysave = self.copying
self.copying = 0
found = 0
while not found:
mover()

40
if self.done: break
found = predicate()
if copysave:
self.compcopyunit()
# Restore the original copying state
self.copying = copysave
return
def cut(self):
"""
Cut both strands to the right of the current position, leaving
us attached to the left-hand side.
"""
# We can simulate a cut by inserting a double-blank (which we
# cant ever move past). This is very similar to the above
# insert() method.
# Position to cut the strands at
# (Note: This works due to the way slicing boundaries work)
pos = self.pos + (not self.bound)
# Insert the blanks
self.strands[0].insert(pos,BLANK)
self.strands[1].insert(pos,BLANK)
# Update the unit were bound to, as is appropriate
self.moveto(self.pos + self.bound)
return
# Since del is a reserved word, we use an allowed name here. Further
# down below we force "del" into the class manually -- its a bit
# dodgy but it works
def nuk(self):
"Deletes the unit at the current position."
# Proper behaviour of "del" isnt known. Its only meant to
# delete the bound base, not the complementary one as
# well. Its not clear if the base should be removed (and
# everything slid along to fill the gap) or replaced by a
# gap. Replacement by a gap keeps complementary bases lined up
# always, so well do that.
self.strands[self.bound][self.pos] = BLANK
# Then move to the right
self.mvr()
return
def swi(self):
"Switches the enzyme to be bound to the other strand."
self.bound = not self.bound
# Swap the mvr and mvl functions
(self.mvr,self.mvl) = (self.mvl,self.mvr)
# Blanks block a switch
if self.isblank():
self.done = 1
return

41

def mvr(self,checkdouble=0):
"Moves the enzyme one unit to the right."
self.moveto(self.pos+1,checkdouble)
return
def mvl(self,checkdouble=0):
"Moves the enzyme one unit to the left."
self.moveto(self.pos-1,checkdouble)
return
def cop(self):
"Turns on strand-copying mode."
self.copying = 1
self.compcopyunit()
return
def off(self):
"Turns off strand-copying mode."
self.copying = 0
return
def ina(self):
"Inserts an A into the strand."
self.insert(A)
return
def inc(self):
"Inserts a C into the strand."
self.insert(C)
return
def ing(self):
"Inserts a G into the strand."
self.insert(G)
return
def int(self):
"Inserts a T into the strand."
self.insert(T)
return
def rpy(self):
"Search for the nearest pyrimidine to the right."
self.search(lambda f=self.mvr: f(1),self.pyrimidine)
return
def rpu(self):
"Search for the nearest purine to the right."
self.search(lambda f=self.mvr: f(1),self.purine)
return
def lpy(self):
"Search for the nearest pyrimidine to the left."
self.search(lambda f=self.mvl: f(1),self.pyrimidine)
return

42
def lpu(self):
"Search for the nearest purine to the left."
self.search(lambda f=self.mvl: f(1),self.purine)
return
# We cant do a "def del(self):" since del is a reserved word. This
# is a kludge to force del into the Applicator class method table.
# ie, Applicator.del is an alias for Applicator.nuk
Applicator.__dict__[del] = Applicator.__dict__[nuk]
class PrintApplicator(Applicator):
"""
This is a superclass of the existing Applicator. It overloads some
methods to provide for debugging.
"""
def __init__(self,strand,file=sys.stdout):
"""
Initializes the Applicator. Identical to previous constructor,
but we also take a file-like object that we print things to.
"""
# Call the inherited constructor
Applicator.__init__(self,strand)
# Store the file object for later use.
self._file = file
return
# Overloaded methods are below. These simply call the inherited
# methods, but also report status stuff out to the file.
def execute(self,op):
# Report the operator were executing, and the state afterwards
self._file.write("Executing operation: %s\n" % str(op))
Applicator.execute(self,op)
self._file.write(str(self)+\n)
return
def apply(self,enzyme,bindpos=-1):
# Report the applicator state before starting the enzyme
self._file.write(--SOE--\n)
Applicator.apply(self,enzyme,bindpos)
self._file.write(--EOE--\n)
return
def moveto(self,position,checkdouble=0):
# Report the applicator state after each move
Applicator.moveto(self,position,checkdouble)
self._file.write(str(self)+\n)
return
def mogrify(strand, enzymes, binding=-1, debug=0):
"Applies a series of enzymes to a strand, and returns the results."
if debug:
a = PrintApplicator(strand)
else:
a = Applicator(strand)
for enzyme in enzymes:

43
a.apply(enzyme,binding)
# Return the resulting strands
return a.extractstrands()
# Self-test code, for when were executed standalone instead of imported.
if __name__ == "__main__":
for strand in [TAGATCCAGTCCACATCGA, CGGATACTAAACCGA]:
print strand, translates to:, translate(Strand(strand))
# Test our strand-space generator
strand = Strand()
startc = time.clock()
startt = time.time()
try:
for x in xrange(1000):
print strand, translate(strand)
strand.increment()
except KeyboardInterrupt: pass
print " CPU time elapsed:", time.clock()-startc, "seconds"
print "Real time elapsed:", time.time()-startt, "seconds"
# End.

A.2 pqueuemodule.c
The following listing is most of an extension module that implements an efficent priority-queue using
Fibonnaci heaps[2, pages 420439]. This module has been released in its entirety and is available at ftp:
//ftp.python.org/pub/python/contrib/DataStructures/PQueue-0.1a.tar.gz.
/*
* pqueue - A priority-queue extension for Python using Fibonacci heaps.
* Copyright (C) 1999 Andrew Snare <ajs@cs.monash.edu.au>
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Library General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This library is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Library General Public License for more details.
*
* You should have received a copy of the GNU Library General Public
* License along with this library; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 02111-1307, USA.
*/
/* Note to developers :*
Fibonnaci heaps are nasty. Do _NOT_ attempt to debug this code
*
unless you really really know what youre doing and have a copy
*
of "Introduction to Algorithms" (Thomas Cormen et al) by your side.
*/

44

#include "pqueuemodule-config.h"
#include <signal.h>
#include <assert.h>
#include <Python.h>
#define MAXDEG

#undef

(64)
/* Maximum degree of any node */
/* NOTE: This allows us to handle ~ 2.3723E13 nodes */

DEBUG

struct heapnodeRec {
struct heapnodeRec *p;
struct heapnodeRec *child;
struct heapnodeRec *left;
struct heapnodeRec *right;
int degree;
int mark;
PyObject *priority;
PyObject *data;
};
typedef struct heapnodeRec heapnode;

/* Turns on debugging stuff */

/*
/*
/*
/*
/*
/*
/*
/*

Pointer to the parent */


Pointer to any of the children */
Pointer to the left sibling */
Pointer to the right sibling */
How many children we have */
Lost child since last made child */
Priority associated with node */
Data associated with the node */

struct heapnodetrampRec {
heapnode *ptr;
/* Trampoline pointer */
int refcount;
/* Reference counter */
};
typedef struct heapnodetrampRec heapnodetramp;
typedef struct {
PyObject_HEAD
heapnode *min;
int n;
PyObject *dict;
} pqueueobject;

/* Pointer to current minimum */


/* Number of nodes in the pq */
/* Dictionary of data */

staticforward PyTypeObject PQueuetype;


/* PQueue -- some debugging routines */
#ifdef DEBUG
#define LONG(x) (PyInt_AS_LONG((x)->priority))
static void
display_children(child, level, parent)
heapnode *child;
int level;
heapnode *parent;
{
/* This is a debugging routine. It displays a child list recursively */
if (child) {
heapnode *w = child;
do {
PyObject* priority = PyObject_Repr(w->priority);
PyObject* data = PyObject_Repr(w->data);

45

printf("%*s%#x P(%#x) L(%#x) R(%#x) D(%d) M(%d) "


"(%s, %s) %#x",
level*4, "", w, w->p, w->left, w->right,
w->degree, w->mark,
PyString_AS_STRING(priority),
PyString_AS_STRING(data), w->child);
Py_DECREF(priority);
Py_DECREF(data);
if (w->child ) {
printf(" -> \n");
display_children(w->child, level+1, w);
} else
printf("\n");
#ifdef DEBUG
assert( w->p == parent );
#endif /* DEBUG */
w = w->right;
} while (child != w);
}
}
static void
display_pqueue(self)
pqueueobject *self;
{
/* This is a debugging routine. Attempt to display the heap as best
as we can. */
printf("PQueue at %#x with %d nodes.\n", self, self->n);
if (self->min != NULL)
{
heapnode *x;
printf("Min -> %#x\n", self->min);
display_children(self->min, 1, NULL);
}
}
static int
check_children(child,level)
heapnode *child;
int level;
{
heapnode *w = child;
heapnode *p = child->p;
int count = 0;
int degree = 0;
do {
int result;
printf("%*sNow checking: %#x-[%ld]\n", level*4, "", w,LONG(w));
/* Check the parent link is correct */
if (p != w->p)
printf("%*s%#x-[%ld]s parent-link not intact -> %#x-[%ld]\n",
level*4, "", w, LONG(w), w->p, LONG(w->p));

46
/* Check the left link */
if (w->left->right != w)
printf("%*s-L-> %#x-[%ld] -R-> %#x-[%ld]\n", level*4,"",
w->left, LONG(w->left),
w->left->right, LONG(w->left->right));
/* Check the right link */
if (w->right->left != w)
printf("%*s-R-> %#x-[%ld] -L-> %#x-[%ld]\n", level*4,"",
w->right, LONG(w->right),
w->right->left, LONG(w->left->right));
/* Check the heap condition */
if (w->p != NULL) {
PyObject_Cmp(w->priority, w->p->priority, &result);
if (result < 0)
printf("%*sHeap-condition violated: %#x-[%ld] > %#x-[%ld]\n",
level*4,"", w->p,LONG(w->p), w,LONG(w));
}
/* Recur to children if theyre present */
if (w->child) {
printf("%*s(should have %d children -> %#x-[%ld])\n",
level*4, "", w->degree,w->child,LONG(w->child));
if( w->child->p != w )
printf("%*s(doesnt point back -> %#x-[%ld])\n",
level*4,"",w->child->p,LONG(w->child->p));
count += check_children(w->child, level+1);
}
degree++;
w = w->right;
} while (w != child);
/* Assert the degree was correct */
if( w->p == NULL )
printf("%*s(no parent, degree information unchecked)\n",
level*4, "");
else if (degree != w->p->degree)
printf("%*s(%d children, %d expected from parent)\n",
level*4, "", degree, w->p->degree);
count += degree;
return count;
}
static void
check_heap(pqp)
pqueueobject *pqp;
{
if (pqp->min != NULL) {
int count = check_children(pqp->min,0);
if (count == pqp->n)
printf("Hmm... %d nodes expected and accounted for.\n",

47
pqp->n);
else
printf("Doh! %d nodes expected, only %d found.\n",
pqp->n, count);
}
}
#endif /* DEBUG */
/* PQueue -- C API -- Constructor */
#define is_pqueueobject(v)

((v)->ob_type == &PQueuetype)

static pqueueobject *
pqueue_new()
{
pqueueobject *pqp;
pqp = PyObject_NEW(pqueueobject, &PQueuetype);
if (pqp == NULL)
return NULL;
/* Allocate the dictionary */
pqp->dict = PyDict_New();
if (pqp->dict == NULL)
return NULL;
/* No minimum to start off with (and also no nodes) */
pqp->min = NULL;
pqp->n = 0;
return pqp;
}
/* PQueue methods */
static void
children_dealloc(child)
heapnode *child;
{
child->left->right = NULL;
do {
heapnode *x = child;
if (x->child != NULL) {
x->left->right = x->right;
x->right = x->child;
}
Py_DECREF(x->priority);
Py_DECREF(x->data);
child = child->right;
free(x);
} while( child != NULL );
}
static void
pqueue_dealloc(pqp)
pqueueobject *pqp;

48
{
Py_DECREF(pqp->dict);
if(pqp->min != NULL)
children_dealloc(pqp->min);
PyMem_DEL(pqp);
}
static PyObject *
pqueue_insert(self, args)
pqueueobject *self;
PyObject *args;
{
PyObject *priority, *data, *ptr;
heapnode *x;
heapnodetramp *tramp;
int newmin, newdata;
/* Check the parameters first */
if (!PyArg_ParseTuple(args, "OO:insert", &priority, &data))
return NULL;
/* Retrieve the data if it already exists */
ptr = PyDict_GetItem(self->dict, data);
if ((ptr == NULL) && (PyErr_Occurred() != NULL))
return NULL;
/* Do the comparison early on to detect errors early */
Py_INCREF(priority);
Py_INCREF(data);
if (self->min != NULL) {
if(PyObject_Cmp(self->min->priority, priority, &newmin) == -1){
PyErr_SetString(PyExc_ValueError,
"unable to compare priority");
Py_DECREF(priority);
Py_DECREF(data);
return NULL;
}
}
/* Then try and allocate the node */
if (NULL == (x = malloc(sizeof(heapnode)))) {
PyErr_NoMemory();
Py_DECREF(priority);
Py_DECREF(data);
return NULL;
}
/* Now make the CObject and put it in the dictionary (if need be) */
if (ptr == NULL) {
PyObject *cobject;
tramp = malloc(sizeof(heapnodetramp));
cobject = PyCObject_FromVoidPtr(tramp, free);
if ((tramp == NULL) || (cobject == NULL) ||
(PyDict_SetItem(self->dict, data, cobject) == -1)) {
/* If things failed, clean up and go home */
Py_XDECREF(cobject);
Py_DECREF(priority);

49
Py_DECREF(data);
free(x);
if (tramp != NULL)
free(tramp);
return NULL;
}
Py_DECREF(cobject);
/* Since PyDict_SetItem borrows */
tramp->ptr = x;
tramp->refcount = 1;
} else {
tramp = (heapnodetramp*)PyCObject_AsVoidPtr(ptr);
/* CObject already exists, increment the counter */
tramp->ptr = NULL;
tramp->refcount++;
}
/* Initialize the node structure */
x->degree = 0;
x->p = NULL;
x->child = NULL;
x->mark = 0;
x->priority = priority;
x->data = data;
/* Insert the node into the root list */
if (self->min == NULL) {
self->min = x->left = x->right = x;
} else {
x->right = self->min;
x->left = self->min->left;
self->min->left->right = x;
self->min->left = x;
if (newmin > 0) {
self->min = x;
}
}
self->n++;
/* We return None to indicate success */
Py_INCREF(Py_None);
return Py_None;
}
static PyObject *
pqueue_peek(self, args)
pqueueobject *self;
PyObject *args;
{
/* Check the parameters first */
if (!PyArg_ParseTuple(args, ":peek"))
return NULL;
/* If empty, return an error */
if (self->min == NULL) {
PyErr_SetString(PyExc_IndexError, "nothing in the queue");
return NULL;
}

50

/* Return a tuple of the current min node */


return Py_BuildValue("OO", self->min->priority, self->min->data);
}
static inline void
consolidate(self)
pqueueobject *self;
{
#ifdef DEBUG
printf("Starting consolidate()\n");
#endif /* DEBUG */
if (self->min != NULL)
{
int i;
heapnode *A[MAXDEG];
memset(A, 0, sizeof(heapnode*)*MAXDEG);
/* We break the link to detect when weve gone through all the
nodes. This can be done since we only remove nodes and
never insert them. */
self->min->left->right = NULL;
do {
heapnode *x = self->min;
int d = x->degree;
/* Advance to the next one while we can */
self->min = self->min->right;
#ifdef DEBUG
printf("Looking at %#x-[%ld] (d=%d).\n", x, LONG(x), d);
#endif /* DEBUG */
while( A[d] != NULL ) {
int cmpresult;
heapnode *y = A[d];
#ifdef DEBUG
printf("Doh! %#x-[%ld] already has d=%d.\n",
y, LONG(y), d);
#endif /* DEBUG */
/* This _should_ never fail? */
PyObject_Cmp(x->priority,
y->priority, &cmpresult);
if (cmpresult > 0) {
heapnode *t = x;
x = y;
y = t;
#ifdef DEBUG
printf("(and were bigger)\n");
#endif /* DEBUG */
}
/* Make y a child of x */
#ifdef DEBUG
printf("Making %#x-[%ld] a child of %#x-[%ld].\n",
y, LONG(y), x, LONG(x));
#endif /* DEBUG */
y->p = x;
if (x->child == NULL)

51
x->child = y->left = y->right = y;
else {
y->right = x->child;
y->left = x->child->left;
x->child->left->right = y;
x->child->left = y;
}
x->degree++;
/* Mark y as having been made a child */
y->mark = 0;
A[d++] = NULL;
}
A[d] = x;
#ifdef DEBUG
printf("Storing %#x-[%ld] at d=%d.\n", x, LONG(x), d);
#endif /* DEBUG */
} while (self->min != NULL);
#ifdef DEBUG
printf("Time to build the root list....\n");
#endif /* DEBUG */
for(i=0; i<MAXDEG; i++)
if (A[i] != NULL) {
#ifdef DEBUG
printf("Oooh.. found %#x-[%ld] with d=%d.\n",
A[i], LONG(A[i]), i);
#endif /* DEBUG */
/* Insert the node into the root list */
if (self->min == NULL) {
self->min = A[i]->left =
A[i]->right = A[i];
} else {
int newmin;
A[i]->right = self->min;
A[i]->left = self->min->left;
self->min->left->right = A[i];
self->min->left = A[i];
/* Check to see if we have a new min */
PyObject_Cmp(self->min->priority,
A[i]->priority, &newmin);
if (newmin > 0) {
self->min = A[i];
}
}
}
}
}
static PyObject *
pqueue_pop(self, args)
pqueueobject *self;
PyObject *args;
{
PyObject *ret;
heapnode *min;

52
heapnode *child;
heapnodetramp *tramp;
/* Check the parameters first */
if (!PyArg_ParseTuple(args, ":pop"))
return NULL;
/* Enter the debugger... */
/* raise(SIGTRAP); */
#ifdef DEBUG
check_heap(self);
#endif /* DEBUG */
min = self->min;
/* If empty, return an error */
if (min == NULL) {
PyErr_SetString(PyExc_IndexError, "nothing in the queue");
return NULL;
}
#ifdef DEBUG
printf("Removing min from the root list...\n");
#endif /* DEBUG */
/* Remove the children of the min-node and put them on root list */
child = min->child;
if (child != NULL) {
heapnode *c = child;
/* Set each child to have no parent */
do {
c->p = NULL;
c = c->right;
} while(c != child);
/* Now put the child-list on the root list */
min->left->right = child;
child->left->right = min;
c = child->left;
child->left = min->left;
min->left = c;
}
/* Now pull the min-node out of the root list */
min->left->right = min->right;
min->right->left = min->left;
/* Check if were the only node in the root list */
if (min == min->right)
self->min = NULL;
else {
self->min = min->right;
#ifdef DEBUG
check_heap(self);
#endif /* DEBUG */
consolidate(self);

53
}
self->n--;
/* Reduce the refcount for the data */
tramp = PyCObject_AsVoidPtr(PyDict_GetItem(self->dict, min->data));
if (--(tramp->refcount) == 0) {
PyDict_DelItem(self->dict, min->data);
}
/* Build the return tuple, and de-allocate the node */
ret = Py_BuildValue("OO", min->priority, min->data);
Py_DECREF(min->priority);
Py_DECREF(min->data);
free(min);
return ret;
}
#ifdef DEBUG
static PyObject *
pqueue_display(self, args)
pqueueobject *self;
PyObject *args;
{
/* Check the parameters first */
if (!PyArg_ParseTuple(args, ":display"))
return NULL;
display_pqueue(self);
Py_INCREF(Py_None);
return Py_None;
}
#endif /* DEBUG */
static PyMethodDef pqueue_methods[] = {
{"insert",
(PyCFunction)pqueue_insert,
{"peek",
(PyCFunction)pqueue_peek,
{"pop",
(PyCFunction)pqueue_pop,
#ifdef DEBUG
{"display",
(PyCFunction)pqueue_display,
#endif /* DEBUG */
{NULL, NULL}
/* Sentinel */
};
static int
pqueue_length(pqp)
pqueueobject *pqp;
{
return pqp->n;
}
static PyObject *
pqueue_subscript(pqp, key)
pqueueobject *pqp;
PyObject *key;
{
heapnode *hp;

METH_VARARGS},
METH_VARARGS},
METH_VARARGS},
METH_VARARGS},

54
heapnodetramp *tramp;
PyObject *cobject = PyDict_GetItem(pqp->dict, key);
if ((cobject == NULL) ||
((tramp = PyCObject_AsVoidPtr(cobject))->ptr == NULL)) {
PyErr_SetObject(PyExc_KeyError, key);
return NULL;
}
hp = tramp->ptr;
Py_INCREF(hp->priority);
return hp->priority;
}
static inline void
cut(pqp, x, y)
pqueueobject *pqp;
heapnode *x, *y;
{
#ifdef DEBUG
printf("Starting cut()\n");
#endif /* DEBUG */
/* Remove x from the child list of y */
if (x->right == x)
/* Only child */
y->child = NULL;
else {
if (y->child == x)
/* Is it worth doing this test? */
y->child = x->right;
x->right->left = x->left;
x->left->right = x->right;
}
y->degree--;
/* Put x on the root list */
x->left = pqp->min->left;
x->right = pqp->min;
pqp->min->left->right = x;
pqp->min->left = x;
x->p = NULL;
x->mark = 0;
}
static void
cascading_cut(pqp, y)
pqueueobject *pqp;
heapnode *y;
{
heapnode *z = y->p;
#ifdef DEBUG
printf("Starting cascading_cut()\n");
#endif /* DEBUG */
if (z != NULL) {
if (y->mark == 0)
y->mark = 1;
else {
cut(pqp, y, z);
cascading_cut(pqp, z);

55
}
}
}
static int
decrease_key(pqp, x, priority)
pqueueobject *pqp;
heapnode *x;
PyObject *priority;
{
/* Assume weve already checked that x->priority <= priority */
int result = -1;
heapnode *y = x->p;
#ifdef DEBUG
printf("Starting decrease_key()\n");
#endif /* DEBUG */
if (y != NULL) {
if ((priority != NULL) &&
(PyObject_Cmp(priority, y->priority, &result) == -1)) {
Py_DECREF(priority);
PyErr_SetString(PyExc_ValueError,
"unable to compare value");
return -1;
}
}
/* Throw away the old priority now */
Py_DECREF(x->priority);
x->priority = priority;
/* If we need to move the node up the tree, do it */
if ((y != NULL) && (result < 0)) {
cut(pqp,x,y);
cascading_cut(pqp,y);
}
/* If we have a new minimum, make note of it */
if (priority != NULL)
PyObject_Cmp(x->priority, pqp->min->priority, &result);
if (result < 0)
pqp->min = x;
return 0;
}
static inline int
delete_key(pqp, x)
pqueueobject *pqp;
heapnode *x;
{
PyObject *min;
/* Now do a reduce on it */
decrease_key(pqp, x, NULL);
/* NULL == -infinity */
/* Now put in the Py_None for the priority */
Py_INCREF(Py_None);
x->priority = Py_None;
/* Now do the extract-min to remove the same element */

56
min = pqueue_pop(pqp, PyTuple_New(0));
if (min == NULL)
return -1;
/* Throw away the result */
Py_DECREF(min);
return 0;
}
static int
pqueue_ass_sub(pqp, data, priority)
pqueueobject *pqp;
PyObject *data, *priority;
{
int result;
heapnode *hp;
heapnodetramp *tramp;
PyObject *cobject = PyDict_GetItem(pqp->dict, data);
/* Check we could find the node theyre referring to */
if ((cobject == NULL) ||
((tramp = PyCObject_AsVoidPtr(cobject))->ptr == NULL)) {
if (priority == NULL) {
/* Deleting non-existent node */
PyErr_SetObject(PyExc_KeyError, data);
return -1;
} else {
/* Setting non-existent node */
/* Turn the set into an insert() */
PyObject *ret =
pqueue_insert(pqp,
Py_BuildValue("OO", priority,
data));
if (ret == NULL)
return -1;
Py_DECREF(ret);
return 0;
}
}
/* Find the node theyre talking about */
hp = tramp->ptr;
/* Check if were doing a deletion */
if (priority == NULL) {
return delete_key(pqp, hp);
}
/* Next check theyre reducing the key */
if (PyObject_Cmp(priority, hp->priority, &result) == -1) {
PyErr_SetString(PyExc_ValueError, "unable to compare value");
return -1;
} else if (result > 0) {
/* New key is greater than old. Do a delete/insert. */
int ret = delete_key(pqp, hp);
if (ret != 0)
return ret;
else {
PyObject *ret =
pqueue_insert(pqp,

57
Py_BuildValue("OO", priority,
data));
if (ret == NULL)
return -1;
Py_DECREF(ret);
return 0;
}
}
#ifdef DEBUG
check_heap(pqp);
#endif /* DEBUG */
/* Take ownership of the new priority, and assign it to the node */
Py_INCREF(priority);
return decrease_key(pqp, hp, priority);
}
static PyMappingMethods pqueue_as_mapping = {
(inquiry)pqueue_length,
/* mp_length */
(binaryfunc)pqueue_subscript,
/* mp_subscript */
(objobjargproc)pqueue_ass_sub, /* mp_ass_subscript */
};
static PyObject *
pqueue_getattr(pqp, name)
pqueueobject *pqp;
char *name;
{
return Py_FindMethod(pqueue_methods, (PyObject *)pqp, name);
}
static int
pqueue_setattr(pqp, name, v)
pqueueobject *pqp;
char *name;
PyObject *v;
{
PyErr_SetString(PyExc_AttributeError,
"cant modify pqueue attributes");
return -1;
}
static char PQueuetype_Type__doc__[] =
"Priority queues are used as a FIFO buffer, but with the difference that\n\
items in the queue have a priority associated with them. The item with the\n\
lowest priority is always extracted from the list first.\n";
static PyTypeObject PQueuetype = {
PyObject_HEAD_INIT(&PyType_Type)
0,
/* ob_size */
"pqueue",
/* tp_name */
sizeof(pqueueobject),
/* tp_basicsize */
0,
/* tp_itemsize */
/* methods */
(destructor)pqueue_dealloc,
/* tp_dealloc */
0,
/* tp_print */
(getattrfunc)pqueue_getattr,
/* tp_getattr */

58
(setattrfunc)pqueue_setattr,
/* tp_setattr */
0,
/* tp_compare */
0,
/* tp_repr */
0,
/* tp_as_number */
0,
/* tp_as_sequence */
&pqueue_as_mapping,
/* tp_as_mapping */
0,
/* tp_hash */
0,
/* tp_call */
0,
/* tp_str */
0L,0L,0L,0L,
/* future expansion */
PQueuetype_Type__doc__ /* Documentation string */
};
/* PQueue -- Python API -- Constructor */
static PyObject *
pqueue_PQueue(self, args)
PyObject *self;
PyObject *args;
{
if (!PyArg_ParseTuple(args, ""))
return NULL;
return (PyObject *)pqueue_new();
}
static PyMethodDef PQueueMethods[] = {
{"PQueue",
(PyCFunction)pqueue_PQueue, METH_VARARGS},
{NULL, NULL}
/* Sentinel */
};
void
initpqueue()
{
(void) Py_InitModule("pqueue", PQueueMethods);
}

You might also like