Professional Documents
Culture Documents
Abbreviated and adapted by C. Lyon from the paper at Link Grammar home page.
Primarily for use as a reference.
There are different types of connectors and those connectors may also point to the right or to the left.
Right-pointing connectors are labeled "+", left-pointing connectors are labeled "-". A left-pointing
connector connects with a right-pointing connector of the same type on another word. The two connectors
joined together form a "link". For example, for the sentence the cat chased a snake, the links look like
this:
Words have rules about how their connectors can be connected up, that is, rules about what would
constitute a valid use of that word. A valid sentence is one in which all the words present are used in a way
which is valid according to their rules, and which also satisfies certain global rules.
blah: A+ & B+;
1.2. WORD RULES. A simple dictionary entry would look like this:
blah: A+;
This means that if the word "blah" is used in a sentence, it must form an "A" link with another word; that
is, there must be another word to the right of it with an "A-" connector. Otherwise the sentence is not valid.
The expression following the colon is the "linking requirement" for the word.
A word may have more than one connector that has to be connected. This would be notated as
blah: A+ & B+;
A word may have a rule that either one of two (or one of several) connectors can be used, but exactly one
must be used. In the dictionary, we notate this as
2
blah: A+ or B-;
This means that if the word can make either an "A" link to the right, or a "B" link to the left, its use in the
sentence is valid; but it must make one or the other, and it can not make both.
These rules can be combined. For example, consider the following notation:
blah: A+ or (B- & C+);
This means that the word must make either an "A" link to the right, or a "B" link to the left and a "C" link
to the right. No other combination will be valid.
Such expressions can be nested without limit, such as
blah: (A+ or B-) & ((C- & A+ & (D- or E-)) or F+);
Some connectors are optional; this is notated with curly brackets. For example:
blah: A+ & {B+};
This means the word must make an "A" link to the right, and it can make a "B" link to the right but
does not have to. Curly brackets can also be put around complex expressions, like
blah: (A+ or B+) & {C- & (D+ or E-)};
An equivalent way of writing an optional expression like "{X-}" is "(X- or ())". This can be useful,
since it allows a cost to be put on the no-link option (see Section 3.5).
A word can also make an indefinite number of links of the same type to other words. For this, we use
the "multi-connector" symbol "@". For instance, the word below could make any number of F links to
words to the right (but is not required to make any).
blah: (A+ or B+) & {C- & (D+ or E-)} & {@F+};
(If a word has "@A+", with no curly brackets, it is required to make at least one A+ link to the right;
any others are optional.)
The ordering of elements in the "connector expression" is important. What that dictates is the relative
closeness of the words that are being connected to. The further to the left the connector name, the
closer the connection must be. For example,
blah: A+ & B+;
This means that "blah" must make an "A" link to the right and a "B" link to the right, and the word it
makes the "A" link with must be closer than the word it makes the "B" link with.
This only pertains, however, to connections in the same direction. For connectors pointing in opposite
directions, the ordering is irrelevant. Therefore
blah: A+ & B-;
For "or" expressions, such as "A+ or B+", the ordering of the elements is irrelevant.
3
A dictionary entry thus consists of a word, followed by a colon, followed by a connector expression,
followed by a semi-colon. The dictionary consists of a series of such entries. Any number of words can
be put in a list, separated by spaces; they will then all possess the linking requirement that follows:
blah blee blay: A+;
A connector name must consist of one or more capital letters (any number may be used), followed by
"+" or "-".
We should mention one concept here that plays an important role in the internal workings of the parser:
the "disjunct". A disjunct is a set of connector types that constitutes a legal use of a word. The
dictionary expression for any word can be represented as a set of disjuncts. If a word has the following
expression:
blah: {C-} & (A+ or B+);
These disjuncts represent all the legal uses of the word "blah". Using C- and A+ is a legal use of the
word; using A+ and B+ is not. Disjuncts play an important role in the internal workings of the parser.
1.3. GLOBAL RULES. As well as these "word rules", which are specified in the dictionary, there are
two other global rules which control how words can be connected.
First of all, links can not cross. For example, the following way of connecting these four words
(connecting "cat" to "dog" and "horse" to "fish") would be illegal. The parser simply will not find such
linkages.
+------------+
+---- | -----+
|
|
|
|
|
cat
horse dog fish
This is the "crossing-links" (or "planarity") rule. Secondly, all the words in a sentence must be
indirectly connected to each other. Therefore the following way of connecting these four words would
be illegal (if it was the entire linkage).
+-----+
+----+
|
|
|
|
cat horse dog fish
This is the "connectivity" rule. A valid sentence is therefore one which can be linked up in a way that
a) all the words are used in a way that satisfies their linking requirements; and b) the crossing- links
and connectivity rules are not violated.
1.4. LINK GRAMMAR IN RELATION TO OTHER SYSTEMS. The structure assigned to a
sentence by a link grammar is rather unlike any other grammatical system that we know of (although it
is certainly related to dependency grammar). Rather than thinking in terms of syntactic functions (like
subject or object) or constituents (like "verb phrase"), one must think in terms of relationships between
pairs of words. In the sentence below, for example, there is an "S" ("subject") relation between "dog"
and "has"; a "PP" (past-participle) relationship between "has" and "gone"; and a "D" (determiner)
relation between "the" and "dog". (Ignore the lower-case letters for the moment; they will be explained
below.)
4
+-----Ds-----+
|
+---A--+-Ss-+-PP-+
|
|
|
|
|
the black.a dog.n has gone
Where possible, we try to give link-types names that have mnemonic significance in this way.
It may be seen, however, that parts of speech, syntactic functions, and constituents may be recovered
from a link structure rather easily. For example, whatever word is on the left end of an "S" link is the
subject of a clause (or the head word of the subject phrase); whatever is on the right end is the finite
verb; whatever is on the left-end of a D link is a determiner; etc.. Moreover, all nouns, verbs, and
adjectives in the dictionary are subscripted (as ".n", ".v", or ".a"--see section 3.4), so in these cases the
syntactic category of the word is made explicit.
With version 4.0, we have incorporated a system for deriving a traditional constituent representation of
a sentence from a linkage.
(In the following verb files, the final number indicates the verb form. ".1" is for infinitive-plural forms,
".2" is for singular forms, ".3" is for simple-past / past-participle forms, ".4" is for present participles,
".5" is for gerunds. On intransitive verbs, the present participle and gerund expression are combined
into a single dictionary entry.)
words.v.1.(1-4)
words.v.1.p
words.v.2.(1-5)
words.v.4.(1-5)
words.v.5.(1-4)
words.v.6.(1-5)
words.v.8.(1-5)
words.v.10.(1-4)
intransitive verbs
special two-word passives ("lied_to_", "paid_for")
optionally transitive verbs
transitive verbs
intransitive verbs that may form two-word verbs with particles like
"up" and "out"
optionally transitive verbs that may form two-word verbs
transitive verbs that may form two-word verbs
verbs that may be used in quotation expressions, like "said" ("John is
here, he said").
6
words.adj.1
words.adj.2
words.adj.3
words.adv.1
words.adv.2
words.adv.3
words.y
words.s
3.4. WORD SUBSCRIPTS. A single word can be given several different dictionary entries. To do this, the
entries must be distinguished by giving the words different subscripts. Words may be followed by a
subscript such as ".n". For example:
run.n: A+ or B+...
run.v: C+ or D+...
If a word is listed more than once with the same subscript, or if it listed once with a subscript and once
without, the parser will generate a warning message and will ignore one of the entries.
The parser starts at the right end of every string of characters. Any sequence of letters to the right of the
right-most period in the string will be considered the subscript.
In searching for linkages, the parser will consider each entry for the word as a different word, and will
generate all linkages found for all entries. The subscript is shown in the display, thus indicating which
entry the parser chose for a particular linkage.
The main word subscripts we use are ".n" for nouns, ".v" for verbs, and ".a" for adjectives. All nouns,
verbs, and adjectives are subscripted in this way. Certain other subscripts are used only when needed to
distinguish two forms of the same word: ".e" for adverb ".p" for preposition, ".s" for singular, ".p" for
plural, ".t" for title.
3.5. THE COST SYSTEM. (Ignore this section initially )
We have a system for assigning a cost to a linkage. This allows the parser to express preferences among
the linkages it finds. The cost system uses square brackets ("[" and "]"). If a connector, or a series of
connectors, is surrounded by square brackets, it is assigned a cost. The amount of cost is equal to the
number of square brackets on each side: [A+] will receive a cost of 1; [[A+]] will receive a cost of 2; etc..
The parser uses this cost as a criterion for deciding which linkage to output first; it outputs them in order of
cost (i.e., lowest cost first).
At the moment, connectors with a cost of 0, 1 or 2 are considered in normal parsing.
Given several linkages of the same cost level, the parser has certain heuristics for choosing the best parse,
i.e., the one to output first. It prefers the linkage in which the total length of the links is lowest; and in
sentences with conjunctions, it prefers a linkage where the lengths of the conjoined word-lists are similar
(see section 5). This information is indicated in the cost vector shown above the linkage:
Unique linkage, cost vector = (UNUSED=1 DIS=0 AND=0 LEN=1)
"DIS" is the connector cost or disjunct cost for the linkage (the "[]" system explained above); "AND" is the
difference in length between and-list elements; and "LEN" is the total length of all links in the sentence
7
(minus the number of words--since the total link length is never less than the number of words).
"UNUSED" indicates the number of null-links; see section 7.1.
Several different unknown word categories may be generated, labeled with different subscripts: for
example, corresponding to nouns, verbs, and adjectives and adverbs. (These are the four categories we use,
8
labeled .n, .v, .a, and .e, respectively.) The parser will search for all linkages that can be found using each
entry. If it only finds a linkage for the "noun" category, then the output will show the unknown word
labeled ".n": in effect, the parser is then guessing that the word is a noun.
Version 4.0 of the parser has an new feature for handling unknown words, known as "morpho-guessing".
This is a system for guessing the syntactic category of an unknown word (that is, a word not explicitly
listed in the dictionary) based on its spelling. Words that end in "-s" are assumed to be plural nouns or
singular verbs; these are assigned to a category listed as "S-WORDS" in the dictionary. Similarly, words
ending in "-ed" are assumed past-tense (or passive) verbs; those ending in "-ing", present participles; those
ending in "-ly", adjectives. This greatly improves the ability of the parser to handle sentences containing
multiple unknown words. Words that have been treated in this way are marked with a "[!]".
4.5. PUNCTUATION The parser is capable of handling a variety of punctuation symbols. There are two
issues to be discussed here. One is the listing of symbols in the dictionary; the other is the way they are
"read" by the parser when they are used in sentences.
Punctuation symbols can be listed in the dictionary just like words, and given ordinary linkage
expressions. The same is true for strings containing multiple punctuation symbols or a mixture of letters
and punctuation. The problem here is that certain punctuation symbols are also used as the "syntax" of the
dictionary: colons, semi-colons, ampersands, etc.. Our solution to this is as follows: when listing these
special characters, or a string containing them, one must put them in quotation marks:
";": A+ or B-;
"+": C+ or D-;
(The special characters that must be treated this way are precisely those which are used in the dictionary in
a "syntactic" way: "(", ")", "{", "}", "[", "]", "@", "%", "&", "*", "+", "-", "/", "<", ">".)
When punctuation symbols are used in sentences, they will be used in linkages according to the connector
expressions listed in the dictionary, in the normal way. There is a difference, however. It may be noted that
although many punctuation symbols are similar to words in the ways they are used, they are often not
separated from preceding or following words by spaces. In order for these symbols to be recognized as
separate units, then, they must be "stripped off": that is, a space must be inserted between the symbol and
the adjacent word. Details are in the paper accessed from Link Grammar home page.
One exceptional case is quotation marks. Quotation marks may not be defined in the dictionary; and they
are simply ignored when they are used in sentences. This is sufficient to handle most uses of quotes;
generally, the presence of quotes does not affect the well-formedness of sentences, and it is often only
subtlely affects meaning. However there are a few constructions, such as the pair of sentences below,
which seem to be only correct when quotes are included.
She said, "John is leaving".
?She said, John is leaving.
We are unable to control such usages at the moment.
4.6. THE WALL(S). It proved to be useful to imagine that there was a dummy word at the beginning of
every sentence. We call this "the wall". The wall has a linking requirement like any other word; it is listed
in the dictionary under "LEFT-WALL". If this entry is included in the dictionary, the wall will be
automatically inserted at the beginning of every sentence. Because of the connectivity rule, it is then
necessary for the wall to be linked to the rest of the sentence in order for the sentence to be valid.
9
There is also a "right-hand wall", which is similar to the original wall at the other hand of the sentence.
This is only needed for certain punctuation phenomena. In most sentences, we use a special "RW"
connector to simply connect the left hand wall to the right hand one. The right-wall's dictionary entry is
"RIGHT-WALL". (Since the left-wall is much more important than the right-wall, we often refer to the
left-wall simply as "the wall".)
In most sentences, the left-wall connects to the sentence with a "Wd" link, and the right-wall connects to
the left-wall with "RW". When only these connectors on the walls are being used, they are not displayed in
the linkage diagram. When other connectors on the walls are being used, instead or as well, the walls are
shown. (For example, the left-wall is shown in questions and imperatives.) To make it so that the walls are
_always_ shown, type "!walls".
4.7. IDIOMS. A string of words can be defined as a single dictionary entry. To do this, simply join the
words together with underbars:
a_la_mode: A+ or B-;
Most idioms can be interpreted either as a single "idiom" or as a string of words (for example, "in
question"). In this case, the parser will find all linkages with both interpretations.
In reading idiomatic strings from the dictionary, the parser breaks them up into individual words and
assigns them "dummy" link-types which simply link the words of the idiom together in series. These linktypes are assigned four-letter names of the form ID[X][Y], where X and Y are arbitrary letters.
Idioms cannot be given subscripts; if "a_la_mode.a" is included in the dictionary, this will not be accepted.
However, an idiom can be listed in the dictionary more than once, without subscripts.
5. Coordinating Conjunctions
Coordination constructions do not fit naturally into the framework of link grammars. We have devised a
method for automatically transforming the given link grammar into another one that captures the desired
phenomena. See the full introduction at the Link Grammar home page for details, but problems associated
with conjunctions are not yet fully resolved.
Conjunctions are a frequent source of ambiguity. For example, in the sentence "Several big cats and dogs
with sharp teeth chased me", "several" may or may not apply to "dogs" (as a plural noun, "dogs" does not
require a determiner); "big" may or may not apply to dogs; and "with sharp teeth" may or may not apply to
cats. Linkages for all of these possibilities will of course be generated.
A few usages of coordinating conjunctions are handled using ordinary link logic. There is some overlap
between the special handling of conjunctions and the ordinary handling, so that some sentences receive
multiple parses. For example, ordinary clauses conjoined together will receive two parses: "John ran and
Fred walked". See the entries in the Guide-To-Links on "W" and "CC" for discussion these ordinary
usages of conjunctions.
Another problem concerns the different kinds of conjunctions. Our discussion focuses on the word "and",
although the ideas apply to the use of "or", "but", "either-or", "neither-nor", "both-and", and "not only but". Right now, our system does not always distinguish between the various kinds of conjunctions
allowed. However, there appear to be different constraints on different conjunctions. This results in some
false positives:
10
6. Post-Processing
Besides conjunctions, there are certain phenomena in English which the parser is incapable of dealing with
in its basic form. To solve these problems, we developed a post-processing system, based on a concept we
call "domains". A domain is a subset of the links that make up a sentence. After a linkage has been found,
the post-processing mechanism goes through the linkage and divides the sentence up into domains based
on the kind of links that are present in the sentence. It then further divides the links into "groups": sets of
links which share a particular domain membership. It then applies rules which may declare the linkage
invalid based on the combinations of links present in a given group. See the full paper at the Link
Grammar home page for details.
In null-link parsing, the connectivity requirement is suspended (see Section 1.3). This means that
disconnected "islands" may form. However, each island represents one added null link. That is, if a
sentence can be parsed as three disconnected islands (but with all the words otherwise connected with
regular links), this will linkage will be found at null link stage 2.
11
The null-link system can be turned on or off by typing the command "!null". The default is that null-links
are on. If null-links are turned off, then, when ther parser is unable to find a complete linkage for a
sentence, it will say "No complete linkages found", and prompt for the next sentence.
7.2. THE LINK-LENGTH LIMIT. In studying the parser's performance on very long sentences (on
which it was often very slow), we discovered that it was often considering extremely long links even for
link-types which are generally very short. For this reason, we installed a "link-length-limit": links are only
allowed to be a certain length, in terms of the number of words from end to end.
7.3. THE POST-PROCESSING LIMIT. Since post-processing proved to be a major source of the
slowness of the parser, we installed a "post-processing limit". This is simply a limit on the number of
linkages that will be considered by post-processing. If the limit is set at 100 (this is the default), then only
100 linkages will be considered by post-processing, even if many more than that are generated; the others
will just be discarded. This means, of course, that the "best" linkage (by the parser's heuristics, for
example) may be discarded. However, the linkages to be considered by post-processing are selected
randomly from all the generated ones, which means that at least one linkage is likely to be found which is
fairly similar to the correct one.
12