You are on page 1of 22

Data & Knowledge Engineering 42 (2002) 293314

www.elsevier.com/locate/datak

Database Lexicography
Gary Coen
Boeing Phantom Works, Mathematics and Computing Technology, P.O. Box 3707,
MC 7L-43, 98124-2207 Seattle, WA, USA

Abstract

This paper introduces database lexicography, a metadata analysis discipline that applies lexical graph
theory to data design. 1 Database lexicography proposes a formal design criterion for data dependencies,
and it provides metrics to evaluate the conformance of designs to this criterion. It treats the data dictionary
as a rst class object encoding design concepts, and its benets include identication of database depen-
dency architecture; quantication of interdependent data elements sensitivity to change; categorization of
core and peripheral data elements; model integration; and gures of merit by which to fortify data archi-
tectures to withstand design fossilization and guide their evolution amidst changing requirements. 2002
Published by Elsevier Science B.V.

Keywords: Database lexicography; Lexical graph; Data dictionary; Model integration

1. Problem

The data assets of large enterprises tend to be compartmentalized. Line-of-business software


systems typically incorporate legacy data designed for single application contexts. Routinely, only
data and process owners comprehend the information content of these assets. Relentless disci-
plinary specialization of knowledge compounds the problem, resulting in minimal re-use of data
across processes and organizations. Under these circumstances, data systems proliferate at the
expense of information integration.
Successful enterprises typically prevail by exploiting insights into the information content of
data. For example, discovery technologies like multidisciplinary optimization and data mining
can spot unexpected but useful correlations in heterogeneous data. Integration architectures like

E-mail address: gary.a.coen@boeing.com (G. Coen).


1
This paper extends logical foundations presented in [1] to the domain of data design and maintenance.

0169-023X/02/$ - see front matter 2002 Published by Elsevier Science B.V.


PII: S 0 1 6 9 - 0 2 3 X ( 0 2 ) 0 0 0 5 2 - 6
294 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

CORBA enable data systems to exchange information without requiring client knowledge of
information resources. When applied in concert to enterprise data problems, frameworks such as
these can support unanticipated information requirements, and from time to time this constitutes
a competitive advantage. Nevertheless, this is not necessarily good news. Remedial services
provided by such frameworks merely patch the deciencies of stovepipe data assets that cannot
keep up with evolving requirements. Since the data systems involved are often core assets of the
enterprise, their inability to keep pace with change does not bode well for the future.
The underlying problem is often data design. Typically, the design of legacy data systems is
more suitable to initial information requirements than, say, requirements at the mid-point of their
economic lifetimes. This deciency usually reects the structured methods by which they were
composed. 2 In particular, the conceptual model of a structured design expresses business rules as
system policy that is allocated to physical implementation interfaces either directly or incre-
mentally. Hence, structured methods oblige high-level policy to depend directly (or through
transitive dependency relations) on concrete elements of low-level physical design. Design change
intended to meet new requirements modies this dependency architecture. Over time, changes that
appear to produce desired eects may inadvertently violate the coherence of the original design.
Change requests and maintenance costs mount as the dependency architecture of the original
design decomposes.
The end state of design decomposition is fossilization. A data design inevitably fossilizes if its
dependency architecture must be modied in order to accommodate change. Unsurprisingly, the
key indicator of design fossilization is evident when change to the semantics of one data element
propagates change to all semantically dependent data elements. This is characteristic of structured
design, where change to low-level details impacts the semantics of high-level policy. As the
original dependency architecture decomposes, the nature and extent of this relationship becomes
less predictable. When the propagation of semantic change cannot be reliably predicted, the cost
of design changes cannot be estimated. Eventually, data and process owners become reluctant to
authorize changes, with the result that change proposals are discouraged and deemed suspect.
Ultimately, owners freeze the change management process and the design fossilizes, initializing the
nal stage in the economic lifetime of the data asset.
One diagnostic for fossilized data designs is that they characteristically propagate semantic
change imposed on low-level data elements to higher-level dependents. From the perspective of
data and process owners, a design change may give rise to problems that impact aspects of the
design that have no conceptual relationship with the changed element. The remedy for one
problem leads to other seemingly unrelated problems. The quality of data for certain aspects of
the design is thrown in doubt, and the potential for data reuse diminishes accordingly. In eect,
design fossilization erects a rewall of non-reusability around data assets. Certainly this is not the
intended eect.
This paper describes database lexicography, a methodology that can be used to forestall the
process of design fossilization. Focusing on the dependency architecture of designs, it presents
concepts and methods that facilitate the productive utilization and reuse of data assets. Using

2
This particular shortcoming of structured design has been known for some time. As noted in an early inuential
work in object-oriented design (see [4, pp. 267268]), most changes in requirements are changes in function rather than
in the objects, so change can be disastrous to procedure-based design.
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 295

lexical graph theory as its foundation (see [1]), this paper illustrates how to expose undened
semantics in data assets as well as how to isolate and measure data element interdependencies.
Within this framework, one useful metric identies points of decomposition in the dependency
architecture. Others demonstrate how to quantify sensitivity to change of interdependent data
elements, deriving useful gures of merit for the management of enterprise data.
Database lexicography is based on the hypothesis that the information content of data de-
pends ultimately on lexicographic knowledge. In other words, interpretation of a data value
crucially depends on the meaning of the data type encoding that value. Furthermore, the data
dictionary records this meaning and other information so that data semantics may be shared con-
sistently throughout the enterprise. Curiously, formal methods for managing the shared meanings
that constitute lexicographic knowledge are largely absent from current practice. The approach
presented here closes this gap by exploiting the data dictionary as a metadata resource responsible
for publishing a controlled terminology as well as the information structure of instance data in a
database. For these reasons, the framework and its methodology are called database lexico-
graphy.

2. Background

The departure point for the description of database lexicography will be to establish some
common ground with respect to concepts and vocabulary. Thereafter, the methodology of the
discipline and its services will be presented in this context.
In general, an architecture is a fundamental and unifying computing infrastructure dened in
terms of system elements, interfaces, processes, constraints, and behaviors, including its sub-
systems and their allocation to tasks and processors. Architectures often include a database, a
persistent repository of information encoded in formatted data and stored in an electronic le
system. A database management system (DBMS) is a computer program for managing databases.
One or more application programs may read from and write to the same database. When more
than one independent application is present in the system context, the database often operates
as a neutral medium facilitating communication between them. Hence, the data designoften
called the data model or schemais crucial to the success of the architecture. Data model de-
nition is often the most expensive, dicult, and important task executed during software de-
velopment.
The ANSI three-schema architecture is a standard architecture for software systems composed
of related database applications (see [5]). It is organized in three layered subsystems known as the
external, conceptual, and internal schemata. According to this perspective, an external schema
instantiates a data model meeting the requirements of an independent application. A conceptual
schema instantiates a data model representing a global, enterprise information structure. An
internal schema is encoded in DBMS instructions identifying keys, tables, indices, storage struc-
tures, and other physical interfaces of the architecture. All three schemata collaborate to embody
a three-schema architecture.
Each layer of the three-schema architecture plays a role in facilitating information integration
and interoperability. For instance, external schemata selectively abstract away from global aspects
of conceptual schema to select views suitable to the fulllment of particular application
296 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

requirements. In practice, each external schema erects an interface to the conceptual schema that
embodies its selective view. This facet of the architecture permits applications to be insulated from
the eects of change in the conceptual schema as it evolves over time. Seen from an alternate
perspective, the conceptual schema serves to integrate related applications within the architecture.
The interfaces of the internal schema to the conceptual schema, on the other hand, serve to hide
implementation details of the underlying DBMS.
Each level of the three-schema architecture is independent in the sense that it embodies a
dierent organization of information. Physical data independence obtains when internal schema
can change without impacting external schema. (This kind of change is commonplace when an
application is ported to dierent environments or when underlying implementations are tuned
for performance.) Logical data independence obtains when external schema can change without
impacting internal schema. (This occurs when, for instance, another client application is entered
into the system context or when client application data requirements change.) Well-designed
interfaces between schemata are necessary to the achievement of physical and logical data in-
dependence, a condition that reduces maintenance costs and extends the economic lifetime of
software systems.
Let the practice of conceiving an architecture to achieve physical and logical data independence
be known as design-for-change.

2.1. Model specication

From a software engineering perspective, a data model results from applying a rigorous,
structured methodology to the creation and validation of a system representation. A logical data
model represents a single, integrated denition of enterprise data. It is unbiased toward any ap-
plication and makes no representations with respect to physical implementation details like data
storage or access. In architectures where multiple applications exist or are contemplated, the
conceptual schema often serves as a logical data model. Otherwise, one might construe the logical
data model as equivalent to the union of external and conceptual schemata. In either case, a
physical data model is instantiated by the internal schema of the architecture as described by the
physical storage details used to implement conceptual schema data requirements. Both forms of
data model constitute metadata, or information about the characteristics of data.
Hence, logical models specify conceptual and categorical elements of an architecture, plus their
associations. From this perspective, data attributes (for example, height, price, weight, salary, and
duration) may signify distinct meanings in a logical model, each independent from the notion of
number used to instantiate data values for these attributes. Although necessary to the interpre-
tation of such data values, the notion number need not be represented in the logical model at all.
Similarly, data entities (product, for example) can exist independently of their particular desig-
nators, which again need not be represented in the model. Thus the logical model factors infor-
mation elements apart from their values, just as things in the world exist independently of the
names we might give them.
In other words, logical models specify conceptual and categorical metadata, but they do not
associate this information with values found in database tables. More abstractly, logical models
do not map individuated concepts to domain elements. (In database design, a domain is a named
set of arbitrarily complex, quantitative or qualitative values associated with a discrete data type.)
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 297

Physical models, on the other hand, are responsible for identifying data structures associated with
storage requirements, among other things. Within SQL-based relational DBMSs, this is most
often accomplished by executing data denition language (DDL) instructions that create tables
composed of hC O L U M N _N A M E , D O M A I N _N A M E i tuples. (Following the inuential analysis of
the relational data model presented in [2] and elsewhere, a tuple is an ordered pair hA; V i where
A is an attribute name, and V a unique domain name corresponding to A.) Conventionally,
C O L U M N _N A M E in each tuple is a physical model entity or attribute, and D O M A I N _N A M E is a
(user-dened or SQL primitive) data type that xes a range of possible values for that column in
the resulting table. Of course, other DBMS techniques can be used to further constrain this range,
but this implementation detail does not detract from the coherence of the notion of data type
employed here. Hence, D O M A I N _N A M E identies the set of values permitted to instantiate the
physical model entity or attribute identied by C O L U M N _N A M E . In this way, physical models
associate their metadata with named sets of values found in database tables.
Two intermediatebut importantconclusions can thus be drawn. Logical models specify
conceptual and categorical metadata, but they do not associate those metadata with values found
in database tables. Physical models, on the other hand, do specify an association between meta-
data and values appearing in database tables, albeit via a single level of indirection. (It is still
necessary to discover the mapping between physical model data types and the set of values as-
sociated with them.) These observations identify two principal obstacles impeding the free nav-
igation of metadata from logical model concepts to physical model encodings of meaning: the
logical and physical metadata are isolated from one another, 3 and the mapping from physical
data type to sets of permissible database values remains unexpressed.
These abstract considerations form the root cause of many real problems. Principal among
them is isolation of the logical model from the physical model, which permits the models to evolve
independently or to desynchronize altogether. When this occurs, data independence is lost and
fossilization sets in.

2.2. Data dictionary specication

Fortunately, there is another form of metadata suitable for the denition of logical and physical
data model elements and their interfacesthe data dictionary. The conguration of a data dic-
tionary suited to this service is decidedly dierent from those provided by conventional DBMSs
and computer-aided software engineering environments, which treat the data dictionary as an
artifact derived from a design. Within these frameworks, a data dictionary is little more than a
report generated by a tool whose primary purpose is the creation and editing of graphical models
or the creation and management of databases. Neither form of dictionary oers an integrated

3
The degree of isolation depends on implementation details. One technique to bridge this gap is to ensure type
transparency at the interface of the logical and physical models (i.e., the conceptual and internal schemata) such that (i)
the C O L U M N _N A M E of each physical model tuple encodes the name of a logical model metadata element and (ii) the
D O M A I N _N A M E of that tuple identies a set of database values consistent with the semantics of that logical element.
This technique has practical utility when the number of database applications within the system context is very small,
but it quickly becomes impractical as the number grows. Implementation details such as naming constraints imposed by
SQL and DBMS vendors further complicate matters, rendering it untenable as a general design solution.
298 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

view of the logical and physical models underlying a database. In particular, the former has little
to say about the physical model, and the latter knows nothing about the logical model.
Instead, a data dictionary suitable for the integration of logical and physical data models
contains exactly ve kinds of descriptive metadata: logical and physical model entities; logical and
physical model attributes; and physical model domains. According to the methodology of data-
base lexicography, each entry in a data dictionary consists of (at least) a metadata term and its
denition, and each type of entry has specic information requirements. Each attribute must be
dened in terms of a corresponding domain or another attribute, for instance, and domain de-
nitions must exhaustively enumerate an appropriate set of values or specify them with respect to
primitive data types (e.g., V A R C H A R , I N T E G E R , B L O B , etc.). Additionally, no metadata element
may be dened in language expressing relations or other associations not present in the logical or
physical models. In general, data dictionary entries should dene metadata terminology as pre-
cisely and succinctly as possible, and say no more.
Database lexicography requires four additional information elements for data dictionary de-
nitions. First, a metadata term must be dened in such a way as to lexically depend on its hypernyms
and holonyms if and only if they exist in the model. 4 This captures the logic of inheritance and
aggregation expressed in the metadata. Additionally, where existence dependence can be inferred
between elements of a model, the dependent entity must be dened in such a manner as to lexically
depend on the other. (In database jargon, existence dependency occurs when a dependent entity
must migrate a key from its parent; it can be detected by reviewing data model relations or by
inspecting DDL to identify foreign keys that cannot bear the value NULL.)
Next, each metadata element naming a relational variable must be dened so as to lexically
depend on the attributes identied in its heading. (Following Date, a relational variable corre-
sponds to the notion of an entity when that entity is realized as a database table. The permitted
values for a relational variable are relations, each of which consists of a heading and a body, where
the heading is a tuple and the body is a set of tuples, all having that same heading. The attributes
and corresponding domains identied in the tuple are the attributes and corresponding domains
of the relation.) These two requirements capture the logic of existential entailment expressed in the
metadata. Finally, where a data dictionary entry represents an element at the interface between
logical and physical models, the logical model element must be dened so that it lexically depends
on the physical model element. This nal requirement integrates the representation of logical and
physical models in the data dictionary.
The following section presents a more thorough discussion of database lexicography and its
information requirements. Although the details are specied in a somewhat doctrinaire manner,
the goal is to impart the methodology of a novel discipline known to produce desirable results.
The narrative assumption is that the tone is tolerable if it eectively demonstrates how to dis-
tinguish good designs from poor ones, as well as how to redirect the course of design fossilization.

4
Lexical dependency is established when the denition of one metadata element refers to another. When this occurs,
the meaning of the rst metadata element depends on the second (v. Section 3.2.1). In the context of database
lexicography, a hypernym is a superordinate metadata term (i.e., one that is a single level more generic than the given
term). Canine, for instance, could serve as a hypernym of dog. Similarly, a holonym is a conceptual whole for which a
given item is a part. For instance, hat is a holonym for brim and crown.
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 299

3. Managing the metadata namespace

In order to avoid model isolation, the software engineering discipline is obliged to construe the
three-schema architecture as the embodiment of an integrated namespace in which all names are
local. This is a reasonable approach since, in general, the schemata of the architecture collectively
instantiate each name of this namespace by labeling the graphical elements of their models, and
designers and maintainers utilize these models to conceptualize the architecture. In database
lexicography, a principal responsibility of the data dictionary is to document this integrated
namespace, with particular attention to the global information structure and interfaces between
schemata. Database lexicography techniques enable the role of the data dictionary to be expanded
to support the validation of metadata creation and update. In this way, its methodology con-
stitutes decision support for the selection of data concept encodings and proposed changes to
them. This additional layer of namespace validation is unavailable elsewhere in the data man-
agement and software engineering industries, yet it is critical to the practice of design-for-change.

3.1. Managing lexical information

Database lexicography makes the strong claim that the information content of data depends on
lexicographic knowledge, hence data dictionary form and content is of crucial importance to the
discipline. As discussed in Section 2.2, a suitable data dictionary contains exactly ve kinds of
descriptive metadata: logical and physical model entities; logical and physical model attributes;
and physical model domains. Each data dictionary entry consists of (at least) a metadata term and
its denition, which is formulated to satisfy two general information requirements:

(1) Each entry denes the identity of a single metadata element as precisely and succinctly as pos-
sible.
(2) No entry denes a lexical dependency that expresses associations or relations not present in
the logical or physical models.

The rst requirement focuses on metadata element identity, which comprises characteristics in-
dividuating a particular metadata type from other metadata. (Relations are excluded from con-
siderations of identity except in matters of inheritance, aggregation, and existential entailment,
when they are mandatory.) The second provision is necessary to ensure that the data dictionary
represents exactly the information encoded elsewhere in the metadata and no more.
Furthermore, each type of data dictionary entry has specic information requirements. Par-
ticular requirements for attributes and domains are as follows:

(3) Each attribute is dened in terms of a corresponding domain, another attribute, a holonym,
or matters external to the architecture.
(4) Each domain is dened in terms of a primitive data type or an exhaustive enumeration of
some set of typed values.

The third provision permits logical model attributes to be dened so that they lexically depend on
physical model attributes (v. Section 3.2.1), thus structuring an interface between schemata.
300 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

Together, the third and fourth requirements ensure that all physical model attributes lexically
depend on user-dened or primitive data types.
Database lexicography identies ve more information requirements. Two are intended to
capture the logic of inheritance and aggregation, providing treatment for those data models that
classify similar entity instances together and relate them by generalization or aggregation:

(5) Each metadata element is dened to lexically depend on its hypernyms if and only if they are
present in the logical or physical models.
(6) Each metadata element is dened to lexically depend on its holonyms if and only if they are
present in the logical or physical models.

Two additional requirements capture the logic of existential entailment:

(7) Where existence dependence can be inferred between elements of a model, the dependent en-
tity is dened to lexically depend on the other.
(8) Each metadata element naming a relational variable is dened to lexically depend on the at-
tributes identied in its heading.

An omnibus nal requirement integrates the representation of logical and physical models in the
data dictionary:

(9) Metadata elements at the interface between logical and physical models are dened so that the
logical model element lexically depends on the physical model element.

Hence, database lexicography information requirements can be specied in a small set of rules
that captures the logic of inheritance, aggregation, and existential entailment expressed in meta-
data. Furthermore, these nine rules guarantee that the resulting data dictionary presents an inte-
grated representation of the logical and physical models of a data system.

3.2. Managing information structure

Database lexicography is iconoclastic with respect to orthodox model integration techniques


since its information requirements demand no more than carefully sculpted natural language
denitions of metadata concept encodings. Once a data dictionary satises these requirements, the
metadata it describes can be navigated and their interrelationships quantied using the prescribed
methodology. Where such measurements indicate problematic relationships between metadata
elements, the architecture can be adjusted to correct the deciency. This section indicates how to
accomplish this.
The rst step in managing the metadata information structure is to identify and ameliorate any
undened semantics that might exist in the data dictionary. Database lexicography prescribes a
method of detection for the undened semantics of circular denitions, but this presupposes an
understanding of lexical graphs (see [1]). To ensure this understanding, the exposition that follows
reviews technical aspects of that logical foundation, beginning with a brief reprise of graph theory.
This material will be useful in demonstrating how the quantication of semantic dependency in a
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 301

data dictionary (v. Theorem 2) can serve to identify and remedy the dependency architecture of an
underlying data design.

3.2.1. Dictionaries and lexical graphs


A graph G is an ordered pair V ; E, where V is a nite set of vertices and E is a binary relation
on V composing a set of edges. An edge is a pair u; v with u; v 2 V . If edge e u; v is in graph
G, then u and v are said to be the end vertices of e.
An edge is directed if its end vertices are an ordered pair. Suppose that e is an outgoing edge of
u and an incoming edge of v. A specialized edge like e is an arc, and its specialized end vertices are
nodes. A directed graph is a graph in which all edges are arcs. If arcs e u; v and e0 u; w
exist in a directed graph, then e and e0 are incident to node u. The degree of node u, du, is the
total number of arcs incident to u. Moreover, the incoming degree of u, d  u, is the total number
of incoming arcs incident to u, and the outgoing degree of u, d u, is the total number of out-
going arcs incident to u.
In a directed graph G V ; E, a walk is a sequence between end nodes of one or more arcs,
and a path is a walk in which all nodes are distinct. Node v of G is reachable from node u if v u
or G contains a path from u to v. A subgraph S of G has all its arcs and nodes in G. A directed
subgraph S of G rooted at v is a directed graph S V 0 ; E0 , where v 2 V , V 0
V , E0
E, and V 0 is
the set of nodes from which v is reachable. Finally, a cycle is a walk in which the rst and last
nodes are identical while all others are distinct, and a directed acyclic graph is a directed graph
without cycles. (A good introduction to graph theory is in [6].)
Let D be a dictionary, a nite set of strings arranged as ordered pairs t; d such that each d
identies the essential meaning of its t according to the lexical information requirements spec-
ied in Section 3.1. A lexical graph is a directed acyclic graph L D; E, where each pair in D is
a node in L, and E is a binary relation on D. Since the end nodes u, v of each arc in E are also
ordered pairs, a more precise notation for an arc in a lexical graph is needed. Let each arc in E
be annotated as e t; d; t0 ; d 0 . Then whenever e 2 E of L, t; d can be said to depend on
t0 ; d 0 .

3.2.1.1. Lexical dependency. Applying the foregoing description of lexical graph L, suppose lexical
dependency exists between nodes t; d and t0 ; d 0 whenever t0 occurs as a substring of d in the arcs
of L. (The substring relation suces for expository purposes, but lexical dependency can be
formulated in terms of other relations as well.) Thus lexical dependency occurs whenever one
terms denition uses another term under denition in the dictionary. If a directed graph contains
a cyclical dependency, then it cannot be a lexical graph and the semantics of its dictionary are
undened. Because of this, detection of cyclical dependency is a decisive operation in the meth-
odology. In database lexicography, cyclical dependency suces as a general diagnostic of unde-
ned semantics.
Note the logical entailment involved in lexical dependency. Whenever one node lexically de-
pends on another, it is impossible to know the meaning of the dictionary term dened by the rst
node without prior knowledge of the meaning of the term dened by the second. Mathematically,
a lexical dependency relation dened on L imposes a topology on its dictionary. Each t; d pair
in D of L identies the information structure of a term under denition, and whenever an ele-
ment in D lexically depends on another to ground its meaning, a corresponding arc exists in E.
302 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

Hence, lexical dependency intrinsically orders D: whenever e t; d; t0 ; d 0 2 E of L, it is clear


that t0 ; d 0 precedes t; d in the topology of D. This observation permits the following induc-
tion:

Lemma 1. Knowledge of the meaning of an element in a dictionary topology cannot be guaranteed


without prior knowledge of the meaning of every preceding element in that topology.

A dictionary topology provides immediate value to the data management enterprise. On the
one hand, it represents a learning sequence for the concepts encoded in the metadata. On the other
hand, it orders physical model entities in table loading sequence, a calculation critical to the data
population process that follows schema denition.

3.2.1.2. Lexical stability. Once metadata information structure is isolated in a lexical graph, it
is possible to evaluate the dependency architecture of the underlying data design. According to
the discipline of database lexicography, this evaluation proceeds by quantifying the stability of
each node in the lexical graph. Stability is generally construed as a measure of the eort required
to validate semantic change to the corresponding dictionary entry. The results are interpreted as
gures of merit for the data design. Alternatively, the evaluation may reveal local points of de-
composition in its dependency architecture. This information then serves to guide corrective ac-
tions intended to fortify the design against fossilization.
Intuitively, the lexical stability property of a particular node in a lexical graph is a discrete, local
measure of the amount of work required to modify the semantics of its dictionary entry. 5 In
lexical graph L, the stability of v, v 2 D expresses the work eort any modication to the lexi-
cographic information in v will require to validate the impact of that change on the immediate
dependents of v in L. To illustrate, consider (1) from Fig. 1, a simple lexical graph of three nodes
x, y, and z. Nodes x and y lexically depend on z, and z is lexically independent. Clearly, the

Fig. 1. Four simple lexical graphs.

5
Robert C. Martin (personal communication) introduced the author to the study of stable interdependencies among
information units within the context of object-oriented design and package dependencies in the programming language
C++. Martin acknowledges Bertrand Meyer, the inventor of the computer language Eiel, as the originator of the
concept (cf. [3], passim). For details of this problem and its treatment, see Martins Engineering Notebook columns of
The C++ Report for the year 1996. The theory of lexical graphsespecially the denition of lexical stabilitystems in
part from this inspiration, although any errors that remain are the authors.
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 303

meaning of x and y can change without aecting the meaning of z. Moreover, no change in x will
aect the meaning of y, and vice-versa. However, a change to z may reformulate the meaning of
both x and y, transforming the structure and content of L to an extent measured by the stability of
z. Alternatively, a potential side eect of change to nodes x and y, which follow z in the dictionary
topology, is that they may no longer follow z in the topology afterward, depending on the nature
of the change to dictionary elements x and y.

3.2.2. Measuring lexical stability


The interdependencies between nodes in a lexical graph determine the global information struc-
ture of the graph as well as the local information structure of each node. Recall from Section 3.2.1
that dv, the degree of node v, is the total number of arcs incident to v. Likewise, d  v, the
incoming degree of v, summarizes the incoming arcs of v. The lexical stability of node v, Sv, is
then the quotient of dividend d  v and divisor dv:
d  v
Theorem 1. Let v, v 2 D, be a node in lexical graph L D; E. Then S v .
d v
Application of this measurement to lexical graph (1) of Fig. 1 indicates Sz 1, the maximal
value for lexical stability, while Sx and Sy both yield 0, the minimal value. As lexical stability
approaches the maximum, change to the information concept encoded in a dictionary node has
more pervasive eects on the structural conguration of the lexical graph. As a practical matter, a
high lexical stability value local to some node v implies diculty of change for the dictionary entry
encoded at v, since it and each of its adjacent nodes must be reviewed for correctness subsequent
to the change. Conversely, a reduced burden of validation accompanies modication of a node
with lower lexical stability. Hence, low lexical stability implies ease of change.

Lemma 2. High lexical stability implies difficulty of change for a dictionary entry; low lexical
stability implies ease of change.

Consistent with this framework of evaluation, the lexical stability metric assigns maximum and
minimum stability values to the root and leaf nodes, respectively, of a lexical graph. This is an
intuitive result. By denition, orphans (i.e., isolated lexical graph nodes where dv 0) possess
minimum lexical stability.
Next consider lexical graph (2) from Fig. 1, which extends (1) such that node w depends on y.
Again, the calculation of Sv identies minimal stability for the leaf nodes of the new congu-
ration. Node w depends on y and y depends on z, and the calculation identies a higher lexical
stability value for z than y: Sz 1 and Sy 0:5. Moreover, each lexical dependency added to
node y increments Sy appropriately, although Sy will never achieve the maximum value.
Hence, Sz will always exceed Sy in lexical stability, another intuitively appropriate result. Now
consider lexical graph (3), which extends (2) such that z, the former root node, has been made to
depend on u, a new root node. In this conguration, Sz is devalued to 0.66 and Su 1. Nodes
w, x, and y are the dictionary entries easiest to change, and they retain their previous lexical
stability values.
Finally consider lexical graph (4), which extends (3) by adding a dependent node t to y and
making t dependent on an additional node s. The new s node, in turn, depends on a third
304 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

Fig. 2. Lexical stability in four simple lexical graphs.

additional node v. As in lexical graphs (1)(3), the two root nodes of (4) have maximal stability.
Nodes x and w remain leaf nodes, and thus retain minimum lexical stability. Sy has increased to
0.66, reecting the additional lexical dependent t. Node z presents the same conguration as y,
hence Sz 0:66. Node s has identical incoming and outgoing degrees; hence, its lexical stability
is 0.5. Fig. 2 summarizes these measurements.

3.2.3. Lexical stability and information structure


Note that lexical stability is local to a node in a lexical graph. Should modication of a node
change its relationship to another node, the topology of the dictionary may change. Not every
change has structural consequences. A change in a leaf node or an orphan may be trivial, as the
minimum lexical stability value of such nodes attests. On the other hand, some changes may have
intricate consequences throughout the dictionary. Modication of a node with high incoming
degree positioned near a root, for instance, may have a signicant impact on the information
structure of a dictionary.
Formally, modication of lexical graph L is accomplished by creating, updating, or deleting
one or more elements in D of L. Suppose two nodes v t; d and w t0 ; d 0 exist, where v; w 2 D
and v lexically depends on w. Recall from Section 3.2.1.1 that whenever a node lexically depends
on another to ground its meaning, a corresponding arc e t; d; t0 ; d 0 appears in E of L.
Should v change such that it no longer depends on w, the result is that e 62 E of L. Should v change
such that it retains its dependency on w and adds a new dependency on v s; d, where v 2 D,
then a new arc e t; d; s; d appears in E of L. Thus E contains the extension of a binary
relation on D, and this relation expresses the intrinsic information structure of D.
Lexical stability and information structure can be factored independently in the evaluation of a
dictionary. Consider lexical graph (4) from Figs. 1 and 2, presented here in two views, one with
node labels and one with lexical stability measurements. Suppose that node z were changed such
that it no longer depends on u. This modication changes the lexical stability values of nodes u
and z to 0 and 1, respectively. No other stability values change. However, since z no longer de-
pends on u, the transitive closure of lexical dependencies by which x, y, w, and t had previously
related to u is disrupted. Because of the change in z, these nodes no longer base their denienda in
whole or in part on u. Clearly, the information structures of x, y, w, and t have changed, although
their lexical stability measurements remain xed. Since information structure can change inde-
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 305

Fig. 3. Two views of a simple lexical graph.

pendently from lexical stability, information structure and lexical stability must be factored in-
dependently in lexical graphs.
Except for orphans in a lexical graph, the meaning of a dictionary element is partially deter-
mined by its information structure. Thus a modication of v, v 2 D of lexical graph L entails the
possibility of change in meaning for v as well as for any other node u, where u is an element of the
subgraph of L rooted at v. This is true because the information structure of u grounds its meaning
in the information structure of v. Should v be positioned strategically within the global infor-
mation structure of Lfor instance, as the single root of Lthen modication of v may trans-
form the meaning of every term under denition in the dictionary.

3.2.4. Discriminating between dictionary elements


For convenience, some device is required to express the foregoing notion of strategic posi-
tioning within a lexical graph. Given some node v, such a device will identify an aggregate stability
value for v and the nodes in L dependent upon the meaning of v.

3.2.4.1. Aggregate stability. Since lexical graph L D; E is a specialization of a directed acyclic


graph, it follows that the subgraph of L rooted at v is a lexical graph L0 D0 ; E0 , where v 2 D,
D0
D, E0
E, and D0 is the set of nodes from which v is reachable. As noted in Section 3.2.1,
node v of L is reachable from node u if v u or L contains a path from u to v. More formally, v is
reachable from u if and only if v is identical to u, v is adjacent to u, or there is some set of arcs
E0 u; xi ; xi ; xi1 ; . . . ; xn ; v, E0
E, where xi and xi 1 are distinct and adjacent for i
0; . . . ; n. Thus for each node v in L, it is possible to isolate and quantify the contribution of the
subgraph rooted at v to the global information structure of L. Let this property be known as Sv ,
the aggregate stability of v:

Theorem 2. Let L0 D0 ; E0 be a subgraph ofPlexical graph L D; E rooted at v. Then each x,


x 2 D0 , semantically depends on v and S v S x.
x2D0

Hence, the aggregate stability of v measures the transitive closure of lexical dependency relations
within the subgraph rooted at v. Thus the aggregate stability property can serve as a crude
comparator discriminating between nodes on the basis of their contribution to the global infor-
mation structure of the lexical graph. Where Sv > Sv0 , the global information structure of L is
more sensitive to modication of v than v0 .
306 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

Since multiple roots are common in lexical graphs, aggregate stability permits the denition of
a partially ordered set D; Sv that might be applied as a comparator between subgraphs, thus
identifying an aggregate stability value for each subgraph of a multiply rooted lexical graph.
Hence, the aggregate stability metric partitions the metadata dependency architecture into com-
ponent information structures where each substructure contributes a discrete amount to the
global information structure of the lexical graph.

3.2.4.2. Global stability. The global stability of a lexical graph, GSL, is the sum of all Sv in L:
P
Theorem 3. Let L D; E be a lexical graph. Then GSL S v.
v2D

Global stability may, for instance, be employed to characterize lexical graphs on the basis of the
inherent integration of information concepts within their respective dictionaries. Where jDj of two
lexical graphs is equivalent, GSL is greater for the lexical graph with the greater interdependency
between dictionary entries. A similar comparator for arbitrarily selected lexical graphs is the
relative global stability of each graph L, GRL, which expresses the relative integration of in-
formation concepts within any lexical graph:
GSL
Theorem 4. Let GSL be the global stability of L. Then GRL .
jDj
Lemma 3. Where GRL > GRL0 , the information structure of L encodes a greater interdependence
of information concepts than that of L0 .

As a practical matter, whenever GRL is relatively high or low, dictionary D of L will demon-
strate a commensurately high or low level of conceptual integration.

3.2.4.3. Fractional stability. Summarizing briey, Theorem 2 demonstrates that for each node v in
L it is possible to isolate the contribution of the subgraph rooted at v to the global stability of L.
Using this technique, an aggregate stability value can be identied for an arbitrary subgraph of L.
Theorem 3, on the other hand, encapsulates in a single property the local lexical stability values
distributed in a lexical graph. When construed in concert, these values can be used to express the
fraction of a lexical graphs global stability contributed by the information structure of some
discrete node. Let this property be known as the fractional stability of node v, FSv:

Theorem 5. Let L0 D0 ; E0 be a subgraph of lexical graph L D; E rooted at v. Then if dv 0,


Sv
FSv 0; otherwise, FSv .
GSL
Fractional stability provides an insightful discriminator for the information concepts encoded
in a dictionary. It imposes a weak partial order on the elements of D, identifying for each node a
value in the interval between 0 and 1. For illustration, consider the two views of a lexical graph
displayed in Fig. 4.
This is the graph from Fig. 3 with fractional stability values labeling its nodes. Fig. 4 indicates
how a greater FSv value for a node correlates with a greater potential for transformation of
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 307

Fig. 4. Fractional stability in a simple lexical graph.

meaning of dictionary elements in the lexical graph. In this case, the fractional stability relation
denes a partially ordered set D; FSv ft; w; x; s; y; z; v; ug. Meaningful change to a dictionary
element at the lower end of the scale has little or no ramication for the lexical graph or the
meaning of other dictionary elements. Conversely, modication of a dictionary element at the
high end of the scale has potentially extensive consequences for the conguration of the lexical
graph and the meaning of other dictionary elements.
The fractional stability relation exposes this lexical graph property to measurement, enabling a
scalar distribution of the information concepts encoded in a dictionary. Dictionary elements ar-
ranged at the lower end of the scale have information structures with few or no dependents.
Meaningful change to their semantics can be introduced without eect on the remainder of the
dictionary. In this sense, elements at the lower end of the scale are peripheral information con-
cepts with respect to the global information structure of the dictionary. As for the information
concepts arranged at the higher end of the scale, meaningful change to their semantics will be
broadly propagated throughout the information structure of the dictionary. Hence, these elements
encode the core information concepts of the dictionary. The chart displayed here depicts the ele-
ments of this partially ordered set for the simple lexical graph of Fig. 4, along with the aggregate
and fractional stability properties of each term under denition in the dictionary of that lexical
graph.

3.3. Managing semantic dependency

Dictionaries with highly interdependent lexis tend to be rigid, narrowly domain specic, and
dicult to maintain. Even so, semantic interdependence is necessary if a dictionary is to be both
308 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

useful and coherent. Thus some forms of semantic dependency are desirable, and others unde-
sirable. Database lexicography provides a discrete method of distinguishing between desirable and
undesirable forms of semantic dependency.

3.3.1. Lexical dependency inversion


In metadata creation and maintenance, an undesirable semantic dependency propagates se-
mantic change imposed on a metadata element in unexpected ways throughout the dependency
architecture. As stated in Section 1, this condition is a diagnostic for fossilized data designs. When
it occurs, the dependency architecture characteristically exhibits a particular structural aw: a
metadata element that is relatively stable with respect to the information structure of the lexical
graph semantically depends on a signicantly less stable element. Consider the lexical graph in
Fig. 5, which illustrates a simple case with lexical and aggregate stability values labeling the nodes
in (1) and (2), respectively.
Note that a node in (1) with lexical stability value 0.8 semantically depends on another with value
0.5. According to Lemma 2, the semantics of the latter node will be relatively easy to change, and
this fact will not be lost on maintainers keen to satisfy changing requirements. Any change to this
volatile element will be propagated throughout the subgraph of its client dependencies, and this
may unintentionally transform the meaning of more stable metadata occurring elsewhere in the
lexical graph.
The modest design in Fig. 5 illustrates how change to low-level detail can impact the semantics
of high-level policy. In designs scaled to more realistic proportions, economic forces oblige
maintainers to prefer easy changes over more costly ones and to restrict change validation to local
unit testing whenever possible. As maintenance changes occur over time, unanticipated problems
arise when structural conditions obscure undesirable side eects. For example, unobtrusive main-
tenance modications to the domain semantics of low-level physical datasay, changing a R E A L
data type to a D O U B L E may produce round o errors in data encoding high-level system policy.
From the perspective of owners and managers, this error may seem conceptually unrelated to any
prior change. In the absence of a comprehensive approach to the dependency architecture, main-
tainers may impose a remedy that fails to address the root cause of the problem. Such changes
typically instigate others until change proliferates, and the dependency architecture decomposes.

Fig. 5. Lexical and aggregate stability in a lexical graph.


G. Coen / Data & Knowledge Engineering 42 (2002) 293314 309

Fig. 6. Lexical and aggregate stability in a lexical graph.

An initial diagnosis might identify the deciency in Fig. 5 as a mismatch between the ow of
lexical stability and the direction of semantic dependency. In particular, the lexical stability of the
two nodes at issue increases in a direction counter to the ow of semantic dependency. Intuitively,
this seems perverse. (Note that this observation does not apply to the view of aggregate stability in
(2), where semantic dependency and aggregate stability ow unidirectionally throughout the
lexical graph.) One might suspect that this condition conceals a design aw, but that conclusion
does not hold up under analysis. Consider for instance Fig. 6, which adds a minor extension to the
structure in Fig. 5.
Here as in Fig. 5, (1) exhibits a mismatch between the ow of lexical stability and the direction of
semantic dependency. However, the node with lexical stability 0.666 has one additional local
client, and its aggregate stability value (displayed in (2)) is commensurately higher than the cor-
responding node in Fig. 5. In this case, the local increase in semantic dependents coincides with an
increase in the aggregate stability of the node, identifying it as more dicult to change. Main-
tainers wanting to change its semantics are more likely to respect its heightened stability because
relations local to a data element are evident in entity-relationship diagrams and DDL, the con-
ventional tools of the trade. Thus one might reasonably expect a more thorough testing of the
eects of change on the node in Fig. 6 than the corresponding node in Fig. 5.
Recall from Theorem 2 that lexical dependency is a specialization of semantic dependency.
Clearly the designs in Figs. 5 and 6 exhibit a mismatch between the ow of lexical stability and the
direction of lexical dependency, but only the former contains a relatively stable metadata element
that semantically depends on a signicantly less stable one. Because of this, the design in Fig. 5 is
more prone to fossilization than the other, since its maintainers are more likely to create inad-
vertent side eects when they change the semantics of the node at issue. This is a signicant
distinction. Although lexical stability runs counter to lexical dependency in both designs, the
heightened stability of the node at issue in Fig. 6 is manifested locally, and this dierence will be
obvious to maintainers and their technology, which work most eectively with local relations.
Consequently, one might reasonably conclude that semantic dependencies like the one at issue in
Fig. 6 are desirable, but those like the one in Fig. 5 are not.
Evidently, the analysis of semantic dependency has multiple dimensions. Locally, lexical sta-
bility measures the work required to validate the impact of semantic change on the immediate
310 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

dependents of a changed metadata element. More expansively, aggregate stability extends this
measurement to partitions of the metadata dependency architecture, measuring the work required
to validate semantic change to a metadata element with respect to all of its clients, both immediate
and those mediated via transitive closure of lexical dependency relations. Both metrics discrimi-
nate between stable and volatile metadata, but neither independently captures the distinction
between desirable and undesirable semantic dependency. Instead, both dimensions of analysis
must be combined to formally express the dierence between desirable and undesirable semantic
dependencies. Let the undesirable form be known as a lexical dependency inversion:

Lemma 4. If v and d are nodes in a lexical graph such that d lexically depends on v and
Sv  Sd P Sd, then the relation between v and d flows in the direction of stability; otherwise, the
relation is inverted.

This lemma identies a lexical dependency inversion in Fig. 5 but not in Fig. 6, as desired. By
comparing the dierence of the aggregate stability values of the two nodes at issue (i.e., Sv 1:3
and Sd 0:8) with the lexical stability of the dependent node (i.e., Sd 0:8), it exposes the fact
that Sv  Sd Sv 0:5 in this case. Therefore Sv < Sd, and a lexical dependency in-
version is identied. As for the lexical graph in Fig. 6, Sv 1:966, Sd 0:8, and the dierence
of these aggregate stability values is greater than the lexical stability of the dependent node (i.e.,
Sd 0:8). In this case, no lexical dependency inversion is found. Thus Lemma 4 leads to a novel
design criterion: lexical dependency should ow in the direction of stability.
In this way, database lexicography provides a general diagnostic for undesirable semantic
dependencies. Each instance of a lexical dependency inversion involves a metadata element
with relatively stable semantics that depends on a less stable one for meaning. Since volatile
metadata is easier to change, the discipline of database lexicography assumes that mainte-
nance practices will change it more readily than stable metadata. Furthermore, change to a vol-
atile metadata element within an inversion will induce semantic side eects in the more stable
element (and its semantic dependents). Under these circumstances, stable metadata may change
more frequently than maintainers might otherwise estimate. Hence, thorough validation of side
eects is less likely to occur when semantic change impacts inversions, and unexpected conse-
quences are inevitable over time. The inescapable conclusion is that lexical dependency inversions
single out positions with high potential for uncontrolled spread of semantic change elsewhere in
the design.

3.3.2. Remedying lexical dependency inversion


In theory, the practice of design-for-change promotes an ideal design that is open for extension
but closed for modication (see [3, pp. 5758]). In practice, this ideal has proven dicult to
achieve, even within the framework of object-oriented design where it originates. In its application
to data assets, the practice must resist the inuence of structured methods that infuse data with
inexible dependency architectures. Inevitably, changing requirements compel modications to
these architectures, and over time changes that appear to have produced desired eects may be
discovered to have inadvertently violated the coherence of the original design. Typically, the root
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 311

Fig. 7. A semantic interface inserted into a lexical graph.

cause of this eect is the undesirable propagation of semantic change within the dependency
architecture.
The rst step toward a remedy for this condition is recognition that undesirable propagation of
change occurs only where the dependency architecture fails to contain its spread. Hence, the
problem is halfway solved once lexical dependency inversions are identied. The problem can be
completely rectied by applying a treatment to each inversion site such that it contains the spread
of any semantic change imposed there. The previous section described how to identify a lexical
dependency inversion in a lexical graph; this section describes how to treat the site so that it
controls the undesirable propagation of semantic change incident to it. Once this is accomplished
for all lexical dependency inversions, the data design will be open for extension but closed for
modication.
Throughout the dependency architecture, whenever a client lexically depends on a server,
changes to the server propagate to the client. Since lexical dependency is a form of semantic
dependency, a transitive relation, the changes are thereafter transmitted throughout the client
dependency subgraph. Propagation of change is not always desirable, thus a control mechanism is
needed to stem it when desired. One way to assert such control is to insert a design entity that
functions as a semantic interface between client and server at the site of an inversion. Consider the
lexical graphs in Fig. 7.
Lexical graph (1) is from Fig. 5, repeated here for convenience. Recall that this graph exhibited
a lexical dependency inversion in the relationship between the metadata elements labeled with
lexical stability values 0.8 and 0.5. Lexical graph (2) contains the same metadata as (1) plus an
additional design entity functioning as a semantic interface between the two nodes of the former
inversion. Introduction of this interface element remedies the inversion. Application of the equa-
tion in Lemma 4 conrms this conclusion.
Let the dierence between lexical graphs (1) and (2) in Fig. 5 serve as a model for application of
this technique. In (2), the design entity functioning as a semantic interface is introduced as a
dependent node. It semantically depends on the nodes that had formerly been client and server in
the lexical dependency inversion. This means that the data dictionary entry associated with this
newly introduced node semantically depends on the entries of both nodes formerly involved in the
inversion. Additionally, no nodes semantically depend on this semantic interfaceit remains
312 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

semantically independent with minimal lexical stability. Finally, the dictionary entry of the client
no longer refers to the server. Under these conditions, no change to the server will propagate to
the client unless the behavior of the semantic interface mandates it. A kind of rewall for con-
tainment of semantic change has thus been erected.
In eect, the introduction of a semantic interface defeats the coupling that had previously
existed between the two nodes at issue in the inversion. Nevertheless, the semantic interface must
respect the logic of inheritance, aggregation, and existential entailment encoded in the metadata
like any other element of the dependency architecture. It is distinctive solely in its treatment of
semantic dependency. Each semantic interface is responsible to encapsulate and control the details
of semantic dependency that had previously related its server and client metadata elements. With
respect to implementation details, the data dictionary denes a semantic interface in terms of an
association between its server and client metadata elements. This denition guarantees that
specic details of the former semantic relationship between server and client will be respected by
the semantic interface.
Because database lexicography requires the data dictionary to have integrity with respect to
data models (cf. item (2) in Section 3.1), dictionary entries for semantic interfaces correspond
directly with logical or physical data model representations. Like the corresponding dictionary
entries, these representations must guarantee the details of the former relationship between server
and client. At a minimum, each semantic interface in the data model maintains three resources: (i)
a representation of the properties of the former server; (ii) a representation of the properties of the
former client; and (iii) a correspondence function capable of signaling an exception when the
logical criteria in (i) and (ii) do not hold of its current lexical dependencies. Hence, each semantic
interface in the data model is empowered to guarantee continuity of the interface or signal an
exception.
In the particular case of semantic interfaces relating two elements of the logical model, sub-
stitution of an interface entity for the relationship underlying a semantic dependency suces to
ensure that services provided to clients are unaected by inadvertent side eects of maintenance
actions. When modication of the interface is detected, an exception is signaled. However, when a
semantic interface relates at least one element of the physical model, there are special cases that
merit more detailed specication. The most notable of these are:

(1) When the former server is a domain, the server properties represented by the interface are the
servers data type or typed values.
(2) When the former server and client are relational variables, the properties represented by the
interface are the former server and client headings.
(3) Otherwise, when the former server is a relational variable, the server property represented by
the interface is the servers heading.

In general, physical model elements like (1) through (3) can be implemented as tables containing
associative entities. Judicious programming with triggers, check constraints, and exceptions will
ensure that domains, headings, and other properties do not change in ways that the original design
did not anticipate. Thus metadata elements that depend on them will not be inadvertently aected.
Using these methods, lexical dependency inversions can be completely rectied by containing
the spread of any semantic change incident to their location in a lexical graph. Once this is ac-
G. Coen / Data & Knowledge Engineering 42 (2002) 293314 313

complished for every inversion in the graph, the data design will be open for extension but closed
for modication.

4. Conclusion

This paper has introduced database lexicography, a metadata analysis discipline that applies
lexical graph theory to data design and maintenance. Database lexicography proposes a formal
design criterion for data dependencies, and it provides metrics to evaluate the conformance of
designs to this criterion. It treats the data dictionary as a rst class object encoding design con-
cepts, and it benets practitioners by identifying database dependency architecture; quantifying
interdependent data elements sensitivity to change; categorizing core and peripheral data ele-
ments; integrating models; and providing gures of merit by which to fortify data architectures to
withstand design fossilization and guide their evolution in the face of changing requirements.
Database lexicography methodology focuses on managing the metadata namespace and its
intrinsic dependency architecture. By managing lexical information in the data dictionary, prac-
titioners exert control over the information structure inherent in the metadata and, by association,
the information structure of the underlying database. The methodology commences by detecting
and eliminating cyclical dependencies in the metadata so that the metadata dependency archi-
tecture can be isolated in a lexical graph. Thereafter, it proceeds by quantifying the stability of
each node in the lexical graph, thereby assigning gures of merit to the metadata dependency
architecture. Should lexical dependency inversions be revealed in these measurements, they are
ameliorated by prescribed methods in order to fortify the design against fossilization.
Database lexicography employs lexical graph theory to support the analysis of the form and
content of lexicographic information contained in a data dictionary. Focusing on the dependency
architecture of designs, it facilitates the productive utilization and reuse of data assets. Its meth-
odology enables practitioners to expose undened semantics in data assets as well as to isolate and
measure data element interdependencies. Database lexicography provides decision support that is
unavailable elsewhere for critical design and maintenance issues, such as quantifying the impact of
adding a new element to a data model or changing the meaning of an element in a data model.
Ultimately, it reveals a novel design criterion: lexical dependency should ow in the direction of
stability.

Acknowledgements

The author is indebted to Ping Xue for thoughtful discussions and constructive criticisms of the
issues treated in this paper.

References

[1] G. Coen, Dictionaries and Lexical Graphs, in: A. Moreno, R.P. van de Riet (Eds.), Applications of Natural
Language to Information Systems: Proceedings of NLDB01, Gesellschaft fur Informatik, Bonn, Germany, 2001.
[2] C.J. Date, An Introduction to Database Systems, sixth ed., AddisonWesley, Reading, MA, 1995.
314 G. Coen / Data & Knowledge Engineering 42 (2002) 293314

[3] B. Meyer, Object Oriented Software Construction, second ed., Prentice-Hall, Englewood Clis, NJ, 2000.
[4] J. Rumbaugh, M. Blaha, W. Premerlani, F. Eddy, W. Lorensen, Object-Oriented Modeling and Design, Prentice-
Hall, Englewood Clis, NJ, 1991.
[5] D. Tsichritzis, A. Klug (Eds.), The ANSI/X3/SPARC DBMS Framework. Report of Study Group on Data Base
Management Systems, AFIPS Press, Montvale, NJ, 1977.
[6] R. Wall, Introduction to Mathematical Linguistics, Prentice-Hall, Englewood Clis, NJ, 1972.

Gary Coen After earning degrees in Philosophy, English, and Comparative Literature, Gary Coen received a
Ph.D. in Foreign Language Studies from The University of Texas at Austin, where he specialized in machine
translation of natural languages. He is currently an Associate Technical Fellow in The Boeing Company,
where he consults in machine translation technology and pursues research and development of natural lan-
guage processing applications. Dr. Coens research has been published in the United States, Europe, and
Japan.

You might also like