You are on page 1of 6

Describing Semistructured Data*

Luca Cardelli
Microsoft Research

Abstract of another tree, then a simple composition (e.g., root-merge)


of the trees should correspond to a simple composition of
We introduce a rich language of descriptions for semistruc-
and . Note that this means that we are not just interested in
tured tree-like data, and we explain how such descriptions re-
describing paths through a tree, but also in describing how
late to the data they describe. Various query languages and
trees branch out.
data schemas can be based on such descriptions.
Our syntax for labeled trees, and a small but important
fragment of our description language, are summarized below:
1 Introduction
Syntax for Trees
1.1 Trees and their Descriptions P, Q ::=
We consider data that is represented as labeled trees, and we 0 root
ask: how can we describe the structure of such data? We use n[P] edge
descriptions (or, more precisely, formulas in a special logic) P|Q composition
to talk about properties of labeled trees. A description denotes Basic Descriptions
the collection of trees that, by a precise definition, match the , ::=
description.
T there is anything
A description can be used as a yes/no query against la-
0 there is only a root
beled trees: “Does the tree under consideration match the de-
scription?”. With some extensions, a description can be used n[ ] there is one edge n to a subtree
as a query returning a complex result. Hence, description lan- | there are two joined trees
guages can be seen as kernels of query languages. Some spe- The description T describes any tree. The description 0 de-
cial classes of descriptions can be used as path queries, or as scribes the empty tree (consisting of just a root node). The de-
flexible type systems (schemas) for the data. scription n[ ] describes a tree consisting of a single edge
We aim to find a very general class of descriptions, so we labeled n off the root, leading to a subtree described by .
can accommodate a large class of actual or potential schema The description | describes any tree that can be seen as
languages and query languages. Most of all, though, we aim the root-merge of two trees that are described by and .
to communicate an approach to formalizing descriptions that
can be adapted to different contexts. The presentation here is 1.2 Historical Remarks
introductory and not completely formal; we refer to other This work arose originally from the observation that the areas
work for full details [11]. of semistructured databases [4] and mobile computation [9]
We consider only labeled trees, not labeled graphs. La- have some surprising similarities at the technical level. These
beled trees are closer to common practice in XML, while la- areas are inspired by the need to find better ways to describe,
beled graphs are the more general model used for respectively, data and computation on the Internet. The tech-
semistructured data. While graphs are natural generalizations nical similarities permit the transfer of some techniques be-
of trees, descriptions of graphs are much more complex than tween the two areas. More interestingly, if we can take
descriptions of trees. So, for the moment at least, we just re- advantage of the similarities and generalize them, we may ob-
strict ourselves to trees. tain a broader model of data and computation on the Internet.
We want both our data and our descriptions to be compo- The ultimate source of similarities is the fact that both ar-
sitional: if is a description of a tree, and is a description eas have to deal with extreme dynamicity of data and behav-
ior. In semistructured databases, one cannot rely on unifor-
Database Principles Column mity of structure, because data may come from heteroge-
Column editor: Leonid Libkin, Department of Computer neous and uncoordinated sources. Still, it is necessary to per-
Science, University of Toronto, Toronto, Ontario M5S 3H5, form searches based on whatever uniformity one can find in
Canada. E-mail: libkin@cs.toronto.edu. the data. In mobile computation, one cannot rely on uniformi-

1
ty of structure because agents, devices, and networks can dy-
P|Q Q|P
namically connect, move around, become inaccessible, or
crash. Still, it is necessary to perform computations based on (P | Q) | R P | (Q | R)
whatever resources and connections are available on the net- P|0 P
work.
As examples of the potential convergence of these two ar- 3 Descriptions
eas, consider the following arguments. First, one can regard
As an example, here is a description asserting that there is ex-
data structures stored inside network nodes as a natural exten-
actly one edge labeled Cambridge, leading to at least one
sion of network structures, since on a large time/space scale
edge labeled Eagle, leading to least one edge labeled chair,
both networks and data are semistructured and dynamic.
leading to nothing:
Therefore, one can think of applying the same navigational
and code mobility techniques uniformly to networks and data. Cambridge[Eagle[chair[0] | T] | T]
Second, since networks and their resources are semistruc-
This assertion happens to be true of the tree shown earlier. In
tured, one can think of applying semistructured database
general, our descriptions include both assertions about trees,
searches to the network structure. This is a well-known major
such as the one above, and standard logical connectives for
problem in distributed computation, going under the name of
composing assertions.
resource discovery.
The exact meaning of descriptions is given by a satisfac-
tion relation relating a tree with a description. The term sat-
2 Labeled Trees isfaction comes from logic; for reasons that will become
We begin with a simple syntax for semistructured data. apparent shortly, we will also call this concept matching. The
basic question we consider is: does this tree match this de-
Syntax for Trees scription?
P, Q ::= The satisfaction/matching relation between a tree P (actu-
0 root ally, an expression P representing a tree) and a description
n[P] edge is written, for the purposes of this paper:
P|Q composition P matches
• 0 represents the tree consisting of a single root node. Informally, the matching relation can be described as follows,
• n[P] represents a tree consisting of a single edge labeled where at the same time we introduce the syntax of descrip-
n off the root, leading to a subtree represented by P. tions and their meaning. It is important to realize that a de-
scriptions states a property that holds at a certain place in the
• P | Q represents the tree obtained by taking the trees rep- tree: a top-level description talks about a tree from its root,
resented by P and by Q, and by merging their roots. and a sub-description may talk about a part of the whole tree.
For example, the following piece of data:
• Invariance
Cambridge[Eagle[chair[0] | chair[0]]]
if P matches and P Q
represents: “in Cambridge there is (nothing but) a pub called then Q matches
the Eagle that contains (nothing but) two empty chairs”.
• T: anything
We consider here a commutative composition operation
P | Q, for unordered trees. However, it is easy to consider a any P matches T
non commutative operation, say P ; Q, for ordered trees, that • ¬ : negation
can replace or be added to P | Q. This may be necessary, for
if P does not match
example, to model certain XML trees more precisely and
conveniently. then P matches ¬
The description of trees in the syntax given above is not • ∧ : conjunction
unique. For example the expressions P | Q and Q | P represent if P matches and P matches
the same (unordered) tree; similarly, the expressions 0 | P and
then P matches ∧
P represent the same tree. We consider two expressions P and
Q equivalent when they represent the same tree, and we write • 0: root
P Q. The relation P Q is an equivalence and a congruence 0 (the tree expression) matches 0 (the description)
(i.e., equals can be replaced by equals in any syntactic con-
text). Moreover, the following simple properties hold: • n[ ]: edge
if P matches
then n[P] matches n[ ]

2
• | : composition that P matches if whenever P matches then the
same P matches . As examples, consider Borders[T]
if P matches and Q matches Borders[Starbucks[T] | T], stating that a Borders book-
then P | Q matches | store must contain a Starbucks shop, and (NonSmoker[T]
• x. : universal quantification | T) (Smoker[T] | T), stating that if there is a non-smok-
er, there is also a smoker nearby (the tree P must be com-
if, for all labels n, P matches {x←n}
posed of both a smoker and a non-smoker).
(i.e., where x is replaced by n)
then P matches x. • Parallel Implication: | ¬( | ¬ ). This
means, by definition, that it is not possible to split the root
• µX. : least fixpoint (with X occurring positively in ) of the current tree in such a way that one part satisfies
if P is contained in the least fixpoint of the and the other does not satisfy . In other words, every
function λX. , taken over the collection way we split the root of the current tree, if one part satis-
of sets of labeled trees ordered by inclusion, fies , then the other part must satisfy . For example,
then P matches µX. NonSmoker[T] | (Smoker[T] | T) is a slightly more
compact formulation of the property of nonsmokers given
Many useful derived connectives can be defined from the above.
ones above. For example: • Nested Implication: n[ ] ¬n[¬ ]. This means, by
Derived Connectives definition, that it is not possible that an edge n leads to a
F ¬T false tree that does not satisfy . In other words, if there is an
∨ ¬(¬ ∧ ¬ ) disjunction edge n, it leads to a tree that satisfies . For example: Bor-
ders[ Starbucks[T] | T] is, again, a slightly more com-
¬ ∨ implication
pact formulation of the property of Borders given above.
⇔ ( ) logical equivalence
∧( ) • Greatest Fixpoint: The dual of the least fixpoint operator
µX. is the greatest fixpoint operator νX. . For example
x. ¬ x.¬ existential quantification
µX.X is equivalent to F, while νX.X is equivalent to T.
|| ¬(¬ | ¬ ) decomposition More interestingly, µX. 0 ∨ m[X] describes every tree of
|| F every part matches the form m[m[... m[0]]], and, on finite trees, it is equiva-
|T some part matches lent to νX. 0 ∨ m[X]. However, if we consider infinite
µX. ∨ x. x[X] | T somewhere holds trees, the distinction between least and greatest fixpoint
¬ ¬ everywhere holds becomes more important. For example, the infinite tree
| ¬( | ¬ ) parallel implication m[m[...]] satisfies νX. 0 ∨ m[X], but does not satisfy µX. 0
n[ ] ¬n[¬ ] nested implication ∨ m[X]. When we consider only finite trees, as we do here,
νX. ¬(µX.¬ {X←¬X}) greatest fixpoint the µ and ν operators are quite similar in practice, since
most interesting descriptions have a single fixpoint.
• Many operators are derived as standard DeMorgan duals: • Somewhere. A tree P satisfies if there is a subtree Q
disjunction, existential quantification, and the everywhere of P that satisfies . This is defined by a recursive de-
modality. scription.
• Decomposition, || , is the DeMorgan dual of compo- • Everywhere: ¬ ¬ . What is true everywhere?
sition. A decomposition description || is satisfied if Not much, unless we qualify a property by negation or im-
for every parallel decomposition of the tree in question, plication. For example, ¬(n[T] ) means that there is no
either one component satisfies or the other satisfies . edge called n anywhere. Moreover, we can write (
• Then, means that in every decomposition either one ) to mean that everywhere is true, is true as well.
component satisfies or the other satisfies F ( ¬T); For example, (NonSmoker[T] | (Smoker[T] | T)): ev-
since the latter is impossible, in every possible decompo- erywhere there is a non-smoker there is also a smoker.
sition one component must satisfy . For example:
(n[T] n[m[T]]) means that every edge n that can be 4 Equivalent Descriptions
found off the root leads to a single edge m. The DeMorgan
A precise semantics of descriptions helps in deriving equiva-
dual of is , which means that it is possible to find a
lences between descriptions (and, further, between queries)
decomposition where one component satisfies . For ex-
[11]. Many such equivalences can be derived; we list some of
ample, n[m[T] ] means that there is at least one edge n
them here, just to give an idea of the rich collection of prop-
that leads to at least one edge m.
erties one can rely on. Equivalences can be used by a query
• Normal Implication: ¬ ∨ . This is the optimizer; in particular, they can be used to push negation to
standard definition of implication. Note that this means the leaves of a description, by dualizing operators.

3
Equivalent Descriptions
Eagle[chair[John[0]] | chair[Mary[0]] | chair[0]]
n[ ] ⇔ n[T] ∧ n[ ] matches
n[ ] ⇔ n[T] n[ ] Eagle[chair[(¬0)∧ ] | T]
n[F] ⇔ F
n[ T] ⇔ T we obtain, bound to , somebody (not 0) sitting at the Eagle.
Here the answer could be either John[0] or Mary[0], since
n[ ∧ ] ⇔ n[ ] ∧ n[ ]
both bindings lead to a successful global match. Moreover, by
n[ ∨ ] ⇔ n[ ] ∨ n[ ]
using the same variable more than once we can express con-
n[ ∨ ] ⇔ n[ ] ∨ n[ ] straints: the description
n[ ∧ ] ⇔ n[ ] ∧ n[ ]
n[ x. ] ⇔ x.n[ ] (x≠n) Eagle[chair[(¬0)∧ ] | chair[ ] | T]
n[ x. ] ⇔ x.n[ ] (x≠n) is successfully matched if there are two people with the same
|F ⇔ F name (or any two equal structures) sitting at the Eagle.
|| T ⇔ T These generalized descriptions that include matching
T|T ⇔ T variables can thus be seen as queries. The result of a success-
F || F ⇔ F ful matching can be seen as a possible answer to a query, and
|( ∨ ) ⇔ ( | )∨( | ) the collection of all possible successful matches as the collec-
|| ( ∧ ) ⇔ ( || ) ∧ ( || ) tion of all answers.
For serious semistructured database applications, we need
also sophisticated ways of matching labels (e.g. with wild-
5 From Descriptions to Queries cards and lexicographic orders) and of matching paths of la-
bels. For the latter, though, we already have considerable
A satisfaction relation, such as the one defined in the previous
flexibility within the existing logic; consider the following
section, is not always decidable. However, in some interest-
examples:
ing cases, the problem of whether P matches becomes de-
cidable [14]; some complexity results are also known [16]. A • Exact path. The description n[m[p[ ]] | T] means: match
decision procedure for such a matching problem is also called a path consisting of the labels n, m, p, and bind to what
a modelchecking algorithm. Such an algorithm implements a the path leads to. Note that, in this example, other paths
matching procedure between a tree and a description, where may lead out of n, but there must be a unique path out of
the result of the match is just success of failure. m and p.
For example, the following match succeeds. The descrip- • Dislocated path. The description n[ (m[ ] | T)] means:
tion can be read as stating that there is an empty chair at the match a path consisting of a label n, followed by an arbi-
Eagle pub; the matching process verifies that this fact holds trary path, followed by a label m; bind to what the path
starting from the root of the tree: leads to.
Eagle[chair[John[0]] | chair[Mary[0]] | chair[0]] • Disjunctive path. The description n[p[ ]] ∨ m[p[ ]]
matches means: bind to the result of following either a path n,p,
Eagle[chair[0] | T] or a path m,p.

More generally, we can imagine collecting information, • Negative path. The description m[¬(p[T] | T) | q[ ]]
during the matching process, about which parts of the tree means: bind to anything found somewhere under m, in-
match which parts of the description. Further, we can enrich side a q but not next to a p.
descriptions with markers that are meant to be bound to parts • Wildcard and restricted wildcard. m[ y.y≠n ∧ y[ ]]
of the tree during matching; the result of the matching algo- means: match a path consisting of m and any label differ-
rithm is then either failure or an association of markers to the ent from n, and bind to what the path leads to. (Inequal-
trees that match them. ity of labels can be easily added to the descriptions [11]).
We can thus extend descriptions with matching variables, • Kleene Star for paths. µX. ∨ (m[X] | T) means: match a
. For example by running the matching computation for: path consisting of any number of m edges leading to a sub-
Eagle[chair[John[0]] | chair[Mary[0]] | chair[0]] tree that matches .
matches Although we have a lot of power and flexibility in defining
Eagle[chair[ ] | T] descriptions for paths, we may want to have a convenient syn-
tax for such common situations; a syntax for paths that easily
we obtain, bound to , either somebody sitting at the Eagle, translates into our descriptions is defined in [11].
or the indication that there is an empty chair. Moreover, by In related work [11], we use a rather traditional SQL-style
matching: select-from construct for constructing answers to queries, af-
ter the matching phase described above. The resulting query

4
language, TQL [3], is fairly similar to XML-QL [4], perhaps languages are nicely related to query algebras and to query
indicating a natural convergence of query mechanisms. logics. However, query algebras and query logics for semis-
We should emphasize, though, that our composition oper- tructured database are not yet well understood.
ator is very powerful, and not very common in the query lit- We believe we have provided at least an example of a que-
erature. It can be used, for example, for the following ry logic that is suitable for semistructured data. Moreover, in
purposes: related work [11,12] we describe a table algebra for our que-
• Composition makes it easy to describe record-like struc- ry logic; this has the same function as relational algebra for
tures both partially ((b[T] | c[T] | T) means: contains b, c, relational databases, and can take advantage of a rich set of al-
and possibly more fields) and completely ((b[T] | c[T]) gebraic properties, such as the ones listed in section 4.
means: contains only b and c fields); complete descrip- An implementation of a query language, TQL [3], based
tions are difficult in path-based approaches. on these ideas is being carried out in Pisa by Giorgio Ghelli
and co-workers. The current prototype can be used to query
• Composition makes it possible to bind a variable to ‘the XML documents accessible through files or through web
rest of the record’, as in “ is everything but the paper ti- servers.
tle”: paper[title[T] | ].
• Composition makes it possible to describe schemas, as Acknowledgments
shown next.
Giorgio Ghelli restarted my interest in databases, and in par-
ticular in semistructured data, and we have since coauthored
6 Schemas
works that are partially reflected in this paper.
Path-like descriptions explore the vertical structure of trees.
Our descriptions can also easily explore horizontal structure, References
as is common in schemas for semistructured data. (E.g. in
XML DTDs, XDuce [19] and XMLSchema [1]. However, [1] XML schema. Available from http://www.w3c.org,
our present formulation deals directly only with unordered 2000.
structures.) [2] XML query. Available from http://www.w3c.org,
For example, we can extract from our description lan- 2001.
guage the following regular-expression-like sublanguage, in- [3] TQL. Available from http://macbeth.di.unipi.it/TQL.
spired by XDuce types. Every expression of this language 2001.
denotes a set of trees: [4] S. Abiteboul, P. Buneman, D. Suciu.: Data on the Web.
Morgan Kaufmann Publishers, San Francisco, 2000.
0 the empty tree [5] S. Abiteboul, R. Hull, and V. Vianu. Foundations of
| an next to a Databases. Addison-Wesley, Reading, MA, 1995.
∨ either an or a [6] S. Abiteboul, Dallan Quass, Jason McHugh, Jennifer
n[ ] an edge n leading to an Widom, and Janet L. Wiener. The Lorel query lan-
* µX. 0 ∨ ( | X) guage for semistructured data. International Journal
finite composition of zero or more '
s on Digital Libraries, 1(1):68-88, 1997.
+ | * finite composition of one or more '
s [7] P. Buneman, S. B. Davidson, G. G. Hillebrand, and D.
? 0∨ optionally an Suciu. A query language and optimization tech-
niques for unstructured data. In Proc. of the 1996
In general, we believe that a number of proposals for describ- ACM SIGMOD International Conference on Manage-
ing the shape of semistructured data can be embedded in our ment of Data (SIGMOD), Montreal, Quebec, Canada,
description language, or in something closely related. Each pages 505-516, 4-6 June 1996. SIGMOD Record 25(2),
such proposal usually comes with an efficient algorithm for June 1996.
checking membership or other properties. These efficient al- [8] P. Buneman, B. Pierce.: Union Types for Semistruc-
gorithms, of course, do not fall out automatically from a gen- tured Data. Proceedings of the International Database
eral framework. Still, a general frameworks such as ours can Programming Languages Workshop, 1999. Also avail-
be used to compare different proposals. able as University of Pennsylvania Dept. of CIS techni-
cal report MS-CIS-99-09.
7 Conclusions [9] L. Cardelli.: Abstractions for Mobile Computation.
Jan Vitek and Christian Jensen, Editors. Secure Internet
Semistructured databases have developed flexible ways of
Programming: Security Issues for Mobile and Distribut-
querying data, even when the data is not rigidly structured ac-
ed Objects. LNCS. 1603, 51-94, Springer, 1999.
cording to schemas [4]. In relational database theory, query

5
[10] L. Cardelli. Semistructured computation. In Proc. of [18] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A
the Seventh Intl. Workshop on Data Base Programming query language and processor for a web-site man-
Languages (DBPL), 1999. agement system. In Proc. of Workshop on Manage-
[11] L. Cardelli, G. Ghelli.: A Query Language Based on ment of Semistructured Data, Tucson, 1997.
the Ambient Logic. In Proceedings ESOP’01, volume [19] M. Fernandez, Daniela Florescu, Jaewoo Kang, Alon
2028 of LNCS, pages 1-22. Springer, 2001. Levy, and Dan Suciu. Catching the boat with Strudel:
[12] L. Cardelli and G. Ghelli. Evaluation of TQL queries. experiences with a web-site management system. In
Available from http://www.di.unipi.it/~ghelli/pa- Proc. of ACM SIGMOD International Conference on
pers.html, 2001. Management of Data (SIGMOD), pages 414-425, 1998.
[13] L. Cardelli, A.D. Gordon.: Mobile ambients. In Pro- [20] G. Ghelli. TQL as an XML query language. Available
ceedings FoSSaCS' 98, volume 1378 of LNCS, pages from http://www.di.unipi.it/_ghelli/papers.html, 2001.
140-155. Springer-Verlag, 1998. To appear in Theoret- [21] R. Goldman, J. McHugh, and J. Widom. From semis-
ical Computer Science. tructured data to XML: Migrating the lore data
[14] L. Cardelli, A.D. Gordon: Anytime, Anywhere. Modal model and query language. In Proc. of Workshop on
Logics for Mobile Ambients. Proceedings POPL’00, the Web and Data Bases (WebDB), pages 25-30, 1999.
365-377, 2000. [22] B.C. Pierce H. Hosoya. XDuce: A typed XML pro-
[15] D. Chamberlin, J. Robie, and D. Florescu. Quilt: An cessing language (preliminary report). In Proc. of
XML query language for heterogeneous data sourc- Workshop on the Web and Data Bases (WebDB), 2000.
es. In Proc. of Workshop on the Web and Data Bases [23] F. Neven and T. Schwentick. Expressive and efficient
(WebDB), 2000. pattern languages for tree-structured data. In Proc.
[16] W. Charatonik and J.-M. Talbot: The Decidability of of the 19th Symposium on Principles of Database Sys-
Model Checking Mobile Ambients. Proceedings of tems (PODS), 2000.
the 15th Annual Conference of the European Associa- [24] Y. Papakonstantinou, H.G. Molina, and J. Widom. Ob-
tion for Computer Science Logic. Springer LNCS, 2001 ject exchange across heterogeneous information
(to appear). sources. Proc. of the eleventh IEEE Int. Conference on
[17] A. Deutsch, D. Florescu M. Fernandez, A. Levy, and D. Data Engineering, Birmingham, England, pages 251-
Suciu. A query language for XML. In Proc. of the 260, 1996.
Eighth International World Wide Web Conference,
1999.

You might also like