Professional Documents
Culture Documents
博士論文
研 究 生:陳鍾誠
指導教授:項 潔
中華民國九十一年七月
誌謝
本論文是我從 1997 年到 2002 年之間的主要成果,總結了我在生活, 工作,與學校中所學習與獲
得的研究與經驗,這些成果若有一點點值得參考的地方,都歸功於那些曾經教導過我與幫助過
我的父母師長,還有同事朋友們。
首先要感謝的是我的指導教授,項潔教授,從日常的討論與正式的課堂中,使我得以學習到研
究的方法與正確的態度,老師的諄諄指導總是使我獲益良多,不論是在理論的建構上,研究的
直覺上與實際系統的觀察上,總是深刻入微,令人深深敬佩。
其次我要感謝高成炎老師,不但是我碩士的指導教授,也在博士班時給予我相當多的指導與援
助, 不論是在研究方向與生物資訊的領域,高老師都給予我相當多的指導。
其次我要感謝中央研究院資訊所的許聞廉教授,在我剛開始博士研究的前幾年,啟發了我對自
然語言研究的熱誠與信心,並且給我相當大的發揮空間與細心的指導。
還有幾位特別給予我協助的人,包含中央大學的洪炯宗教授,交通大學的楊進木教授與 振剛
主任,他們在研究論文上給予我的協助與指導,都是這篇論文得以順利完成的原因。
接下來要感謝的是我的同學們,“林耀仁、杜協昌、黃光璿、謝育平、劉文俊、潘家煜、傅國長、陳宏
杰、陳必衷、余禎祥、黃子葵、賴勝華、洪智瑋、劉秉涵、陳瑞呈、徐代昕、陳詩沛、陳耀將”,感謝他們
在這段求學的日子裡的幫助與照顧,希望大家都能有美好的未來。
特別感謝謝育平同學,在博士論文的最後階段,提供了許多寶貴的意見,透過許多次的討論,
才得以使論文呈現目前的面貌,也使我獲益良多。
另外、還有歷任的研究助理們,“胡純毓、張慶瑞、許玉霜、鐘淑微、梁素瑜” 沒有他們的努力, 實驗
室的所有成果將無法累積,我們也無法擁有如此優良的研究環境。
最後、要感謝的是我的父親與母親,在我唸博士般的時候,一直堅定的支持我,使我能安然的度
過這一路上的風風雨雨,令我相當愧疚的是,在這幾年當中,我無法盡到照顧他們的責任,也很
感謝我的大哥,這幾年真的辛苦你了。
由於電腦不容易理解自然語言文件,因此造成了人與機器之間的語意落差,對於
XML 檢索系統而言,語意落差可分為查詢端的語意落差與文件端的語意落差。查詢端的
語意落差主要是由於結構化查詢語言的不容易寫所造成的,而文件端的語意落差則是由
於電腦無法理解 XML 文件所造成的。為了解決語意落差的問題,本論文提出以欄位樹
(Slot-Tree Ontology)為核心的知識表達方法,並利用此方法解決 XML 文件檢索系統上的
語意落差問題。
由於建構欄位樹的工作不易,因此我們發展出一套資料採掘(Data Mining)的演算法
(Slot-Mining Algorithm),以自動從 XML 文件集合中抽取出欄位樹,該方法以統計的手
段分析語意標籤與詞彙之間的相關係數,以便找出特徵詞彙填入欄位中,自動建構出欄
位樹,使得欄位樹的建構工作變得比較容易。
我們用兩個實際的案例-台灣蝴蝶數位博物館與蛋白質資料庫(Protein Information
Resource),來測試該 XML 文件檢索系統的表現,發現該系統能較正確的檢索 XML 文
件,並且組織檢索結果以便瀏覽,,另外、自動建構欄位樹的程式也能有效填入特徵詞彙
於欄位中,但仍然需要人工修改以提高欄位樹的品質。
One problem for us to build smart computer is that computer cannot understand natural
language as good as human. This is called the semantic gap between human and computer. For
XML retrieval systems, semantic gap lies on both the query side and document side. The
semantic gap on the query side is due to the difficulty for human to write structured query. The
semantic gap on the document side is due to the difficulty for computer to understand XML
documents. In order to reduce the semantic gap, we design a XML retrieval system based on a
notion of slot-tree ontology.
Since the construction of slot-tree is not an easy job, we also develop a slot-mining
algorithm to construct the slot-tree automatically. Our slot-mining algorithm is a statistical
approach based on the correlation analysis between tags and words. The highly correlated
terms are filled into the slot-tree as values. This algorithm eases the construction process of the
slot-tree.
Two XML collections, one on butterflies and another on proteins, are used as test-bed of
our XML retrieval system. We found that our XML retrieval system is easy to use and performs
well in the retrieval effectiveness and the quality of browsing. Furthermore, the slot-mining
algorithm can fill important words into each slot. However, the mining results should be
modified manually in order to improve the quality of the slot-tree.
Finally, we summary our contributions on XML retrieval, and then compare our methods
to some other methods. A qualitative analysis is given in the last chapter. We also suggest
directions for further research.
XML Retrieval - A Slot-Filling Approach
Ph.D. Dissertation
23 July 2002
Content
Part 1 : Tutorial of This Thesis
1 Introduction 1
1.1 Motivation 1
1.2 Research problems 3
1.3 Research approaches 5
1.4 Outline of this thesis 7
Part 4 : Conclusions
7 Conclusions and Contributions 65
7.1 Comparison 65
7.2 Contributions 69
7.3 Discussion and future work 70
Reference 71
1 Introduction
This thesis introduces an information retrieval (IR) method for XML. One big problem for
information retrieval is that computer cannot understand documents as good as people. The
problem is called the semantic gap problem. Our goal is building an information retrieval
system to reduce the semantic gap between human and computer on XML. Our approach is
using ontology to help the searching processes for XML, include querying, retrieval and
browsing. This thesis is opened with our motivation in section 1.1. Our research problems are
proposed in section 1.2. Our research approaches are described in section 1.3. An overview of
this thesis is outlined in section 1.4.
7.4 Motivation
Extensible Markup Language (XML) [XML98] is a standard to encode semi-structured
documents. XML is useful in data representation, data exchanging and data publishing on the
web. Many people believes that XML will be a widely spread standard in the future. For this
reason, XML has gained much attention in both the information community and in the field of
database research.
XML is a markup language with extensible tags. Everyone may define his own markup
language based on XML. In fact, hundreds of specifications based on XML have been
proposed from 1997 to 2002. These specifications are designed to fulfill the need of some
domains or some applications. For example, Protein Information Resource (PIR)
(http://pir.georgetown.edu/) is an XML collections designed to record the data about proteins.
UDDI [UDDI00] is an XML specifications designed to record the profile of business
companies.
XML is designed to be easy understood by human and computer. XML is encoded in text
format for human to read and understand easily. Tags in XML provide semantic background for
computer to “understand” the content correctly. XML can be used as a bridge between human
writing and computer understanding.
A smart computer program that understands XML documents is useful. However, building
a computer program to “understand” XML documents is still very difficult. In this thesis, we
propose methods for computer to “understand” XML documents.
The natural language processing (NLP) community has been focus on the processing and
understanding of natural language documents for a long time [Grosz86]. However,
understanding natural language documents is still very difficult for computer programs. No
effective approach is powerful enough to solve the understanding problem. Building a smart
computer program to understand natural language texts is very difficult because of the
“semantic gap”. The semantic gap is described as following.
The semantic gap causes some difficulties for information retrieval systems. For example,
an information retrieval system cannot understand our natural language queries, and retrieve
many documents that are not semantically related to our queries.
There are two semantic gaps for an information retrieval system, one for queries
understanding and another for documents understanding. These gaps are list as following.
In order to reduce the semantic gap problem, researchers in NLP community have been
trying hard to resolve the following question.
“How to make computers understand natural language? ”
However, natural language is too difficult for computer programs to understand now.
Although many people have been devoted to solve the problem for more than thirty years,
designing a computer program to understand natural language is still an open research
problem.
Computers do not understand natural language well. Why don’t we design a structured
language that is easy for computer to understand and easy for human to write. If we can design
such a language, then we have a common language between human and computer. People may
write documents in this language for computer to understand. Then we may build computer
programs to understand documents in this language.
XML is such a language that is easy for human to write. However, we have no method for
computer to understand XML documents easily. If we can design such a computer program, we
may reduce the semantic gap for XML, so that XML may plays as a bridge between human and
computer.
In this thesis, our goal is to reduce the semantic gap on XML. Our approach is to design
methods for computer to understand XML documents. Our research problem is described in the
next section.
Our goal is to design an XML retrieval system to resolve the “human-computer dilemma
of XML”. For an XML retrieval system, there are two semantic gaps between human and
computer, one gap on query side and another gap on document side. Figure 1.2 shows these
two gaps.
Figure 1.2 : Semantic gaps of XML
On the document side, an XML document may be easy for human to write but not so easy for
computer to understand. An XML document with many natural language texts is not so easy
for computer to understand. Example 1.1 shows an XML document that contains natural
language text in the “color” block and “size” block. It is not so easy for computers to
understand the XML document.
Example 1.1 : An XML document that is not easy for computer to understand
<butterfly name=”kodairai”>
<color>with black wing and white spots on it</color>
<size>middle size butterflies, from 50mm to 60mm</size>
</butterfly>
On the contrarily, an XML document may be easy for computer to understand but not so easy
for human to read and write. An XML document that marks each word up is not so easy for
human to read and write. Example 1.2 shows an XML document that is not easy for human to
read and write.
Example 1.2 An XML document that is not easy for human to read and write
<butterfly name=”kodairai”>
<color><wing>black<wing><texture>white spot</texture></color>
<size>
<classification>middle size</classification>
<from>50mm</from><to>60mm</to>
</size>
</butterfly>
The same things happen on the query side, an XML query may be easy for human to write but
not so easy for computer to understand. An XML query with natural language is not so easy for
computer to understand. Example 1.3 shows an XML query that is not so easy for computer to
understand.
Example 1.3 An XML query that is not easy for computer to understand
On the contrarily, an XML query may be easy for computer to understand but not easy for
human to read and write. A structuralized XML query is not so easy for human to read and
write. Example 1.4 shows an XML query that is not so easy for human to read and write.
Example 1.4 An XML query that is not easy for human to read and write
For $b in //butterfly
Where ?b/color = “black” and ?b/texture=”white spots”
Return ?b
Two approaches may be used to reduce semantic gap between human and computer on
XML. The first approach is building computer programs to understand XML documents or
queries. The second approach is building tools for human to write XML documents or queries.
We adopt the first approach on the document side and adopt the second approach on the
query side. It means that we build a computer program to understand roughly tagged XML
documents, and we build a tool for human to write XML queries easily. The following section
shows our approach.
On the document side, we build a computer program to understand XML documents. The
“understanding” process is based on an ontology called slot-tree. Slot-tree is a frame like
representation that embedded with XPATH [XPATH99] expression. In order to make computer
understand XML documents, we designed a slot-filling algorithm to map XML documents into
the slot-tree.
On the query side, we build a query interface for human to write queries easily. The
interface is built by transform the ontology into a web page. User may use the interface to write
structural queries just by choosing or typing values into slots to build a structural query.
In our approach, the slot-tree ontology is a key component for both documents
understanding and queries building. The slot-tree ontology mediates queries and documents in
the retrieval process to reduce the semantic gaps both on query side and document side.
However, it is not an easy job to build the slot-tree ontology. The ontology constructor
needs tools to build slot-tree ontology. The problem of construct slot-tree automatically based
on a set of XML documents is called the slot-mining problem. It is described as following.
Part 1 sets the stage for all the others. Chapter 1 outlines the research problems and
approaches. Chapter 2 reviews the background literatures for our research - “Designing an
XML retrieval system to reduce the semantic gap problem”.
Part 2 is a detail description of our methods. Our methods are based on a knowledge
representation structure called slot-tree. The slot-tree is used in catching the semantics of XML
documents. It helps our XML retrieval system to understand XML documents.
Chapter 3 shows the syntax and semantics of slot-tree ontology, and shows a method that
uses the slot-tree to catch the semantics of XML documents called slot-filling algorithm.
Chapter 4 outlined an XML information retrieval system that based on slot-tree. The slot-tree
ontology and slot-filling algorithm are used to reduce the semantic gap of XML retrieval.
Chapter 5 shows the process of constructing slot-tree ontology. The steps of constructing a
slot-tree are outlined. After that, a method that constructs slot-tree automatically is proposed.
The method is a statistical program that called slot-mining algorithm. The slot-mining
algorithm mines slot-trees from XML documents based on the correlation analysis between
tags and terms. It helps peoples to construct the slot-tree ontology for a given XML collection.
Part 3 is test-beds of the slot-tree based approach. The slot-tree based approach is
examined in this part. Two cases are used to test the slot-tree based approaches. Chapter 6
shows the first case that is an XML collection about butterflies. The collection is a set of XML
documents in Chinese about butterflies in Taiwan. Chapter 7 shows the second case that called
Protein Information Resource (PIR). PIR is a large set of XML documents that released by
George Town University. The experiment on these two cases is used to analyze the strength
and weakness of the slot-tree based approach.
Part 4 is the conclusion part. Chapter 8 analyzes the strength of slot-tree based approach.
We compare the slot-tree based methods to some other XML retrieval methods, and point out
our contribution, conclusions and future works.
8 Background – XML and Information Retrieval
In chapter 1, we have introduced our motivation, goals and research approaches. Briefly
speaking, we would like to build an XML retrieval system that reducing the semantic gap
between human and computer on XML. In this chapter, we will survey the related researches in
order to provide background knowledge for our research. Since our approach is using slot-tree
ontology to help the XML retrieval process, we will survey the topics of XML, information
retrieval and ontology in this chapter.
In section 2.1, we focus on the XML topics to survey the related specifications and
technologies. In section 2.2, we survey the information retrieval technologies. After that, we
will survey the current status and state of art in XML retrieval in section 2.3. Finally, we will
outline the relationship between ontology and XML retrieval in section 2.4.
8.1 XML
We have to understand XML in order to build an XML retrieval system that reduces the semantic gap.
In this section, we will survey the XML related specifications and technologies, especially literature
about knowledge representation and information retrieval.
The third line, with a “phonebooks” tag, is the root node of this XML document. One XML
document has one and only one root node. In this line, the xmlns= “http://www.ntu.edu.tw/phonebook”
is the default name space of this XML document. Name space [XMLNS99] in XML is used to
distinguish tags with the same names form each other. So that people can define their own tags and
using others tags without have to worry about using the same tag name in different meaning.
A node in XML contains tag, attribute and text. “phonebooks”, In the example above, “people”
and “name” and “tel” are tags, “xmlns” and “id” are attributes, “http://www.ntu.edu.tw/phonebook” and
“Johnson Chen” and “02-34134345” are text parts.
XPath [XPATH99] is a specification that used to locate nodes in XML documents. If we would
like to locate all the “people” nodes, we may use the XPath expression “//people” to locate nodes of
people. The “//” operator means matching every descendent nodes. If we would like to locate the
“people” node with id = “001”, then we may use the XPath expression “//people[@id= ‘001’]” to locate
the node. The “@” symbol means the “id” is an attribute name. XPath is used in the slot-tree ontology
that is going to be discussed in chapter 3. We embed XPath into the slot-tree to locate nodes in XML,
and using the XPath to map XML documents into slot-tree ontology.
Many XML related specifications are proposed since 1997. XML has been a wide spreading
specification and used in many domains and applications, such as in “data exchanging”, “data
presentation”, “data querying”, and “knowledge representation”. For data exchanging, UDDI and
ebXML are used to mediate the data exchange process between business enterprises. For data
presentation, XSLT can be used to transform XML into HTML for presenting on the web. For data
querying, XQL, XML-QL and X-Query are used to query data in XML documents. For knowledge
representation, RDF/RDFS, DAML/DAMLS, XML topic map are proposed to represent knowledge in
XML format. We will survey specifications about data querying in section 2.3 that discussing the XML
query and retrieval topics, and survey specifications about knowledge representation in section 2.4 that
discussing the ontology topic.
The evolution of IR technique is close related to the target document structure. Each time, a new
document structure proposed, a new IR technique developed. In 1970~1980, Vector Space Model is
developed to retrieve text documents. In 1990~1999, Random Walk Model developed to retrieve HTML
documents. Today, XML document are wide spreading. Many researchers are trying to develop new
retrieval models for XML.
Text Retrieval
Text Retrieval Technology is almost as old as the Computer Technology. There are many models for
text retrieval. The most well known is Vector Space Model (VSM) [Salton75]. In this model, each
document is represented by a k-dimensional vector of terms. A plain text is expressed as following.
d = (dt1, dt2, …, dtk), where dti is the weight of term ti that show up in the document of d
In the expression above, where k equals the number of index terms in the collection. The order of
words in the text sequence is discarded.
A query is represented by a k-dimensional vector of terms, too. The query (q) may be represented
as the following vector.
q = (qt1, qt2, …, qtk), where qti is the weight of term ti that show up in the query of q
Cosine coefficient is a popular measure for the similarity between a document and a query. The
definition of cosine similarity is the cosine of the angle between the document vectors d and the query
vectors q.
d •q
∑ (d ti * qti )
Similarity(d, q) = | d | * | q |
= i =1
k k
∑d * ∑ qti
2 2
ti
i =1 i =1
One question is how to set the weight dti and qti in the vector space model. The “tfidf” is a simple and
common used weighting function. The “tfidf” weighting is defined as the product of term frequency (tf)
and inverse document frequency (idf)
Term frequency (tf) : tf(t,d) : the number of occurrences of term t in document d
Document frequency (df) : df t : the number of documents, containing term tj .
Inverse document frequency (idf) : the inverse number of documents in which the term occurs.
The SMART system experiments lead by Salton [Salton88] shows that “tfidf” term weighting function
is the best in his 287 distinct combinations of term-weighting assignments. The “tfidf” weighting
function has been proved to be a good measure for the vector space model.
HTML Retrieval
The main issue of HTML-retrieval is to measure the importance of a document. A HTML retrieval
system retrieves documents that match the query, and then sort by importance. On the web, there are too
many documents to retrieve. The importance measure helps user to decide what he should read.
Documents on the web are different from the text collection because of the hyperlink structure.
The measure of HTML importance is based on the hyperlink analysis technique. Historically, hyperlink
analysis is developed based on the citation analysis technique. A simple strategy to measure the
importance of a web page is by counting the number of hyperlink that reference to it. A web page
referenced by many other pages is important.
In 1998, a random walk model used to weight the importance of web pages proposed was proposed
[Brin98][Page98]. The random walk model was then used in the Google search engine. In the random
walk model, a page is important if it is cited by many important pages. Formally speaking, each web
page in the random walk model has a weight measure w(d). An iterative process is used to recalculate
the w(d) in each iteration.
w( p ) ← ∑ w(q)
q:( q , p )∈E
Conceptually, the random walk model simulates the process of a person click web pages randomly.
The random walker chooses a web page randomly as a start page. After that, he randomly clicks a web
page in the page and repeats the click process on each clicked page. In the random walk model, a
important page will be visited with high probability.
Kleinberg proposed a Hub-Authority model to weight the impact of a web page [Kleinberg98]. Web
pages are divided into two classes in this model, hub-page and authority-page. The hub-authority model
is an iterative process. For a hub-page (h), it is important if the page point to many important authority-
pages. For an authority page (a), it is important if the page is cited by many important hub-pages.
Formally speaking, there are two weight on each page (d) in the hub-authority model, the hub
weighting measure h(d) and the authority weighting measure a(d). An iterative process is used to
recalculate the h(d) and a(d) in each iteration. Figure 2.1 shows the concept of hub-authority model.
A set of web page (D) contains many hyperlinks (E). For each page d in D, h(d) is the hub weight
of d, and a(d) is the authority weight of d. At first, we may set both h(d) and a(d) as 1/|D|, where |D| is
the number of documents in D. After that, an iteration is used to recalculate h(d) and a(d) based on the
following recurrence equations.
a( p) ← ∑ h( q )
q:( q , p )∈E
h( p ) ← ∑ a (q )
q:( p , q )∈E
Hub-authority model is used to weight the importance of a web page, and decide whether a page is
a hub or authority. Besides weighting the importance, hub-authority model provides a mechanism to
classify the type of a web page.
Both hub-authority model and random walk model used the iterative approach to decide the
importance of a web page. The convergence analysis based on eigen-value in linear algebra is used to
analyze the behavior of recurrence equations used in these models. The paper of Kleinberg
[Kleinberg98] and Page et. al. [Page98] have further discussions for the theory of these models.
8.3 XML Querying and Retrieval
In order to manage XML documents, the database community and IR community have recently
focus on the research of storing, indexing, querying, and retrieving XML documents. For
storing, the database management systems are extended to support the function of storing XML
documents. One way is extending relational database system to store XML documents, another
way to store XML documents in object-oriented database (OODB) system. For indexing,
Patricia-trie and inverted-file are used to index XML documents. For querying, several XML
query languages are proposed to retrieve XML nodes. For searching, several systems are
designed to search XML documents. In this section, we will focus on the survey of XML query
languages and XML retrieving systems.
Querying an XML collection is like to query a database. We usually query tables by “SQL”
language in a relational database. The following example shows a query to retrieve name and birthday
of United-State presidents.
An XML query language has to retrieve nodes in the tree of XML nodes. The following example
shows an X-Query example that retrieve name and birthday of United-State presidents.
For $p in //people
Let $n=?p/name, $b=?p/birthday
Where ?p/job = “president” and ?p/nation=”US”
Return ?n, ?b
XML-GL is a graphical notation used to retrieve XML documents. Figure 2.2 shows an example of
retrieve orders that ship books with title “Introduction to XML” to Los Angles.
Figure 2.2 An example of XML-GL
Lore was one pioneer research project for XML retrieval in Stanford-University. In this project,
an object-oriented database was used to store XML documents. The XML query language
“Loral” was developed. Besides that, a query interface “DataGuider” was developed to query
XML documents. Figure 2.3 is a screen catch of the DataGuider system.
Figure 2.3 The query interface of DataGuider system
XYZfind is a commercial system that split the querying process into four steps. The following
figures show the retrieval steps of the XYZfind retrieval system.
Natural language processing community has been trying to resolve the semantic gap
problem for a long time. Natural language understanding is a field that focuses on building
computer programs to “understand” natural language text [Grosz86] [Allen94]. However, the
word “understanding” used here is a misleading word. Computers do not really understand
natural language text as human. Calculation and symbolic reasoning is what computers can do.
Computers “understand” natural language text by mapping text into internal representation.
The internal representation guides the computer to do symbolic reasoning and act as it know
the meaning of natural language text.
Alan Turing designed the Turing-Test [Turing50] to test whether a computer understand
natural language text or not. For information retrieval, we adopt a similar definition as Turing-
Test. If a computer program that retrieve we want and discard what we do not want, and
organize the retrieval result into what we like to browse, then we say the computer program
understand documents and our queries. A computer can do what we like it to do is a smart
computer. A retrieval system that retrieves only what we want and organize the result into what
we like is a smart retrieval system.
A data-structure called ontology that represents the concept in human mind is used in the
process of understanding. Generally speaking, understanding is the process of mapping natural
language text into ontology. After the mapping, computer can do actions based on the mapping.
This is the style of computer “understanding”.
Ontology may be represented in different structures. The research topic that focuses on the
structure of ontology is called knowledge representation [Brachman85a]. Roughly speaking, there are
two approach to represent knowledge and ontology, logic-based approach and object-based approach.
We will introduce and compare these two approaches. It is a basis of our slot-tree ontology that is going
to be discussed in chapter 3.
The logic-based approach encodes knowledge into logic statements for reasoning,
including propositional-logic, first-order-logic, probabilistic-logic etc. Prolog is the most well
known programming language based on logic.
is(butterfly, insect)
is(insect, animal)
∀x, y, z is(x,y) ∧ is (y,z) is(x,z)
The power of first order logic lies on the ability of monotonic reasoning. The “monotonic
reasoning” means any conclusions made will never being erased in the future. The 100% certainty of
facts, rules and conclusions should be assured in the first logic reasoning process. The following
example shows a reasoning process for the example above. The reasoning process inferred “butterfly is
a kind of animal”.
∀x, y, z is(x,y) ∧ is (y,z) is(x,z) (bind x to butterfly, y to insect, z to animal)
-----------------------------------------------------------------------------------------------
is(butterfly,insect) ∧ is (insect,animal) is(butterfly,animal)
-----------------------------------------------------------------------------------------------
conclusion : is(butterfly, animal)
A difficulty is that many uncertainty situations are encountered in the natural language
understanding process. The 100% certainty of first order logic cannot always being assured.
Probabilistic logic and fuzzy logic are developed to handle the uncertainty. However, the monotonic
property is lost in the uncertain reasoning process.
After reviewing the logic-based approach, we will introduce object-based approach. Object based
approach contains a set of representation methods, including frame, semantic network and script.
Generally speaking, frames are used to represent the internal structure of object, semantic networks are
used to represent the relation between objects, and scripts are used to describe an active scenario
involving many objects.
Frame is proposed by Minsky in 1975 [Minsky75] in the seminal paper "A framework for
representing knowledge". Frame is a method of representation that organizes knowledge into
chunks. However, Minsky did not formalize the frame concept into mathematics model.
Minsky explicitly argued in favor of staying flexible and nonformal. After that, some AI
systems are built based on the frame representation, such as the KL-ONE system
[Brachman85b] and the KRL language [Bobrow77].
Generally speaking, a frame is a structure that describes the internal structure of an object.
Frames are composed out of slots (attributes) for which fillers (scalar values, references to
other frames or procedures) have to be specified or computed. A slot can be expressed as a
tuple in the form of (object, slot, filler). It is easy to transform these tuples into a logic
predicate in the from of slot(object, filler).
One frame that inherits from another frame is called a sub-frame. The inherit property may
be expressed as the “is” relation between frames in the form of is(object, object). The inherit
property organize frames into hierarchy. The concept of frame that organizes statements into
object-based structures is easy for human to read and write. It was then adopted by object-
oriented programming language for people to write program easily. The following example
shows a frame for “koairai” that is a species of butterfly.
<object name= “kodairai”>
<is>butterfly</is>
<texture>eyespots</texture>
</object>
Semantic networks concentrate on categories of objects and the relations between them
[Quillian66] [Wood75]. Drawing graphs to represent the relationship between objects is the
basic idea of semantic network. In these graphs, a link may be represented as a tuple in the
form of (object, relation, object). It is easy to transform these tuples into a logic predicate in the
from of relation(object, object).
Scripts are used to describe a scenario involving many objects [Schank77]. Steps in the
scenario are described as lattices. One step may be triggered when its preceding steps are
finished. For example, the following script shows the process of make a cup of coffee.
In fact, we may translate object-based representations into logic rules. The difference between
logic-based representation and object-based representation lies on the organization principle.
Logic-based representation encodes knowledge into logic expressions, and the object-based
representation organizes these expressions into frames, semantic networks and scripts.
Reasoning is not a standardized part in object-based systems [Ifikes85]. The information stored in
frames has often been treated as the “database” of the knowledge system, whereas the control of
reasoning has been left to other parts of the system. The most popular and effective reasoning
mechanism for frame is the production rules [Stefik83] [Kehler84]. Production rules are rules in the
form of pattern/action. It is a subset of predicate calculus with an added prescriptive component
indicating how the information in the rules is to be used during reasoning. Whenever a pattern is
matched, the production system will trigger the corresponding frame, and the action is performed to do
something that helps the “understand” process. After the pattern/action process, some values are filled
into frames as the conclusion. The reasoning process in object-based system that map natural language
text into slot-tree ontology is what we called “the slot-filling process”.
Both logic-based representation and object-based representation may be used to represent the
ontology and reasoning based on the ontology. Reasoning is helpful but not a necessary part for
computers to understand natural language. However, computers need a process to map natural language
text into ontology in order to understand it.
The mapping process for XML documents is easier than the mapping process for natural language
documents, because tags provide semantic contexts that make the process of mapping easily. In chapter
3, we will propose a slot-filling algorithm to map XML documents into slot-tree ontology in order to
reduce the semantic gap between human and computer on XML.
8.5 Discussion
In this chapter, we review the research background of XML, information retrieval and
ontology. However, the technology of XML retrieval now is not good enough and needs further
research. In fact, researchers in information retrieval community are trying hard to develop
methods for XML retrieval recently.
In the workshop of ACM SIGIR 2000 on XML and information retrieval, Carmel et al.
[Carmel00] discuss about several unsolved problems for XML retrieval in the workshop
summary. We list these problems as following.
1. Using XML query language is likely to improve precision. However, XML query
languages are not easy for people. How to make it easier to use for people?
2. A heterogeneous XML collection contains document structures are coming from different
sources, and the tag names and document structures may be different and idiosyncratic.
How to retrieve heterogeneous XML documents?
3. XML is specified using Unicode. The tag names coming from different sources may be
given in different languages. Since a word can have more that one translation and even no
translation, how to find or make the appropriate translation is an interesting issue for
multilingual information retrieval. How retrieve do multilingual XML documents?
4. Browsing XML retrieval results should be better than browsing text document. How to
organize the retrieval results for browsing? Is it the entire document, a part of the XML
tree, or perhaps a graph?
In this thesis, we will try to resolve these problems by develop an XML retrieval system. The
system is mainly designed to reduce the semantic gap between human and computer. In this
system, we develop programs for computer to understand XML documents easily, for human to
write query easily and browse query results easily. These methods are based on an ontology
representation called slot-tree. We will describe these methods in the next part. In chapter 3, we
will show how to represent slot-tree and map XML documents into slot-tree. In chapter 4, we
will show how to use the slot-tree ontology to help the XML retrieval process. In chapter 5, we
will design a method to build slot-tree automatically.
Part 2 : Slot-Tree Based Methods for XML Retrieval
Part 2 contains three chapters. In chapter 3, we will describe the syntax, semantics and
usage of slot-tree. In chapter 4, we will use the slot-tree to reduce the semantic gap in the XML
retrieval process. In chapter 5, we will show how to construct the slot-tree ontology, and design
a mining algorithm to build the slot-tree ontology automatically.
This chapter contains four sections. In section 3.1, we outline the structure of slot-tree
ontology and its usage in the process of understanding XML documents. In chapter 3.2, we
describe the syntax and semantics of slot-tree ontology. In chapter 3.3, we design the slot-
filling algorithm to map XML documents into slot-tree ontology that is the core of
understanding process. Finally, we have a discussion about slot-tree ontology and slot-filling
algorithm in section 3.4.
8.6 Introduction
In this chapter, we design an object-based representation called slot-tree ontology, and then use the slot-
tree to “understand” XML documents. As we have said in section 2.4, the word “understand” used here
means the process of mapping text in XML into the slot-tree. This enables a computer to trigger the
corresponding procedure to do what user like it to do, such as answering questions or retrieving
documents that user want. We will outline the slot-tree ontology and the slot-filling algorithm that maps
XML documents in this section, and describe the detail of slot-tree in section 3.2 and slot-filling
algorithm in 3.3.
Slot-tree representation is object-based approach to represent the internal structure of objects like
frame. We have surveyed object-based approach for knowledge representation, including frame,
semantic network and script in section 2.4. Generally speaking, frame is used to represent the internal
structures of objects, semantic network is used to represent relations between objects, and script is used
to represent scenarios that involve many objects. The object-based approach is conceptually consistent
to our notion about world, because the world is a composed by many objects in our sense. The
difference between slot-tree and frame is that a slot in slot-tree contains a set of paths to locate nodes in
XML documents. A path in a slot is in XPath format that was described in section 2.1. For example,
“//butterfly//color” is used to locate “color” nodes in the block of “butterfly”.
In our XML retrieval system, a slot-tree is encoded in XML format like the following
example.
Based on the slot-tree ontology, we design a slot-filling algorithm that is used to map
XML documents into slot-tree ontology in the process of understanding. In the slot-filling
algorithm, a path in a slot is used to catch a block in XML like a hand, and a matching process
is used to map the content of the block into the slot. After the matching process, words that
matched any values in a slot are filled into the slot. The filled slot-tree after the matching
process is then used as a semantics structure of the XML document. We will show the detail of
slot-tree ontology in section 3.2 and the detail of slot-filling algorithm in section 3.3.
Definition 3.1 : A slot-tree is a tree (T) that each node in the slot-tree contains a tuple (s, P s, Vs), where
s is the name of slot, Ps is a set of paths, and Vs is a set of values. The name of a slot is a label that
uniquely represents the slot. A path (p) in Ps is a string in XPath format that used to locate nodes in
XML documents. A value (v) in Vs is a term that contains a set of semantically identical words or
patterns.
Figure 3.1 shows the structure of a slot-tree, the {p} in each node represent a set of paths and the
{v} in each node represent a set of values. For a slot-tree that represent the internal structure of an
object, a slot in the tree may used to represent a property of the object, such as the “color”, “shape”,
“texture”, “size”, etc. A value in the slot is a possible value of the property. For example, “black” is a
possible value in the “color” slot.
s {p} {v}
s {p} {v}
A slot-tree can be encoded as an XML document that each slot is encoded as a node in tag “s”.
The attribute “slot” in the node is the label of the slot. The attribute path contains a set of path in XPath
format that encode the {p} part for each slot. The node in tag “v” is a value that encodes the {v} part for
each slot. Example 3.2 shows a slot-tree for butterflies in XML format and figure 3.2 shows the graph
representation of the example.
Formally, the syntax of slot-tree is defined as grammars in figure 3.3. A slot (S) contains a label
(NAME), a set of path (P*) and a set of values (V*). The slot may also contain a set of sub-slot (S*). A
value (V) contains a label (NAME), a set of key (KEY*) and a set of matching rules (R*).
The symbol “P” used in figure 3.3 is in a path in the format of XML path language (XPath). XPath
is a specification that proposed by Web Consortium (W3C) used to locate nodes in XML documents.
The symbol “/” is used to match children nodes, the symbol “//” is used to match nodes inside the
current node. A tag name with a prefix “@” symbol means an attribute. Example 3.3 shows several
example of XPath.
The path of example 3.3.a is used to locate “color” nodes that are children of an “adult” node, and
the “adult” node is a child of the “butterfly” node. The path of example 3.3.b is used to locate any
“color” nodes that are in the block of an “insect” node. The path of example 3.3.c is used to locate any
“color” nodes that are in the block of an “insect” node with values ‘butterfly’ in the attribute “type”. If
you would like to learn more about XPath, please see the XPath specification in the following web page
- http://www.w3.org/TR/xpath.
A rule in the slot-tree is used to match a string in XML. The syntax of a rule (R) is further
defined as grammar in figure 3.4. A rule may contains “&” operator, “|” operator and “-“
operators. A symbol “E” is an expression that is part of a rule. Each expression contains only a
literal “L” or a pattern in the form of “L..L”.
R (R & R)
R (R | R)
RE
R -E
E L {..L}
Figure 3.4 : The grammar of rules in slot tree
The “&” operator equals to a logical “and”. A “R1 & R2” rule satisfied if and only if both
R1 and R2 are satisfied. The “|” operator equals to a logical “or”. A “R1 | R2” rule satisfied if
and only if R1 or R2 is satisfied. A “..” symbol in the syntax of “E” means a far connect. A “L1
.. L2” rules satisfied if a L1 string is followed by an L2 string in one sentence. The following
example shows a several rules as following.
The rule of example 3.4.a is used to match a sentence like “a butterfly that is mixed of black and white
color”, or “a butterfly with white wing and black head”. The rule of example 3.4.b is used to match a
sentence such as “a butterfly with brown lines on wings”, but cannot match the sentence “a butterfly
with brown lines and white spots on wings”. The rule of example 3.4.c is used to match a sentence such
as “a butterfly with black color on head”, but cannot match the sentence “a butterfly with has green
head and black wings”.
8.8 Slot-Filling Algorithm
A slot in a slot-tree is a container that may contain several fillers. The filler can be a value of a
sub-slot. A slot-filling algorithm is a method to map fillers into slots. In this chapter, we
describe how to map an XML document into slot-tree ontology.
One simple way to fill values into the corresponding slot is by copy. A copy-slot is a slot with the
attribute (type=“copy”) in it. The copy-slot is used to extract a value from a specified field. In the slot-
filling process of example 3.3, the value “Athyma_fortuna_kodairai” is filled into the “name” slot in
example 3.4 just by copy.
Another way to fill values into slots is by keyword matching. A value is filled into a slot if the
value matched a sentence in the target XML document. The following example shows the process of
matching the “spotted” value in “texture” slot to the “color” nodes in XML document.
Texture Slot :
<s slot= “texture” path= “//butterfly//texture”>
<v value= “single color” keys= “single, mono, uniform”/>
<v value= “spotted” keys= “spot”/>
<v value= “lines” keys=”line”/>
</s>
Algorithm Slot-Filling(d, T)
SV = {}
for each s in T
ds = {c | (s, p) ∈M(T), (p, c) ∈d }
for each v in s
if w(v, ds) >ε then put (s,v : w(v, ds)) into SV
end for
end for
return SV
8.9 Discussion
In this chapter, we have described the slot-tree ontology in section 3.2 and slot-filling
algorithm in section 3.3. The slot-filling algorithm is used to map XML documents into slot-
tree ontology in the understanding process. In chapter 4, we will use the slot-tree and slot-
filling algorithm to develop an ontology-based XML retrieval method, and using the method to
reduce the semantic gap between human and computer.
9 An Ontology-Based Approach for XML Querying, Retrieval and
Browsing
In the previous chapter, we have showed the slot-tree ontology and its usage. A mapping
between slot-tree and XML documents is built in the process of slot-filling algorithm. The
mapping process helps our XML retrieval system in reducing the semantic gap between human
and computer. In this chapter, we will outline the relationship between our XML retrieval
system and slot-tree ontology, and show the power of slot-tree.
In section 4.1, we will describe the process of our XML retrieval system, and outline
important components in our system. We will describe how to represent an XML documents
for retrieval in section 4.2, and describe the index structure in section 4.3. After that, the query
interface is described in section 4.4 and ranking strategies is described in section 4.5. And then
we show how to organize retrieval results for browsing in section 4.6. Finally, we have a
discussion about our XML retrieval system in section 4.7.
9.1 Introduction
Two technologies are needed in the process of searching for documents, retrieving and browsing.
Retrieving is the process of retrieves documents in a collection. After that, the retrieved documents
should be organized for browsing. Browsing is the process of read and traverse on the collection of
documents. We usually use retrieving and browsing techniques alternatively in a searching process. A
model integrated retrieving and browsing may used to improve the quality of searching.
Our research focuses on using ontology to improve the XML retrieval and browsing process. We
will focus on the following questions in this chapter.
Figure 4.1 shows a scenario of our approach to retrieve XML documents. First, a user build a
query by click or type on slots in the query interface, and then submit the query to the XML retrieval
system. The retrieval system retrieves XML documents, and then summarizes them for user to browse.
Figure 4.1 : A scenario of our XML retrieval system
The ontology in figure 4.1 is the slot-tree ontology that described in chapter 3. It is the core of our XML
retrieval system. The slot-tree ontology is used to build query interface, retrieve documents and
summarize retrieved documents for browsing. The XML queries, XML documents and query interface
are important objects in our system. The retrieval and extraction are important processes in our system.
We will introduce these objects and processes in this chapter.
<adult>
</adult>
<geography>
</geography>
</butterfly>
Figure 4.2 : An XML document of butterfly
For conceptual simplicity, the XML example above is expressed as a sequence of (path,
value) pairs that describe the object.
(butterfly, )
(butterfly@about, Athyma_fortuna_kodairai.jpg)
(butterfly\adult, )
(butterfly\adult\texture, There are some eye spots in each wing)
(butterfly\adult\color, Brown background color, Eye spots in white color)
(butterfly\adult\size, Middle size, 50-60mm)
(butterfly\adult, )
(butterfly\geography, )
(butterfly\geography\taiwan, North-Taiwan, 1000-2000meters mountain area)
(butterfly\geography\global, Central China Area)
(butterfly\geography, )
(butterfly, )
The (path, value) expression can be thought as an object concept model. A “path” specified a
property of an object. A “value” specified a value for the property. The object concept model above is a
binary relation that may be expressed as path(object, value). A path represents a logical predicate with
two arguments. An object in this model is expressed as a set of (path, value) pairs.
Storing Structure
The (path, value) representation does not reflect the tree structure of an XML document. In order to
represent the tree structure, we use a pair of index to represent begin and end of each block. In other
word, we extend each (path, value) pair with a (begin, end) pair to represent the begin node and end
node of each block. The butterfly example above is expressed as the following structure.
1, 12 (butterfly, )
2, 2 (butterfly@about, Athyma_fortuna_kodairai.jpg)
3, 7 (butterfly\adult, )
4, 4 (butterfly\adult\texture, There are some eye spots in each wing)
5, 5 (butterfly\ adult \color, Brown background color, Eye spots in white color)
6, 6 (butterfly\ adult \size, Middle size, 50-60mm)
7, 7 (butterfly\ adult, )
8, 11 (butterfly\geography, )
9, 9 (butterfly\geography\taiwan, North-Taiwan, 1000-2000 meters mountain area)
10, 10 (butterfly\geography\global, Central China Area)
11, 11 (butterfly\geography, )
12, 12 (butterfly, )
In the example above, each node is lead by a (begin, end) pair. The begin index of a node is
always identical to the ID of the node. A block with (begin, end) means it cover all nodes
between begin node and end node. For example, the first block “1,12 (butterfly,)” covers nodes
from 1 to 12, the third block “3,7 (butterfly\adult)” covers nodes from 3 to 7. In this way, the tree
structure of XML is expressed as the cover/covered relations between nodes.
The begin-end pair structure totally reflects the hierarchical structure of XML documents. In
our XML storage system, we store the (begin, end) pairs in a table instead of storing as a tree.
The following example is a simple XML document. We will show how to index the following
XML document, for both text field and number field.
Indexing Text: The following table shows our inverted-file structure. The inverted-file is stored in a
relational database now. The following figure shows an inverted-file for the example above.
#path, #term #object list
… …
#\butterfly\adult\color, #brown …,#kodairai, #…..
… …
#\butterfly\adult\texture, #spot …,#kodairai,…
… …
Indexing Number : Traditional full text indexing technology doesn’t index number. In our system,
number indexing is important for the browsing process. We may sort the search results in some
specified order based on number index. In the indexing process, we extract number from XML
documents and put into a number table as following.
In our system, we design a program to transform slot-tree into HTML based query interface. A
template in Extensible Stylesheet Transformations (XSLT) is used to do the transformation.
A user may select a slot just by one click, and select a value in the slot or type keywords into the
slot. He may also specify a field for sorting. A query will be built and submit to the XML retrieval
system when he press the submit button.
A query in our system is a filled slot-tree. The following example shows a query “find all
butterflies with broken wing and brown color”.
Ranking by Field
In order to organize the retrieval result for user to browse, a user may specify the ranking
strategy. A user may specify any field to sort the result for browsing just like in a database. A
field can be sorted as numbers by scale or sorted as strings by alphabetical order, in either
increasing order or decreasing order. The variety of ranking strategies provides users a way to
organize the retrieval result into a list for browsing.
Ranking by Importance
In section 2.2, we have introduced how to measure the importance of a web page based on hyperlink.
Hyperlinks in XML may used to decide the importance of an XML document, too. In our XML retrieval
system, ranking by importance is used as a default ranking strategy. A simple way to measure the
importance of an XML document is by counting references to an XML document. We use the strategy
in our system for simplicity. In the future, we will try to accommodate random-walk model and hub-
authority model to measure the importance of XML documents in our XML retrieval system.
Ranking by Similarity
For text retrieval, a ranking strategy based on vector space model (VSM) and TFIDF weighting
function performs well. A brief survey for VSM and TFIDF was described in section 2.3. However, an
XML object is not only a sequence of words like a text, but also contains a lot of tags. For XML, we
extend VSM with a path to each term that is called the Path Vector Space Model (PVSM). An XML
document (d) could be expressed as the following vector v(d).
v(d) = (dp1,t1… d p1,tk …dpn,t1… dpn,tk) dpi,ti is the weight of (pi, ti) pair in document object d
When several paths have similar meaning, we may cluster them into a slot for retrieval. The model after
paths clustering is called the Slot Vector Space Model (SVSM).
v(d) = (ds1,t1… d s1,tk …dsn,t1… dsn,tk) dpi,ti is the weight of (pi, ti) pair in document object d
We may use the cosine-coefficient to measure the similarity between queries and documents in SVSM
just like in VSM.
d •q
Similarity(d, q) = | d | * | q |
However, we do not know what kind of weighting function is good to measure the value dsi,tj. Is TFIDF
good enough in the SVSM, or we need another measure. In our system, we express the dsi,tj as the
product of wsi,tj and tfsi,tj . Where tfsi,tj is the term frequency of the term tj in slot si , and wsi,tj ais the
weighting coefficient.
A difficulty for retrieval system today is too many documents are retrieved. When there are to many
retrieval results for browsing, the ranking strategy is used to present what users want to them. A user
may like to see large butterflies, important butterflies or butterflies that are similar to a query. The
variety of ranking strategy in XML provides ways for users to retrieve only what they like to browse.
We may use the slot-filling algorithm to extract values from the following XML document.
- <butterfly about=“Athyma_fortuna_kodairai.jpg”>
<adult>
<texture>There are some eye spots in each wing</texture>
<color>Brown background color, Eye spots in white color</color>
</adult>
<geography>
<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>
<global>Central China Area</global>
</geography>
</butterfly>
The slot-filling algorithm will fill values into slot-tree. The following example shows the result of
filling.
<s slot= “butterfly” values= “Athyma_fortuna_kodairai”>
<s slot= “adult”>
<s slot= “texture” values= “spot”/>
<s slot= “color” values = “brown”/></s>
<s slot= “geography”>
<s slot= “Taiwan” values= “North”/>
<s slot= “Global” values= “China”/></s>
</s>
The result of slot-filling algorithm is a filled slot-tree. For human, it is easier to browses filled slot-trees
than browse the source documents. The filled slot-tree is a summary of the XML document and is well
organized.
9.7 Discussion
In this chapter, we design an XML retrieval system to reduce the semantic gap between human
and computer. The slot-tree ontology and the slot-filling algorithm are used in our XML
retrieval system to understand XML documents. Based on the slot-tree, we design a query
interface to reduce the semantic gap in query side. The interface helps people to write XML
queries easily. Based on the slot-filling algorithm, we design the slot vector space model
(SVSM) retrieve XML documents. The SVSM model helps computer to understand XML
documents. Besides that, the slot-filling algorithm also help computer to extract summary from
XML documents for browsing. Our goal of reducing the semantic gap between human and
computer is almost achieved by using slot-tree as a core representation.
We will study two cases of our XML retrieval systems in chapter 6 and chapter 7. In
chapter 6, we use the domain of butterflies as an example. In chapter 7, we use the domain of
proteins as an example. We will show the slot-tree, query interface, retrieved results and
summary for butterflies in chapter 6. And we will show the slot-tree, query interface, retrieved
results and summary for proteins in chapter 7.
10 The Construction of Slot-Tree Ontology
We have introduced the slot-tree ontology in chapter 3, and then showed an XML retrieval
system based on slot-tree ontology in chapter 4. However, building slot-tree ontology is a not
an easy job. In order to reduce the effort to build the slot-tree ontology, we have developed the
slot-mining algorithm. The slot-mining algorithm is a statistical approach to mine slot-tree
from XML documents. The algorithm is used to learn the slot-tree from a collection of XML
documents.
10.1Introduction
The goal of text mining is to find important patterns from text collection and organize these patterns
into ontology. In this thesis, we use the ontology to help the XML retrieval and browsing. The mining
technology may used to help us in the construction process of slot-tree ontology. In this section, we will
focus on the text-mining problem for XML.
The semi-structured property of XML makes the mining program work. For a given XML
collection, the distribution of a term is highly depends on the tags. For example, the following terms
show up more frequent in the <color> block than in the other blocks.
The problem of mining the important values for each slot is called the Slot-Mining Problem. We will
propose a mining-algorithm that is based on a simple observation – the distribution of terms depends on
the tag. A term shows up more frequently in a tag is likely to be a key value for the corresponding slot.
10.2Background
The goal of text mining is to discover some regularity in text-data. A text-mining program induces rules
from text or learn grammar form corpus, these rules are used in the process of natural language
understanding and information extraction.
For natural language processing, inside-outside algorithm is a popular tool to learn probabilistic
context-free grammar (PCFG) from tree-bank corpus. However, tree-bank corpus is not easy to build.
Building a tree-bank by human is a time consuming job. Some other text-learning methods are
developed to learn from text corpus. For example, link grammar is a simple head-driven grammar that
developed to parse natural language sentence. A learning algorithm is developed to learn the link-
grammar from text-corpus. Besides that, transducer is a learning algorithm to induce finite-state
automata from a given text-corpus. Learning transducer is easier than learning a context-free grammar.
For information extraction, a wrapper is an algorithm to learn a simple grammar from structured text,
such as web page. A wrapper will induce some rule to wrapping the document. For example, a simple
wrapper may learn the prefix and postfix of each field from a collection of program generated web
page. We may extract fields from web page based on these prefix and postfix. A transducer may also
used to learn the extraction rule from a collection of web page, too.
However, these methods are used to learn the grammar of input text, not used to learn ontology from a
given document collection. In this chapter, we will propose a learning algorithm that mine slot-tree
ontology from a given XML collection in section 5.4. The algorithm is called the slot-mining algorithm.
This algorithm is a tool to help the domain-knowledge designer to design the slot-tree
ontology. Before we show the slot-mining algorithm, we have to show the process for human
to build a slot-tree in section 5.3, in order to observe what is need in designing such an
algorithm.
Listing all tags in this collection : an XML tag usually has strong semantic meaning. For example, the
<color> tag represents the color of a butterfly. We may list all tags to understand the semantics for each
tag. For the simple butterfly collection, we list all tags as following.
Identifying slots for this collection : We are lucky to find out that these tag are not ambiguous.
The semantics of tags are clear and definite. We may build a slot for each tag.
Mapping slots to tags (or xpath) : For the simple butterfly collection, we can map each tag to one slot
directly. The following example shows the schema of slot-tree.
Identifying values for each slot : In order to identify values for each slot, we have to read the data for
each slot. For example, if we read the data in <color> tag, we may found that the “black”, “white”,
“brown”, “orange”, “yellow”, “green”, “blue”, “purple”, “gray” are key values for this slot. We may fill
them into the values list of the color slot. After we fill values for each slot. We finish the slot-tree
building process. The following XML document shows a slot-tree for the simple butterfly collection.
<s slot=“butterfly”>
<s slot=“adult”>
<s slot= “color”>
<v value= “black” /><v value= “white” /><v value= “brown”/>
<v value= “yellow” /><v value= “orange” /><v value= “green”/>
<v value= “blue” /><v value= “purple” /><v value= “gray”/>
</s>
<s slot= “texture”>
<v value= “single color” keys= “single, mono, uniform”/>
<v value= “spotted” keys= “spot”/>
<v value= “lines” keys=”line”/>
</s>
<s slot= “size” >
<v value= “small” /><v value= “middle” /><v value= “large” />
</s>
</s>
<s slot= “geography”>
<s slot= “Taiwan”>
<v value= “north”/><v value = “center”/><v value = “south”/><v value = “east”/>
</s>
<s slot= “Global”>
<v value= “Enrope”/><v value = “China”/><v value = “India”/>
<v value = “America”/><v value = “Australia”/>
</s>
</s>
</s>
In the slot-tree example above, a <v> tag represent a value in a slot. The simplest value is a
keyword. We may also specify a set of keywords or rules for a value, such as the “single color” value in
the “texture” slot.
The last step “Identifying values for each slot” is the most human laboring step in the whole slot-
tree building process. In order to construct slot-tree automatically, we develop the slot-mining algorithm
to mine slot-tree from XML documents in the next section.
10.4Slot-mining algorithm
A slot-mining algorithm mines slot-tree from XML documents. The first step is to extract paths
in XML documents to build a schema. The second step is using statistical correlation analysis
to find out what terms is important for these paths. After that, a slot-tree is built that each slot
corresponds to a path in XML documents. The following figure shows a concept model of the
slot-mining algorithm.
Before we describe the algorithm, we have to define some mathematics notation for it.
Definition : Slot-Vector
A slot-vector is a vector of (slot, term) pairs for a given collection of XML blocks (B).
Example
1. A slot-vector for a given collection (D) is represented as the following formula.
v(D) = (Ds1,t1, …, D s1,tk ,…,Dsn,t1,…,Dsn,tk)
2. A slot-vector for a specified slot (s) of collection (D) is represented as the following formula.
4. A slot-vector for a specified slot (s) of document (d) is represented as the following formula.
Slot-Mining Problem
Given an XML documents collection (D) and a set of slots (S), find the key values for each slot : v(s).
Slot-Mining Algorithm
The slot vector for D is v(D) = (Ds1,t1, …, D s1,tk ,…,Dsn,t1,…,Dsn,tk)
In our XML-mining system, we set the parameter (r = 2.0) to extract the key values for each slot.
P = {p | p is a path in D}
for each (p,t) in D
|Dp,t | = |Dp,t|+1
|Dp| = |Dp|+1
|Dt| = |Dt|+1
|D| = |D|+1
end for
for each (p,t) in PT
p(t | p) = |Dp,t | / |Dp|
p(t) = |Dt | / |D|
if p(t|p)/p(t) > r then put (p,t) into SV
end for
return SV
The slot-mining algorithm mines values from XML collection D. The mining values should be
modified and organized into slot-tree for improving the quality. Let’s have a look at a mining
example for slot “color”.
Example :
<color> head, brown, yellow, body, white, wing, gray, blue, black, background, line, spot
</color>
In the mining result above, “brown, yellow, white, gray, blue, black ” are what we want, but “head,
body, wing, background, line, spot” are noise words. Until now, we cannot distinguish these two
groups by statistical method. We have to find out a way to distinguish them. One possible solution is
to combine a dictionary like “WordNet” to distinguish these two groups. We will try this solution in
the future.
10.5Discussion
In order to help people constructing slot-tree ontology, we developed a slot-mining algorithm
to mine slot-tree from XML documents. The slot-mining algorithm is used as an authoring tool
to construct the slot-tree ontology.
The slot-mining algorithm mines slot-trees from a collection of XML documents. Our
approach is based on statistical correlation analysis between tags and terms. The correlation
analysis decides what terms are important for a given tag, and fills terms into the slot of this
tag.
Some modification is needed for the automatically constructed slot-tree in order to
improve the quality. At first, we have to merge paths with the same meaning into a slot in order
to simplify the structure of slot-tree. Second, we have to delete some incorrect mined-values
and merge values with the same meaning in order to improve the quality of each slot.
The slot-mining algorithm is used to construct the ontology for butterflies in section 6.7
and used to construct the ontology for proteins in section 7.7. We will show full version the
mined slot-tree in these sections.
Part 3 : Case Studies
In this chapter, an overview of MBT is given in section 6.1. A source XML document of
MBT is showed in section 6.2. A slot-tree for MBT is described in section 6.3. A query
interface based on the slot-tree is described in section 6.4. The slot-filling process for MBT is
described in section 6.5. The retrieval process for MBT is discussed in section 6.6. The mining
process to build slot-tree for MBT is discussed in section 6.7. A discussion of our approach on
MBT is given in section 6.8.
11.1Introduction
The Digital Museum of Butterfly is a collection of butterfly in Taiwan. Each document in this collection
describes a species of butterfly in Taiwan. The following table is a profile for this collection.
URL : http://www.nmns.edu.tw/
NCNU : National Chi-Nan University (暨南大學), Taiwan
URL : http://dlm.ncnu.edu.tw/butterfly/index.htm
NTU : National Taiwan University (台灣大學), Taiwan
URL : http://turing.csie.ntu.edu.tw/ncnudlm/
Size 356 species, 356 XML documents.
Language Tag in English, Content in Chinese
Digital Museum for Butterfly in Taiwan contains XML documents for 356 species of butterfly in
Taiwan. Roughly specking, tags may be classified into groups as following.
Table 6.2 : XML tags for butterflies in Taiwan
Group Fields
Classification name, family, cfamily (Chinese family), genus, species, subspecies
Host Host plant, Honey plant
Geography Taiwan, global
Egg Color, shape, feature, characteristic, days of growth, enemy
Larva Color, shape, feature, characteristic, days of growth, enemy
Pupa Color, shape, feature, characteristic, days of growth, enemy
Adult Color, shape, texture, characteristic, life period, enemy
<cname>拉拉山三線蝶</cname>
- <classification>
<family>Nymphalidae</family>
<cfamily>蛺蝶科</cfamily>
<genus>Athyma</genus>
<species>fortuna</species>
<sub_species>kodairai</sub_species></classification>
<honeyplant>成蝶喜吸食腐熟水果汁液或樹幹流出汁液。</honeyplant>
<global>中國大陸中部有原名亞種分布。</global></geographic>
- <life_stage>
- <egg>
<feature>底部扁平之高饅頭形,表面有明顯六角形格狀花紋,於六角形頂點處,各著生一
細長刺毛…
<predator>各類卵寄生蜂、蜱等節肢動物。</predator>
- <larva>
<feature>終齡幼蟲體呈長圓筒狀,頭部密生硬棘,各體節背方及體側皆長有具星狀刺之突
起…
<color>終齡幼蟲頭部褐色,表面密生棘狀突起。體呈翠綠色,各體節背方及體側突起基部
為藍色,星狀刺為黃綠色。</color>
<days_of_growth>冬季以二齡幼蟲越冬,幼蟲期長達半年以上。</days_of_growth>
<defense>初齡幼蟲停棲於寄主葉脈,攝食葉脈兩側葉肉,二齡幼蟲會將寄主植物葉片咬成
小塊並吐絲將其此碎片及糞便黏於葉脈造一蟲巢,越冬幼蟲即躲藏於蟲巢當中,由於
幼蟲褐色之體色與蟲巢上乾枯之小葉片或糞便色澤相近,或許可混淆天敵耳目。 <
/defense></larva>
- <pupa>
<feature>蛹體為垂蛹,中胸背方隆起,腹節末端有一柄狀懸絲器。頭部前端有一對大型明
顯之彎曲角狀突出物,腹節背方均有小型鋸齒狀脊起。</feature>
<color>蛹體底色呈黃褐色,中、後胸背方有銀色斑塊,體側氣門黑褐色。</color>
<size>蛹體長度約為 22-27mm。</size>
<predator>蛹寄生蜂、胡蜂、姬蜂及各種真菌等。</predator>
<defense>老熟幼蟲化蛹於隱蔽之植物叢間,藉以躲避天敵。</defense></pupa>
- <adult>
<feature>成蟲前翅外觀大致呈現三角形,翅形稍微橫長。後翅卵圓形,外觀接近三角形。雌
蝶翅型較為寬圓。</feature>
<color>雄蝶前、後翅表底色為黑色,前翅中室內有一枚長形白斑,各翅室中橫線部位有一
大型白色橢圓斑,前翅端有兩枚小型白斑。後翅有兩條明顯白色橫帶紋,前後翅緣皆
有不明顯小白紋。雌蟲翅表色澤花紋與雄蟲相似。</color>
<size>本種為中型蝶種,展翅約為 50-60mm。</size>
<characteristic>前翅中室內有一枚長形白斑。</characteristic>
<habitate>台灣中部以北山區均有分布。</habitate>
<predator>蜘蛛、螳螂、青蛙、蜻蜓、鳥類及蜥蜴等捕食性天敵。</predator>
<days_of_growth>前翅中室內有一枚長形白斑。</days_of_growth>
<defense>成蟲飛行快速,外觀與其他多種三線蝶類似,為莫氏擬態的一種。</defense>
<season>夏季較易見到成蟲活動。</season>
<behavior>成蝶喜吸食腐熟水果汁液或樹幹發酵流出之樹液,成蟲活動於開闊林道,常見
成蟲於開闊山徑兩旁樹上佔據地盤驅趕附近飛過蝴蝶,亦可見其活動於溪邊開闊處,
吸食腐果或潮濕地面水分。</behavior>
</adult>
</life_stage>
</butterfly>
<butterfly>
<classification/>
<Geography/>
<life-period>
<Egg/>
<Larva/>
<Pupa/>
<Adult/>
</life-period>
</butterfly>
Each object in the “life period” (egg, larva, pupa, adult) has a sub schema to describe the object. The
schema looks like the following tree.
<object>
<Color/>
<shape/>
<feature/>
<size>
</object>
The consistency between slot-tree and document ease our design process. Besides that, the
consistency also eliminates ambiguity for our retrieval and browsing process. On the contrary, a lousy
design of XML document structure will makes our domain-knowledge design process difficult, and
makes our domain-knowledge hard to help the retrieval process and browsing process.
A fragment of the slot-tree for butterfly is showed in the following figure. For a full list of slot-tree,
please see appendix 1.
- <butterfly>
</color>
</texture>
</adult>
</s></pupa>
keys="Square_Texture"/> …</s>
</egg>
/> …</s>
</larva>
</s>
</butterfly>
11.4Query Interface
The query interface is built automatically by transform the slot-tree into a web page. We use
XSLT to transform slot-tree into HTML. The following figure shows a query interface for
butterflies.
After the interface submits the query to our XML retrieval system, the retrieval results will be shows
up. The query above specified the query expression and the ranking strategy. The ranking strategy is by
the size of adult butterfly in decreasing order. Based on the query, the XML retrieval system will
retrieve the butterfly object and ranking by size of butterfly. We will show the query results in the
following section.
11.5Slot-Filling Algorithm
We have to parse XML objects before the fill documents into slot-tree. For example, the following
XML document is a butterfly called “maraho”.
<butterfly>
<cname>寬尾鳳蝶</cname>
<geographic>
</geographic>
<egg><feature>外觀呈圓球形</feature></egg>
<adult><color>成蟲翅表為黑色</color><adult>
<footnote>本種經行政院農業委員會公告為一級瀕臨滅 保育 ……</footnote>
</butterfly>
The example above will be parsed into a sequence of (path, value) pair as following.
<value name="半球形"/>
….
卵的形狀 : 圓球形
11.6XML Retrieval
After the user submits the query to the XML retrieval system, the XML retrieval system
retrieves the query results. Then an XML extraction algorithm extracts values for each slot.
After that, a sorting function sorts the result by the size of butterflies. The following figure
shows the query results.
Figure 6.4 A Query Result for Butterflies
11.7Slot-Mining Algorithm
…雄紅三線蝶身上有…
…江崎三線蝶分布於…
…台灣三線蝶是一種…
…埔里三線蝶屬於小…
“三線蝶” left neighbor {紅,崎,灣,里} ,right neighbor {身,分,是,屬 }
“線蝶” left neighbor {三} ,right neighbor {身,分,是,屬}
“三線” left neighbor {紅,崎,灣,里} ,right neighbor {蝶}
For the string “三線蝶”, both left side and right side has four neighbors. But for “ 線蝶”, there are only
one left neighbor. For “ 三線 ” , there are only one right neighbor. A string with many neighbors in both
sides is very possible to be a “word”, so that “ 三線蝶 ” is putted into the learning-dictionary for the
following XML text-mining step.
Slot-Mining
After the word learning step, the slot mining algorithm describe in section 3.5 is used to extract
important word for each slot. The following table shows some results of of the Slot-Mining (part of
slot-tree).
11.8Discussion
In this chapter, we have studied our methods on the case of butterflies. We describe the
following methods.
These methods reduce the semantic gap between human and computer in the domain of
butterflies. The query interface enable user to write queries easily. The slot-filling algorithm
makes computer understand XML documents easily. Finally, the mining algorithm makes us
construct slot-tree ontology easily.
12 Case Study - Protein Information Resource
In the previous chapter, we have tested our methods on the collection of “a Museum of
Butterflies in Taiwan (MBT) ”. In this chapter, we will test our methods on the collection of
“Protein Information Resource (PIR)”. The PIR is a large collection for proteins that
maintained by George Town University.
In this chapter, an overview of PIR is given in section 7.1. Some XML data of PIR are
showed in section 7.2. A slot-tree for PIR is described in section 7.3. A query interface for
proteins is described in section 7.4. The slot-filling process for PIR is described in section 7.5.
The retrieval process for PIR is discussed in section 7.6. The mining process to build slot-tree
for PIR is discussed in section 7.7. A discussion of our approach on PIR is given in section 7.8.
12.1Introduction
Protein Information Resource is a general collection of Protein and Gene record for life, including
human, animal, plant, virus bacteria, etc. Each document in this collection describes a gene or protein.
The following table is a profile of this collection.
URL : http://pir.georgetown.edu/
Size The PIR-PSD, Release 72.03, May 17, 2002, Contains 283174 Entries
Language English
Protein Information Resource contains 283174 entries of protein and gene. Roughly specking, tags may
be classified into groups as following.
<created_date>03-Feb-1994</created_date>
- <organism><source>zebra fish</source>…</organism>
<volume>12</volume><year>1993</year><pages>1403-1414</pages>
development…
- <xrefs><xref><db>MUID</db><uid>93223680</uid></xref></xrefs>
<xrefs><xref><db>EMBL</db><uid>X70299</uid></xref>…
</reference>
- <feature label="ERBA">
<feature-type>domain</feature-type>
<seq-spec>74-320</seq-spec>
</feature>…
- <summary><length>411</length><type>complete</type></summary>
<sequence>MAMVVSVWRDPQEDVAGGPPSGPNPAAQPAREQQQAASAAPHTPQTPSQPGPPSTPGTAGDK…
</ProteinEntry>
Our ontology used in this section is based on the suggestion of Gene Ontology. Gene
Ontology Consortium proposed an ontology system with three dimensions, including
“molecular function”, “biological process” and “cellular component”. Besides that, we add the
“protein structure”, “protein size”, “molecular type”, and some other information in our slot-
tree. A fragment of the slot-tree for butterfly is showed in the following figure. For a full list
of the slot-tree for protein, please see appendix 2.
- <frame>
</frame>
12.4Query Interface
The query interface is built automatically by transform the slot-tree into a web page. We use
XSLT to transform slot-tree into HTML. The following figure shows a query interface for
butterflies.
12.5Slot-Filling Algorithm
We have to parse XML objects before the extraction. For example, the following XML document is a
protein.
<ProteinEntry>
<protein>
<summary><length>411</length>…</summary>
</butterfly>
The example above will be parsed into a sequence of (path, value) pair as following.
(ProteinEntry\summary\length,411)
(ProteinEntry\organism\name,zebra fish),
<value name="Plant"/>
<value name="Fish"/>
….
Organism: Fish
The following figure shows the extraction result for the example above.
<slot protein="S35333">
</slot>
12.6XML Retrieval
After the user submits the query to the XML retrieval system, the XML retrieval system
retrieves the query results. Then an XML extraction algorithm extracts values for each slot.
After that, a sorting function sorts the result by the size of butterflies. The following figure
shows the query results.
Figure 7.4 : A Query Result for Protein Information Resource
12.7Slot-Mining Algorithm
The slot mining algorithm describe in section 3.5 is used to extract important word for each slot. The
following table shows some results of of the Slot-Mining (part of slot-tree).
transferase,transfer, transcription,transcript,topoisomerase,thioredoxin,tRNA,
synthase,sulfatase,ste,sea,rich,ribosomal,response,repressor,repeat,regulator,region,
reductase,receptor,rat,ras,ran,proteins,protein,probable,polyprotein,phosphate,phage,
permease,peptide,peptidase,oxidase,ornithine,nucleotide,nor,non,mol,min,mer,
membrane,man,long,line,ligase,lactaldehyde,kinesin,kinase,isomerase,inhibitor,inhibit,
immunoglobulin,hypothetical,hydrolyzing,hydrogenase,hydrogen,homology,
homolog,homeobox,glucose,globin,gene,gamma,form,factor,esterase,ester,erbA,
epimerase,enzyme,elegans,edu,domain,dehydrogenase,cytochrome,control,conserved,
coli,chr,cholinesterase,choline,chain,cell,cassette,carrier,binding,bind,beta,barley,
bacterium,antigen,anti,ant,alpha,alcohol,alanine,acid,RNA,NADH,NAD,Mycobacterium,
MTH,III,Escherichia,DNA,Caenorhabditis,Bacillus,ATPase,ATP,ADP
/ProteinEntry/comment ste,protein,phosphorylation,phosphorylated,phosphorylase,phospho,phosphate,non,
molecule,mol,interacts,inhibit,enzyme,covalent,cell,allosterically,allosteric,allo,This,Thi
/ProteinEntry/complex tet,phosphorylase,phospho,mer,homotetramer
/ProteinEntry/feature TMM,SIG,RRH,MAT,KIN,IMM,HOX,FOX,ERBA,ACP,ABC
/ProteinEntry/header/created_date Sep,Oct,Nov,May,Mar,Jun,Jul,Jan,Feb,Dec,Aug,Apr
/ProteinEntry/header/seq-rev_date Sep,Oct,Nov,May,Mar,Jun,Jul,Jan,Feb,Dec,Aug,Apr
/ProteinEntry/genetics/xrefs/xref/db SGD,OMIM,MIPS,MIP,GDB
/ProteinEntry/genetics/start-codon GTG
/ProteinEntry/genetics/map-position qter,pter,circular,chromosome,chr,REV
/ProteinEntry/genetics/gene/db SPDB,SGD,SCOEDB,GDB,CESP,ATSP
/ProteinEntry/function/description sulfate,ran,protein,phospho,phosphate,hydrogenase,hydrogen,glucose,formate,
form,catalyzes,alpha
/ProteinEntry/feature/status predicted,experimental,exp,atypical
/ProteinEntry/feature/feature-type site,region,product,modified,inhibitory,inhibitor,inhibit,domain,disulfide,bonds,binding,
bind,active
/ProteinEntry/keywords/keyword zinc,transmembrane,transferase,transfer,transcription,transcript,tet,ste,ribosome,
regulation,reductase,receptor,rat,ras,ran,pyridoxal,proteinase,protein,polyprotein,
photo,phosphoprotein,phospho,phosphate,oxygen,oxidoreductase,nucleus,nucleotide,
muscle,mol,mitochondrion,min,metalloprotein,metal,mer,membrane,magnesium,lyase,
loop,kinase,isomerase,iron,immunoglobulin,hydrolase,homotetramer,homeobox,
heterotetramer,heme,glycoprotein,finger,erythrocyte,end,edu,duplication,date,complex,
chromoprotein,chr,chloroplast,cell,carrier,carboxyl,carboxy,carbon,blood,biosynthesis,
binding,bind,aminoacyl,amino,amidated,allo,acid,acetylated,NAD,DNA,ATP
/ProteinEntry/feature/description zinc,trypsin,transmembrane,transforming,ste,signal,sequence,seq,response,repeat,
regulator,reductase,rat,ras,ran,pyridoxal,pter,proteinase,protein,potential,phosphorylase,
phospho,phosphate,peptide,oxidase,nucleotide,muscle,motif,molybdopterin,mol,
min,mer,membrane,mature,man,magnesium,low,loop,ligands,ligand,kinase,iron,
inhibitor,inhibit,immunoglobulin,hydrogenase,hydrogen,homology,homolog,homeobox,
heme,glycoprotein,fragment,form,finger,ferroxidase,factor,erbA,end,edu,domain,
dehydrogenase,date,cytochrome,covalent,chr,chain,cassette,carrier,carboxyl,carboxy,
carbohydrate,binding,bind,beta,axial,amino,amidated,alpha,allo,alcohol,acetylated,Thr,
Ser,Lys,Ile,His,Glu,GTP,Cys,Bowman,Birk,Asp,Asn,Arg,ATP,ADP
However, the schema of PIR is not consistent to our ontology that described in section 5.2. The
inconsistency causes mapping problem between slots ant paths. Ontology designer have to spend a lot
of time to adjust the automatic generated slot-tree.
12.8Discussion
In this chapter, we have studied our methods on the case of proteins. We describe the following
methods.
These methods reduce the semantic gap between human and computer in the domain of
butterflies. The query interface enable user to write queries easily. The slot-filling algorithm
makes computer understand XML documents easily. Finally, the mining algorithm makes us
construct slot-tree ontology easily.
Part 4 : Conclusions
In this thesis, slot-trees are used to generate a query interface for user to write queries easily. The
query interface reduces the semantic gap in the query side. On the other hand, a slot-filling algorithm is
designed for computer to understand XML documents easily. The slot-filling algorithm reduces the
semantic gap on the document side.
In order to ease the process of building a slot-tree, we propose a slot-mining algorithm to mine
slot-tree from XML documents. The slot-tree has to be modified by domain expert for quality
improvement.
In this chapter, we will compare our approach to other approaches in section 8.1. Our
contributions are described in section 8.2. Finally, we have conclusions in section 8.3.
13.1Comparison
We will try to compare our approach to other approaches based on four measures. Each
measures corresponding to a question listed below.
Natural language approach (NL) : Both documents and queries are written in natural
language. A typical text retrieval system adopts natural language approach. Natural language is
easy for user to read and write. However, natural language is not easy for computer to
understand.
Database approach (DB) : Documents are encoded as a set of tables in a relational database.
Database system is not so easy for user to read and write data. A designer has to design user-
interface to help user read and write data. Database query languages like SQL are not so easy
for end user to write. However, data in database are easy to understand for computer.
Logic based approach (Logic) : Both logic queries and data are very easy for computer to
understand. However, people cannot write logic rules and queries easily. Besides, not all
documents can be represented logic rules.
XML based approach (XML) : XML queries are easy for computer to understand. However,
XML queries are not easy for human to write, and computer cannot understand XML
documents easily for the time being. In this thesis, we use the slot-tree ontology to help
computer to understand XML documents. We also use the slot-tree ontology to build the query
interface. The interface helps human to write XML queries easily. The slot-tree based methods
moves the XML based approach to the easy side in the figure below. The slot-tree ontology
reduces the gap between human and computer on XML.
b. A comparison of XML-based representations
Next, we compare three XML-based representation, including XML, RDF and DAML. XML
has been described in this thesis for several times. Now we will introduce RDF and DAML
before comparison.
<rdf:RDF>
<rdf:Description about="Athyma_fortuna_kodairai">
<rdf:type resource="http://description.org/schema/butterfly"/>
<color>with brown wing and black head</color>
<texture>has white spots on wings</texture>
</rdf:Description>
</rdf:RDF>
We have to use the tag defined in RDF specification to describe object and the inheritance
relation. People have to understand RDF tags before write RDF documents. However, RDF is
simple and easy to use.
DAML is a representation that encodes logic rules into frame based XML documents.
DAML extend tags in RDF to accommodate frequently used logic predicate, such as
“disjointWith”, “cardinality”, “intersectionOf”, etc. The following example shows a example
of DAML.
<rdfs:Class rdf:ID="Athyma_fortuna_kodairai">
<rdfs:subClassOf rdf:resource="#butterfly"/>
<daml:disjointWith rdf:resource="#Moth"/>
</rdfs:Class>
Writing DAML document is not an easy job. People have to understand many DAML tags
and express the content into logic predicates. The following figure shows the comparison
between XML, RDF and DAML.
Figure 8.2 : A comparison of XML-based representations
Figure 8.3 shows the comparison of these approaches. Our slot-tree based system is
labeled as “slot” in the figure. XML-GL is labeled as “X-GL” in the figure. XYZfind is labeled
as “XYZ” in the figure. Lore is labeled as “Lore” in the figure.
We found that slot-tree approach perform well in all questions. The slot-tree ontology
makes people write queries easily. Our approach does not ask people to write XML document
in specified tags, so that people can write documents easily. The XML queries are always easy
to understand for computer. The slot-filling algorithm makes computer understand XML
documents easily.
Figure 8.3 : A comparison of XML retrieval systems
13.2Contributions
Based on the analysis in section 9.1, we may describe our contribution briefly as following.
“The slot-tree approach reduces the semantic gap between human and computer on XML”
1. “The slot-tree based query interface makes human to write XML queries easily.”
2. “The slot-filling algorithm makes computer understand XML documents easily.”
3. “A retrieval system that based on slot-tree is built to reduce the semantic gap on XML.”
4. “The slot-mining algorithm makes people construct slot-tree ontology easily.”
However, we proposed the slot-tree based XML retrieval method only focus on a specific
domain. We have to construct slot-tree for each domain before release the XML retrieval
system. The method is good in retrieve object-based XML documents such as butterflies and
proteins. However, we are not sure the method can be used to retrieve XML collection that is
not object-based. Besides, we have to extend the method to build an XML retrieval system for
more than one domain.
13.3Discussion and Future Work
In this thesis, we use the slot-tree ontology and slot-filling algorithm to reduce the semantic
gap of XML. The slot-tree is used to generate a query interface to reduce the semantic gap on
query side. The slot-filling algorithm is used to map XML document into slot-tree ontology in
order to reduce the semantic gap on document side. Our XML retrieval system works well on
objects-based XML collections, such as the collection for butterflies in chapter 6 and the
collection for proteins in chapter 7.
However, not all XML documents are used to describe objects. Some XML documents are
used to encode categories, scripts and other structures. How to integrate these structures into an
XML retrieval system is a good question for our future research.
Another question is the integration of XML collections in several domains. For example,
how to integrate XML documents that describe gene, protein and biological species into one
XML retrieval system is a good case to study. The integration of several domains needs a
further research.
Finally, a scalable XML retrieval system should be useful on a web with many XML
documents. The XML retrieval system should be used to retrieve a large collection of XML
documents in a variety of domains. We will try to build such an XML retrieval system in the
future.
Reference
[Aguilera00] Aguilera, V. and Cluet, S. and Veltri, P. and Vodislav, D. and Wattez,F. (2000) “Querying
XML Documents in Xyleme” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/xyleme/XylemeQuery/XylemeQuery.html
[Albano00] Albano, A. and Colazzo, D. and Ghelli, G. and Manghi, P. and Sartiani, C. (2000) “A Type
System for Querying XML Documents” in ACM SIGIR 2000 Workshop On XML and Information
Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Sartiani/athens.html
[Allen94] Allen, J.F. “Natural Language Understanding,” Benjamin Cummings, 1987, Second Edition,
1994.
[Alshawi92] Hiyan Alshawi, editor. The Core Language Engine. MIT Press, Cambridge, Massachusetts,
1992.
[Baeza00] Baeza-Yates, R. and Navarro, G. (2000) “XQL and Proximal Nodes,” in ACM SIGIR 2000
Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/RBaetza/att1.htm
[Bollacker98] Bollacker, K.D. and Lawrence, S. and Giles, C.L. (1998) “CiteSeer: An Autonomous
Web Agent for Automatic Retrieval and Identification of Interesting Publications”, 2nd International
ACM Conference on Autonomous Agents, pp. 116-123, ACM Press, May, 1998.
[Brachman85b] Brachman, F.J., and Schmolze, J.G. (1985) “An overview of the KL-ONE knowledge
representation system.” Cognitive Sci. 9.2 (Apr. 1985) 171-216.
[Brin98] Brin, S. and Page,L.(1998) "The Anatomy of a Large-Scale Hypertextual Web Search Engine"
in Proceedings of World-Wide Web '98 (WWW7), April 1998.
[Carmel00] Carmel, D. and Maarek, Y. and Soffer, A. (2000) “Workshop Summary of XML and
Information Retrieval: a SIGIR 2000 Workshop” IBM Research Lab in Haifa.
http://www.haifa.il.ibm.com/sigir00-xml/WorkshopSummary.html
[Chen00] Chen, B.C. (2000) ‘Content-Based Image Retrieval of Butterflies”, Master Thesis. NTU,
Taiwan, June, 2000.
[Chien97] Chien, L.F. (1997) "PAT-Tree Based Keyword Extraction for Chinese Information Retrieval"
ACM SIGIR 1997.
[Cooper01] Cooper, B.F. and Sample, N. and Franklin,M.J. and Hjaltason,G.R. and Shadmon, M.
(2001) “A Fast Index for Semistructured Data” Proc. of 27th Intl. Conf. on Very Large Data Bases,
August 2001. http://www.rightorder.com/technology/XML.pdf
[DC99] “Dublin Core Metadata Element Set, Version 1.1: Reference Description” –
http://dublincore.org/documents/dces/
[DeJong82] DeJong; G.. (1982) “An Overview of the FRUMP System.” In Strategies for Natural
Language Processing, W.G.Lehnert & M.H.Ringle (Eds), Lawrence Erlbaum Associates, 1982, 149-
176.
[Dyer83] Dyer, M.G. (1983) "In-Depth Understanding - A computer model of integrated processing for
Narrative Comprehension, " MIT press, 1983.
[Egnor00] Egnor,D. and Lord,R. (2000) “XYZfind: Searching in Context with XML” in ACM SIGIR
2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Egnor/index.html
[Fuhr00] Fuhr, N. (2000) “XIRQL An Extension of XQL for Information Retrieval” in ACM SIGIR
2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/KaiGross/sigir00.html
[Goldman97] Goldman, R. and Widom, J. (1997) “DataGuides: Enabling query formulation and
optimization in semistructured databases.” In Proc. Intl. Conf. on Very Large Data Bases, 1997.
[Green63] Green, B.F., Wolf, A.K., Chomsky, C., and Laughery, K. (1963). “Baseball : An automatic
question answerer.” In Feigenbaum and Feldman (Eds.), Computer and Thought. McGraw-Hill, New
York, 207-233.
[Grosz86] Grosz, B.J., Sparck-Jones, K., and Webber, B.L., eds. (1986) "Readings in Natural Language
Processing", Morgan Kaufmann Publishers, Los Altos, CA, 1986
[Han01] Han, J. and Kamber, M. (2001) “Data Mining - Concepts and Techniques”, Morgan Kaufmann
Publisher. 2001.
[Hayashi00] Hayashi, Y. and Tomita, J. and Kikui,G. (2000) “Searching Text-rich XML Documents
with Relevance Ranking” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Hayashi/hayashi.html
[Heb00] Heb, M. and Monch, C. and Drobnik, O. (2000) "Quest - Querying Specialized Collections on
the Web", J. Borbinha and T.Baker (Eds.) : ECDL 2000, LNCS 1923, pp. 117-126, 2000.
[Hobbs96] Hobbs, J. and Appelt, D. and Bear, J. and Israel, D. and Kameyama, M. and Stickel, M. and
Tyson, M. (1996) “FASTUS: A Cascaded Finite-State Transducer for Extracting Information from
Natural-Language Text.” in Finite State Devices for Natural Language Processing, MIT Press, 1996
[Hsu98] Hsu, C.N. and Dung, M.T. (1998) “Generating finite-state transducers for semistructured data
extraction from the web,” Information Systems, 23(8):521-538, Special Issue on Semistructured
Data, 1998.
[Ide00] Ide, N. (2000) “Searching Annotated Language Resources in XML: A Statement of the
Problem” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Ide/SIGIR-XML.html
[Ifikes85] Ifikes, R. and Kehler, J. (1985) “The role of frame-based representation in reasoning.”
Communications of the ACM, Volume 28 Number 9, September 1985.
[Kehler84] T.P. Kehler and G.D. Clemenson. KEE: The Knowledge Engineering Environment for
Industry. Systems And Software, 3(1):212-224, January 1984.
[Loral97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. “The Lorel Query Language for
Semistructured Data.” International Journal on Digital Libraries, 1(1):68-88, April 1997.
[Luk00] Luk,R. and Chan,A. and Dillon,T. and Leong, H.V. (2000) “A Survey of Search Engines for
XML Documents” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Luk/XMLSUR.htm
[McHugh97] McHugh, J. and Widom, J. and Wiener, J. and Abiteboul, S. and Quass, D. (1997) “The
Lorel Query Language for Semistructured Data, ” - International Journal on Digital Libraries,
1(1):68-88, 1997.
[Muslea99] Muslea, I. (1999) “Extraction Patterns for Information Tasks : A Survey, ” In AAAI-99
Workshop on Machine Learning for Information Extraction, 1999.
[OIL00] “An informal description of Standard OIL and Instance OIL 28 November 2000”
http://www.ontoknowledge.org/oil/downl/oil-whitepaper.pdf
[Page98] Page, L. and Brin, S. and Motwani, R. and Winograd, T. “The PageRank citation ranking:
Bringing order to the Web.” Unpublished manuscript, online at http://google.stanford.edu/~backrub/
pageranksub.ps, 1998.
[Quillian66] Quillian, R. "Semantic memory," Cambridge, Mass. : Bolt, Beranek and Newman, 1966.
[RDF99] Resource Description Framework (RDF) Model and Syntax Specification W3C
Recommendation 22 February 1999 http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
[RDFS00] Resource Description Framework (RDF) Schema Specification 1.0 W3C Candidate
Recommendation 27 March 2000 http://www.w3.org/TR/2000/CR-rdf-schema-20000327/
[Salton88] Salton, G. and Buckley, C. “Term-Weighting Approaches in Automatic Text Retrieval,”
Information Processing and Management, 24(5), 513-23, 1988.
[Schank74] Schank, R.C. and Reiger III, C.J.(1974) "Inference and the Computer Understanding of
Natural language," Artificial Intelligence 5(4), 1974, 373-412.
[Schank77] Schank, R.C. and Abelson, R. (1977). “Scripts, Plans, Goals, and Understanding.”
Hillsdale, NJ: Earlbaum Assoc.
[Schank80] Schank, R.C. and Kolodner, J.L. and DeJong, G. (1980) “Conceptual Information
Retrieval.” SIGIR 1980: 94-116.
[Schmidt00] Schmidt, A. et al. (2000) “Efficient Relational Storage and Retrieval of XML Documents”,
In proceedings of International Workshop on the Web and Databases (In conjunction with ACM
SIGMOD), pages 47-52, Dallas, TX, USA, May 2000.
http://citeseer.nj.nec.com/schmidt00efficient.html
[Schlieder00] Schlieder, T. and Meuss, H. (2000) “Result ranking for structured queries against XML
documents.” In DELOS Workshop on Information Seeking, Searching and Querying in Digital
Libraries, Zurich, Switzerland, December 2000.
[Schlieder01] Schlieder, T. (2001) “Similarity search in XML data using cost-based query
transformations.” In Proceedings of the Fourth International Workshop on the Web and Databases
(WebDB'01), Santa Barbara, USA, May 2001.
[Schlieder00] Schlieder, T. and Naumann ,F. (2000) “Approximate Tree Embedding for Querying XML
Data” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Approximate.htm
[Stefik83] Stefik, M., Bobrow, D. G., Mittal, S., and Conway, L. Knowledge Programming in Loops:
Report on an Experimental Course. AI Magazine, 4:3, pp. 3-13, Fall 1983. (Reprinted in Readings
From the AI Magazine, Volumes 1-5, 1980-1985, pp. 493-503, 1988.)
[Tu99] Tu, H. C. (1999) “Interactive Web IR: Focalization Model, Effectiveness Measures, and
Experiments”, Doctoral Dissertation, NTU, Taiwan, June, 1999.
[van Zwol2002] van Zwol, R. (2002). “Modelling and searching web-based document collections.”
PhD thesis, Centre for Telematics and Information Technology (CTIT), Enschede, the Netherlands.
ISBN: 90-365-1721-4; ISSN: 1381-3616 No. 02-40 (CTIT Ph.D. thesis series).
[Widom99] Widom, J. (1999) “Data Management for XML - Research Directions”, IEEE Data
Engineering Bulletin, Special Issue on XML, 22(3):44-52, September 1999.
http://www-db.stanford.edu/~widom/xml-whitepaper.htm
[Wood75] Woods, William A. “What's in a Link : Foundations for Semantic Networks” Available in
Readings in Knowledge Representation, Brachman, R.J. & Levesque, H.J., Eds. (1985), Morgan
Kaufman.
[XML-QL98] “XML-QL: A Query Language for XML,” Submission to the World Wide Web
Consortium 19-August-1998 http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/
[XML-GL99] Stefano Ceri, Sara Comai, Ernesto Damiani , Piero Fraternali, Stefano Paraboschi,
Letizia Tanca “XML-GL: a Graphical Language for Querying and Restructuring XML Documents,”
in The Eighth International World Wide Web Conference (WWW8), Toronto Convention Centre,
Toronto, Canada May 11-14, 1999.
[XPATH99] “XML Path Language (XPath) Version 1.0”, W3C Recommendation 16 November 1999,
http://www.w3.org/TR/xpath
[XQuery01] “XQuery 1.0 and XPath 2.0 Data Model” W3C Working Draft 20 December 2001
http://www.w3.org/TR/2001/WD-query-datamodel-20011220/
a. XML example
- <butterfly>
<cname>拉拉山三線蝶</cname>
<nickname />
- <present_SN_record>
<present_SN>Athyma_fortuna_kodairai</present_SN>
<present_SN_author>Sonan</present_SN_author>
<present_SN_year>1938</present_SN_year>
</present_SN_record>
- <classification>
<family>Nymphalidae</family>
<cfamily>蛺蝶科</cfamily>
<genus>Athyma</genus>
<species>fortuna</species>
<sub_species>kodairai</sub_species>
</classification>
<honeyplant>成蝶喜吸食腐熟水果汁液或樹幹流出汁液。</honeyplant>
- <geographic>
<global>中國大陸中部有原名亞種分布。</global>
</geographic>
- <life_stage>
- <egg>
<feature>底部扁平之高饅頭形,表面有明顯六角形格狀花紋,於六角形頂點處,各著生一
細長刺毛。</feature>
<color>淡綠。</color>
<size>直徑約為 1.1-1.3mm。</size>
<characteristic />
<habitate />
<predator>各類卵寄生蜂、蜱等節肢動物。</predator>
</egg>
- <larva>
<feature>終齡幼蟲體呈長圓筒狀,頭部密生硬棘,各體節背方及體側皆長有具星狀刺之突
起。</feature>
<color>終齡幼蟲頭部褐色,表面密生棘狀突起。體呈翠綠色,各體節背方及體側突起基部
為藍色,星狀刺為黃綠色。</color>
<characteristic />
<habitate />
<predator>寄生蜂、寄生蠅、小繭蜂、椿象、蜥蜴及鳥類等。</predator>
<days_of_growth>冬季以二齡幼蟲越冬,幼蟲期長達半年以上。</days_of_growth>
<defense>初齡幼蟲停棲於寄主葉脈,攝食葉脈兩側葉肉,二齡幼蟲會將寄主植物葉片咬
成小塊並吐絲將其此碎片及糞便黏於葉脈造一蟲巢,越冬幼蟲即躲藏於蟲巢當中,
由於幼蟲褐色之體色與蟲巢上乾枯之小葉片或糞便色澤相近,或許可混淆天敵耳目。
</defense>
</larva>
- <pupa>
<feature>蛹體為垂蛹,中胸背方隆起,腹節末端有一柄狀懸絲器。頭部前端有一對大型明
顯之彎曲角狀突出物,腹節背方均有小型鋸齒狀脊起。</feature>
<color>蛹體底色呈黃褐色,中、後胸背方有銀色斑塊,體側氣門黑褐色。</color>
<size>蛹體長度約為 22-27mm。</size>
<characteristic />
<habitate />
<predator>蛹寄生蜂、胡蜂、姬蜂及各種真菌等。</predator>
<defense>老熟幼蟲化蛹於隱蔽之植物叢間,藉以躲避天敵。</defense>
</pupa>
- <adult>
<feature>成蟲前翅外觀大致呈現三角形,翅形稍微橫長。後翅卵圓形,外觀接近三角形。雌
蝶翅型較為寬圓。</feature>
<color>雄蝶前、後翅表底色為黑色,前翅中室內有一枚長形白斑,各翅室中橫線部位有一
大型白色橢圓斑,前翅端有兩枚小型白斑。後翅有兩條明顯白色橫帶紋,前後翅緣皆
有不明顯小白紋。雌蟲翅表色澤花紋與雄蟲相似。</color>
<size>本種為中型蝶種,展翅約為 50-60mm。</size>
<characteristic>前翅中室內有一枚長形白斑。</characteristic>
<habitate>台灣中部以北山區均有分布。</habitate>
<predator>蜘蛛、螳螂、青蛙、蜻蜓、鳥類及蜥蜴等捕食性天敵。</predator>
<days_of_growth>前翅中室內有一枚長形白斑。</days_of_growth>
<defense>成蟲飛行快速,外觀與其他多種三線蝶類似,為莫氏擬態的一種。</defense>
<season>夏季較易見到成蟲活動。</season>
<behavior>成蝶喜吸食腐熟水果汁液或樹幹發酵流出之樹液,成蟲活動於開闊林道,常見
成蟲於開闊山徑兩旁樹上佔據地盤驅趕附近飛過蝴蝶,亦可見其活動於溪邊開闊處,
吸食腐果或潮濕地面水分。</behavior>
</adult>
</life_stage>
<update>2000/11/7</update>
<footnote />
</butterfly>
b. Domain Knowledge
- <frame language="big5" database="xir" showPath="//butterfly//cname//">
- <butterfly>
</family>
</shape>
</color>
menu="yes">
</texture>
</adult>
</s>
</s>
</s>
</pupa>
- <s slot="表面">
</s>
</s>
</s>
</s>
</egg>
- <s slot="體側">
</s>
</s>
</s>
menu="yes">
</s>
</larva>
</s>
</s>
</s>
</s>
</s>
menu="yes">
</s>
</s>
</s>
/>
</butterfly>
</frame>
Appendix 2 : Protein Information Resource
a. XML example
- <ProteinEntry id="S35333">
- <header>
<uid>S35333</uid>
<accession>S35333</accession>
<created_date>03-Feb-1994</created_date>
<seq-rev_date>03-Feb-1994</seq-rev_date>
<txt-rev_date>24-Sep-1999</txt-rev_date>
</header>
- <organism>
<source>zebra fish</source>
<common>zebra fish</common>
<formal>Brachydanio rerio</formal>
</organism>
- <reference>
- <refinfo refid="S35333">
- <authors>
<author>Fjose, A.</author>
<author>Nornes, S.</author>
<author>Weber, U.</author>
<author>Mlodzik, M.</author>
</authors>
<citation>EMBO J.</citation>
<volume>12</volume>
<year>1993</year>
<pages>1403-1414</pages>
development.</title>
- <xrefs><xref><db>MUID</db><uid>93223680</uid></xref></xrefs>
</refinfo>
- <accinfo label="FJO">
<accession>S35333</accession>
<mol-type>mRNA</mol-type>
<seq-spec>1-411</seq-spec>
- <xrefs>
- <xref><db>EMBL</db><uid>X70299</uid></xref>
- <xref><db>NID</db><uid>g296418</uid></xref>
- <xref><db>PIDN</db><uid>CAA49780.1</uid></xref>
- <xref><db>PID</db><uid>g296419</uid></xref>
</xrefs>
</accinfo>
</reference>
- <genetics><gene><uid>svp44</uid></gene></genetics>
- <classification>
</classification>
- <keywords>
<keyword>DNA binding</keyword>
<keyword>zinc finger</keyword>
</keywords>
- <feature label="ERBA">
<feature-type>domain</feature-type>
<seq-spec>74-320</seq-spec>
</feature>
- <feature>
<feature-type>region</feature-type>
<description>zinc finger</description>
<seq-spec>76-96</seq-spec>
</feature>
- <feature>
<feature-type>region</feature-type>
<description>zinc finger</description>
<seq-spec>112-136</seq-spec>
</feature>
- <summary><length>411</length><type>complete</type></summary>
<sequence>MAMVVSVWRDPQEDVAGGPPSGPNPAAQPAREQQQAASAAPHTPQTPSQPGPPSTP
GTAGDKGSQNSGQSQQHIECVVCGDKSSGKHYGQFTCEGCKSFFKRSVRRNLTYTCRANRNCPI
DQHHRNQCQYCRLKKCLKVGMRREAVQRGRMPPTQPNPGQYALTNGDPLNGHCYLSGYISLLL
RAEPYPTSRYGSQCMQPNNIMGIENICELAARLLFSAVEWARNIPFFPDLQITDQVSLLRLTWSEL
FVLNAAQCSMPLHVAPLLAAAGLHASPMSADRVVAFMDHIRIFQEQVEKLKALHVDSAEYSCIK
AIVLFTSDACGLSDAAHIESLQEKSQCALEEYVRSQYPNQPSRFGKLLLRLPSLRTVSSSVIEQLFF
VRLVGKTPIETLIRDMLLSGSSFNWPYMSIQ</sequence>
</ProteinEntry>
b. Domain Knowledge
- <frame>
</s>
</structure>
</source_genus>
</body_component>
</cell_component>
</body_function>
</cell_function>
- <v value="金屬">
</v>
</material>
</property>
</s>
</s>
</s>
</s>
</source_genus>
</frame>