You are on page 1of 100

國立台灣大學 資訊工程研究所

博士論文

基於欄位填充機制的 XML 文件檢索方法


- (以蝴蝶與蛋白質的檢索為案例)

XML Retrieval - A Slot Filling Approach

研 究 生:陳鍾誠

指導教授:項 潔

中華民國九十一年七月
誌謝
本論文是我從 1997 年到 2002 年之間的主要成果,總結了我在生活, 工作,與學校中所學習與獲
得的研究與經驗,這些成果若有一點點值得參考的地方,都歸功於那些曾經教導過我與幫助過
我的父母師長,還有同事朋友們。

首先要感謝的是我的指導教授,項潔教授,從日常的討論與正式的課堂中,使我得以學習到研
究的方法與正確的態度,老師的諄諄指導總是使我獲益良多,不論是在理論的建構上,研究的
直覺上與實際系統的觀察上,總是深刻入微,令人深深敬佩。

其次我要感謝高成炎老師,不但是我碩士的指導教授,也在博士班時給予我相當多的指導與援
助, 不論是在研究方向與生物資訊的領域,高老師都給予我相當多的指導。

其次我要感謝中央研究院資訊所的許聞廉教授,在我剛開始博士研究的前幾年,啟發了我對自
然語言研究的熱誠與信心,並且給我相當大的發揮空間與細心的指導。

還有幾位特別給予我協助的人,包含中央大學的洪炯宗教授,交通大學的楊進木教授與 振剛
主任,他們在研究論文上給予我的協助與指導,都是這篇論文得以順利完成的原因。

接下來要感謝的是我的同學們,“林耀仁、杜協昌、黃光璿、謝育平、劉文俊、潘家煜、傅國長、陳宏
杰、陳必衷、余禎祥、黃子葵、賴勝華、洪智瑋、劉秉涵、陳瑞呈、徐代昕、陳詩沛、陳耀將”,感謝他們
在這段求學的日子裡的幫助與照顧,希望大家都能有美好的未來。

特別感謝謝育平同學,在博士論文的最後階段,提供了許多寶貴的意見,透過許多次的討論,
才得以使論文呈現目前的面貌,也使我獲益良多。

另外、還有歷任的研究助理們,“胡純毓、張慶瑞、許玉霜、鐘淑微、梁素瑜” 沒有他們的努力, 實驗
室的所有成果將無法累積,我們也無法擁有如此優良的研究環境。

最後、要感謝的是我的父親與母親,在我唸博士般的時候,一直堅定的支持我,使我能安然的度
過這一路上的風風雨雨,令我相當愧疚的是,在這幾年當中,我無法盡到照顧他們的責任,也很
感謝我的大哥,這幾年真的辛苦你了。

要感謝的人真是太多了,現在我真正了解到,陳之藩在 “謝天” 一文中所說的 : “要感謝的人真


的太多了,無法一一感謝,因此只好謝謝老天爺,以表達我們無限的感謝 ! ”,在未來的日子裡,
希望大家都能有健康,快樂, 美好的每一天。
中文摘要
可擴充標記語言 XML 自 1998 年由 W3C 提出之後,已被廣泛用於文件交換與知識表達
上,由於 XML 文件具有語意標記與半結構化的特性,使得 XML 的檢索具有相當大的
發展潛力,為了充分利用 XML 文件的特性,本論文利用特殊設計的知識表達方法,發
展出了一套 XML 文件的檢索機制。

由於電腦不容易理解自然語言文件,因此造成了人與機器之間的語意落差,對於
XML 檢索系統而言,語意落差可分為查詢端的語意落差與文件端的語意落差。查詢端的
語意落差主要是由於結構化查詢語言的不容易寫所造成的,而文件端的語意落差則是由
於電腦無法理解 XML 文件所造成的。為了解決語意落差的問題,本論文提出以欄位樹
(Slot-Tree Ontology)為核心的知識表達方法,並利用此方法解決 XML 文件檢索系統上的
語意落差問題。

欄位樹是一種物件式的知識表達法,特別適合用來檢索物件式的 XML 文件,在本


論文中,首先我們設計出欄位樹以代表物件的背景知識,接著發展出欄位填充機制
(Slot-Filling Algorithm),將 XML 文件映射到欄位樹中,以抓取 XML 文件的語意,然後
利用該欄位樹與填充機制,設計出一套 XML 文件的語意檢索方法,包含多欄位的檢索
介面,能充分利用語意標籤的檢索模型與摘要技術,以使系統能精確的檢索出 XML 文
件,並動態抽取出語意樹以便瀏覽。

由於建構欄位樹的工作不易,因此我們發展出一套資料採掘(Data Mining)的演算法
(Slot-Mining Algorithm),以自動從 XML 文件集合中抽取出欄位樹,該方法以統計的手
段分析語意標籤與詞彙之間的相關係數,以便找出特徵詞彙填入欄位中,自動建構出欄
位樹,使得欄位樹的建構工作變得比較容易。

我們用兩個實際的案例-台灣蝴蝶數位博物館與蛋白質資料庫(Protein Information
Resource),來測試該 XML 文件檢索系統的表現,發現該系統能較正確的檢索 XML 文
件,並且組織檢索結果以便瀏覽,,另外、自動建構欄位樹的程式也能有效填入特徵詞彙
於欄位中,但仍然需要人工修改以提高欄位樹的品質。

最後、我們總結了本論文在 XML 文件檢索上的貢獻,並與現有的一些方法進行定


性的比較,以說明本方法的優點與缺點,並提出未來可能的研究方向。
Abstract
Extensible Markup Language (XML) is widely used in data exchanging and knowledge
representation. A retrieval system that used to manage the content of XML documents is
strongly desired. In order to improve the efficiency of XML retrieval systems, we design a set
of methods based on a ontology called slot-trees, and use the slot-trees to help the XML
retrieval process.

One problem for us to build smart computer is that computer cannot understand natural
language as good as human. This is called the semantic gap between human and computer. For
XML retrieval systems, semantic gap lies on both the query side and document side. The
semantic gap on the query side is due to the difficulty for human to write structured query. The
semantic gap on the document side is due to the difficulty for computer to understand XML
documents. In order to reduce the semantic gap, we design a XML retrieval system based on a
notion of slot-tree ontology.

Slot-tree ontology is an object-based knowledge representation. In this thesis we develop


slot-tree ontology to represent the inner structure of an object. We then introduce a slot-filling
algorithm that maps XML documents into the slot-tree ontology in order to capture the
semantics. After that, we design a XML retrieval system based on the slot-tree ontology and
slot-filling algorithm. The system includes a slot-based query interface, a semantic retrieval
model for XML, and a program that extract summary for browsing.

Since the construction of slot-tree is not an easy job, we also develop a slot-mining
algorithm to construct the slot-tree automatically. Our slot-mining algorithm is a statistical
approach based on the correlation analysis between tags and words. The highly correlated
terms are filled into the slot-tree as values. This algorithm eases the construction process of the
slot-tree.

Two XML collections, one on butterflies and another on proteins, are used as test-bed of
our XML retrieval system. We found that our XML retrieval system is easy to use and performs
well in the retrieval effectiveness and the quality of browsing. Furthermore, the slot-mining
algorithm can fill important words into each slot. However, the mining results should be
modified manually in order to improve the quality of the slot-tree.

Finally, we summary our contributions on XML retrieval, and then compare our methods
to some other methods. A qualitative analysis is given in the last chapter. We also suggest
directions for further research.
XML Retrieval - A Slot-Filling Approach

Ph.D. Dissertation

Chen, Chung Chen

Department of Computer Science and Information Engineering


National Taiwan University
Taipei, Taiwan
E-mail : johnson@turing.csie.ntu.edu.tw

Advisor : Jieh Hsiang

23 July 2002
Content
Part 1 : Tutorial of This Thesis
1 Introduction 1
1.1 Motivation 1
1.2 Research problems 3
1.3 Research approaches 5
1.4 Outline of this thesis 7

2 Background – XML and Information Retrieval 8


2.1 XML 8
2.2 Information retrieval 9
2.3 XML querying and retrieval 12
2.4 Using ontology to help the XML retrieval process 16
2.5 Discussion 20

Part 2 : Slot-Tree Based Methods for XML Retrieval


2 Slot-Tree Ontology and Slot-Filling Algorithm 21
2.6 Introduction 21
2.7 Slot-tree ontology 22
2.8 Slot-filling algorithm 26
2.9 Discussion 28

3 An Ontology Based Approach for XML Querying, Retrieval and Browsing 29


3.1 Introduction 29
3.2 XML documents 30
3.3 Indexing structure 32
3.4 Query language and query interface 33
3.5 Ranking strategy 34
3.6 Browsing XML documents 36
3.7 Discussion 37

4 The Construction of Slot-Tree Ontology 38


4.1 Introduction 38
4.2 Background 39
4.3 The process of building a slot-tree 39
4.4 Slot-mining algorithm 41
4.5 Discussion 44

Part 3 : Case Studies


5 Case Study - A Digital Museum of Butterflies 46
5.1 Introduction 46
5.2 The representation of butterflies in XML 47
5.3 Slot-tree ontology for butterflies 48
5.4 Query interface 51
5.5 Slot-filling algorithm 52
5.6 XML retrieval 53
5.7 Slot-mining algorithm 53
5.8 Discussion 56

6 Case Study - Protein Information Resource 57


6.1 Introduction 57
6.2 The representation of proteins in XML 58
6.3 Slot-tree ontology for proteins 58
6.4 Query interface 59
6.5 Slot-filling algorithm 60
6.6 XML retrieval 61
6.7 Slot-mining algorithm 62
6.8 Discussion 64

Part 4 : Conclusions
7 Conclusions and Contributions 65
7.1 Comparison 65
7.2 Contributions 69
7.3 Discussion and future work 70

Reference 71

Appendix 1 : A Museum of Butterflies in Taiwan 77


Appendix 2 : Protein Information Resource 85
Part 1 : Tutorial of This Thesis

1 Introduction
This thesis introduces an information retrieval (IR) method for XML. One big problem for
information retrieval is that computer cannot understand documents as good as people. The
problem is called the semantic gap problem. Our goal is building an information retrieval
system to reduce the semantic gap between human and computer on XML. Our approach is
using ontology to help the searching processes for XML, include querying, retrieval and
browsing. This thesis is opened with our motivation in section 1.1. Our research problems are
proposed in section 1.2. Our research approaches are described in section 1.3. An overview of
this thesis is outlined in section 1.4.

7.4 Motivation
Extensible Markup Language (XML) [XML98] is a standard to encode semi-structured
documents. XML is useful in data representation, data exchanging and data publishing on the
web. Many people believes that XML will be a widely spread standard in the future. For this
reason, XML has gained much attention in both the information community and in the field of
database research.

XML is a markup language with extensible tags. Everyone may define his own markup
language based on XML. In fact, hundreds of specifications based on XML have been
proposed from 1997 to 2002. These specifications are designed to fulfill the need of some
domains or some applications. For example, Protein Information Resource (PIR)
(http://pir.georgetown.edu/) is an XML collections designed to record the data about proteins.
UDDI [UDDI00] is an XML specifications designed to record the profile of business
companies.

XML is designed to be easy understood by human and computer. XML is encoded in text
format for human to read and understand easily. Tags in XML provide semantic background for
computer to “understand” the content correctly. XML can be used as a bridge between human
writing and computer understanding.

A smart computer program that understands XML documents is useful. However, building
a computer program to “understand” XML documents is still very difficult. In this thesis, we
propose methods for computer to “understand” XML documents.
The natural language processing (NLP) community has been focus on the processing and
understanding of natural language documents for a long time [Grosz86]. However,
understanding natural language documents is still very difficult for computer programs. No
effective approach is powerful enough to solve the understanding problem. Building a smart
computer program to understand natural language texts is very difficult because of the
“semantic gap”. The semantic gap is described as following.

“Computer cannot understand natural language as good as human.”

The semantic gap causes some difficulties for information retrieval systems. For example,
an information retrieval system cannot understand our natural language queries, and retrieve
many documents that are not semantically related to our queries.

There are two semantic gaps for an information retrieval system, one for queries
understanding and another for documents understanding. These gaps are list as following.

Gap 1 : “Computer cannot understand queries as good as human.”


Gap 2 : “Computer cannot understand documents as good as human.”

Figure 1.1 : Semantic gaps of natural language

In order to reduce the semantic gap problem, researchers in NLP community have been
trying hard to resolve the following question.
“How to make computers understand natural language? ”

However, natural language is too difficult for computer programs to understand now.
Although many people have been devoted to solve the problem for more than thirty years,
designing a computer program to understand natural language is still an open research
problem.

Computers do not understand natural language well. Why don’t we design a structured
language that is easy for computer to understand and easy for human to write. If we can design
such a language, then we have a common language between human and computer. People may
write documents in this language for computer to understand. Then we may build computer
programs to understand documents in this language.

XML is such a language that is easy for human to write. However, we have no method for
computer to understand XML documents easily. If we can design such a computer program, we
may reduce the semantic gap for XML, so that XML may plays as a bridge between human and
computer.

In this thesis, our goal is to reduce the semantic gap on XML. Our approach is to design
methods for computer to understand XML documents. Our research problem is described in the
next section.

7.5 Research problems


XML is a markup language with extensible tags. People have to understand tags before writing
XML documents. If there are too many tags for an XML writer to remember, he cannot write
XML documents easily. If a writer has to mark each word up in XML documents, he cannot
write it easily, too. On the other hand, if a writer mark documents up roughly, it is difficult for
computer to understand. The tradeoff between human writing and computer understanding is
called the “human-computer dilemma of XML”.

Our goal is to design an XML retrieval system to resolve the “human-computer dilemma
of XML”. For an XML retrieval system, there are two semantic gaps between human and
computer, one gap on query side and another gap on document side. Figure 1.2 shows these
two gaps.
Figure 1.2 : Semantic gaps of XML

On the document side, an XML document may be easy for human to write but not so easy for
computer to understand. An XML document with many natural language texts is not so easy
for computer to understand. Example 1.1 shows an XML document that contains natural
language text in the “color” block and “size” block. It is not so easy for computers to
understand the XML document.

Example 1.1 : An XML document that is not easy for computer to understand

<butterfly name=”kodairai”>
<color>with black wing and white spots on it</color>
<size>middle size butterflies, from 50mm to 60mm</size>
</butterfly>

On the contrarily, an XML document may be easy for computer to understand but not so easy
for human to read and write. An XML document that marks each word up is not so easy for
human to read and write. Example 1.2 shows an XML document that is not easy for human to
read and write.

Example 1.2 An XML document that is not easy for human to read and write

<butterfly name=”kodairai”>
<color><wing>black<wing><texture>white spot</texture></color>
<size>
<classification>middle size</classification>
<from>50mm</from><to>60mm</to>
</size>
</butterfly>

The same things happen on the query side, an XML query may be easy for human to write but
not so easy for computer to understand. An XML query with natural language is not so easy for
computer to understand. Example 1.3 shows an XML query that is not so easy for computer to
understand.

Example 1.3 An XML query that is not easy for computer to understand

<butterfly>in black color with white spots</butterfly>

On the contrarily, an XML query may be easy for computer to understand but not easy for
human to read and write. A structuralized XML query is not so easy for human to read and
write. Example 1.4 shows an XML query that is not so easy for human to read and write.

Example 1.4 An XML query that is not easy for human to read and write

For $b in //butterfly
Where ?b/color = “black” and ?b/texture=”white spots”
Return ?b

Two approaches may be used to reduce semantic gap between human and computer on
XML. The first approach is building computer programs to understand XML documents or
queries. The second approach is building tools for human to write XML documents or queries.

We adopt the first approach on the document side and adopt the second approach on the
query side. It means that we build a computer program to understand roughly tagged XML
documents, and we build a tool for human to write XML queries easily. The following section
shows our approach.

7.6 Research approaches


In this thesis, we build an XML retrieval system to reduce the semantic gap between human
and computer on XML. An ontology called slot-tree is used to help the retrieval process. A user
may use the query interface to write queries easily. The slot-tree ontology also helps the
computer to understand XML documents easily. Figure 1.3 shows a scenario of our XML
retrieval system.
Figure 1.3 : A scenario of our XML retrieval system.

On the document side, we build a computer program to understand XML documents. The
“understanding” process is based on an ontology called slot-tree. Slot-tree is a frame like
representation that embedded with XPATH [XPATH99] expression. In order to make computer
understand XML documents, we designed a slot-filling algorithm to map XML documents into
the slot-tree.

On the query side, we build a query interface for human to write queries easily. The
interface is built by transform the ontology into a web page. User may use the interface to write
structural queries just by choosing or typing values into slots to build a structural query.

In our approach, the slot-tree ontology is a key component for both documents
understanding and queries building. The slot-tree ontology mediates queries and documents in
the retrieval process to reduce the semantic gaps both on query side and document side.

However, it is not an easy job to build the slot-tree ontology. The ontology constructor
needs tools to build slot-tree ontology. The problem of construct slot-tree automatically based
on a set of XML documents is called the slot-mining problem. It is described as following.

“How to mine the slot-tree ontology from a collection of XML documents ?”


In order to handle the slot-tree mining problem, we developed a statistical method to build
the slot-tree automatically. The algorithm is called slot-mining algorithm that based on
correlation analysis between tags and terms in XML documents.

7.7 Outline of this thesis


This thesis is divided into four parts, including “tutorial part”, “methods part”, “case study
part” and “conclusion part”.

Part 1 sets the stage for all the others. Chapter 1 outlines the research problems and
approaches. Chapter 2 reviews the background literatures for our research - “Designing an
XML retrieval system to reduce the semantic gap problem”.

Part 2 is a detail description of our methods. Our methods are based on a knowledge
representation structure called slot-tree. The slot-tree is used in catching the semantics of XML
documents. It helps our XML retrieval system to understand XML documents.
Chapter 3 shows the syntax and semantics of slot-tree ontology, and shows a method that
uses the slot-tree to catch the semantics of XML documents called slot-filling algorithm.
Chapter 4 outlined an XML information retrieval system that based on slot-tree. The slot-tree
ontology and slot-filling algorithm are used to reduce the semantic gap of XML retrieval.
Chapter 5 shows the process of constructing slot-tree ontology. The steps of constructing a
slot-tree are outlined. After that, a method that constructs slot-tree automatically is proposed.
The method is a statistical program that called slot-mining algorithm. The slot-mining
algorithm mines slot-trees from XML documents based on the correlation analysis between
tags and terms. It helps peoples to construct the slot-tree ontology for a given XML collection.

Part 3 is test-beds of the slot-tree based approach. The slot-tree based approach is
examined in this part. Two cases are used to test the slot-tree based approaches. Chapter 6
shows the first case that is an XML collection about butterflies. The collection is a set of XML
documents in Chinese about butterflies in Taiwan. Chapter 7 shows the second case that called
Protein Information Resource (PIR). PIR is a large set of XML documents that released by
George Town University. The experiment on these two cases is used to analyze the strength
and weakness of the slot-tree based approach.

Part 4 is the conclusion part. Chapter 8 analyzes the strength of slot-tree based approach.
We compare the slot-tree based methods to some other XML retrieval methods, and point out
our contribution, conclusions and future works.
8 Background – XML and Information Retrieval
In chapter 1, we have introduced our motivation, goals and research approaches. Briefly
speaking, we would like to build an XML retrieval system that reducing the semantic gap
between human and computer on XML. In this chapter, we will survey the related researches in
order to provide background knowledge for our research. Since our approach is using slot-tree
ontology to help the XML retrieval process, we will survey the topics of XML, information
retrieval and ontology in this chapter.

In section 2.1, we focus on the XML topics to survey the related specifications and
technologies. In section 2.2, we survey the information retrieval technologies. After that, we
will survey the current status and state of art in XML retrieval in section 2.3. Finally, we will
outline the relationship between ontology and XML retrieval in section 2.4.

8.1 XML
We have to understand XML in order to build an XML retrieval system that reduces the semantic gap.
In this section, we will survey the XML related specifications and technologies, especially literature
about knowledge representation and information retrieval.

XML is proposed by world-wide-web consortium (W3C) (http://www.w3c.org) in 1998. It’s a tree


structured markup language with extensible tags. The following example is an XML document of
phonebook.

Example 2.1 An XML document


<?xml version= “1.0”?>
<!DOCTYPE phonebook SYSTEM "phonebook.dtd">
<phonebooks xmlns= “http://www.ntu.edu.tw/phonebook”>
<people id= “001”>
<name>Johnson Chen</name>
<tel>02-34134345</tel>
</people>
<people id= “002”>
<name>Fanny Chen</name>
<tel>02-33451294</tel>
</people>
</phonebooks>
In example 2.1, the head part <?xml version= “1.0”?> indicate that this document is an XML document.
The second line is the document type definition (DTD) part of this XML document. DTD is used to
validate the syntax of XML documents. The DTD part is optional and can be removed to ignore the
syntax validation process.

The third line, with a “phonebooks” tag, is the root node of this XML document. One XML
document has one and only one root node. In this line, the xmlns= “http://www.ntu.edu.tw/phonebook”
is the default name space of this XML document. Name space [XMLNS99] in XML is used to
distinguish tags with the same names form each other. So that people can define their own tags and
using others tags without have to worry about using the same tag name in different meaning.

A node in XML contains tag, attribute and text. “phonebooks”, In the example above, “people”
and “name” and “tel” are tags, “xmlns” and “id” are attributes, “http://www.ntu.edu.tw/phonebook” and
“Johnson Chen” and “02-34134345” are text parts.

XPath [XPATH99] is a specification that used to locate nodes in XML documents. If we would
like to locate all the “people” nodes, we may use the XPath expression “//people” to locate nodes of
people. The “//” operator means matching every descendent nodes. If we would like to locate the
“people” node with id = “001”, then we may use the XPath expression “//people[@id= ‘001’]” to locate
the node. The “@” symbol means the “id” is an attribute name. XPath is used in the slot-tree ontology
that is going to be discussed in chapter 3. We embed XPath into the slot-tree to locate nodes in XML,
and using the XPath to map XML documents into slot-tree ontology.

Many XML related specifications are proposed since 1997. XML has been a wide spreading
specification and used in many domains and applications, such as in “data exchanging”, “data
presentation”, “data querying”, and “knowledge representation”. For data exchanging, UDDI and
ebXML are used to mediate the data exchange process between business enterprises. For data
presentation, XSLT can be used to transform XML into HTML for presenting on the web. For data
querying, XQL, XML-QL and X-Query are used to query data in XML documents. For knowledge
representation, RDF/RDFS, DAML/DAMLS, XML topic map are proposed to represent knowledge in
XML format. We will survey specifications about data querying in section 2.3 that discussing the XML
query and retrieval topics, and survey specifications about knowledge representation in section 2.4 that
discussing the ontology topic.

8.2 Information retrieval


In order to build an XML retrieval system that reduce the semantic gap, we have to understand the
information retrieval technologies, and how to use natural language understanding technologies to
reduce the semantic gap of XML.

The evolution of IR technique is close related to the target document structure. Each time, a new
document structure proposed, a new IR technique developed. In 1970~1980, Vector Space Model is
developed to retrieve text documents. In 1990~1999, Random Walk Model developed to retrieve HTML
documents. Today, XML document are wide spreading. Many researchers are trying to develop new
retrieval models for XML.

Text Retrieval
Text Retrieval Technology is almost as old as the Computer Technology. There are many models for
text retrieval. The most well known is Vector Space Model (VSM) [Salton75]. In this model, each
document is represented by a k-dimensional vector of terms. A plain text is expressed as following.

d = (dt1, dt2, …, dtk), where dti is the weight of term ti that show up in the document of d

In the expression above, where k equals the number of index terms in the collection. The order of
words in the text sequence is discarded.

A query is represented by a k-dimensional vector of terms, too. The query (q) may be represented
as the following vector.

q = (qt1, qt2, …, qtk), where qti is the weight of term ti that show up in the query of q

Cosine coefficient is a popular measure for the similarity between a document and a query. The
definition of cosine similarity is the cosine of the angle between the document vectors d and the query
vectors q.

d •q
∑ (d ti * qti )
Similarity(d, q) = | d | * | q |
= i =1
k k

∑d * ∑ qti
2 2
ti
i =1 i =1

One question is how to set the weight dti and qti in the vector space model. The “tfidf” is a simple and
common used weighting function. The “tfidf” weighting is defined as the product of term frequency (tf)
and inverse document frequency (idf)
Term frequency (tf) : tf(t,d) : the number of occurrences of term t in document d
Document frequency (df) : df t : the number of documents, containing term tj .
Inverse document frequency (idf) : the inverse number of documents in which the term occurs.

idf(t) = log(N/dft), where N is the number of documents.

For a given document d, dti= tfidf(ti, d) = tf(ti, d) * idf(t)


For a given document q, qti= tfidf(ti, q) = tf(ti, q) * idf(t)

The SMART system experiments lead by Salton [Salton88] shows that “tfidf” term weighting function
is the best in his 287 distinct combinations of term-weighting assignments. The “tfidf” weighting
function has been proved to be a good measure for the vector space model.

HTML Retrieval
The main issue of HTML-retrieval is to measure the importance of a document. A HTML retrieval
system retrieves documents that match the query, and then sort by importance. On the web, there are too
many documents to retrieve. The importance measure helps user to decide what he should read.

Documents on the web are different from the text collection because of the hyperlink structure.
The measure of HTML importance is based on the hyperlink analysis technique. Historically, hyperlink
analysis is developed based on the citation analysis technique. A simple strategy to measure the
importance of a web page is by counting the number of hyperlink that reference to it. A web page
referenced by many other pages is important.

In 1998, a random walk model used to weight the importance of web pages proposed was proposed
[Brin98][Page98]. The random walk model was then used in the Google search engine. In the random
walk model, a page is important if it is cited by many important pages. Formally speaking, each web
page in the random walk model has a weight measure w(d). An iterative process is used to recalculate
the w(d) in each iteration.

w( p ) ← ∑ w(q)
q:( q , p )∈E

Conceptually, the random walk model simulates the process of a person click web pages randomly.
The random walker chooses a web page randomly as a start page. After that, he randomly clicks a web
page in the page and repeats the click process on each clicked page. In the random walk model, a
important page will be visited with high probability.
Kleinberg proposed a Hub-Authority model to weight the impact of a web page [Kleinberg98]. Web
pages are divided into two classes in this model, hub-page and authority-page. The hub-authority model
is an iterative process. For a hub-page (h), it is important if the page point to many important authority-
pages. For an authority page (a), it is important if the page is cited by many important hub-pages.
Formally speaking, there are two weight on each page (d) in the hub-authority model, the hub
weighting measure h(d) and the authority weighting measure a(d). An iterative process is used to
recalculate the h(d) and a(d) in each iteration. Figure 2.1 shows the concept of hub-authority model.

Figure 2.1 The hub-authority model

A set of web page (D) contains many hyperlinks (E). For each page d in D, h(d) is the hub weight
of d, and a(d) is the authority weight of d. At first, we may set both h(d) and a(d) as 1/|D|, where |D| is
the number of documents in D. After that, an iteration is used to recalculate h(d) and a(d) based on the
following recurrence equations.

a( p) ← ∑ h( q )
q:( q , p )∈E

h( p ) ← ∑ a (q )
q:( p , q )∈E

Hub-authority model is used to weight the importance of a web page, and decide whether a page is
a hub or authority. Besides weighting the importance, hub-authority model provides a mechanism to
classify the type of a web page.

Both hub-authority model and random walk model used the iterative approach to decide the
importance of a web page. The convergence analysis based on eigen-value in linear algebra is used to
analyze the behavior of recurrence equations used in these models. The paper of Kleinberg
[Kleinberg98] and Page et. al. [Page98] have further discussions for the theory of these models.
8.3 XML Querying and Retrieval
In order to manage XML documents, the database community and IR community have recently
focus on the research of storing, indexing, querying, and retrieving XML documents. For
storing, the database management systems are extended to support the function of storing XML
documents. One way is extending relational database system to store XML documents, another
way to store XML documents in object-oriented database (OODB) system. For indexing,
Patricia-trie and inverted-file are used to index XML documents. For querying, several XML
query languages are proposed to retrieve XML nodes. For searching, several systems are
designed to search XML documents. In this section, we will focus on the survey of XML query
languages and XML retrieving systems.

XML Query Language


Designing query languages for XML is a hot research topic for XML. XML query languages are much
more complex than text-retrieval and HTML-retrieval. XML query languages are more flexible than
database query languages. There are many XML query languages proposed in these years, such as Loral
[Loral97] , XML-QL [XML-QL98], XML-GL[XML-GL99], and X-Query [XQuery01].

Querying an XML collection is like to query a database. We usually query tables by “SQL”
language in a relational database. The following example shows a query to retrieve name and birthday
of United-State presidents.

SELECT name, birthday FROM people WHERE nation=”US” and job=”president”

An XML query language has to retrieve nodes in the tree of XML nodes. The following example
shows an X-Query example that retrieve name and birthday of United-State presidents.

For $p in //people
Let $n=?p/name, $b=?p/birthday
Where ?p/job = “president” and ?p/nation=”US”
Return ?n, ?b

XML-GL is a graphical notation used to retrieve XML documents. Figure 2.2 shows an example of
retrieve orders that ship books with title “Introduction to XML” to Los Angles.
Figure 2.2 An example of XML-GL

XML retrieval systems


There are several XML retrieval system proposed in these years. We will have a survey of
these systems in this section.

Lore was one pioneer research project for XML retrieval in Stanford-University. In this project,
an object-oriented database was used to store XML documents. The XML query language
“Loral” was developed. Besides that, a query interface “DataGuider” was developed to query
XML documents. Figure 2.3 is a screen catch of the DataGuider system.
Figure 2.3 The query interface of DataGuider system

XYZfind is a commercial system that split the querying process into four steps. The following
figures show the retrieval steps of the XYZfind retrieval system.

Step 1 : User type in a query to start the


category searching process.
Step 2 : The XYZfind system found
several related categories. User have to
click the target application.
Step 3 : User use the query interface to
build a query.
Step 4 : The XYZfind system retrieves
XML documents and shows on the
browser.

Figure 2.4 Retrieval steps of the XYZfind system

8.4 Using Ontology to Help the XML Retrieval Process


In order to reduce the semantic gap, we have to survey the technologies that used to make
computer understand natural language text. The design of XML does not eliminate the usage of
natural language text in the content of XML documents. Natural language texts are frequently
embedded in XML documents. The natural language understanding technologies that used to
reduce the semantic gap is still needed in the understanding process of XML documents. In this
section, we will focus on how to use natural language understanding technologies that based on
ontology representation to understand XML documents.

Natural language processing community has been trying to resolve the semantic gap
problem for a long time. Natural language understanding is a field that focuses on building
computer programs to “understand” natural language text [Grosz86] [Allen94]. However, the
word “understanding” used here is a misleading word. Computers do not really understand
natural language text as human. Calculation and symbolic reasoning is what computers can do.
Computers “understand” natural language text by mapping text into internal representation.
The internal representation guides the computer to do symbolic reasoning and act as it know
the meaning of natural language text.

Alan Turing designed the Turing-Test [Turing50] to test whether a computer understand
natural language text or not. For information retrieval, we adopt a similar definition as Turing-
Test. If a computer program that retrieve we want and discard what we do not want, and
organize the retrieval result into what we like to browse, then we say the computer program
understand documents and our queries. A computer can do what we like it to do is a smart
computer. A retrieval system that retrieves only what we want and organize the result into what
we like is a smart retrieval system.

A data-structure called ontology that represents the concept in human mind is used in the
process of understanding. Generally speaking, understanding is the process of mapping natural
language text into ontology. After the mapping, computer can do actions based on the mapping.
This is the style of computer “understanding”.

Ontology may be represented in different structures. The research topic that focuses on the
structure of ontology is called knowledge representation [Brachman85a]. Roughly speaking, there are
two approach to represent knowledge and ontology, logic-based approach and object-based approach.
We will introduce and compare these two approaches. It is a basis of our slot-tree ontology that is going
to be discussed in chapter 3.

The logic-based approach encodes knowledge into logic statements for reasoning,
including propositional-logic, first-order-logic, probabilistic-logic etc. Prolog is the most well
known programming language based on logic.

Logic-based approaches encode knowledge into logic statements. Based on logic


statement, a reasoning process is used to inference unforeseen true statements from these
predefined logic statements.

First-order logic is a powerful theory to represent knowledge and reasoning conclusions.


First-order logic is a monotonic logic system that contains predicates and quantifiers in logic
expressions. In first order logic, we use logical statement to represent the ontology. The
following example shows the logic statements that describe the inheritance relationship
between butterfly, insect and animal.

is(butterfly, insect)
is(insect, animal)
∀x, y, z is(x,y) ∧ is (y,z)  is(x,z)

The power of first order logic lies on the ability of monotonic reasoning. The “monotonic
reasoning” means any conclusions made will never being erased in the future. The 100% certainty of
facts, rules and conclusions should be assured in the first logic reasoning process. The following
example shows a reasoning process for the example above. The reasoning process inferred “butterfly is
a kind of animal”.
∀x, y, z is(x,y) ∧ is (y,z)  is(x,z) (bind x to butterfly, y to insect, z to animal)
-----------------------------------------------------------------------------------------------
is(butterfly,insect) ∧ is (insect,animal)  is(butterfly,animal)
-----------------------------------------------------------------------------------------------
conclusion : is(butterfly, animal)

A difficulty is that many uncertainty situations are encountered in the natural language
understanding process. The 100% certainty of first order logic cannot always being assured.
Probabilistic logic and fuzzy logic are developed to handle the uncertainty. However, the monotonic
property is lost in the uncertain reasoning process.

After reviewing the logic-based approach, we will introduce object-based approach. Object based
approach contains a set of representation methods, including frame, semantic network and script.
Generally speaking, frames are used to represent the internal structure of object, semantic networks are
used to represent the relation between objects, and scripts are used to describe an active scenario
involving many objects.

Frame is proposed by Minsky in 1975 [Minsky75] in the seminal paper "A framework for
representing knowledge". Frame is a method of representation that organizes knowledge into
chunks. However, Minsky did not formalize the frame concept into mathematics model.
Minsky explicitly argued in favor of staying flexible and nonformal. After that, some AI
systems are built based on the frame representation, such as the KL-ONE system
[Brachman85b] and the KRL language [Bobrow77].

Generally speaking, a frame is a structure that describes the internal structure of an object.
Frames are composed out of slots (attributes) for which fillers (scalar values, references to
other frames or procedures) have to be specified or computed. A slot can be expressed as a
tuple in the form of (object, slot, filler). It is easy to transform these tuples into a logic
predicate in the from of slot(object, filler).

One frame that inherits from another frame is called a sub-frame. The inherit property may
be expressed as the “is” relation between frames in the form of is(object, object). The inherit
property organize frames into hierarchy. The concept of frame that organizes statements into
object-based structures is easy for human to read and write. It was then adopted by object-
oriented programming language for people to write program easily. The following example
shows a frame for “koairai” that is a species of butterfly.
<object name= “kodairai”>
<is>butterfly</is>
<texture>eyespots</texture>
</object>

Semantic networks concentrate on categories of objects and the relations between them
[Quillian66] [Wood75]. Drawing graphs to represent the relationship between objects is the
basic idea of semantic network. In these graphs, a link may be represented as a tuple in the
form of (object, relation, object). It is easy to transform these tuples into a logic predicate in the
from of relation(object, object).

Scripts are used to describe a scenario involving many objects [Schank77]. Steps in the
scenario are described as lattices. One step may be triggered when its preceding steps are
finished. For example, the following script shows the process of make a cup of coffee.

1. Put an empty cup on table.  put_on(cup, table)


2. Put coffee powder into the cup  put_into(coffee powder, cup)
3. Filling hot water into the cup.  fill(hot water, cup)
4. Mixing the powder and the water by a spoon.  mix_by_spoon(powder, water)
5. Process finished.

In fact, we may translate object-based representations into logic rules. The difference between
logic-based representation and object-based representation lies on the organization principle.
Logic-based representation encodes knowledge into logic expressions, and the object-based
representation organizes these expressions into frames, semantic networks and scripts.

Reasoning is not a standardized part in object-based systems [Ifikes85]. The information stored in
frames has often been treated as the “database” of the knowledge system, whereas the control of
reasoning has been left to other parts of the system. The most popular and effective reasoning
mechanism for frame is the production rules [Stefik83] [Kehler84]. Production rules are rules in the
form of pattern/action. It is a subset of predicate calculus with an added prescriptive component
indicating how the information in the rules is to be used during reasoning. Whenever a pattern is
matched, the production system will trigger the corresponding frame, and the action is performed to do
something that helps the “understand” process. After the pattern/action process, some values are filled
into frames as the conclusion. The reasoning process in object-based system that map natural language
text into slot-tree ontology is what we called “the slot-filling process”.
Both logic-based representation and object-based representation may be used to represent the
ontology and reasoning based on the ontology. Reasoning is helpful but not a necessary part for
computers to understand natural language. However, computers need a process to map natural language
text into ontology in order to understand it.

The mapping process for XML documents is easier than the mapping process for natural language
documents, because tags provide semantic contexts that make the process of mapping easily. In chapter
3, we will propose a slot-filling algorithm to map XML documents into slot-tree ontology in order to
reduce the semantic gap between human and computer on XML.

8.5 Discussion
In this chapter, we review the research background of XML, information retrieval and
ontology. However, the technology of XML retrieval now is not good enough and needs further
research. In fact, researchers in information retrieval community are trying hard to develop
methods for XML retrieval recently.

In the workshop of ACM SIGIR 2000 on XML and information retrieval, Carmel et al.
[Carmel00] discuss about several unsolved problems for XML retrieval in the workshop
summary. We list these problems as following.

1. Using XML query language is likely to improve precision. However, XML query
languages are not easy for people. How to make it easier to use for people?
2. A heterogeneous XML collection contains document structures are coming from different
sources, and the tag names and document structures may be different and idiosyncratic.
How to retrieve heterogeneous XML documents?
3. XML is specified using Unicode. The tag names coming from different sources may be
given in different languages. Since a word can have more that one translation and even no
translation, how to find or make the appropriate translation is an interesting issue for
multilingual information retrieval. How retrieve do multilingual XML documents?
4. Browsing XML retrieval results should be better than browsing text document. How to
organize the retrieval results for browsing? Is it the entire document, a part of the XML
tree, or perhaps a graph?

In this thesis, we will try to resolve these problems by develop an XML retrieval system. The
system is mainly designed to reduce the semantic gap between human and computer. In this
system, we develop programs for computer to understand XML documents easily, for human to
write query easily and browse query results easily. These methods are based on an ontology
representation called slot-tree. We will describe these methods in the next part. In chapter 3, we
will show how to represent slot-tree and map XML documents into slot-tree. In chapter 4, we
will show how to use the slot-tree ontology to help the XML retrieval process. In chapter 5, we
will design a method to build slot-tree automatically.
Part 2 : Slot-Tree Based Methods for XML Retrieval

2 Slot-Tree Ontology and Slot-Filling Algorithm


In part 1, we have introduces our motivation, goals and research approaches in chapter 1, and
review the related researches for XML, information retrieval and ontology in chapter 2. In part
2, we will show our method to reduce the semantic gap of XML retrieval. In order to reduce
the semantic gap, an ontology called slot-tree, is used to help the XML retrieval process in our
system. In this part, we focus one the usage of slot-tree ontology in our XML retrieval system.

Part 2 contains three chapters. In chapter 3, we will describe the syntax, semantics and
usage of slot-tree. In chapter 4, we will use the slot-tree to reduce the semantic gap in the XML
retrieval process. In chapter 5, we will show how to construct the slot-tree ontology, and design
a mining algorithm to build the slot-tree ontology automatically.

This chapter contains four sections. In section 3.1, we outline the structure of slot-tree
ontology and its usage in the process of understanding XML documents. In chapter 3.2, we
describe the syntax and semantics of slot-tree ontology. In chapter 3.3, we design the slot-
filling algorithm to map XML documents into slot-tree ontology that is the core of
understanding process. Finally, we have a discussion about slot-tree ontology and slot-filling
algorithm in section 3.4.

8.6 Introduction
In this chapter, we design an object-based representation called slot-tree ontology, and then use the slot-
tree to “understand” XML documents. As we have said in section 2.4, the word “understand” used here
means the process of mapping text in XML into the slot-tree. This enables a computer to trigger the
corresponding procedure to do what user like it to do, such as answering questions or retrieving
documents that user want. We will outline the slot-tree ontology and the slot-filling algorithm that maps
XML documents in this section, and describe the detail of slot-tree in section 3.2 and slot-filling
algorithm in 3.3.

Slot-tree representation is object-based approach to represent the internal structure of objects like
frame. We have surveyed object-based approach for knowledge representation, including frame,
semantic network and script in section 2.4. Generally speaking, frame is used to represent the internal
structures of objects, semantic network is used to represent relations between objects, and script is used
to represent scenarios that involve many objects. The object-based approach is conceptually consistent
to our notion about world, because the world is a composed by many objects in our sense. The
difference between slot-tree and frame is that a slot in slot-tree contains a set of paths to locate nodes in
XML documents. A path in a slot is in XPath format that was described in section 2.1. For example,
“//butterfly//color” is used to locate “color” nodes in the block of “butterfly”.

In our XML retrieval system, a slot-tree is encoded in XML format like the following
example.

Example 3.1 A simple slot-tree in XML format


<s slot= “butterfly” path= “//butterfly”>
<s slot= “color” path= “//butterfly//adult//color”>
<v value= “brown”/>
<v value= “white”/>
</s>
</s>

Based on the slot-tree ontology, we design a slot-filling algorithm that is used to map
XML documents into slot-tree ontology in the process of understanding. In the slot-filling
algorithm, a path in a slot is used to catch a block in XML like a hand, and a matching process
is used to map the content of the block into the slot. After the matching process, words that
matched any values in a slot are filled into the slot. The filled slot-tree after the matching
process is then used as a semantics structure of the XML document. We will show the detail of
slot-tree ontology in section 3.2 and the detail of slot-filling algorithm in section 3.3.

8.7 Slot-Tree Ontology


In this section, we propose an ontology representation called slot-tree. Slot-tree is an object-based
representation that describes the internal structure of an object like frame. We have described the frame
representation in section 2.4. We will describe the syntax, semantics and examples for slot-tree in this
section.

Definition 3.1 : A slot-tree is a tree (T) that each node in the slot-tree contains a tuple (s, P s, Vs), where
s is the name of slot, Ps is a set of paths, and Vs is a set of values. The name of a slot is a label that
uniquely represents the slot. A path (p) in Ps is a string in XPath format that used to locate nodes in
XML documents. A value (v) in Vs is a term that contains a set of semantically identical words or
patterns.

Figure 3.1 shows the structure of a slot-tree, the {p} in each node represent a set of paths and the
{v} in each node represent a set of values. For a slot-tree that represent the internal structure of an
object, a slot in the tree may used to represent a property of the object, such as the “color”, “shape”,
“texture”, “size”, etc. A value in the slot is a possible value of the property. For example, “black” is a
possible value in the “color” slot.

s {p} {v} s {p} {v}

s {p} {v}

s {p} {v} s {p} {v}

s {p} {v}

Figure 3.1. The structure of a slot-tree

A slot-tree can be encoded as an XML document that each slot is encoded as a node in tag “s”.
The attribute “slot” in the node is the label of the slot. The attribute path contains a set of path in XPath
format that encode the {p} part for each slot. The node in tag “v” is a value that encodes the {v} part for
each slot. Example 3.2 shows a slot-tree for butterflies in XML format and figure 3.2 shows the graph
representation of the example.

Example 3.2. A slot-tree for butterflies in XML format


<s slot= “butterfly” path= “//butterfly”>
<s slot= “name” path=”//butterfly//name”/>
<s slot= “adult” path= “//butterfly//adult””>
<s slot= “color” path= “//butterfly//adult//color”>
<v value= “black”/>
<v value= “brown”/>
<v value= “black&white”/>
</s>
<s slot= “texture” path= “//butterfly//adult//color”>
<v value= “lines”/>
<v value= “spots”/>
</s>
</s>
Figure 3.2 : The graph representation of slot-tree

Formally, the syntax of slot-tree is defined as grammars in figure 3.3. A slot (S) contains a label
(NAME), a set of path (P*) and a set of values (V*). The slot may also contain a set of sub-slot (S*). A
value (V) contains a label (NAME), a set of key (KEY*) and a set of matching rules (R*).

S  <s slot= “NAME” path= “P*”> V* S* </s>

V  <v value= “NAME” keys= “KEY*” match= “R*”/>

NAME Alphabetical String


KEY  Alphabetical String

Where P is a path in XPath format, R is a rule.


Figure 3.3 : The grammar of slot-tree

The symbol “P” used in figure 3.3 is in a path in the format of XML path language (XPath). XPath
is a specification that proposed by Web Consortium (W3C) used to locate nodes in XML documents.
The symbol “/” is used to match children nodes, the symbol “//” is used to match nodes inside the
current node. A tag name with a prefix “@” symbol means an attribute. Example 3.3 shows several
example of XPath.

Example 3.3 : Examples of XML path language (XPath)


a. /butterfly/adult/color
b. //insect//color
c. //insect[@type=‘butterfly’]//color

The path of example 3.3.a is used to locate “color” nodes that are children of an “adult” node, and
the “adult” node is a child of the “butterfly” node. The path of example 3.3.b is used to locate any
“color” nodes that are in the block of an “insect” node. The path of example 3.3.c is used to locate any
“color” nodes that are in the block of an “insect” node with values ‘butterfly’ in the attribute “type”. If
you would like to learn more about XPath, please see the XPath specification in the following web page
- http://www.w3.org/TR/xpath.

A rule in the slot-tree is used to match a string in XML. The syntax of a rule (R) is further
defined as grammar in figure 3.4. A rule may contains “&” operator, “|” operator and “-“
operators. A symbol “E” is an expression that is part of a rule. Each expression contains only a
literal “L” or a pattern in the form of “L..L”.

R  (R & R)
R  (R | R)
RE
R  -E
E  L {..L}
Figure 3.4 : The grammar of rules in slot tree

The “&” operator equals to a logical “and”. A “R1 & R2” rule satisfied if and only if both
R1 and R2 are satisfied. The “|” operator equals to a logical “or”. A “R1 | R2” rule satisfied if
and only if R1 or R2 is satisfied. A “..” symbol in the syntax of “E” means a far connect. A “L1
.. L2” rules satisfied if a L1 string is followed by an L2 string in one sentence. The following
example shows a several rules as following.

Example 3.4 : Matching rules in slot-tree


a. R = “white & black”
b. R = “lines & -spots”
c. R = “black .. head”

The rule of example 3.4.a is used to match a sentence like “a butterfly that is mixed of black and white
color”, or “a butterfly with white wing and black head”. The rule of example 3.4.b is used to match a
sentence such as “a butterfly with brown lines on wings”, but cannot match the sentence “a butterfly
with brown lines and white spots on wings”. The rule of example 3.4.c is used to match a sentence such
as “a butterfly with black color on head”, but cannot match the sentence “a butterfly with has green
head and black wings”.
8.8 Slot-Filling Algorithm
A slot in a slot-tree is a container that may contain several fillers. The filler can be a value of a
sub-slot. A slot-filling algorithm is a method to map fillers into slots. In this chapter, we
describe how to map an XML document into slot-tree ontology.

Example 3.5 : An XML document for a butterfly


- <butterfly about=“Athyma_fortuna_kodairai.jpg”>
<adult>
<texture>There are some eye spots in each wing</texture>
<color>Brown background color, Eye spots in white color</color>
<size>Middle size, 50-60mm</size>
</adult>
<geography>
<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>
<global>Central China Area</global>
</geography>
</butterfly>

Example 3.6. A slot-tree for butterflies


<s slot= “butterfly” path= “//butterfly”>
<s slot= “name” path=”//butterfly//name” type= “copy”/>
<s slot= “adult” path= “//butterfly//adult””>
<s slot= “color” path= “//butterfly//adult//color”>
<v value= “black”/>
<v value= “brown”/>
<v value= “black&white”/>
</s>
<s slot= “texture” path= “//butterfly//adult//color”>
<v value= “lines”/>
<v value= “spots”/>
</s>
</s>

One simple way to fill values into the corresponding slot is by copy. A copy-slot is a slot with the
attribute (type=“copy”) in it. The copy-slot is used to extract a value from a specified field. In the slot-
filling process of example 3.3, the value “Athyma_fortuna_kodairai” is filled into the “name” slot in
example 3.4 just by copy.
Another way to fill values into slots is by keyword matching. A value is filled into a slot if the
value matched a sentence in the target XML document. The following example shows the process of
matching the “spotted” value in “texture” slot to the “color” nodes in XML document.

Example 3.7 An example of filling a value into slot by keyword matching


Texture block :
<color> Brown background color, Eye spots in white color </color>

Texture Slot :
<s slot= “texture” path= “//butterfly//texture”>
<v value= “single color” keys= “single, mono, uniform”/>
<v value= “spotted” keys= “spot”/>
<v value= “lines” keys=”line”/>
</s>

 Matching result <s slot= “texture” values = “spotted”/>

A slot-filling algorithm is designed to fill values into slots in a slot-tree. In order to


“understand” an XML document, we use the slot-filling algorithm to fill an XML document
into the slot-tree. The output of our slot-filling algorithm is a filled slot-tree, where each node
in the tree is filled by values. For a given XML document d, d s is part of the document that
covered by slot s. The output of the slot-filling algorithm is a set of slot-value (s,v) pairs.

Slot-Filling(d, T) = { (s,v) | v∈V, t is a term in d, w(v, ds) > ε }

The following figure shows the pseudo code of slot-filling algorithm.

Algorithm Slot-Filling(d, T)
SV = {}
for each s in T
ds = {c | (s, p) ∈M(T), (p, c) ∈d }
for each v in s
if w(v, ds) >ε then put (s,v : w(v, ds)) into SV
end for
end for
return SV

Figure 3.5 : The pseudo code of slot-filling algorithm


The time complexity of slot-filling algorithm is ∑s |ds|*|Vs|, where |ds| is the size of ds, and |Vs|
is the number of values in slot s.

8.9 Discussion
In this chapter, we have described the slot-tree ontology in section 3.2 and slot-filling
algorithm in section 3.3. The slot-filling algorithm is used to map XML documents into slot-
tree ontology in the understanding process. In chapter 4, we will use the slot-tree and slot-
filling algorithm to develop an ontology-based XML retrieval method, and using the method to
reduce the semantic gap between human and computer.
9 An Ontology-Based Approach for XML Querying, Retrieval and

Browsing
In the previous chapter, we have showed the slot-tree ontology and its usage. A mapping
between slot-tree and XML documents is built in the process of slot-filling algorithm. The
mapping process helps our XML retrieval system in reducing the semantic gap between human
and computer. In this chapter, we will outline the relationship between our XML retrieval
system and slot-tree ontology, and show the power of slot-tree.

In section 4.1, we will describe the process of our XML retrieval system, and outline
important components in our system. We will describe how to represent an XML documents
for retrieval in section 4.2, and describe the index structure in section 4.3. After that, the query
interface is described in section 4.4 and ranking strategies is described in section 4.5. And then
we show how to organize retrieval results for browsing in section 4.6. Finally, we have a
discussion about our XML retrieval system in section 4.7.

9.1 Introduction
Two technologies are needed in the process of searching for documents, retrieving and browsing.
Retrieving is the process of retrieves documents in a collection. After that, the retrieved documents
should be organized for browsing. Browsing is the process of read and traverse on the collection of
documents. We usually use retrieving and browsing techniques alternatively in a searching process. A
model integrated retrieving and browsing may used to improve the quality of searching.

Our research focuses on using ontology to improve the XML retrieval and browsing process. We
will focus on the following questions in this chapter.

1. How to encode XML documents for retrieval?


2. How to use slot-tree ontology to improve the efficiency of querying?
3. How to use slot-tree ontology to improve the efficiency of retrieval?
4. How to use slot-tree ontology to improve the efficiency of browsing?

Figure 4.1 shows a scenario of our approach to retrieve XML documents. First, a user build a
query by click or type on slots in the query interface, and then submit the query to the XML retrieval
system. The retrieval system retrieves XML documents, and then summarizes them for user to browse.
Figure 4.1 : A scenario of our XML retrieval system

The ontology in figure 4.1 is the slot-tree ontology that described in chapter 3. It is the core of our XML
retrieval system. The slot-tree ontology is used to build query interface, retrieve documents and
summarize retrieved documents for browsing. The XML queries, XML documents and query interface
are important objects in our system. The retrieval and extraction are important processes in our system.
We will introduce these objects and processes in this chapter.

9.2 XML Documents


An XML document is encoded as a tree-structure text. Figure 4.2 shows an XML document that
describes a butterfly.
- <butterfly about=“Athyma_fortuna_kodairai.jpg”>

<adult>

<texture>There are some eye spots in each wing</texture>

<color>Brown background color, Eye spots in white color</color>

<size>Middle size, 50-60mm</size>

</adult>

<geography>

<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>

<global>Central China Area</global>

</geography>

</butterfly>
Figure 4.2 : An XML document of butterfly
For conceptual simplicity, the XML example above is expressed as a sequence of (path,
value) pairs that describe the object.

(butterfly, )

(butterfly@about, Athyma_fortuna_kodairai.jpg)
(butterfly\adult, )
(butterfly\adult\texture, There are some eye spots in each wing)
(butterfly\adult\color, Brown background color, Eye spots in white color)
(butterfly\adult\size, Middle size, 50-60mm)
(butterfly\adult, )
(butterfly\geography, )
(butterfly\geography\taiwan, North-Taiwan, 1000-2000meters mountain area)
(butterfly\geography\global, Central China Area)
(butterfly\geography, )
(butterfly, )

Figure 4.3 : The (path, value) expression of an XML document

The (path, value) expression can be thought as an object concept model. A “path” specified a
property of an object. A “value” specified a value for the property. The object concept model above is a
binary relation that may be expressed as path(object, value). A path represents a logical predicate with
two arguments. An object in this model is expressed as a set of (path, value) pairs.

Storing Structure
The (path, value) representation does not reflect the tree structure of an XML document. In order to
represent the tree structure, we use a pair of index to represent begin and end of each block. In other
word, we extend each (path, value) pair with a (begin, end) pair to represent the begin node and end
node of each block. The butterfly example above is expressed as the following structure.

1, 12 (butterfly, )

2, 2 (butterfly@about, Athyma_fortuna_kodairai.jpg)
3, 7 (butterfly\adult, )
4, 4 (butterfly\adult\texture, There are some eye spots in each wing)
5, 5 (butterfly\ adult \color, Brown background color, Eye spots in white color)
6, 6 (butterfly\ adult \size, Middle size, 50-60mm)
7, 7 (butterfly\ adult, )
8, 11 (butterfly\geography, )
9, 9 (butterfly\geography\taiwan, North-Taiwan, 1000-2000 meters mountain area)
10, 10 (butterfly\geography\global, Central China Area)
11, 11 (butterfly\geography, )
12, 12 (butterfly, )

Figure 4.4 : The storing structure of an XML document

In the example above, each node is lead by a (begin, end) pair. The begin index of a node is
always identical to the ID of the node. A block with (begin, end) means it cover all nodes
between begin node and end node. For example, the first block “1,12 (butterfly,)” covers nodes
from 1 to 12, the third block “3,7 (butterfly\adult)” covers nodes from 3 to 7. In this way, the tree
structure of XML is expressed as the cover/covered relations between nodes.

The begin-end pair structure totally reflects the hierarchical structure of XML documents. In
our XML storage system, we store the (begin, end) pairs in a table instead of storing as a tree.

9.3 Indexing structure


Based on the PVSM, we index (p,t) pairs instead of (t) for an XML retrieval system. There are several
data-structures for full-text indexing, such as inverted-file, signature-file and Patricia-trie. We use
inverted-file as the index structure of our XML retrieval system for simplicity.

The following example is a simple XML document. We will show how to index the following
XML document, for both text field and number field.

Example 4.4 An XML document for butterfly


<butterfly about=“kodairai”>
<adult>
<color>brown</color>
<texture>spot</texture>
<size>50-60mm</size>
</adult>
</butterfly>

Indexing Text: The following table shows our inverted-file structure. The inverted-file is stored in a
relational database now. The following figure shows an inverted-file for the example above.
#path, #term #object list
… …
#\butterfly\adult\color, #brown …,#kodairai, #…..
… …
#\butterfly\adult\texture, #spot …,#kodairai,…
… …

Figure 4.5 An example of text index in inverse file format

Indexing Number : Traditional full text indexing technology doesn’t index number. In our system,
number indexing is important for the browsing process. We may sort the search results in some
specified order based on number index. In the indexing process, we extract number from XML
documents and put into a number table as following.

#object, #path Number


… …
#kodairai, #\butterfly\adult\size 50
#kodairai, #\butterfly\adult\size 60
… …

Figure 4.6 An example of number index

9.4 Query Language and Query Interface


XML may used to encode metadata instead of data. Metadata is a kind of data that used to describe
data. We may use metadata to describe objects such as audio, video, people, etc. Based on metadata, we
may index image, video and audio in text format, so that we may query object by number and text field
in our XML retrieval system.

In our system, we design a program to transform slot-tree into HTML based query interface. A
template in Extensible Stylesheet Transformations (XSLT) is used to do the transformation.

In our query-interface, a value can be expressed as a string, a range of number, or an


object. A user may specify the value for a slot just by click a value or an icon in the slot. Our
retrieval system is not only used to retrieve text-based documents, but also used to retrieve
image or video. The following figure shows a query interface for butterflies.
Figure 4.7 : The Query Interface for Butterflies

A user may select a slot just by one click, and select a value in the slot or type keywords into the
slot. He may also specify a field for sorting. A query will be built and submit to the XML retrieval
system when he press the submit button.

A query in our system is a filled slot-tree. The following example shows a query “find all
butterflies with broken wing and brown color”.

<s slot= “butterfly” path= “//butterfly”>


<s slot=“color” path=“//butterfly//adult//color” keys=“brown”/>
<s slot=“shape” path=“//butterfly//adult//shape” keys=“broken”/>
</s>

9.5 Ranking Strategy


The ranking strategy for XML-retrieval is much more like database than text-retrieval. We may rank the
retrieval result by any field in XML documents. For example, we may sort the retrieval result by the
size of butterflies. We may also sort the retrieval result by the similarity between document and query
or by the importance of documents. In this section, we will show the ranking strategies that used to sort
the retrieval results.

Ranking by Field
In order to organize the retrieval result for user to browse, a user may specify the ranking
strategy. A user may specify any field to sort the result for browsing just like in a database. A
field can be sorted as numbers by scale or sorted as strings by alphabetical order, in either
increasing order or decreasing order. The variety of ranking strategies provides users a way to
organize the retrieval result into a list for browsing.

Ranking by Importance
In section 2.2, we have introduced how to measure the importance of a web page based on hyperlink.
Hyperlinks in XML may used to decide the importance of an XML document, too. In our XML retrieval
system, ranking by importance is used as a default ranking strategy. A simple way to measure the
importance of an XML document is by counting references to an XML document. We use the strategy
in our system for simplicity. In the future, we will try to accommodate random-walk model and hub-
authority model to measure the importance of XML documents in our XML retrieval system.

Ranking by Similarity
For text retrieval, a ranking strategy based on vector space model (VSM) and TFIDF weighting
function performs well. A brief survey for VSM and TFIDF was described in section 2.3. However, an
XML object is not only a sequence of words like a text, but also contains a lot of tags. For XML, we
extend VSM with a path to each term that is called the Path Vector Space Model (PVSM). An XML
document (d) could be expressed as the following vector v(d).

v(d) = (dp1,t1… d p1,tk …dpn,t1… dpn,tk) dpi,ti is the weight of (pi, ti) pair in document object d

When several paths have similar meaning, we may cluster them into a slot for retrieval. The model after
paths clustering is called the Slot Vector Space Model (SVSM).

v(d) = (ds1,t1… d s1,tk …dsn,t1… dsn,tk) dpi,ti is the weight of (pi, ti) pair in document object d

We may use the cosine-coefficient to measure the similarity between queries and documents in SVSM
just like in VSM.

d •q
Similarity(d, q) = | d | * | q |
However, we do not know what kind of weighting function is good to measure the value dsi,tj. Is TFIDF
good enough in the SVSM, or we need another measure. In our system, we express the dsi,tj as the
product of wsi,tj and tfsi,tj . Where tfsi,tj is the term frequency of the term tj  in slot si , and wsi,tj ais the
weighting coefficient.

A difficulty for retrieval system today is too many documents are retrieved. When there are to many
retrieval results for browsing, the ranking strategy is used to present what users want to them. A user
may like to see large butterflies, important butterflies or butterflies that are similar to a query. The
variety of ranking strategy in XML provides ways for users to retrieve only what they like to browse.

9.6 Browsing XML documents


For an information retrieval system, the retrieved documents should be summarized and
organized into readable format for people to browse. In our XML retrieval system, slot-filling
algorithm is used to map the retrieved documents into filled slot-trees for browsing. The filled
slot-tree is a summary of documents that is easy to browse and is well organized. In this
section, we will show an example of slot-filling algorithm that fills XML documents into slot-
tree. Before that, we have to show an XML document and a slot-tree used in the algorithm.
The following example shows a simple slot-tree for butterfly.

<s slot= “butterfly” path= “//butterfly”>


<s slot= “name” path=”//butterfly//name”/>
<s slot= “adult” path= “//butterfly//adult””>
<s slot= “color” path= “//butterfly//adult//color”>
<v value= “black”/>
<v value= “brown”/>
<v value= “black&white”/></s>
<s slot= “texture” path= “//butterfly//adult//color”>
<v value= “lines”/>
<v value= “spots”/></s>
</s>

We may use the slot-filling algorithm to extract values from the following XML document.

- <butterfly about=“Athyma_fortuna_kodairai.jpg”>
<adult>
<texture>There are some eye spots in each wing</texture>
<color>Brown background color, Eye spots in white color</color>
</adult>
<geography>
<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>
<global>Central China Area</global>
</geography>
</butterfly>

The slot-filling algorithm will fill values into slot-tree. The following example shows the result of
filling.
<s slot= “butterfly” values= “Athyma_fortuna_kodairai”>
<s slot= “adult”>
<s slot= “texture” values= “spot”/>
<s slot= “color” values = “brown”/></s>
<s slot= “geography”>
<s slot= “Taiwan” values= “North”/>
<s slot= “Global” values= “China”/></s>
</s>
The result of slot-filling algorithm is a filled slot-tree. For human, it is easier to browses filled slot-trees
than browse the source documents. The filled slot-tree is a summary of the XML document and is well
organized.

9.7 Discussion
In this chapter, we design an XML retrieval system to reduce the semantic gap between human
and computer. The slot-tree ontology and the slot-filling algorithm are used in our XML
retrieval system to understand XML documents. Based on the slot-tree, we design a query
interface to reduce the semantic gap in query side. The interface helps people to write XML
queries easily. Based on the slot-filling algorithm, we design the slot vector space model
(SVSM) retrieve XML documents. The SVSM model helps computer to understand XML
documents. Besides that, the slot-filling algorithm also help computer to extract summary from
XML documents for browsing. Our goal of reducing the semantic gap between human and
computer is almost achieved by using slot-tree as a core representation.

We will study two cases of our XML retrieval systems in chapter 6 and chapter 7. In
chapter 6, we use the domain of butterflies as an example. In chapter 7, we use the domain of
proteins as an example. We will show the slot-tree, query interface, retrieved results and
summary for butterflies in chapter 6. And we will show the slot-tree, query interface, retrieved
results and summary for proteins in chapter 7.
10 The Construction of Slot-Tree Ontology
We have introduced the slot-tree ontology in chapter 3, and then showed an XML retrieval
system based on slot-tree ontology in chapter 4. However, building slot-tree ontology is a not
an easy job. In order to reduce the effort to build the slot-tree ontology, we have developed the
slot-mining algorithm. The slot-mining algorithm is a statistical approach to mine slot-tree
from XML documents. The algorithm is used to learn the slot-tree from a collection of XML
documents.

An overview of mining approaches is described in section 5.1. Section 5.2 provides


background for the text-mining technology. Section 5.3 shows how to construct slot-tree for a
given XML collection. Section 5.4 describes a method to mine slot-tree from XML documents
called slot-mining algorithm. Finally, we have a discussion for the building of slot-tree in
section 5.5.

10.1Introduction
The goal of text mining is to find important patterns from text collection and organize these patterns
into ontology. In this thesis, we use the ontology to help the XML retrieval and browsing. The mining
technology may used to help us in the construction process of slot-tree ontology. In this section, we will
focus on the text-mining problem for XML.

Slot-tree is an ontology representation method. Our mining approach is to build a XML-mining


program to induce values for each slot. In this section, we assume that each value is represented by a
term (or a word) for simplicity. Based on this assumption, we developed a statistical program to mine
values for each slot.

The semi-structured property of XML makes the mining program work. For a given XML
collection, the distribution of a term is highly depends on the tags. For example, the following terms
show up more frequent in the <color> block than in the other blocks.

<color> “black”, “white”, “yellow”, “blue”, “green” </color>

The problem of mining the important values for each slot is called the Slot-Mining Problem. We will
propose a mining-algorithm that is based on a simple observation – the distribution of terms depends on
the tag. A term shows up more frequently in a tag is likely to be a key value for the corresponding slot.
10.2Background
The goal of text mining is to discover some regularity in text-data. A text-mining program induces rules
from text or learn grammar form corpus, these rules are used in the process of natural language
understanding and information extraction.

For natural language processing, inside-outside algorithm is a popular tool to learn probabilistic
context-free grammar (PCFG) from tree-bank corpus. However, tree-bank corpus is not easy to build.
Building a tree-bank by human is a time consuming job. Some other text-learning methods are
developed to learn from text corpus. For example, link grammar is a simple head-driven grammar that
developed to parse natural language sentence. A learning algorithm is developed to learn the link-
grammar from text-corpus. Besides that, transducer is a learning algorithm to induce finite-state
automata from a given text-corpus. Learning transducer is easier than learning a context-free grammar.

For information extraction, a wrapper is an algorithm to learn a simple grammar from structured text,
such as web page. A wrapper will induce some rule to wrapping the document. For example, a simple
wrapper may learn the prefix and postfix of each field from a collection of program generated web
page. We may extract fields from web page based on these prefix and postfix. A transducer may also
used to learn the extraction rule from a collection of web page, too.

However, these methods are used to learn the grammar of input text, not used to learn ontology from a
given document collection. In this chapter, we will propose a learning algorithm that mine slot-tree
ontology from a given XML collection in section 5.4. The algorithm is called the slot-mining algorithm.
This algorithm is a tool to help the domain-knowledge designer to design the slot-tree
ontology. Before we show the slot-mining algorithm, we have to show the process for human
to build a slot-tree in section 5.3, in order to observe what is need in designing such an
algorithm.

10.3The process of building a slot-tree


In order to show the ontology design process, we will trace the designing step of a simple slot-tree for
butterfly. There are six steps to design a slot-tree.

1. Browse XML data.


2. Identify object boundary.
3. List all tags in this domain.
4. Identify slots for this domain.
5. Mapping each slot to tags (or xpath).
6. Identify values for each slot.
Browsing XML data : The first step to design a slot-tree is to browse data in order to understand data.
What is the structure of the XML collection? Can we identify the object boundary in XML documents?
What’s the meaning of each tag? Does each tag correspond to a slot? What are candidate values for a
slot? We have to answer these questions before construct a slot-tree.

Identifying object boundaries : An object-block is an XML block that correspond to a object. We


have to identify the boundary of object-block to find out what objects the collection contains. For
example, in our butterfly collection, a <butterfly></butterfly> block is the boundary of a butterfly
object.

Listing all tags in this collection : an XML tag usually has strong semantic meaning. For example, the
<color> tag represents the color of a butterfly. We may list all tags to understand the semantics for each
tag. For the simple butterfly collection, we list all tags as following.

Butterfly, adult, texture, color, size, geography, Taiwan, global

Identifying slots for this collection : We are lucky to find out that these tag are not ambiguous.
The semantics of tags are clear and definite. We may build a slot for each tag.

Mapping slots to tags (or xpath) : For the simple butterfly collection, we can map each tag to one slot
directly. The following example shows the schema of slot-tree.

<s slot=“butterfly” >


<s slot=“adult”>
<s slot= “texture” />
<s slot= “color” />
<s slot= “size” />
</s>
<s slot= “geography”>
<s slot= “Taiwan” />
<s slot= “Global” />
</s>
</s>

Identifying values for each slot : In order to identify values for each slot, we have to read the data for
each slot. For example, if we read the data in <color> tag, we may found that the “black”, “white”,
“brown”, “orange”, “yellow”, “green”, “blue”, “purple”, “gray” are key values for this slot. We may fill
them into the values list of the color slot. After we fill values for each slot. We finish the slot-tree
building process. The following XML document shows a slot-tree for the simple butterfly collection.

<s slot=“butterfly”>
<s slot=“adult”>
<s slot= “color”>
<v value= “black” /><v value= “white” /><v value= “brown”/>
<v value= “yellow” /><v value= “orange” /><v value= “green”/>
<v value= “blue” /><v value= “purple” /><v value= “gray”/>
</s>
<s slot= “texture”>
<v value= “single color” keys= “single, mono, uniform”/>
<v value= “spotted” keys= “spot”/>
<v value= “lines” keys=”line”/>
</s>
<s slot= “size” >
<v value= “small” /><v value= “middle” /><v value= “large” />
</s>
</s>
<s slot= “geography”>
<s slot= “Taiwan”>
<v value= “north”/><v value = “center”/><v value = “south”/><v value = “east”/>
</s>
<s slot= “Global”>
<v value= “Enrope”/><v value = “China”/><v value = “India”/>
<v value = “America”/><v value = “Australia”/>
</s>
</s>
</s>

In the slot-tree example above, a <v> tag represent a value in a slot. The simplest value is a
keyword. We may also specify a set of keywords or rules for a value, such as the “single color” value in
the “texture” slot.

The last step “Identifying values for each slot” is the most human laboring step in the whole slot-
tree building process. In order to construct slot-tree automatically, we develop the slot-mining algorithm
to mine slot-tree from XML documents in the next section.
10.4Slot-mining algorithm
A slot-mining algorithm mines slot-tree from XML documents. The first step is to extract paths
in XML documents to build a schema. The second step is using statistical correlation analysis
to find out what terms is important for these paths. After that, a slot-tree is built that each slot
corresponds to a path in XML documents. The following figure shows a concept model of the
slot-mining algorithm.

Figure 5.1 The process of slot-mining algorithm

Before we describe the algorithm, we have to define some mathematics notation for it.

Definition : Slot-Vector
A slot-vector is a vector of (slot, term) pairs for a given collection of XML blocks (B).

v(B) = (Bs1,t1, …, B s1,tk ,…,Bsn,t1,…,Bsn,tk)

B si,tj is the weight of (tj) shows up in blocks for slot(sj) of B

|B| is the abbreviation for ∑s,t Bs,t


|Bt| is the abbreviation for ∑s Bs,t
|Bs| is the abbreviation for ∑ t Bs,t

Definition : Slot-Vector Space Model (SVSM)


The model of represent XML document by Slot-Vector is called Slot-Vector Space Model.

Example
1. A slot-vector for a given collection (D) is represented as the following formula.
v(D) = (Ds1,t1, …, D s1,tk ,…,Dsn,t1,…,Dsn,tk)

2. A slot-vector for a specified slot (s) of collection (D) is represented as the following formula.

v(Ds) = (Ds,t1, …, D s,tk)

3. A slot-vector for a given document (d) is represented as the following formula.

v(d) = (ds1,t1, …, d s1,tk ,…,dsn,t1,…,dsn,tk)

4. A slot-vector for a specified slot (s) of document (d) is represented as the following formula.

v(ds) = (ds,t1, …, d s,tk)

Slot-Mining Problem
Given an XML documents collection (D) and a set of slots (S), find the key values for each slot : v(s).

Slot-Mining Algorithm
The slot vector for D is v(D) = (Ds1,t1, …, D s1,tk ,…,Dsn,t1,…,Dsn,tk)

Let |Dt| = ∑ Dsi,t

The slot vector for Ds is v(Ds) = (Ds,t1, …, D s,tk)

v>r(s) = { w | Ds,t /|Ds| > r * |Dt|/|D| }

v>r(s) is called the r-key-set for slot (s)

In our XML-mining system, we set the parameter (r = 2.0) to extract the key values for each slot.

The following figure shows the pseudo code of slot-mining algorithm.

Algorithm Slot-Mining (D)

P = {p | p is a path in D}
for each (p,t) in D
|Dp,t | = |Dp,t|+1
|Dp| = |Dp|+1
|Dt| = |Dt|+1
|D| = |D|+1

end for
for each (p,t) in PT
p(t | p) = |Dp,t | / |Dp|
p(t) = |Dt | / |D|
if p(t|p)/p(t) > r then put (p,t) into SV
end for
return SV

Figure 5.2 The Pseudo Code of Slot-Mining Algorithm

The slot-mining algorithm mines values from XML collection D. The mining values should be
modified and organized into slot-tree for improving the quality. Let’s have a look at a mining
example for slot “color”.

Example :
<color> head, brown, yellow, body, white, wing, gray, blue, black, background, line, spot
</color>

In the mining result above, “brown, yellow, white, gray, blue, black ” are what we want, but “head,
body, wing, background, line, spot” are noise words. Until now, we cannot distinguish these two
groups by statistical method. We have to find out a way to distinguish them. One possible solution is
to combine a dictionary like “WordNet” to distinguish these two groups. We will try this solution in
the future.

10.5Discussion
In order to help people constructing slot-tree ontology, we developed a slot-mining algorithm
to mine slot-tree from XML documents. The slot-mining algorithm is used as an authoring tool
to construct the slot-tree ontology.

The slot-mining algorithm mines slot-trees from a collection of XML documents. Our
approach is based on statistical correlation analysis between tags and terms. The correlation
analysis decides what terms are important for a given tag, and fills terms into the slot of this
tag.
Some modification is needed for the automatically constructed slot-tree in order to
improve the quality. At first, we have to merge paths with the same meaning into a slot in order
to simplify the structure of slot-tree. Second, we have to delete some incorrect mined-values
and merge values with the same meaning in order to improve the quality of each slot.

The slot-mining algorithm is used to construct the ontology for butterflies in section 6.7
and used to construct the ontology for proteins in section 7.7. We will show full version the
mined slot-tree in these sections.
Part 3 : Case Studies

11 Case Study - A Digital Museum of Butterflies


In part 2, we have described our methods, including slot-tree ontology, slot-filling algorithm,
slot vector space model and slot-mining algorithm. These methods are used to build a semantic
retrieval system for XML. In this part, we will use two XML collections to test our methods,
including a collection for butterflies and a collection for proteins.

In chapter 6, we will test our methods on the collection of “A Museum of Butterflies in


Taiwan (MBT) ”. In chapter 7, we will test our methods on the collection of “Protein
Information Resource (PIR)”. Both collections are encoded in XML format.

In this chapter, an overview of MBT is given in section 6.1. A source XML document of
MBT is showed in section 6.2. A slot-tree for MBT is described in section 6.3. A query
interface based on the slot-tree is described in section 6.4. The slot-filling process for MBT is
described in section 6.5. The retrieval process for MBT is discussed in section 6.6. The mining
process to build slot-tree for MBT is discussed in section 6.7. A discussion of our approach on
MBT is given in section 6.8.

11.1Introduction
The Digital Museum of Butterfly is a collection of butterfly in Taiwan. Each document in this collection
describes a species of butterfly in Taiwan. The following table is a profile for this collection.

Table 6.1 : A Museum of Butterflies in Taiwan


Collection A Museum of Butterflies in Taiwan (台灣蝴蝶數位博物館)
Working Group NMNS : National Museum of Natural Science (國立自然科學博物館), Taiwan

URL : http://www.nmns.edu.tw/
NCNU : National Chi-Nan University (暨南大學), Taiwan
URL : http://dlm.ncnu.edu.tw/butterfly/index.htm
NTU : National Taiwan University (台灣大學), Taiwan
URL : http://turing.csie.ntu.edu.tw/ncnudlm/
Size 356 species, 356 XML documents.
Language Tag in English, Content in Chinese

Digital Museum for Butterfly in Taiwan contains XML documents for 356 species of butterfly in
Taiwan. Roughly specking, tags may be classified into groups as following.
Table 6.2 : XML tags for butterflies in Taiwan
Group Fields
Classification name, family, cfamily (Chinese family), genus, species, subspecies
Host Host plant, Honey plant
Geography Taiwan, global
Egg Color, shape, feature, characteristic, days of growth, enemy
Larva Color, shape, feature, characteristic, days of growth, enemy
Pupa Color, shape, feature, characteristic, days of growth, enemy
Adult Color, shape, texture, characteristic, life period, enemy

11.2The Representation of Butterflies in XML


The following figure shows an XML document for the butterfly “kodairai”.
- <butterfly>

<cname>拉拉山三線蝶</cname>

- <classification>

<family>Nymphalidae</family>

<cfamily>蛺蝶科</cfamily>

<genus>Athyma</genus>

<species>fortuna</species>

<sub_species>kodairai</sub_species></classification>

<hostplant>忍冬科 (Caprifoliaceae) 的松田氏紅子仔 (Viburnum luzonicum var. matsudai)。</hostplant>

<honeyplant>成蝶喜吸食腐熟水果汁液或樹幹流出汁液。</honeyplant>

- <geographic><taiwan>分布於台灣中北部地區,海拔 1000-2000 公尺間山區均有分布。</taiwan>

<global>中國大陸中部有原名亞種分布。</global></geographic>

- <life_stage>

- <egg>

<feature>底部扁平之高饅頭形,表面有明顯六角形格狀花紋,於六角形頂點處,各著生一
細長刺毛…

<color>淡綠。</color> <size>直徑約為 1.1-1.3mm。</size>

<predator>各類卵寄生蜂、蜱等節肢動物。</predator>

<days_of_growth>卵期約為 5-6 天左右。</days_of_growth></egg>

- <larva>

<feature>終齡幼蟲體呈長圓筒狀,頭部密生硬棘,各體節背方及體側皆長有具星狀刺之突
起…

<color>終齡幼蟲頭部褐色,表面密生棘狀突起。體呈翠綠色,各體節背方及體側突起基部
為藍色,星狀刺為黃綠色。</color>

<size>終齡幼蟲體長約為 33-41 mm。</size>


<predator>寄生蜂、寄生蠅、小繭蜂、椿象、蜥蜴及鳥類等。</predator>

<days_of_growth>冬季以二齡幼蟲越冬,幼蟲期長達半年以上。</days_of_growth>

<defense>初齡幼蟲停棲於寄主葉脈,攝食葉脈兩側葉肉,二齡幼蟲會將寄主植物葉片咬成
小塊並吐絲將其此碎片及糞便黏於葉脈造一蟲巢,越冬幼蟲即躲藏於蟲巢當中,由於

幼蟲褐色之體色與蟲巢上乾枯之小葉片或糞便色澤相近,或許可混淆天敵耳目。 <

/defense></larva>

- <pupa>

<feature>蛹體為垂蛹,中胸背方隆起,腹節末端有一柄狀懸絲器。頭部前端有一對大型明
顯之彎曲角狀突出物,腹節背方均有小型鋸齒狀脊起。</feature>

<color>蛹體底色呈黃褐色,中、後胸背方有銀色斑塊,體側氣門黑褐色。</color>

<size>蛹體長度約為 22-27mm。</size>

<predator>蛹寄生蜂、胡蜂、姬蜂及各種真菌等。</predator>

<days_of_growth>蛹期約為 15-20 天,視溫度而定。</days_of_growth>

<defense>老熟幼蟲化蛹於隱蔽之植物叢間,藉以躲避天敵。</defense></pupa>

- <adult>

<feature>成蟲前翅外觀大致呈現三角形,翅形稍微橫長。後翅卵圓形,外觀接近三角形。雌
蝶翅型較為寬圓。</feature>

<color>雄蝶前、後翅表底色為黑色,前翅中室內有一枚長形白斑,各翅室中橫線部位有一
大型白色橢圓斑,前翅端有兩枚小型白斑。後翅有兩條明顯白色橫帶紋,前後翅緣皆

有不明顯小白紋。雌蟲翅表色澤花紋與雄蟲相似。</color>

<size>本種為中型蝶種,展翅約為 50-60mm。</size>

<characteristic>前翅中室內有一枚長形白斑。</characteristic>

<habitate>台灣中部以北山區均有分布。</habitate>

<predator>蜘蛛、螳螂、青蛙、蜻蜓、鳥類及蜥蜴等捕食性天敵。</predator>

<days_of_growth>前翅中室內有一枚長形白斑。</days_of_growth>

<defense>成蟲飛行快速,外觀與其他多種三線蝶類似,為莫氏擬態的一種。</defense>

<season>夏季較易見到成蟲活動。</season>

<behavior>成蝶喜吸食腐熟水果汁液或樹幹發酵流出之樹液,成蟲活動於開闊林道,常見
成蟲於開闊山徑兩旁樹上佔據地盤驅趕附近飛過蝴蝶,亦可見其活動於溪邊開闊處,

吸食腐果或潮濕地面水分。</behavior>

</adult>

</life_stage>

</butterfly>

Figure 6.1 : An XML document for butterfly (Full List)

11.3Slot-Tree Ontology for Butterflies


Our ontology is represented as a slot-tree in XML format. The slot-tree we designed for butterfly is
consistent to the target collection, both of them are in the following schema.

<butterfly>
<classification/>
<Geography/>
<life-period>
<Egg/>
<Larva/>
<Pupa/>
<Adult/>
</life-period>
</butterfly>

Each object in the “life period” (egg, larva, pupa, adult) has a sub schema to describe the object. The
schema looks like the following tree.

<object>
<Color/>
<shape/>
<feature/>
<size>
</object>

The consistency between slot-tree and document ease our design process. Besides that, the
consistency also eliminates ambiguity for our retrieval and browsing process. On the contrary, a lousy
design of XML document structure will makes our domain-knowledge design process difficult, and
makes our domain-knowledge hard to help the retrieval process and browsing process.

A fragment of the slot-tree for butterfly is showed in the following figure. For a full list of slot-tree,
please see appendix 1.
- <butterfly>

- <family slot="種類" path="//butterfly//cfamily//">

<v value="弄蝶" keys="Hesperiidae" /><v value="小灰蝶" keys="ycaenidae" /> ….</family>

- <adult slot="蝴蝶成蟲" keys="Adult" path="//butterfly//adult//">

- <shape slot="蝴蝶的形狀" keys="Adult:Shape" path="//butterfly//adult//shape//">

<v value=" 類 似 燕 尾 " image="swallowtail.gif"/> <v value=" 翅 緣 波 浪 狀 " …/>…<


/shape>

- <color slot="蝴蝶的顏色" keys="Adult:Color" path="//butterfly//adult//color//">

<v value=" 黑 色 " keys="Black" />… <v value=" 黑 白 相 間 " keys="Black_White"/>…

</color>

- <texture slot="蝴蝶的特徵" keys="Adult:Texture" path="//butterfly//adult//texture//">

<v value="沒有花紋" image="mono.gif" /><v value="少數斑點" image="spot.gif" /> …

</texture>

</adult>

- <pupa slot="蝴蝶的蛹" keys="Pupa" path="//butterfly//pupa//">

- <s slot="蛹的形狀" path="//butterfly//pupa//"><v value="突起" keys="Skin_Stick" /> …</s>

- <s slot="蛹的顏色" keys="Pupa:Color" path="//butterfly//pupa//color//">

<v value="翠綠色" keys="Green"/> <v value="褐色" keys="Wood" /> …</s>

- <s slot="蛹的特徵" keys="Pupa:Feature" path="//butterfly//pupa//feature//">

<v value=" 帶 蛹 " keys="Laying_Pupa"/><v value=" 垂 蛹 " keys="Hanging_Pupa"/>

</s></pupa>

- <egg slot="蝴蝶的卵" keys="Egg" path="//butterfly//egg//">

- <s slot="卵的形狀" keys="Egg:Shape" path="//butterfly//egg//feature//">

<v value="圓球形" keys="Ball" image="egg_ball.jpg" />

<v value="半球形" keys="饅頭形+Half_Ball" image="egg_half_ball.jpg" /> …</s>

- <s slot="卵的顏色" keys="Egg:Color" path="//butterfly//egg//color//">

<v value="乳白" keys="Milk_White" /> …</s>

- <s slot="卵的特徵" keys="Egg:Texture" path="//butterfly//egg//feature//">

<v value=" 表 面 光 滑 " keys="Smooth”/>…<v value=" 格 狀 花 紋 "

keys="Square_Texture"/> …</s>

</egg>

- <larva slot="蝴蝶的幼蟲" keys="Larva+毛毛蟲" path="//butterfly//larva//">

- <s slot="幼蟲的形狀" keys="Larva:shape" path="//butterfly//larva//feature//">

<v value="紡棰形" keys="Like_Shuttle" /><v value="鳥糞狀" keys="Like_Bird's_Shit"

/> …</s>

- <s slot="幼蟲的顏色" keys="Larva:Color" path="//butterfly//larva//color//">

<v value="綠色" keys="Green" /><v value="褐色" keys="Brown" /> …</s>

- <s slot="幼蟲的特徵" keys="Larva:Texture" path="//butterfly/life_stage/larva/characteristic">

<v value="短毛" keys="Short_Hair" /><v value="長毛" keys="Long_Hair" /> … </s>

</larva>

- <s slot="台灣分布" keys="Taiwan" path="//butterfly//geographic//taiwan//">

<v value="台灣北部" keys="North_Taiwan+北" /> …</s>


- <s slot="全球分布" path="//butterfly//geographic//global//">

<v value="東南亞" keys="South_Asia " /><v value="中國大陸" keys="China" /> …. </s>

- <s slot="體型大小" keys="Size" path="//butterfly//adult//size//">

<v value="小型" keys="Small_Size+小" /><v value="中型" keys="Middle_Size+中" /></s>

- <s slot="棲息地" keys="棲息地=Habitate" path="//butterfly//adult//habitate//">

<v value="平 地" keys="Ground" />…<v value="高海拔山區 " keys="High_Mountain” /> …

</s>

- <s slot="宿主植物" keys="Hostplant+寄主植物" path="//butterfly//hostplant//">

<v value="豆科" keys="Leguminosae" /><v value="大戟科" keys="Euphorbiaceae" /> …</s>

- <s slot="飲食習慣" keys="Eat Food" path="//butterfly//adult//behavior//;//butterfly//honeyplant//">

<v value="食花蜜" keys="Nectar" /><v value="食腐汁" keys="Juice " />…</s>

</butterfly>

Figure 6.2 : A slot-tree for butterflies

11.4Query Interface
The query interface is built automatically by transform the slot-tree into a web page. We use
XSLT to transform slot-tree into HTML. The following figure shows a query interface for
butterflies.

A query-interface is automatically generated from slot-tree by XSLT template. The XSLT


template transforms the slot-tree into a HTML document. Then we show it as a web page on
the browser. The following figure shows the interface for butterfly domain.
Figure 6.3 A Query Interface for Butterflies

The query interface above generates the following query.

<query sort_by= “蝴蝶大小” path= “/butterfly/adult/size$meter” order= “-“>


<s slot= “蝴蝶花紋” path= “//butterfly//adult//texture” value=”水平色帶”/>
<s slot= “台灣分布” path= “//butterfly//geographic//Taiwan” value=”恆春半島”/>
</query>

After the interface submits the query to our XML retrieval system, the retrieval results will be shows
up. The query above specified the query expression and the ranking strategy. The ranking strategy is by
the size of adult butterfly in decreasing order. Based on the query, the XML retrieval system will
retrieve the butterfly object and ranking by size of butterfly. We will show the query results in the
following section.

11.5Slot-Filling Algorithm
We have to parse XML objects before the fill documents into slot-tree. For example, the following
XML document is a butterfly called “maraho”.

<butterfly>
<cname>寬尾鳳蝶</cname>

<geographic>

<taiwan>本種分布於海拔較高山區,台灣中北部海拔 1000-1500 公尺山區才可見 … </taiwan>

</geographic>

<egg><feature>外觀呈圓球形</feature></egg>

<adult><color>成蟲翅表為黑色</color><adult>

<footnote>本種經行政院農業委員會公告為一級瀕臨滅 保育 ……</footnote>

</butterfly>

The example above will be parsed into a sequence of (path, value) pair as following.

(butterfly\ cname , 寬尾鳳蝶)

(butterfly\ geographic\ taiwan, 本種分布於海拔較高山區,台灣中北部海拔 1000-1500 公尺山區才可見 …)

(butterfly\ egg \ feature, 外觀呈圓球形)

(butterfly\ adult \ color, 成蟲翅表為黑色)

Then we may fill them into corresponding slot as following.

(butterfly\ egg \ feature, 外觀呈圓球形)

 <slot name="卵的形狀" path="butterfly//egg//feature">


<value name="圓球形"/>

<value name="半球形"/>

….
卵的形狀 : 圓球形

11.6XML Retrieval
After the user submits the query to the XML retrieval system, the XML retrieval system
retrieves the query results. Then an XML extraction algorithm extracts values for each slot.
After that, a sorting function sorts the result by the size of butterflies. The following figure
shows the query results.
Figure 6.4 A Query Result for Butterflies

11.7Slot-Mining Algorithm

Chinese Word Learning


A problem for Chinese language is the word boundary detection. For English, there is a space between
words in a sentence. But in Chinese, there are not spaces between words. This problem causes some
difficulty in our XML Text-Mining problem. One way to solve this problem is use a dictionary to find
out the words shows in a sentence. The deficiency of this approach is that no dictionary contains all
words. And there are many unknown words used in a special domain. We have to learn words
dynamically to conquer the problem. In our system, we adopt the keyword-learning algorithm proposed
by L.F.Chien [Chien97]. This keyword-learning algorithm is based on the following observation –
“Both the right hand side and left hand side of a word should be ‘free’”. The ‘free’ means a word can
connect to many neighbors statistically. For example, we may extract the word ‘ 三 線 蝶 ’ from the
following sentences based on the statistical freedom of this word.

…雄紅三線蝶身上有…
…江崎三線蝶分布於…
…台灣三線蝶是一種…
…埔里三線蝶屬於小…
“三線蝶” left neighbor {紅,崎,灣,里} ,right neighbor {身,分,是,屬 }
“線蝶” left neighbor {三} ,right neighbor {身,分,是,屬}
“三線” left neighbor {紅,崎,灣,里} ,right neighbor {蝶}

For the string “三線蝶”, both left side and right side has four neighbors. But for “ 線蝶”, there are only
one left neighbor. For “ 三線 ” , there are only one right neighbor. A string with many neighbors in both
sides is very possible to be a “word”, so that “ 三線蝶 ” is putted into the learning-dictionary for the
following XML text-mining step.

Slot-Mining
After the word learning step, the slot mining algorithm describe in section 3.5 is used to extract
important word for each slot. The following table shows some results of of the Slot-Mining (part of
slot-tree).

Table 6.3 : A Result of Slot-Mining Algorithm for Butterflies


Slot Value List
\butterfly\classification\cfamily 鳳蝶科, 蛺蝶科, 蛇目蝶科, 粉蝶科, 斑蝶科, 弄蝶科, 小灰蝶科
\butterfly\classification\family Satyridae, Pieridae, Papilionidae, Papilio, Nymphalidae, Lycaenidae,
Hesperiidae, Danaidae
\butterfly\cname 蝶 , 鳳蝶, 蛺蝶, 蔭蝶, 胡麻斑粉蝶, 胡麻, 粉蝶, 樺斑蝶, 斑蝶, 弄蝶,台灣
\butterfly\footnote 高冷蔬菜區, 非常, 開發, 開墾, 長達,近年來, 種經, 種族群, 破壞, 生活史, 生
存, 環境, 帶 果園, 帶 , 海拔山, 海拔, 植物, 棲息環境, 棲息,, 本種, 更使,
族群分布, 族群分, 族群, 拔山, 情形, 寄主植物, 寄主, 台灣中, 台灣, 分布, 再
加上農藥, 侷限同時, 侷限, 使用, 主植物
\butterfly\geographic\global 馬來半島,非洲,錫金 西部,蘇門達臘,蘇門答臘 蘇門,群島,美洲,緬甸北部,緬
甸,琉球群島,琉球,爪哇,熱帶,澳洲東部,澳洲,泰國,歐洲,東部,東亞,本種尚分
布,本種,朝鮮半島,朝鮮,有分,日本,新幾內亞,斯里蘭卡,廣泛分布,廣泛,幾內亞,
巴基斯坦,尼泊爾,尚分布,婆羅洲,地區皆,地區均,地區,喜馬拉,喀什米爾,印度,
南部,半島,區皆,區均,北部,利亞,分布,分 ,亞種分布,亞熱帶地區,亞熱帶,亞洲,
亞地區,中國大陸,中南半島,中亞,
\butterfly\honeyplant\ 馬櫻丹,馬利筋,馬利,金露花,野花,豐草,菊科野花,菊科, 菊科,花蜜,腐熟,繁星
花,繁星,紫花霍香薊,紫花,流出,汁液,水果汁,水果,樹液,樹幹,植物,果汁液,果
汁,成蟲,成蝶,小型,多種,咸豐草,咸豐,吸食花蜜,吸食腐,吸食,各種野花, 各種,
\butterfly\life_stage\adult\predator ,鳥類,青蛙,螳螂,蜻蜓,蜥蜴,蜘蛛,捕食性天敵,捕食,性天敵,天敵,
\butterfly\life_stage\egg\characteristic ,表面平滑,表面,縱脊,細微,精孔,突起,條縱脊,條細微縱脊,條細微縱脊,明顯縱
脊,明顯,數條,平滑,刻點,光澤,中央精孔,中央,
\butterfly\life_stage\egg\feature\ ,高饅頭形,饅頭,頂點,頂部微凸,頂部,角形,表面,著生,菱形,花紋,縱脊,細長刺
毛,細長,細小突起,細小,精孔,突起,稍微,砲彈形,砲彈,瓶形,球形,明顯縱脊,明
顯,扁平,扁圓形,扁圓,微凹,微凸,底部扁平,底部,小突起,圓球形,圓球,圓球,圓
形精孔,圓形,各著生,半圓球形,半圓,刺毛,凹陷,六角形, 滿,中央微凹,中央微
凸,中央,
\butterfly\life_stage\adult\color\ 黑褐色,黑褐,黑色細帶紋,黑色斑點,黑色斑紋,黑色性徵,黑色帶紋,黑色小斑,
黑色小圓斑,黑色外框,黑色圓斑,黑色,黑紋,黑眼紋,黑白,黑斑,黃色,鱗片,體型,
體呈,顯眼,面底色,靠近,青藍色,雌蟲,雌蝶色澤,雌蝶翅表,雌蝶外觀,雄蟲相似,
雄蟲相,雄蟲,雄蝶相似,雄蝶前,雄蝶,附近,長型白斑,金屬,部分,部位,角形,規
則,褐色帶紋,褐色帶,褐色,表無,表各,蟲翅,蝶翅,蝶前,藍色,花紋,色鱗,色細,色
紋,色澤,色斜,色斑點,色斑紋,色斑,色帶紋,色帶,色小,色寬,色外框,色外,色圓
斑,色圓,色區域,至亞,腹面,肛角部位,肛角,翅表色澤,翅表底色,翅表,翅腹面底
色,翅腹色澤,翅腹底色,翅腹,翅脈,翅緣,翅第,翅端,翅形,翅外緣部位,翅外,翅
基部位,翅基,翅中,縱貫,緣部,緣毛,線部位,細紋,細小,紫色,紋橫,紋分,紅色,端
角,突起,眼紋,眼狀,相間,相似,白點,白色鱗,白色細帶紋,白色斜帶紋,白色斑紋,
白色斑,白色帶紋,白色,白斑,狹長,狹長,狀細,狀紋,狀突起,狀突,無明顯差異,
灰黑色,灰黑,灰褐色,灰褐,灰白色,灰白,淺黃色,淺黃,深褐色,深褐,深色,淡紫
色,淡紫色,消失,波狀,橫線,橢圓,橙黃色,橙色外,橙色圓斑,橙色,橘黃色,枚白
斑,枚白,枚小白斑,枚小,暗藍色,明顯黑,明顯白色,明顯,斑點,斑紋,數枚白斑,
數枚白,數枚,散生,排列,成蟲,成蝶前,成蝶,性徵,後翅表底色,後翅表,後翅色澤,
後翅腹面,後翅肛角,後翅第,後翅外緣,後翅前緣,後翅中央,後翅,後緣,形黑,形
成,底色,帶金屬光澤,帶金屬,帶金,帶紋,帶狀,差異,差異,尾狀突起,小黑圓斑,
小黑,小部份,小白,小斑,小型白斑,小型,寬圓, 寬圓,室各,室亞外緣,大型,多數,
外觀,外緣部位,外緣,外橫線,外框,外圈,基部,型黑,型白,圓斑,圓形,呈黑,呈灰,
各翅室,及第,區域,前翅表底色,前翅端部,前翅端角,前翅端,前翅前緣,前翅中
室內,前翅中室,前翅中央,前翅中,前翅,前緣,分布,分 ,具性徵,其中,兩枚,光澤,
滿,亞外緣部位,亞外緣,中橫線附近,中橫線附近,中橫線,中室內,中室,中央
部位,中央部,中央,三角形

11.8Discussion
In this chapter, we have studied our methods on the case of butterflies. We describe the
following methods.

1. Modeling XML documents of butterflies.


2. Constructing slot-tree ontology for butterflies.
3. Using slot-filling algorithm to map XML documents into slot-tree of butterflies.
4. Using slot-tree ontology to build query interface for butterflies.
5. Using slot-tree ontology to help XML retrieval for butterflies.
6. Mining slot-tree ontology from XML documents of butterflies.

These methods reduce the semantic gap between human and computer in the domain of
butterflies. The query interface enable user to write queries easily. The slot-filling algorithm
makes computer understand XML documents easily. Finally, the mining algorithm makes us
construct slot-tree ontology easily.
12 Case Study - Protein Information Resource
In the previous chapter, we have tested our methods on the collection of “a Museum of
Butterflies in Taiwan (MBT) ”. In this chapter, we will test our methods on the collection of
“Protein Information Resource (PIR)”. The PIR is a large collection for proteins that
maintained by George Town University.

In this chapter, an overview of PIR is given in section 7.1. Some XML data of PIR are
showed in section 7.2. A slot-tree for PIR is described in section 7.3. A query interface for
proteins is described in section 7.4. The slot-filling process for PIR is described in section 7.5.
The retrieval process for PIR is discussed in section 7.6. The mining process to build slot-tree
for PIR is discussed in section 7.7. A discussion of our approach on PIR is given in section 7.8.

12.1Introduction
Protein Information Resource is a general collection of Protein and Gene record for life, including
human, animal, plant, virus bacteria, etc. Each document in this collection describes a gene or protein.
The following table is a profile of this collection.

Table 6.1 : The Protein Information Resource


Collection Protein Information Resource
Working Group National Biomedical Research Foundation in George Town University

URL : http://pir.georgetown.edu/
Size The PIR-PSD, Release 72.03, May 17, 2002, Contains 283174 Entries
Language English

Protein Information Resource contains 283174 entries of protein and gene. Roughly specking, tags may
be classified into groups as following.

Table 6.2 : XML tags for Protein Information Resource


Group Fields
Identification ID, name,
Characteristic organism, function, classification, feature
Gene sequence, length, type
Reference keyword, reference (author, citation), access information
Date create_date
12.2The Representation of Proteins in XML
The following figure shows an XML document for the protein entry “S35333”
- <ProteinEntry id="S35333">

<created_date>03-Feb-1994</created_date>

- <protein><name>steroid receptor protein svp44</name></protein>

- <organism><source>zebra fish</source>…</organism>

- <reference>…<author>Fjose, A.</author> …..<citation>EMBO J.</citation>

<volume>12</volume><year>1993</year><pages>1403-1414</pages>

<title>Functional conservation of vertebrate seven-up related genes in neurogenesis and eye

development…

- <xrefs><xref><db>MUID</db><uid>93223680</uid></xref></xrefs>

- <accinfo label="FJO">…<mol-type>mRNA</mol-type> <seq-spec>1-411</seq-spec> -

<xrefs><xref><db>EMBL</db><uid>X70299</uid></xref>…

</reference>

- <classification><superfamily>unassigned erbA-related proteins</superfamily> …

- <keywords>DNA binding, steroid hormone receptor, zinc finger</keywords>

- <feature label="ERBA">

<feature-type>domain</feature-type>

<description>erbA transforming protein homology</description>

<seq-spec>74-320</seq-spec>

</feature>…

- <summary><length>411</length><type>complete</type></summary>

<sequence>MAMVVSVWRDPQEDVAGGPPSGPNPAAQPAREQQQAASAAPHTPQTPSQPGPPSTPGTAGDK…

</ProteinEntry>

Figure 7.1 : An example of XML document in Protein Information Resource

12.3Slot-Tree Ontology for Proteins


Our ontology is represented as a slot-tree in XML format. The slot-tree we designed for
protein is not so consistent to the PIR collection. For example, the keyword field contains any
keyword that is important for a protein. But in our ontology representation, we use several slot
to represent a protein, including “protein structure”, “molecular function”, “biological
process” and “cellular component”.

Our ontology used in this section is based on the suggestion of Gene Ontology. Gene
Ontology Consortium proposed an ontology system with three dimensions, including
“molecular function”, “biological process” and “cellular component”. Besides that, we add the
“protein structure”, “protein size”, “molecular type”, and some other information in our slot-
tree. A fragment of the slot-tree for butterfly is showed in the following figure. For a full list
of the slot-tree for protein, please see appendix 2.
- <frame>

- <s slot="mol-type" path="/ProteinEntry/reference/accinfo/mol-type">

<v value="protein" /><v value="DNA" /><v value="RNA" /></s>

- <structure slot="mol-shape" path="//">

<v value="Alpha Helix"/><v value="Beta Sheet"/>…

- <source_genus slot="organism" path="//">

<v value="Animal" /><v value="Plants" /><v value="Bacteria"/>…

- <body_component slot="body_component" path="//" >

<v value="Heart" /><v value="Lung" /><v value="Liver" />…

- <cell_component slot="cell_component" path="//">

<v value="Nucleus" /><v value="Cytoplasm" />…<v value="Golgi_Bodies"/>…

- <body_function slot="body_function" path="//">

<v value="Digestion" /> <v value="Respiration" /> <v value="Motion" /> …

- <cell_function slot="cell_function" path="//">

<v value="Structural" /><v value="Metabolism" /><v value="Communication" /> …

- <material slot="material" path="protein/target">

<v value="Acid" /><v value="Base" />…<v value="Enzyme" />…

- <s slot="ref-db" path="//db">

<v value="SGD" />…<v value="GDB" /> …<v value="FlyBase" />

</frame>

Figure 7.2 : A Slot-Tree for Proteins

12.4Query Interface
The query interface is built automatically by transform the slot-tree into a web page. We use
XSLT to transform slot-tree into HTML. The following figure shows a query interface for
butterflies.

A query-interface is automatically generated from slot-tree by XSLT template. The XSLT


template transforms the slot-tree into a HTML document. Then we show it as a web page on
the browser. The following figure shows the interface for proteins domain.
Figure 7.3 A Query Interface for Proteins

12.5Slot-Filling Algorithm
We have to parse XML objects before the extraction. For example, the following XML document is a
protein.
<ProteinEntry>

<protein><name> steroid receptor protein svp44</name>

<protein>

<organism><name> zebra fish </name></organism>

<keyword>DNA binding, steroid hormone receptor, zinc finger</keyword>

<summary><length>411</length>…</summary>

</butterfly>

The example above will be parsed into a sequence of (path, value) pair as following.

(ProteinEntry\protein\name, steroid receptor protein svp44)


(ProteinEntry\organism\name,zebra fish)

(ProteinEntry\keyword,DNA binding, steroid hormone receptor, zinc finger)

(ProteinEntry\summary\length,411)

Then we may fill them into corresponding slot as following.

(ProteinEntry\organism\name,zebra fish),

+ <slot name="Organism" path="ProteinEntry/Organism">


<value name="Human"/>

<value name="Plant"/>

<value name="Fish"/>

….
Organism: Fish

The following figure shows the extraction result for the example above.
<slot protein="S35333">

<slot name="Organism" values="Fish">

<slot name="Molecular Function" values="binding">

</slot>

12.6XML Retrieval
After the user submits the query to the XML retrieval system, the XML retrieval system
retrieves the query results. Then an XML extraction algorithm extracts values for each slot.
After that, a sorting function sorts the result by the size of butterflies. The following figure
shows the query results.
Figure 7.4 : A Query Result for Protein Information Resource

12.7Slot-Mining Algorithm
The slot mining algorithm describe in section 3.5 is used to extract important word for each slot. The
following table shows some results of of the Slot-Mining (part of slot-tree).

Table 7.3 : A Result of Slot-Mining Algorithm for Proteins


Slot Value List
/ProteinEntry/classification/superfamily virus,unassigned,ubiquinone,tyrosine,type,tuberculosis,trypsin,translation,transforming,

transferase,transfer, transcription,transcript,topoisomerase,thioredoxin,tRNA,

synthase,sulfatase,ste,sea,rich,ribosomal,response,repressor,repeat,regulator,region,

reductase,receptor,rat,ras,ran,proteins,protein,probable,polyprotein,phosphate,phage,

permease,peptide,peptidase,oxidase,ornithine,nucleotide,nor,non,mol,min,mer,

membrane,man,long,line,ligase,lactaldehyde,kinesin,kinase,isomerase,inhibitor,inhibit,

immunoglobulin,hypothetical,hydrolyzing,hydrogenase,hydrogen,homology,

homolog,homeobox,glucose,globin,gene,gamma,form,factor,esterase,ester,erbA,

epimerase,enzyme,elegans,edu,domain,dehydrogenase,cytochrome,control,conserved,
coli,chr,cholinesterase,choline,chain,cell,cassette,carrier,binding,bind,beta,barley,

bacterium,antigen,anti,ant,alpha,alcohol,alanine,acid,RNA,NADH,NAD,Mycobacterium,

MTH,III,Escherichia,DNA,Caenorhabditis,Bacillus,ATPase,ATP,ADP
/ProteinEntry/comment ste,protein,phosphorylation,phosphorylated,phosphorylase,phospho,phosphate,non,

molecule,mol,interacts,inhibit,enzyme,covalent,cell,allosterically,allosteric,allo,This,Thi
/ProteinEntry/complex tet,phosphorylase,phospho,mer,homotetramer
/ProteinEntry/feature TMM,SIG,RRH,MAT,KIN,IMM,HOX,FOX,ERBA,ACP,ABC
/ProteinEntry/header/created_date Sep,Oct,Nov,May,Mar,Jun,Jul,Jan,Feb,Dec,Aug,Apr
/ProteinEntry/header/seq-rev_date Sep,Oct,Nov,May,Mar,Jun,Jul,Jan,Feb,Dec,Aug,Apr
/ProteinEntry/genetics/xrefs/xref/db SGD,OMIM,MIPS,MIP,GDB
/ProteinEntry/genetics/start-codon GTG
/ProteinEntry/genetics/map-position qter,pter,circular,chromosome,chr,REV
/ProteinEntry/genetics/gene/db SPDB,SGD,SCOEDB,GDB,CESP,ATSP
/ProteinEntry/function/description sulfate,ran,protein,phospho,phosphate,hydrogenase,hydrogen,glucose,formate,

form,catalyzes,alpha
/ProteinEntry/feature/status predicted,experimental,exp,atypical
/ProteinEntry/feature/feature-type site,region,product,modified,inhibitory,inhibitor,inhibit,domain,disulfide,bonds,binding,

bind,active
/ProteinEntry/keywords/keyword zinc,transmembrane,transferase,transfer,transcription,transcript,tet,ste,ribosome,

regulation,reductase,receptor,rat,ras,ran,pyridoxal,proteinase,protein,polyprotein,

photo,phosphoprotein,phospho,phosphate,oxygen,oxidoreductase,nucleus,nucleotide,

muscle,mol,mitochondrion,min,metalloprotein,metal,mer,membrane,magnesium,lyase,

loop,kinase,isomerase,iron,immunoglobulin,hydrolase,homotetramer,homeobox,

heterotetramer,heme,glycoprotein,finger,erythrocyte,end,edu,duplication,date,complex,

chromoprotein,chr,chloroplast,cell,carrier,carboxyl,carboxy,carbon,blood,biosynthesis,

binding,bind,aminoacyl,amino,amidated,allo,acid,acetylated,NAD,DNA,ATP
/ProteinEntry/feature/description zinc,trypsin,transmembrane,transforming,ste,signal,sequence,seq,response,repeat,
regulator,reductase,rat,ras,ran,pyridoxal,pter,proteinase,protein,potential,phosphorylase,
phospho,phosphate,peptide,oxidase,nucleotide,muscle,motif,molybdopterin,mol,

min,mer,membrane,mature,man,magnesium,low,loop,ligands,ligand,kinase,iron,

inhibitor,inhibit,immunoglobulin,hydrogenase,hydrogen,homology,homolog,homeobox,

heme,glycoprotein,fragment,form,finger,ferroxidase,factor,erbA,end,edu,domain,

dehydrogenase,date,cytochrome,covalent,chr,chain,cassette,carrier,carboxyl,carboxy,

carbohydrate,binding,bind,beta,axial,amino,amidated,alpha,allo,alcohol,acetylated,Thr,

Ser,Lys,Ile,His,Glu,GTP,Cys,Bowman,Birk,Asp,Asn,Arg,ATP,ADP

However, the schema of PIR is not consistent to our ontology that described in section 5.2. The
inconsistency causes mapping problem between slots ant paths. Ontology designer have to spend a lot
of time to adjust the automatic generated slot-tree.
12.8Discussion
In this chapter, we have studied our methods on the case of proteins. We describe the following
methods.

1. Modeling XML documents of proteins.


2. Constructing slot-tree ontology for proteins.
3. Using slot-filling algorithm to map XML documents into slot-tree of proteins.
4. Using slot-tree ontology to build query interface for proteins.
5. Using slot-tree ontology to help XML retrieval for proteins.
6. Mining slot-tree ontology from XML documents of proteins.

These methods reduce the semantic gap between human and computer in the domain of
butterflies. The query interface enable user to write queries easily. The slot-filling algorithm
makes computer understand XML documents easily. Finally, the mining algorithm makes us
construct slot-tree ontology easily.
Part 4 : Conclusions

13 Conclusions and Contributions


In this thesis, our goal is designing an XML retrieval system to reduce the semantic gap between human
and computer. We use the slot-tree to help the XML retrieval system to achieve the goal. We have
proved that the slot-tree ontology may used to reduce the semantic gap of XML retrieval.

In this thesis, slot-trees are used to generate a query interface for user to write queries easily. The
query interface reduces the semantic gap in the query side. On the other hand, a slot-filling algorithm is
designed for computer to understand XML documents easily. The slot-filling algorithm reduces the
semantic gap on the document side.

In order to ease the process of building a slot-tree, we propose a slot-mining algorithm to mine
slot-tree from XML documents. The slot-tree has to be modified by domain expert for quality
improvement.

In this chapter, we will compare our approach to other approaches in section 8.1. Our
contributions are described in section 8.2. Finally, we have conclusions in section 8.3.

13.1Comparison
We will try to compare our approach to other approaches based on four measures. Each
measures corresponding to a question listed below.

1. Can people write queries easily?


2. Can people write documents easily?
3. Can machine understand queries easily?
4. Can machine understand documents easily?

a. A comparison of knowledge representation approaches


At first, we compare to four knowledge representation approaches that trying to resolve the
semantic gap problem, including natural language approach, database approach, logic based
approach and XML based approach. Figure 8.1 show the comparison of these approaches.
Figure 8.1 : A comparison of knowledge representation approaches

Natural language approach (NL) : Both documents and queries are written in natural
language. A typical text retrieval system adopts natural language approach. Natural language is
easy for user to read and write. However, natural language is not easy for computer to
understand.

Database approach (DB) : Documents are encoded as a set of tables in a relational database.
Database system is not so easy for user to read and write data. A designer has to design user-
interface to help user read and write data. Database query languages like SQL are not so easy
for end user to write. However, data in database are easy to understand for computer.

Logic based approach (Logic) : Both logic queries and data are very easy for computer to
understand. However, people cannot write logic rules and queries easily. Besides, not all
documents can be represented logic rules.

XML based approach (XML) : XML queries are easy for computer to understand. However,
XML queries are not easy for human to write, and computer cannot understand XML
documents easily for the time being. In this thesis, we use the slot-tree ontology to help
computer to understand XML documents. We also use the slot-tree ontology to build the query
interface. The interface helps human to write XML queries easily. The slot-tree based methods
moves the XML based approach to the easy side in the figure below. The slot-tree ontology
reduces the gap between human and computer on XML.
b. A comparison of XML-based representations
Next, we compare three XML-based representation, including XML, RDF and DAML. XML
has been described in this thesis for several times. Now we will introduce RDF and DAML
before comparison.

RDF is a recommendation of the W3C Semantic-Web project. It is an object-based


representation that encodes objects in XML format. Each object in RDF is called a resource
and has a unique URI. The following example shows a RDF document.

<rdf:RDF>
<rdf:Description about="Athyma_fortuna_kodairai">
<rdf:type resource="http://description.org/schema/butterfly"/>
<color>with brown wing and black head</color>
<texture>has white spots on wings</texture>
</rdf:Description>
</rdf:RDF>

We have to use the tag defined in RDF specification to describe object and the inheritance
relation. People have to understand RDF tags before write RDF documents. However, RDF is
simple and easy to use.

DAML is a representation that encodes logic rules into frame based XML documents.
DAML extend tags in RDF to accommodate frequently used logic predicate, such as
“disjointWith”, “cardinality”, “intersectionOf”, etc. The following example shows a example
of DAML.

<rdfs:Class rdf:ID="Athyma_fortuna_kodairai">
<rdfs:subClassOf rdf:resource="#butterfly"/>
<daml:disjointWith rdf:resource="#Moth"/>
</rdfs:Class>

Writing DAML document is not an easy job. People have to understand many DAML tags
and express the content into logic predicates. The following figure shows the comparison
between XML, RDF and DAML.
Figure 8.2 : A comparison of XML-based representations

c. A comparison of XML retrieval systems


Finally, we compare several XML retrieval systems, including XML-GL, XYZfind, Lore, and
our slot-tree system. We have described the XML-GL, XYZfind and Lore system in section
2.3. Briefly speaking, XML-GL is a graphical XML query language, XYZfind is a two level
XML search system and Lore is an XML retrieval system based on object-oriented database.

Figure 8.3 shows the comparison of these approaches. Our slot-tree based system is
labeled as “slot” in the figure. XML-GL is labeled as “X-GL” in the figure. XYZfind is labeled
as “XYZ” in the figure. Lore is labeled as “Lore” in the figure.

We found that slot-tree approach perform well in all questions. The slot-tree ontology
makes people write queries easily. Our approach does not ask people to write XML document
in specified tags, so that people can write documents easily. The XML queries are always easy
to understand for computer. The slot-filling algorithm makes computer understand XML
documents easily.
Figure 8.3 : A comparison of XML retrieval systems

13.2Contributions
Based on the analysis in section 9.1, we may describe our contribution briefly as following.

“The slot-tree approach reduces the semantic gap between human and computer on XML”

The contribution is further described as the following parts.

1. “The slot-tree based query interface makes human to write XML queries easily.”
2. “The slot-filling algorithm makes computer understand XML documents easily.”
3. “A retrieval system that based on slot-tree is built to reduce the semantic gap on XML.”
4. “The slot-mining algorithm makes people construct slot-tree ontology easily.”

However, we proposed the slot-tree based XML retrieval method only focus on a specific
domain. We have to construct slot-tree for each domain before release the XML retrieval
system. The method is good in retrieve object-based XML documents such as butterflies and
proteins. However, we are not sure the method can be used to retrieve XML collection that is
not object-based. Besides, we have to extend the method to build an XML retrieval system for
more than one domain.
13.3Discussion and Future Work
In this thesis, we use the slot-tree ontology and slot-filling algorithm to reduce the semantic
gap of XML. The slot-tree is used to generate a query interface to reduce the semantic gap on
query side. The slot-filling algorithm is used to map XML document into slot-tree ontology in
order to reduce the semantic gap on document side. Our XML retrieval system works well on
objects-based XML collections, such as the collection for butterflies in chapter 6 and the
collection for proteins in chapter 7.

However, not all XML documents are used to describe objects. Some XML documents are
used to encode categories, scripts and other structures. How to integrate these structures into an
XML retrieval system is a good question for our future research.

Another question is the integration of XML collections in several domains. For example,
how to integrate XML documents that describe gene, protein and biological species into one
XML retrieval system is a good case to study. The integration of several domains needs a
further research.

Finally, a scalable XML retrieval system should be useful on a web with many XML
documents. The XML retrieval system should be used to retrieve a large collection of XML
documents in a variety of domains. We will try to build such an XML retrieval system in the
future.
Reference

[Aguilera00] Aguilera, V. and Cluet, S. and Veltri, P. and Vodislav, D. and Wattez,F. (2000) “Querying
XML Documents in Xyleme” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/xyleme/XylemeQuery/XylemeQuery.html

[Albano00] Albano, A. and Colazzo, D. and Ghelli, G. and Manghi, P. and Sartiani, C. (2000) “A Type
System for Querying XML Documents” in ACM SIGIR 2000 Workshop On XML and Information
Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Sartiani/athens.html

[Allen94] Allen, J.F. “Natural Language Understanding,” Benjamin Cummings, 1987, Second Edition,
1994.

[Alshawi92] Hiyan Alshawi, editor. The Core Language Engine. MIT Press, Cambridge, Massachusetts,
1992.

[Baeza00] Baeza-Yates, R. and Navarro, G. (2000) “XQL and Proximal Nodes,” in ACM SIGIR 2000
Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/RBaetza/att1.htm

[Bobrow77] Bobrow, D. G. and Winograd, T. (1977). “An overview of KRL, a knowledge


representation language.” Cognitive Science, 1(1), 3--46.

[Bollacker98] Bollacker, K.D. and Lawrence, S. and Giles, C.L. (1998) “CiteSeer: An Autonomous
Web Agent for Automatic Retrieval and Identification of Interesting Publications”, 2nd International
ACM Conference on Autonomous Agents, pp. 116-123, ACM Press, May, 1998.

[Brachman85a] Brachman, R. and Levesque, H. (1985). “Readings in Knowledge Representation”,


Stanford: Morgan Kaufmann

[Brachman85b] Brachman, F.J., and Schmolze, J.G. (1985) “An overview of the KL-ONE knowledge
representation system.” Cognitive Sci. 9.2 (Apr. 1985) 171-216.

[Brin98] Brin, S. and Page,L.(1998) "The Anatomy of a Large-Scale Hypertextual Web Search Engine"
in Proceedings of World-Wide Web '98 (WWW7), April 1998.
[Carmel00] Carmel, D. and Maarek, Y. and Soffer, A. (2000) “Workshop Summary of XML and
Information Retrieval: a SIGIR 2000 Workshop” IBM Research Lab in Haifa.
http://www.haifa.il.ibm.com/sigir00-xml/WorkshopSummary.html

[Chen00] Chen, B.C. (2000) ‘Content-Based Image Retrieval of Butterflies”, Master Thesis. NTU,
Taiwan, June, 2000.

[Chien97] Chien, L.F. (1997) "PAT-Tree Based Keyword Extraction for Chinese Information Retrieval"
ACM SIGIR 1997.

[Cooper01] Cooper, B.F. and Sample, N. and Franklin,M.J. and Hjaltason,G.R. and Shadmon, M.
(2001) “A Fast Index for Semistructured Data” Proc. of 27th Intl. Conf. on Very Large Data Bases,
August 2001. http://www.rightorder.com/technology/XML.pdf

[DC99] “Dublin Core Metadata Element Set, Version 1.1: Reference Description” –
http://dublincore.org/documents/dces/

[DeJong82] DeJong; G.. (1982) “An Overview of the FRUMP System.” In Strategies for Natural
Language Processing, W.G.Lehnert & M.H.Ringle (Eds), Lawrence Erlbaum Associates, 1982, 149-
176.

[Dyer83] Dyer, M.G. (1983) "In-Depth Understanding - A computer model of integrated processing for
Narrative Comprehension, " MIT press, 1983.

[Egnor00] Egnor,D. and Lord,R. (2000) “XYZfind: Searching in Context with XML” in ACM SIGIR
2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Egnor/index.html

[Fuhr00] Fuhr, N. (2000) “XIRQL An Extension of XQL for Information Retrieval” in ACM SIGIR
2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/KaiGross/sigir00.html

[Goldman97] Goldman, R. and Widom, J. (1997) “DataGuides: Enabling query formulation and
optimization in semistructured databases.” In Proc. Intl. Conf. on Very Large Data Bases, 1997.

[Green63] Green, B.F., Wolf, A.K., Chomsky, C., and Laughery, K. (1963). “Baseball : An automatic
question answerer.” In Feigenbaum and Feldman (Eds.), Computer and Thought. McGraw-Hill, New
York, 207-233.
[Grosz86] Grosz, B.J., Sparck-Jones, K., and Webber, B.L., eds. (1986) "Readings in Natural Language
Processing", Morgan Kaufmann Publishers, Los Altos, CA, 1986
[Han01] Han, J. and Kamber, M. (2001) “Data Mining - Concepts and Techniques”, Morgan Kaufmann
Publisher. 2001.

[Hayashi00] Hayashi, Y. and Tomita, J. and Kikui,G. (2000) “Searching Text-rich XML Documents
with Relevance Ranking” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Hayashi/hayashi.html

[Heb00] Heb, M. and Monch, C. and Drobnik, O. (2000) "Quest - Querying Specialized Collections on
the Web", J. Borbinha and T.Baker (Eds.) : ECDL 2000, LNCS 1923, pp. 117-126, 2000.

[Hobbs96] Hobbs, J. and Appelt, D. and Bear, J. and Israel, D. and Kameyama, M. and Stickel, M. and
Tyson, M. (1996) “FASTUS: A Cascaded Finite-State Transducer for Extracting Information from
Natural-Language Text.” in Finite State Devices for Natural Language Processing, MIT Press, 1996

[Hsu98] Hsu, C.N. and Dung, M.T. (1998) “Generating finite-state transducers for semistructured data
extraction from the web,” Information Systems, 23(8):521-538, Special Issue on Semistructured
Data, 1998.

[Ide00] Ide, N. (2000) “Searching Annotated Language Resources in XML: A Statement of the
Problem” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Ide/SIGIR-XML.html

[Ifikes85] Ifikes, R. and Kehler, J. (1985) “The role of frame-based representation in reasoning.”
Communications of the ACM, Volume 28 Number 9, September 1985.

[Kehler84] T.P. Kehler and G.D. Clemenson. KEE: The Knowledge Engineering Environment for
Industry. Systems And Software, 3(1):212-224, January 1984.

[Kleinberg98] Kleinberg, J.M. (1998) "Authoritative Sources in a Hyperlinked Environment" in


Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 668-677, January 1998.
http://www.cs.cornell.edu/home/kleinber/auth.ps

[Kushmerick00] Kushmerick, N. (2000) “Wrapper induction: Efficiency and expressiveness” Artificial


Intelligence J. 118(1-2):15-68 (special issue on Intelligent Internet Systems).
[Lewin99] Lewin et al.1999 I. Lewin, R. Becket, J. Boye, D. Carter, M. Rayner, and M. Wiren.
Language processing for spoken dialogue systems: is shallow parsing enough? In Accessing
Information in Spoken Audio: Proceedings of ESCA ETRW Workship, Cambridge, 19 & 20th April
1999, pages 37--42, 1999.

[Loral97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. “The Lorel Query Language for
Semistructured Data.” International Journal on Digital Libraries, 1(1):68-88, April 1997.

[Luk00] Luk,R. and Chan,A. and Dillon,T. and Leong, H.V. (2000) “A Survey of Search Engines for
XML Documents” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Luk/XMLSUR.htm

[McHugh97] McHugh, J. and Widom, J. and Wiener, J. and Abiteboul, S. and Quass, D. (1997) “The
Lorel Query Language for Semistructured Data, ” - International Journal on Digital Libraries,
1(1):68-88, 1997.

[Minsky75] Minsky, M. (1975). “A framework for representing knowledge.” Available in Readings in


Knowledge Representation, Brachman, R.J. & Levesque, H.J., Eds. (1985), Morgan Kaufman.

[Muslea99] Muslea, I. (1999) “Extraction Patterns for Information Tasks : A Survey, ” In AAAI-99
Workshop on Machine Learning for Information Extraction, 1999.

[OIL00] “An informal description of Standard OIL and Instance OIL 28 November 2000”
http://www.ontoknowledge.org/oil/downl/oil-whitepaper.pdf

[Page98] Page, L. and Brin, S. and Motwani, R. and Winograd, T. “The PageRank citation ranking:
Bringing order to the Web.” Unpublished manuscript, online at http://google.stanford.edu/~backrub/
pageranksub.ps, 1998.

[Quillian66] Quillian, R. "Semantic memory," Cambridge, Mass. : Bolt, Beranek and Newman, 1966.

[RDF99] Resource Description Framework (RDF) Model and Syntax Specification W3C
Recommendation 22 February 1999 http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/

[RDFS00] Resource Description Framework (RDF) Schema Specification 1.0 W3C Candidate
Recommendation 27 March 2000 http://www.w3.org/TR/2000/CR-rdf-schema-20000327/
[Salton88] Salton, G. and Buckley, C. “Term-Weighting Approaches in Automatic Text Retrieval,”
Information Processing and Management, 24(5), 513-23, 1988.

[Schank74] Schank, R.C. and Reiger III, C.J.(1974) "Inference and the Computer Understanding of
Natural language," Artificial Intelligence 5(4), 1974, 373-412.

[Schank77] Schank, R.C. and Abelson, R. (1977). “Scripts, Plans, Goals, and Understanding.”
Hillsdale, NJ: Earlbaum Assoc.

[Schank80] Schank, R.C. and Kolodner, J.L. and DeJong, G. (1980) “Conceptual Information
Retrieval.” SIGIR 1980: 94-116.

[Schmidt00] Schmidt, A. et al. (2000) “Efficient Relational Storage and Retrieval of XML Documents”,
In proceedings of International Workshop on the Web and Databases (In conjunction with ACM
SIGMOD), pages 47-52, Dallas, TX, USA, May 2000.
http://citeseer.nj.nec.com/schmidt00efficient.html

[Schlieder00] Schlieder, T. and Meuss, H. (2000) “Result ranking for structured queries against XML
documents.” In DELOS Workshop on Information Seeking, Searching and Querying in Digital
Libraries, Zurich, Switzerland, December 2000.

[Schlieder01] Schlieder, T. (2001) “Similarity search in XML data using cost-based query
transformations.” In Proceedings of the Fourth International Workshop on the Web and Databases
(WebDB'01), Santa Barbara, USA, May 2001.

[Schlieder00] Schlieder, T. and Naumann ,F. (2000) “Approximate Tree Embedding for Querying XML
Data” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.
http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Approximate.htm

[Stefik79] Stefik, M.J. (1979) “An examination of a frame-structured representation system.” In


Proceedings of the 6th International Joint Conference on Artificial Intelligence (Tokyo, Japan, Aug.).
Kaufmann, Los Altos, CaIif., 1979, pp. 845-852.

[Stefik83] Stefik, M., Bobrow, D. G., Mittal, S., and Conway, L. Knowledge Programming in Loops:
Report on an Experimental Course. AI Magazine, 4:3, pp. 3-13, Fall 1983. (Reprinted in Readings
From the AI Magazine, Volumes 1-5, 1980-1985, pp. 493-503, 1988.)
[Tu99] Tu, H. C. (1999) “Interactive Web IR: Focalization Model, Effectiveness Measures, and
Experiments”, Doctoral Dissertation, NTU, Taiwan, June, 1999.

[Turing50] Turing, A. M. “Computing machinery and intelligence. Mind”, 59:433-460, 1950.

[UDDI00] “UDDI Technical White Paper” September 6, 2000


http://www.uddi.org/pubs/Iru_UDDI_Technical_White_Paper.PDF

[van Zwol2002] van Zwol, R. (2002). “Modelling and searching web-based document collections.”
PhD thesis, Centre for Telematics and Information Technology (CTIT), Enschede, the Netherlands.
ISBN: 90-365-1721-4; ISSN: 1381-3616 No. 02-40 (CTIT Ph.D. thesis series).

[Weizenbaum66] Weizenbaum, J. 1966. “ELIZA,” Communication of ACM 9:36-45.

[Widom99] Widom, J. (1999) “Data Management for XML - Research Directions”, IEEE Data
Engineering Bulletin, Special Issue on XML, 22(3):44-52, September 1999.
http://www-db.stanford.edu/~widom/xml-whitepaper.htm

[Wood75] Woods, William A. “What's in a Link : Foundations for Semantic Networks” Available in
Readings in Knowledge Representation, Brachman, R.J. & Levesque, H.J., Eds. (1985), Morgan
Kaufman.

[XML98] “Extensible Markup Language (XML) 1.0” W3C Recommendation 10-February-1998


http://www.w3.org/TR/1998/REC-xml-19980210

[XML-QL98] “XML-QL: A Query Language for XML,” Submission to the World Wide Web
Consortium 19-August-1998 http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/

[XML-GL99] Stefano Ceri, Sara Comai, Ernesto Damiani , Piero Fraternali, Stefano Paraboschi,
Letizia Tanca “XML-GL: a Graphical Language for Querying and Restructuring XML Documents,”
in The Eighth International World Wide Web Conference (WWW8), Toronto Convention Centre,
Toronto, Canada May 11-14, 1999.

[XMLNS99] “Namespaces in XML” World Wide Web Consortium 14-January-1999


http://www.w3.org/TR/REC-xml-names/

[XPATH99] “XML Path Language (XPath) Version 1.0”, W3C Recommendation 16 November 1999,
http://www.w3.org/TR/xpath
[XQuery01] “XQuery 1.0 and XPath 2.0 Data Model” W3C Working Draft 20 December 2001
http://www.w3.org/TR/2001/WD-query-datamodel-20011220/

[XTM01] “XML Topic Maps (XTM) 1.0” http://www.topicmaps.org/xtm/1.0/


Appendix 1 : A Museum of Butterflies in Taiwan

a. XML example
- <butterfly>

<cname>拉拉山三線蝶</cname>

<nickname />

- <present_SN_record>

<present_SN>Athyma_fortuna_kodairai</present_SN>

<present_SN_author>Sonan</present_SN_author>

<present_SN_year>1938</present_SN_year>

</present_SN_record>

- <classification>

<family>Nymphalidae</family>

<cfamily>蛺蝶科</cfamily>

<genus>Athyma</genus>

<species>fortuna</species>

<sub_species>kodairai</sub_species>

</classification>

<hostplant>忍冬科 (Caprifoliaceae) 的松田氏紅子仔 (Viburnum luzonicum var. matsudai)。</hostplant>

<honeyplant>成蝶喜吸食腐熟水果汁液或樹幹流出汁液。</honeyplant>

- <geographic>

<taiwan>分布於台灣中北部地區,海拔 1000-2000 公尺間山區均有分布。</taiwan>

<global>中國大陸中部有原名亞種分布。</global>

</geographic>

- <life_stage>

- <egg>

<feature>底部扁平之高饅頭形,表面有明顯六角形格狀花紋,於六角形頂點處,各著生一
細長刺毛。</feature>

<color>淡綠。</color>

<size>直徑約為 1.1-1.3mm。</size>

<characteristic />

<habitate />

<predator>各類卵寄生蜂、蜱等節肢動物。</predator>

<days_of_growth>卵期約為 5-6 天左右。</days_of_growth>

</egg>

- <larva>

<feature>終齡幼蟲體呈長圓筒狀,頭部密生硬棘,各體節背方及體側皆長有具星狀刺之突
起。</feature>

<color>終齡幼蟲頭部褐色,表面密生棘狀突起。體呈翠綠色,各體節背方及體側突起基部
為藍色,星狀刺為黃綠色。</color>

<size>終齡幼蟲體長約為 33-41 mm。</size>

<characteristic />

<habitate />

<predator>寄生蜂、寄生蠅、小繭蜂、椿象、蜥蜴及鳥類等。</predator>

<days_of_growth>冬季以二齡幼蟲越冬,幼蟲期長達半年以上。</days_of_growth>

<defense>初齡幼蟲停棲於寄主葉脈,攝食葉脈兩側葉肉,二齡幼蟲會將寄主植物葉片咬
成小塊並吐絲將其此碎片及糞便黏於葉脈造一蟲巢,越冬幼蟲即躲藏於蟲巢當中,

由於幼蟲褐色之體色與蟲巢上乾枯之小葉片或糞便色澤相近,或許可混淆天敵耳目。

</defense>

</larva>

- <pupa>

<feature>蛹體為垂蛹,中胸背方隆起,腹節末端有一柄狀懸絲器。頭部前端有一對大型明
顯之彎曲角狀突出物,腹節背方均有小型鋸齒狀脊起。</feature>

<color>蛹體底色呈黃褐色,中、後胸背方有銀色斑塊,體側氣門黑褐色。</color>

<size>蛹體長度約為 22-27mm。</size>

<characteristic />

<habitate />

<predator>蛹寄生蜂、胡蜂、姬蜂及各種真菌等。</predator>

<days_of_growth>蛹期約為 15-20 天,視溫度而定。</days_of_growth>

<defense>老熟幼蟲化蛹於隱蔽之植物叢間,藉以躲避天敵。</defense>

</pupa>

- <adult>

<feature>成蟲前翅外觀大致呈現三角形,翅形稍微橫長。後翅卵圓形,外觀接近三角形。雌
蝶翅型較為寬圓。</feature>

<color>雄蝶前、後翅表底色為黑色,前翅中室內有一枚長形白斑,各翅室中橫線部位有一
大型白色橢圓斑,前翅端有兩枚小型白斑。後翅有兩條明顯白色橫帶紋,前後翅緣皆

有不明顯小白紋。雌蟲翅表色澤花紋與雄蟲相似。</color>

<size>本種為中型蝶種,展翅約為 50-60mm。</size>

<characteristic>前翅中室內有一枚長形白斑。</characteristic>

<habitate>台灣中部以北山區均有分布。</habitate>

<predator>蜘蛛、螳螂、青蛙、蜻蜓、鳥類及蜥蜴等捕食性天敵。</predator>

<days_of_growth>前翅中室內有一枚長形白斑。</days_of_growth>

<defense>成蟲飛行快速,外觀與其他多種三線蝶類似,為莫氏擬態的一種。</defense>

<season>夏季較易見到成蟲活動。</season>

<behavior>成蝶喜吸食腐熟水果汁液或樹幹發酵流出之樹液,成蟲活動於開闊林道,常見
成蟲於開闊山徑兩旁樹上佔據地盤驅趕附近飛過蝴蝶,亦可見其活動於溪邊開闊處,

吸食腐果或潮濕地面水分。</behavior>

</adult>

</life_stage>

<update>2000/11/7</update>

<footnote />

</butterfly>

b. Domain Knowledge
- <frame language="big5" database="xir" showPath="//butterfly//cname//">

- <butterfly>

- <family slot="種類" path="//butterfly//cfamily//" menu="yes">

<v value="弄蝶" keys="弄蝶=Hesperiidae" />

<v value="小灰蝶" keys="小灰蝶=Lycaenidae" />

<v value="斑蝶" keys="斑蝶=Danaidae" />

<v value="粉蝶" keys="粉蝶=Pieridae" />

<v value="鳳蝶" keys="鳳蝶=Papilionidae" />

<v value="蛇目蝶" keys="蛇目蝶=Satyridae" />

<v value="蛺蝶" keys="蛺蝶=Nymphalidae" />

<v value="小灰蛺蝶" keys="小灰蛺蝶=Riodinidae" />

<v value="長鬚蝶" keys="長鬚蝶=Libytheidae" />

</family>

- <adult slot="蝴蝶成蟲" keys="Adult" path="//butterfly//adult//">

- <shape slot="蝴蝶的形狀" keys="Adult:Shape" path="//butterfly//adult//shape//" menu="yes">

<v value="類似燕尾" keys="Swallowtail+突出" image="swallowtail.gif"/>

<v value="細小尾突" keys="little_tail" image="little_tail.gif" />

<v value="翅緣破裂" keys="broken+破裂" image="broken.gif" />

<v value="翅緣波浪狀" keys="鋸齒狀+wave" image="wave.gif" />

<v value="似蛾狀" keys="Moth+蛾" image="moth.gif" />

<v value="似枯葉狀" keys="Leaf+枯葉" image="leaf.gif" />

</shape>

- <color slot="蝴蝶的顏色" keys="Adult:Color" path="//butterfly//adult//color//" menu="yes">

<v value="大致黑色" keys="Black" />

<v value="大致深棕色" keys="Dark_Wood" />

<v value="大致淺棕色" keys="Light_Wood" />

<v value="大致橘紅色" keys="Orange_Red" />

<v value="大致橘黃色" keys="Orange_Yellow" />

<v value="大致黃色" keys="Yellow" />


<v value="大致綠色" keys="Green" />

<v value="大致藍色" keys="Blue" />

<v value="大致紫色" keys="Purple" />

<v value="大致灰色" keys="Gray" />

<v value="大致白色" keys="White" />

<v value="黑白相間" keys="Black_White" />

<v value="黑黃相間" keys="Black_Yellow" />

<v value="黑橘相間" keys="Black_Orange" />

<v value="黑藍相間" keys="Black_Blue" />

<v value="黑紅相間" keys="Black_Red+" />

<v value="棕白相間" keys="Wood_White" />

<v value="超過三種顏色" keys="many" />

</color>

- <texture slot="蝴蝶的特徵" keys="Adult:Texture" path="//butterfly//adult//texture//"

menu="yes">

<v value="沒有花紋" keys="Mono+無..花紋" image="mono.gif" />

<v value="垂直色帶" keys="v_Band" image="v_band.gif" />

<v value="水平色帶" keys="h_Band" image="h_band.gif" />

<v value="一條細線" keys="1_Line" image="1_line.gif" />

<v value="多條細線" keys="lines" image="lines.gif" />

<v value="翅脈明顯" keys="Vein+翅脈" image="vein.gif" />

<v value="格子斑紋" keys="Grid+格狀" image="grid.gif" />

<v value="眼睛狀點" keys="Eyes+圓斑+眼" image="eyes.gif" />

<v value="少數斑點" keys="Spot" image="spot.gif" />

<v value="一些斑點" keys="Some_Spots" image="some_spots.gif" />

<v value="滿佈著斑點" keys="Spots" image="spots.gif" />

<v value="複雜木紋" keys="Complex_Wood" image="complex_wood_t.gif" />

<v value="翅緣有花紋" keys="Edge" image="edge.gif" />

<v value="有零星小點" keys="Stars" image="stars.gif" />

<v value="前翅前半異色" keys="Fore_Half" image="fore_half.gif" />

</texture>

</adult>

- <pupa slot="蝴蝶的蛹" keys="Pupa" path="//butterfly//pupa//">

- <s slot="蛹的形狀" path="//butterfly//pupa//" menu="yes">

<v value="突起" keys="Skin_Stick" />

<v value="環紋" keys="Ring_Texture" />

<v value="粗糙" keys="Rough_Skin" />

<v value="光滑" keys="Smooth_Skin" />


<v value="橢圓形" keys="Ellipse_shape" />

</s>

- <s slot="蛹的顏色" keys="Pupa:Color" path="//butterfly//pupa//color//" menu="yes">

<v value="翠綠色" keys="Green=翠綠色" />

<v value="黃綠色" keys="Light_Green=淡綠色" />

<v value="褐色" keys="Wood" />

<v value="灰色" keys="Gray" />

<v value="白色" keys="White" />

<v value="金黃色" keys="Gold" />

</s>

- <s slot="蛹的特徵" keys="Pupa:Feature" path="//butterfly//pupa//feature//" menu="yes">

<v value="帶蛹" keys="Laying_Pupa" image="pupa_bag.jpg" />

<v value="垂蛹" keys="Hanging_Pupa" image="pupa_hang.jpg" />

</s>

</pupa>

- <egg slot="蝴蝶的卵" keys="Egg" path="//butterfly//egg//">

- <s slot="底部"> <v value="扁平" /> </s>

- <s slot="表面">

<s slot="縱脊" />

<s slot="突出物" />

</s>

- <s slot="卵的形狀" keys="Egg:Shape" path="//butterfly//egg//feature//" menu="yes">

<v value="圓球形" keys="Ball" image="egg_ball.jpg" />

<v value="半球形" keys="饅頭形+Half_Ball" image="egg_half_ball.jpg" />

<v value="扁平盤狀" keys="Plate" image="egg_plate.jpg" />

<v value="梭子形" keys="酒瓶形+瓶形+Shuttle" image="egg_shuttle.jpg" />

<v value="砲彈形" keys="Bullet" image="egg_bullet.jpg" />

</s>

- <s slot="卵的顏色" keys="Egg:Color" path="//butterfly//egg//color//" menu="yes">

<v value="乳白" keys="Milk_White" />

<v value="淡 " keys="Light_Yellow" />

<v value="棕褐色" keys="Wood+棕+褐" />

<v value="淡綠" keys="Light_Green" />

<v value="橙黃" keys="Yellow" />

<v value="光澤" keys="Shining" />

</s>

- <s slot="卵的特徵" keys="Egg:Texture" path="//butterfly//egg//feature//" menu="yes">

<v value="表面光滑" keys="Smooth+光滑" />


<v value="六角形花紋" keys="Haxagon Texture+六角形" />

<v value="有縱脊" keys="Ridge+縱脊" />

<v value="菱形花紋" keys="Rhombus_Texture+菱形" />

<v value="格狀花紋" keys="Square_Texture" />

</s>

</egg>

- <larva slot="蝴蝶的幼蟲" keys="Larva+毛毛蟲" path="//butterfly//larva//">

<s slot="軀體" keys="蟲體" />

<s slot="頭部" />

<s slot="體節" />

<s slot="表面" />

- <s slot="體側">

<s slot="氣門" />

</s>

<s slot="肛板" />

<s slot="體長" />

- <s slot="幼蟲的形狀" keys="Larva:shape" path="//butterfly//larva//feature//" menu="yes">

<v value="細長" keys="Thin" />

<v value="扁平" keys="Like_Plate" />

<v value="紡棰形" keys="Like_Shuttle" />

<v value="鳥糞狀" keys="Like_Bird's_Shit" />

</s>

- <s slot="幼蟲的顏色" keys="Larva:Color" path="//butterfly//larva//color//" menu="yes">

<v value="翠綠色" keys="Green+綠色" />

<v value="黃綠色" keys="Yellow_Green" />

<v value="淡黃色" keys="Light_Yellow" />

<v value="灰色" keys="Gray" />

<v value="白色" keys="White" />

<v value="黑色" keys="Black" />

<v value="褐色" keys="Brown" />

</s>

- <s slot="幼蟲的特徵" keys="Larva:Texture" path="//butterfly/life_stage/larva/characteristic"

menu="yes">

<v value="短毛" keys="Short_Hair" />

<v value="長毛" keys="Long_Hair" />

<v value="細毛" keys="Thin_Hair" />

<v value="肉突" keys="Skin_Stick+突起" />

<v value="橫紋" keys="Line_Texture" />


<v value="圈紋" keys="Ring_Textrue+圈狀眼紋+環紋" />

</s>

</larva>

- <s slot="台灣分布" keys="Taiwan" path="//butterfly//geographic//taiwan//" menu="yes">

<v value="台灣全島" keys="Whole_Taiwan" />

<v value="台灣北部" keys="North_Taiwan+北" />

<v value="台灣東部" keys="East_Taiwan+東" />

<v value="台灣南部" keys="South_Taiwan+南" />

<v value="恆春半島" keys="HunChan" />

<v value="綠島" keys="GreenIsland" />

<v value="蘭嶼" keys="LanYu" />

</s>

- <s slot="全球分布" path="//butterfly//geographic//global//" menu="yes">

<v value="東亞" keys="East_Asia+朝鮮半島+韓國+日本" />

<v value="東南亞" keys="South_Asia+中南半島+印尼+泰國+馬來+緬甸+菲律賓+婆羅洲" />

<v value="中國大陸" keys="China" />

<v value="喜馬拉亞地區" keys="Himalayas+喜馬拉亞" />

<v value="中亞地區" keys="Middle_Asia+中亞" />

<v value="西伯利亞" keys="Siberia" />

<v value="新幾內亞" keys="New_Guinea" />

<v value="澳洲" keys="Australia" />

<v value="歐洲" keys="Europe+歐" />

<v value="美洲" keys="America+北美+中美+南美" />

<v value="非洲" keys="Africa" />

</s>

- <s slot="體型大小" keys="Size" path="//butterfly//adult//size//" menu="yes">

<v value="小型" keys="Small_Size+小" />

<v value="中型" keys="Middle_Size+中" />

<v value="大型" keys="Large_Size+大" />

</s>

- <s slot="棲息地" keys="棲息地=Habitate" path="//butterfly//adult//habitate//" menu="yes">

<v value="平地" keys="平地=Level_Ground" />

<v value="低海拔山區" keys="Low_Mountain+低海拔" />

<v value="中海拔山區" keys="Middle_Mountain+中海拔" />

<v value="高海拔山區" keys="High_Mountain+高海拔" />

</s>

- <s slot="宿主植物" keys="Hostplant+寄主植物" path="//butterfly//hostplant//" menu="yes">

<v value="豆科" keys="Leguminosae" />


<v value="大戟科" keys="Euphorbiaceae" />

<v value="白花菜科" keys="Capparidaceae" />

<v value="蘇鐵科" keys="Cycadaceae" />

<v value="蕁麻科" keys="Urticaceae" />

<v value="禾本科" keys="Gramineae" />

<v value="殼斗科" keys="Fagaceae" />

<v value="芸香科" keys="Rutaceae" />

<v value="榆科" keys="Ulmaceae" />

<v value="樟科" keys="Lauraceae" />

<v value="木犀科" keys="Oleaceae" />

<v value="桑寄生科" keys="Loranthaceae" />

<v value="肉食性" keys="Carnivore" />

</s>

- <s slot="飲食習慣" keys="Eat Food" path="//butterfly//adult//behavior//;//butterfly//honeyplant//"

menu="yes">

<v value="食花蜜" keys="Nectar+蜜" />

<v value="食腐汁" keys="Juice+腐+果汁+汁液+液" />

</s>

- <s slot="飛行速度" keys="Fly Speed" path="//butterfly//adult//behavior//;//butterfly//adult//defense//" menu="yes">

<v value="飛行迅速" keys="速+快" />

<v value="飛行緩慢" keys="緩+慢" /> </s>

- <s slot="禦敵方式" path="//butterfly//defense//" menu="yes">

<v value="有毒" keys="毒" />

<v value="擬態+保護色" keys="擬態+欺騙+環境融合+混淆" />

<v value="有臭味" keys="臭" />

</s>

- <s slot="現存數量" path="//butterfly//footnote//" menu="yes">

<v value="已絕種" keys="已滅絕" />

<v value="瀕臨絕種" keys="瀕臨滅絕" />

<v value="罕見稀少" keys="稀少+罕見" />

<v value="普通常見" keys="常見" />

</s>

<s slot="棲息地高度" path="//butterfly//geographic//taiwan//text()$meter" sortable="yes" />

<s slot="蝴蝶的大小" path="//butterfly//life_stage//adult//size//text()$meter" sortable="yes" />

<s slot="蝴蝶的壽命" path="//butterfly//life_stage//adult//days_of_growth//text()$day" sortable="yes"

/>

</butterfly>

</frame>
Appendix 2 : Protein Information Resource
a. XML example
- <ProteinEntry id="S35333">

- <header>

<uid>S35333</uid>

<accession>S35333</accession>

<created_date>03-Feb-1994</created_date>

<seq-rev_date>03-Feb-1994</seq-rev_date>

<txt-rev_date>24-Sep-1999</txt-rev_date>

</header>

- <protein><name>steroid receptor protein svp44</name></protein>

- <organism>

<source>zebra fish</source>

<common>zebra fish</common>

<formal>Brachydanio rerio</formal>

</organism>

- <reference>

- <refinfo refid="S35333">

- <authors>

<author>Fjose, A.</author>

<author>Nornes, S.</author>

<author>Weber, U.</author>

<author>Mlodzik, M.</author>

</authors>

<citation>EMBO J.</citation>

<volume>12</volume>

<year>1993</year>

<pages>1403-1414</pages>

<title>Functional conservation of vertebrate seven-up related genes in neurogenesis and eye

development.</title>

- <xrefs><xref><db>MUID</db><uid>93223680</uid></xref></xrefs>

</refinfo>

- <accinfo label="FJO">

<accession>S35333</accession>

<mol-type>mRNA</mol-type>

<seq-spec>1-411</seq-spec>

- <xrefs>
- <xref><db>EMBL</db><uid>X70299</uid></xref>

- <xref><db>NID</db><uid>g296418</uid></xref>

- <xref><db>PIDN</db><uid>CAA49780.1</uid></xref>

- <xref><db>PID</db><uid>g296419</uid></xref>

</xrefs>

</accinfo>

</reference>

- <genetics><gene><uid>svp44</uid></gene></genetics>

- <classification>

<superfamily>unassigned erbA-related proteins</superfamily>

<superfamily>erbA transforming protein homology</superfamily>

</classification>

- <keywords>

<keyword>DNA binding</keyword>

<keyword>steroid hormone receptor</keyword>

<keyword>zinc finger</keyword>

</keywords>

- <feature label="ERBA">

<feature-type>domain</feature-type>

<description>erbA transforming protein homology</description>

<seq-spec>74-320</seq-spec>

</feature>

- <feature>

<feature-type>region</feature-type>

<description>zinc finger</description>

<seq-spec>76-96</seq-spec>

</feature>

- <feature>

<feature-type>region</feature-type>

<description>zinc finger</description>

<seq-spec>112-136</seq-spec>

</feature>

- <summary><length>411</length><type>complete</type></summary>

<sequence>MAMVVSVWRDPQEDVAGGPPSGPNPAAQPAREQQQAASAAPHTPQTPSQPGPPSTP

GTAGDKGSQNSGQSQQHIECVVCGDKSSGKHYGQFTCEGCKSFFKRSVRRNLTYTCRANRNCPI

DQHHRNQCQYCRLKKCLKVGMRREAVQRGRMPPTQPNPGQYALTNGDPLNGHCYLSGYISLLL

RAEPYPTSRYGSQCMQPNNIMGIENICELAARLLFSAVEWARNIPFFPDLQITDQVSLLRLTWSEL
FVLNAAQCSMPLHVAPLLAAAGLHASPMSADRVVAFMDHIRIFQEQVEKLKALHVDSAEYSCIK

AIVLFTSDACGLSDAAHIESLQEKSQCALEEYVRSQYPNQPSRFGKLLLRLPSLRTVSSSVIEQLFF

VRLVGKTPIETLIRDMLLSGSSFNWPYMSIQ</sequence>

</ProteinEntry>

b. Domain Knowledge
- <frame>

- <s slot="分子種類=mol-type" path="/ProteinEntry/reference/accinfo/mol-type" menu="yes">

<v value="protein" />

<v value="DNA" />

<v value="RNA" />

<v value="mRNA" />

<v value="genomic RNA" />

</s>

- <structure slot="分子形狀=mol-shape" path="//ProtenEntry " menu="yes">

<v value="螺旋=Alpha" keys="螺旋=Helix" image="motif/Alpha.gif" />

<v value="平板=Beta" keys="平板=Sheet" image="motif/Beta.gif" />

<v value="Alpha+Beta" />

<v value="Parallel-Beta" />

<v value="AntiParallel-Beta" />

</structure>

- <source_genus slot="分子來源=organism" path="//ProtenEntry//organism" menu="yes">

<v value="動物=Animal" />

<v value="植物=Plants" />

- <v value="細菌=Bacteria"><v value="大腸桿菌=E_coli"/></v>

- <v value="病毒=Virus"><v value="噬菌體=Bacteriophage" /></v>

<v value="昆蟲=Insects" />

<v value="酵母=Yeast" />

<v value="人=Human" />

<v value="牛=Cow" keys="牛=Ox" />

<v value="雞=Chicken" />

<v value="豬=Pig" />

<v value="兔=Rabbit" />

<v value="鼠=Mouse" keys="rat=鼠" />

<v value="魚=Fish" keys="Whale,Dolphen" />

<v value="鳥類=Bird" />

<v value="昆蟲=Insect" />

<v value="真菌=Fungi" />


<v value="線蟲=Nematodes" />

</source_genus>

- <body_component slot="身體部位=body_component" path="//ProtenEntry " menu="yes">

<v value="心臟=Heart" />

<v value="肺臟=Lung" />

<v value="肝臟=Liver" />

<v value="腎臟=Kidney" />

<v value="胰臟=Pancreas" />

<v value="脾臟=Spleen" />

<v value="腸道=Intestine" />

<v value="大腦=Nucleus" />

<v value="皮膚=Cytoplasm" keys="皮膚=skin" />

<v value="肌肉=Membrane" keys="肌肉=Myosin" />

<v value="毛髮=Hair" />

<v value="神經=Nerve_System" />

<v value="血液=Blood" />

<v value="骨骼=Bone" />

<v value="副甲狀腺=Parathyroid" />

<v value="荷爾蒙=pheromone" />

<v value="羽毛=Feather" />

<v value="植物的根=Root" />

<v value="植物的莖=Stem+Trunk" />

<v value="植物的葉=Leaf" />

</body_component>

- <cell_component slot="細胞部位=cell_component" path="//ProtenEntry " menu="yes">

<v value="細胞核=Nucleus" />

<v value="細胞質=Cytoplasm" />

<v value="細胞膜=Membrane" />

<v value="細胞壁=Cell_Wall" />

<v value="內質網=Endoplasmic_reticulum" />

<v value="高基氏體=Golgi_Bodies" />

<v value="溶小體=Lysosomes" />

<v value="粒腺體=Mitrochondria" />

<v value="運輸系統=Transport" />

<v value="植物的質體=Plastids" />

</cell_component>

- <body_function slot="身體功能=body_function" path="//ProtenEntry " menu="yes">

<v value="消化=Digestion" />


<v value="呼吸=Respiration" />

<v value="運動=Motion" />

<v value="學習=Memory" />

<v value="感覺=Perception" />

<v value="幼兒=Larval" />

<v value="成長=Adult" />

<v value="懷孕=Pregnancy" />

<v value="交配=Mating" />

</body_function>

- <cell_function slot="細胞功能=cell_function" path="//ProtenEntry " menu="yes">

<v value="骨架=Structural" />

<v value="成長=Growth" />

<v value="吞噬=Phagocytosis" />

<v value="訊息=Communication" />

<v value="轉錄=Transcription" />

<v value="代謝=Metabolism" />

<v value="平衡=Ion_homeostasis" />

<v value="分解=Catabolism" />

<v value="調節=Regulation" />

<v value="催化=Enzyme" />

<v value="免疫=Immune" />

<v value="色素=Cytochrome" />

<v value="結合=Binding" />

<v value="水解=Hydrolase" />

<v value="循環=Circulation" />

<v value="毒素=Toxin" />

</cell_function>

- <material slot="相關元素=material" path="//ProtenEntry " menu="yes">

<v value="DNA" />

<v value="RNA" />

<v value="酸=Acid" />

<v value="鹼=Base" />

<v value="鹽=Salt" />

<v value="醣=Carbohydrate" />

<v value="脢=Enzyme" />

<v value="核酸=Nucleotides" />

<v value="脂肪=Lipid" />

<v value="維生素=vitamin" />


<v value="離子=Anion/Cation" />

<v value="碳=Carbon" />

<v value="磷=Phosphatase" />

<v value="能量=ATP" />

- <v value="金屬">

<v value="鈉" />

<v value="鉀=Potassium" />

<v value="鈣=calcium" />

<v value="鐵=iron" keys="ferric" />

<v value="銅=copper" />

<v value="鋁=aluminum" />

<v value="鎂=phosphatase" />

<v value="重金屬=heavy_metal" />

</v>

</material>

- <property slot="特性=property" path="//ProtenEntry " menu="yes">

<v value="親水性=Hydrophobic" />

<v value="斥水性=Hydropholic" />

<v value="帶正電=Positive_Charged" />

<v value="帶負電=Negative_Charged" />

</property>

- <s slot="蛋白質大小=size" path="//ProteinEntry//size" menu="yes" sortable="yes">

<v value="10-20R" />

<v value="20-50R" />

<v value="50-100R" />

<v value="100-500R" />

<v value="500-1000R" />

<v value="1000R-*" />

</s>

- <s slot="全部/片斷=whole/part" path="//ProteinEntry/summary/type" menu="yes">

<v value="fragment" />

<v value="complete" />

<v value="fragments" />

</s>

- <s slot="database" path="//db" menu="yes">

</s>

- <s slot="記錄資料庫=record-db" path="/ProteinEntry/reference/accinfo/xrefs/xref/db" menu="yes">


<!-- gene db -->

<v value="SGD" />

<v value="OMIM" />

<v value="MIPS" />

<v value="MIP" />

<v value="GDB" />

<v value="FlyBase" />

<!-- ref db -->

<v value="XFSC" />

<v value="UWGP" />

<v value="TIGR" />

<v value="SPDB" />

<v value="SCOEDB" />

<v value="PIDN" />

<v value="PID" />

<v value="PASP" />

<v value="NMASP" />

<v value="NMA" />

<v value="NID" />

<v value="MIPS" />

<v value="MIP" />

<v value="GSPDB" />

<v value="EMBL" />

<v value="DDBJ" />

<v value="CJSP" />

<v value="CESP" />

<v value="ATSP" />

<!-- ref db -->

<v value="PMID" />

<v value="MUID" />

</s>

- <source_genus slot="出版日期" path="//date" menu="yes" sortable="yes">

<v value="2002" />

<v value="2001" />

<v value="2000" />

<v value="1999" />

<v value="1998" />

<v value="1997" />


<v value="1996" />

<v value="1990-1995" />

<v value="1980-1990" />

<v value="1970-1980" />

<v value="before 1970" />

</source_genus>

</frame>