You are on page 1of 4

2009 First International Workshop on Education Technology and Computer Science

Domain Ontology Component-based Semantic Information Integration


Hongmei Zhu1,2, Qijia Tian1, Yongquan Liang1
1

Shujuan Ji1 Wei Sun2


2

College of Information Science and Engineering Shandong University of Science and Technology Qingdao, China zhuhongmei@163.com, tianqj@bnii.gov.cn, lyq@sdust.edu.cn

College of Information Science and Engineering Shandong Agricultural University Taian, China jsjsuzie@sina.com, sw@sdau.edu.cn

AbstractResearch on architecture of domain ontology component-based information semantic representation and integration is studied. Domain ontology component, a loosely coupled approach in the use of ontology, is advocated. As a case study, a prototype for agricultural policy-oriented domain ontology component-based semantic information integration system (APODOCSIIS) is established. Ontology plays a key role in providing a shared terminology and supporting for the semantic representation and integration process. The architecture allows APODOCSIIS-based applications to perform automatic semantic information integration of agricultural policy text at more length: semantic matching of concepts between different ontology components, domain ontology component-based dynamic semantic annotation of unstructured and semi-structured content, semanticallyenabled information extraction, indexing, retrieval, integration, as well as ontology management, such as querying and modifying the underlying ontology components. Main frame of this architecture have been implemented and concrete integration example are given. Keywords-domain integration; retrival ontology; semantic information;

I.

INTRODUCTION (HEADING 1)

Today, we are witnessing an exponential growth of data accumulated within the Web. It is a meaningful and important subject to research how to discovery and obtain valuable information in Web repository. Although the data of Web has characteristics such as semi-structured, and distributed, each of the information sources has to work together with the system that is querying the information. The problem of bringing together heterogeneous and distributed computer systems is known as interoperability. Agricultural policies are the extremely fundamental policies of our state. On one hand, with the time passed by, several agricultural policies concerned in all affairs of agriculture are established by the state and government. On the other hand, hundred millions peasants of our state want to understand items of these policies to solve their own questions anxiously but they dont know how to and where they can get the information they need. These characteristics make the agricultural policy-oriented information retrieval on Web a challenging work. Initially, ontology is introduced as an explicit specification of a conceptualization [1]. Therefore, ontology

can be used in retrieval tasks to describe the semantics of the information sources and to make the contents explicit. With respect to the retrieval of data, it can be used for the identification and association of semantically corresponding information concepts. This paper advocates domain ontology component, which is a loosely coupled approach in the use of ontology. Interoperation across them is achieved via terminological relationships represented between terms across the ontology components. Agricultural policies have their own characteristics different from other information. Although they are structured or semi-structured, they are all official documents and abide by compilation rules and standard tokens are used. Those characteristics takes convenient to automatic semantic representation, retrieval in fine granular scalable. A prototype for agricultural policy-oriented domain ontology component-based semantic information integration system (APODOCSIIS) is introduced in this article. With the help of ontology, the APODOCSIIS provides a novel agricultural policy knowledge and information management infrastructure and services for automatic semantic annotation, indexing, and retrieval of unstructured and semistructured content. The most direct applications are: knowledge management, enhancing the efficiency of the existing indexing, retrieval, classification and filtering applications. The rest of this paper is organized as follows: The second section is related works of information retrieval. The third section is concerns about architecture of policy-oriented ontology-based semantic information retrieval system. In this section, five function modules involved in the architecture are introduced respectively. The fourth section is the implementation of prototype for agricultural policy-oriented ontology-based semantic information retrieval system (APODOCSIIS). The fifth section is conclusion and future works. II. RELATED WORKS The use of ontology for the explication of implicit and hidden knowledge is a possible approach to overcome the problem of semantic heterogeneity. Uschold and Gruninger mention interoperability as a key application of ontology, and many ontology-based approaches [2] to information integration in order to achieve interoperability have been developed.

978-0-7695-3557-9/09 $25.00 2009 IEEE DOI 10.1109/ETCS.2009.546

101

Many information retrieval solutions with special focus on the use of ontology have been proposed by previous researches. Approaches of intelligent information retrieval including SIMS, TSIMMIS, OBSERVER, CARNOT, Infosleuth, KRAFT, PICSEL, DWQ, Ontobroker, SHOE and others with respect to the role and use of ontologies. Most of these systems use some notion of ontologies. Although the researches in the literatures are interesting, it is reported in many papers [3], [4] that pre-existing dictionaries often do not meet the users needs for interesting concepts, or ontology like WordNet [5] does not include proper nouns.
Document Semantic pretreatment module (DocSemPreModule) Semantic -based information integration module (SemInfInt Module) user

semantic foundations integration system. III.

for

the

information

semantic

ARCHITECTURE OF DOMAIN ONTOLOGY COMPONENTBASED SEMANTIC INFORMATION INTEGRATION SYSTEM

The architecture of domain ontology component-based semantic information integration system is shown in Figure 1. Five function modules are included in the architecture: Document Semantic pretreatment module (DocSemPreModule), Query pretreatment module (QuePreModule), Semantic-based information retrieval module (SemInfRetrModule), Semantic-based information integration module (SemInfIntModule), Ontology components management module (OntComMngModule). Next, they are introduced in detail respectively. A. Query pretreatment module (QuePreModule) The architecture of query pretreatment (QuePreModule) is shown in Figure 2.
SemInfRetr Module

Semantic -based information retrieval module (SemInfRetr Module)

Document base

module

Ontology component s management module (Ont ComMngModule)

Query input

Query pretreatment module(QuePreModule)

Domain Ontology Component base

Figure 1. Architecture of domain ontology component-based semantic information integration system

Query pretreatment module(QuePreModule) Query input Word segment Synonym expansion Upstream expansion Downstream expansion

Ontology components management module (OntComMn gModule)

The past decade has seen many long-term, large-scale efforts to develop standard, common ontologies to support information and knowledge sharing. These efforts are undoubtedly important. Nevertheless, it is unrealistic to expect that in general all people and organizations developing information and knowledge application systems will use a common, shared ontology. In any mature field, concepts, properties of concepts, relationships between properties, as well as the relationships between concepts are of great complicated. It is of extremely complex when all the concepts are organized into a single ontology. In this case, it is indeed a high degree of coupling system to reflect a wide range of complex relationships between concepts of the real world. But the ontology contains so great a number of concepts that even a subtle change of a concept may affect a large part of the ontology. The construction, assessment, maintenance and update of such a ontology is extremely difficult and is a time and material consuming thing. It is often done by people or group collaboration, but the knowledge of individuals or groups is often limited to a certain area or a certain range, "an island of knowledge" is often formed. On the other hand, in the reuse of ontology, ontology as a whole are often not interested in, but a part of it would like to be reused. However, due to its high degree of coupling, the ontology as a whole had to be reused, which not only increases the cost but also decrease efficiency. These limitations restricted the wider use of the ontology. We feel that it is important for on-time agricultural policies semantic information to be integrated. Therefore, we employed domain ontology components to provide sharable

Figure 2. Architecture of query pretreatment module (QuePreModule)

The main processing procedure of query pretreatment module is: Step 1 Word segment operation is carried on to the user input string, recognizing words corresponds to the ontology component glossary. These words are stored in eigenvector array according to the order of their appearance in the user input string. If a word corresponds to multiple glossaries of different ontology component, it is recorded in a different eigenvector array correspond to that component. Step 2 Semantic-relation based expands were made: 1. Synonym expansion: in terms of ontology component, all the synonyms of every concept recognized in Step 1 are found. These tokens are stored into synonym eigenvector array; 2. Hypernym expansion: in terms of ontology component, all concepts that subsume the concept recognized in Step 1 are found. These tokens are stored into hypernym eigenvector array; 3. Hyponym expansion: in terms of ontology component, all concepts that subsumed by the concept recognized in Step 1 are found. These tokens are stored into hyponym eigenvector array.

102

B. Document Semantic pretreatment module (DocSemPreModule) The architecture of document semantic pretreatment module (DocSemPreModule) is shown in Figure 3.
Document Semantic pretreatment module (DocSemPreModule)
Policy Document

C. Semantic-based information retrieval module (SemInfRetrModule) The architecture of Semantic-based information retrieval module (SemInfRetrModule) is shown in Figure 4.
Semantic-based information retrieval module (SemInfRetrModule)

DocSemPreModule

Document pretreatment

Semantic annotation

Document Base

Concepts Similarity computation retrieval result Properties Similarity computation Semantic Similarity computation

Document base

Semantic indexing Fundamental Structured Document Base SemInfRetr Module Ontology components management module (OntComMngModule) Semantic-based Clustering

Query input QuePreModule

Ontology components management module (OntComMngModule)

Figure 3. Architecture of document semantic pretreatment module (DocSemPreModule)

Figure 4. Architecture of semantic-based information retrieval module (SemInfRetrModule)

Four parts are mainly included in the documents semantic pretreatment module (DocSemPreModule): document pretreatment, semantic annotation, semantic classification, semantic indexing and semantic-based clustering. In order to carries on the analysis computation to the agricultural policy document content, document pretreatment is carried on to the document, the following operations are included: (l) HTML labels are removed to get free text; (2) Fundamental structure identification of agricultural policy: different levels of chapter, item, item content subsumption relations are identified and marked to build XML document. Document pretreatment may cause the documents processing to be more accurate and efficient. Semantic annotation is made next. Word segment operation is carried on corresponds to the ontology glossary to recognize tokens in the XML documents. Based on the result of semantic annotation, classification to every item is carried on in terms of the hierarchical concept tree of ontology. Hence hierarchical management corresponds with the ontology has logically been established for every policy document. To take a further processing, semantic index is established for the classified items. Firstly, a pointer that points to its example is built for concept in the ontology. Secondly, those concepts that have examples are put into an index file and they are sorted by dictionary order. The goal of semantic-based clustering is carries on the cluster of different policy items. The semantic center of each item is plotted. With semantic clustering based on above procedure, we can find the right cluster of user query quickly, which provides effective document management in semantic level. Then, in the retrieval time, user's inquiry intention may locate directly to the corresponding category, so the efficiency of semantic retrieval is enhanced.

The main function of semantic-based information retrieval module (SemInfRetrModule) is to carry on semantic similarity computation between every token in the eigenvector array that takes out from the query pretreatment module (QuePreModule) and the concept that takes out from documents semantic pretreatment module (DocSemPreModule). It is noteworthy that, in order to speed up the retrieval, the comparison between every token in the eigenvector arrays that takes out from the query pretreatment module (QuePreModule) and the semantic center of each item is carried on first, thus the respective category of the user query can be determined promptly. Then the retrieval can be carried on in this semantic cluster. D. Semantic-based information integration module (SemInfIntModule) The main function of Semantic-based information integration module (SemInfIntModule) is combining the corresponding data and gives an answer to the user. Format and value heterogeneity is checked for the data retrieved. The values of each answer (represented as a relation) are transformed into a unified format. The different partial answers according to different ontology component can be combined. A preprocess step is needed before presenting the answer to the user if the user query is the projection of some roles of the objects satisfying some particular constraints. Common objects retrieved from different ontology components must be able to be identified. The common objects are only taken into account for intersection; and the duplicate objects are eliminated for union. The correlated result returns to the user interface. E. Semantic-based information integration module (SemInfIntModule) Ontology components management module (OntComMngModule) The main function of ontology components management module (OntComMngModule) is to carry on ontology management such as ontology components establishment in OWL automatically according to structured inheriting plain

103

text, and then we can edit and verify the ontology through ontology editor, e.g. Protg. One of the most important functions is to perform concepts mapping between different ontology components. The results of mapping are stored so that they can be used in the reasoning of retrieval and integration described in the former relevant modules. IV. THE IMPLEMENTATION OF PROTOTYPE FOR DOMAIN ONTOLOGY COMPONENT-BASED SEMANTIC INFORMATION INTEGRATION SYSTEM (APODOCSIIS) In the prototype implementation of APODOCSIIS, the users can set voluntarily the expansions of concepts, properties, and the range of policy documents to retrieval and integrated and results to be displayed per page, numbers of tokens of different layers of expansions. After a user submits his query to the system, correlation results will return. Users can set these parameters above according to their satisfactory to the retrieval result. For example, the user wants to query Milk quality regulations. In deed, there is no word milk are appeared in almost all official documents, even in the "Law of the People's Republic of China on Agricultural Product Quality Safety" and Quality control regulations for dairy products. To answer the query, our system adopts the next steps: Step 1: Semantically expansion to the user query according to ontology component Keywords of official documents drafted by Administration of Secretary General Office of the CPC and China Classification Thesaurus on agriculture. Synonym expansion, e.g. dairy products, hypernym expansion, e.g. farm products, farm product, agricultural products, agricultural commodity, and agricultural product, livestock product, animal product, animal products, animal production, livestock products, animal by-products, hyponym expansion, e.g. goats milk, is done separately. The expansion tokens are stored into eigenvector arrays. Step 2: Semantic-based information retrieval is carried on. Semantic similarity computation between every token in the eigenvector arrays that takes out from Step 1 and the concept that identified from documents semantic pretreatment module. Relevant passages in different laws and regulations are found. Step 3: Corresponding data are combined and answers are given to the user. Each answer is transformed into a unified plain format. Common items are only taken into account for intersection; and the duplicate items are eliminated for union. 9 items are found in Quality control regulations for dairy products and 31 items are found in "Law of the People's Republic of China on Agricultural Product Quality Safety". The correlated result returns to the user interface. V. CONCLUSION AND FUTURE WORKS In this article, research on architecture of domain ontology component-based semantic information integration system is done. As a case study, we established a prototype for agricultural policy-oriented domain ontology component-

based semantic information integration system (APODOCSIIS). The architecture allows APODOCSIISbased applications to perform automatic semantic information retrieval and integration of policy text at more length: automatic and dynamic semantic annotation of unstructured and semi-structured content, semanticallyenabled information extraction, indexing, retrieval, as well as ontology components management. The proposed architecture has the potential to provide a formal and semantically rich representation, retrieval and integration for items of agricultural policy. However, it still leaves some open problems. In our frame, the automatically constructed initial ontology can only express relations of subsumption and synonym among concepts, and the edit of ontology components must be done by human. The next direction is to take the role relations, e.g. role subsumption and synonym into account. Besides this, to replace human intervention in ontology components construction by some inductive learning technique or text mining method still needs a further discussion. It influences the retrieval and integration results extremely. REFERENCES
[1] Tom Gruber, A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 1993, pp.199220. [2] M. Uschold and M. Gruninger, Ontologies: Principles, methods and applications, Knowledge Engineering Review. 11(2), 1996, pp.93 155. [3] S. Loh, L K Wives, and J P M de Oliveira, Concept-Based Knowledge Discovery in Texts Extracted from the Web, SIGKDD Explorations, ACM SIGKDD, July 2000, Vol 1, Issue 1, pp.29-39 [4] N. Guarino, Formal Ontology and Information Systems, In N. Guarino (Ed), Formal Ontology in Information Systems. Proc. Of the 1st International Conference, Trento, Italy, June 1998, IOS Press, Amsterdam, pp. 3-15 [5] Miller, G. A., WORDNET: A Lexical Database for English, Communications of ACM (11), 1995, pp. 39-41. [6] S. Liu, C.A. McMahon, M.J. Darlington, S.J. Culley, P.J. Wild (2006), A computational framework for retrieval of document fragments based on decomposition schemes in engineering information management, Advanced Engineering Informatics, Vol. 20, pp. 401413. [7] McKechnie, J., Shaaban, S. and Lockley, L. (2001), Computer Assisted Processing of Large Unstructured Document Sets: A Case Study in the Construction Industry, In Proceedings of 1st ACM Symposium on Document Engineering, Atlanta. [8] Gruber T R. A Translation Approach to Portable Ontology Specifications, Knowledge Acquisition, 1993, 5, pp. 199-220. [9] K.Y. Lin and L. Soibelman (2005), Knowledge Assisted Retrieval of Online Product Information in AEC), in Proceedings of Computing in Civil Engineering, July 12-15, 2005, Cancun, Mexico. [10] Y. Rezgui (2006), Ontology-centered Knowledge Management Using Information Retrieval Techniques, Journal of Computing in Civil Engineering, 20 (4), July/August 2006, pp. 261-270. [11] E. M. Voorhees and D. Harman (1998), The text retrieval conferences (TRECS), Proceedings of a workshop on held at Baltimore, October 13-15, 1998, Baltimore, Maryland. [12] Y Amghar, D Bahloul and P Maret (2004), Ontology-Based Framework For Document Indexing, In Proceedings of the 6th International Conference on Enterprise Information Systems (ICEIS 2004), April 14-17, Porto, Portugal.

104

You might also like