Professional Documents
Culture Documents
net/publication/221596752
CITATIONS READS
2 125
4 authors:
Some of the authors of this publication are also working on these related projects:
The Development of Integrated Model and Intelligent Support System for Software Certification Process View project
All content following this page was uploaded by Rosmayati Mohemad on 22 July 2015.
1 Introduction
Nowadays, ontology plays an essential role in knowledge management in which it
shares common understanding of a domain. Since the last decade ontology has
E. Mugellini et al. (Eds.): Advances in Intelligent Web Mastering - 3, AISC 86, pp. 153–162.
springerlink.com © Springer-Verlag Berlin Heidelberg 2011
154 R. Mohemad et al.
2 Related Works
The importance of ontology is recognized and implemented in information extrac-
tion. Information extraction is defined as any method of analyzing large volumes
of unstructured texts, normally in the form of natural language and automatically
extracts salient information from these texts into pre-defined template [7-9]. On-
tology and information extraction are closely related in two main tasks either pre-
existing ontology is used to extract information or it is used to populate and
improving ontology. The first task is our focus in the research. The use of ontolo-
gy enables information extraction to have better access and coverage to relevant
information by providing domain knowledge specific application [10,11]. Output
from this process is useful for further processing in wide range of application such
as text mining, text classification, ontology learning, information retrieval, deci-
sion support systems and others.
Research that was done by Holzinger et al. [12] populated domain specific on-
tology with instances extracted from structural information on tabular data of Web
documents. Here, they come out with table ontology and fixed adjacent attribute-
value pairs to identify the instances. The goal of our paper shares the same interest
of instantiating domain ontology, yet expands to process non structural and semi
structural information. Table ontology is improved by identifying headers of tabu-
lar structure and overcome fixed adjacent attribute-value pairs approach. In
addition, non tabular concepts also included to give semantic knowledge for non
tabular data. Meanwhile, Shashirekha and Murali [13] improved similar frame-
work that had been proposed by Embly et al. [14] by identifying relevant informa-
tion written in short forms or abbreviations using domain dependent ontology and
populated the extracted information into relational database. Nevertheless, they
examined only non structural documents and the semantic of document structures
were not considered. In our approach, the semantic representations allow informa-
tion to be extracted directly from documents and can be queried directly from the
ontology.
WeDax is web-based data extraction tool where it restructures web documents
into XML schema representations and mapping with domain specific ontology
[15]. The tool however only managed to extract constantly changing data with a
fixed structure. Moreover, XML is more to syntactic language, thus lack supports
for efficient sharing between semantically defined concepts. Instead, our research
uses the classification capabilities of rule-based inference engine to identify mea-
ningful instances without having to depend on conventional mapping process.
Other information extraction application has been proposed by Biletskiy et al. [16]
is Course Outline Data Extractor. The tool used data integration to automatically
transform learning syllabus stored on HTML web pages into XML format. Then,
the relevant information is extracted by computing similarity between source and
target syllabus in XML using domain specific ontology. However, the main dif-
ferent of these researches is they do not consider enough semantic relations among
fundamental concepts of document structure and recognize instances depend on
heuristic mapping and matching algorithms.
156 R. Mohemad et al.
Document Preprocessing
Tender Documents Removing irrelevant tags and punctuations
in PDF
Text Tokenizing and Coordinating
Information Acquisition
Parsing Keywords and Expression Patterns
Content Mapping
hasTitle
Table String Value Non-Table
consists consists consists
Body, Keyword, Pattern and these concepts are associated with relationships.
Some concepts in the ontology are reused from table ontology as proposed by
Holzinger et al. [12]. The ontology populates instances according to the structure
found on the document. Each document and table is reflected by Document and
Table concepts respectively in which are differentiated by string title. Transitive
consists relation is modeled as OWL object property which indicates semantic
relationship between concepts of Document, Table, Row, Column, Cell, Non-Table
and Paragraph. Relevant data or value found is stored into either Cell or Para-
graph which contains string value modeled as OWL data type property. This value
is associated with matched keywords and pattern rules. Concept of Keyword can
be inherited by subclasses such as KeywordConcept, KeywordAttribute and etc.
Particular concepts and attributes are represented using predefined keywords and
are associated with hasAttribute relationship. Here, related keywords and regular
expression patterns are defined to guide the extraction. Meanwhile, complex con-
cepts can be derived from basic concepts definitions. All of these concepts and
relationships are encoded using the most recent standard ontology language of
OWL. The document model can represent any standard document that contains
three different type of information either non structured, semi structured or un-
structured. It is adequate in modeling construction tender document structure.
158 R. Mohemad et al.
StringName
has Keyword has Rule
Keyword Pattern
is a has Value
“Mr”
StringName=∀Cell ∩∃.hasKeywordKeywordName∩∃.hasRulePattern
using OWL operators. For example, one of the possible rules that can be inferred
by OWL Pellet reasoner is depicted in Fig. 3. Here, complex class of StringName
is derived by union of any Cell has KeywordName and has Pattern. Furthermore,
SPARQL query is executed to retrieve the extracted information. Concepts of
domain specific keywords identification and regular expression patterns are se-
mantically associated with cell and paragraph.
4 Experimental Setup
The purpose of this experiment is to evaluate the performance of the proposed
approach according to precision and recall measurements. Six copies of tender
(a)
(b)
Indicator
Document Title
Concept
Attribute
Value
(c)
The finding shows that the ontology is significantly capable in recognizing im-
portant of concepts, attributes and values and then represent them as instances into
machine readable format of ontology that can be used for further decision-making
process. The results indicate the approach is capable to detect information that
matched with the predetermined keywords and regular patterns. In addition, the
capability of ontology to allow rules reasoning improve the extraction process.
6 Conclusion
This study has proposed an approach of information extraction using domain de-
pendent ontology. The relevant information is recognized by matching with key-
words and regular expression patterns. Based on precision, recall and f-measure,
the extracted information has shown significant experimental results. However,
the implementation of keywords and regular expression patterns allows any
matched strings to be associated with them including two different strings that
contain similar word. There is a chance to produce redundant recognizable infor-
mation and ambiguous interpretation of the knowledge. In future, the meanings of
each keyword will be hierarchical expanded in the ontology in order to produce
more quality keywords. Furthermore, table recognizing which currently works
based on simple table assumption will be enhanced for complex table. We are
considering to include supporting documents as part of document sources and
proposed a model to semantically match between the content of compulsory and
supporting documents.
References
1. Kayed, A., Colomb, R.M.: Extracting Ontological Concepts for Tendering Conceptual
Structures. Data & Knowledge Engineering 40(1), 71–89 (2002a)
2. Du, T.C.: Building an Automatic e-Tendering System on the Semantic Web. Decision
Support Systems 14(1), 13–21 (2009), doi:10.1016/j.dss.2008.12.009
3. McCray, A.T.: An Upper-Level Ontology for the Biomedical Domain. Comparative
and Functional Genomics 4(1), 80–84 (2003)
4. Schulze-Kremer, S.: Ontologies for Molecular Biology and Bioinformatics. Silico Bi-
ol. 2(3), 179–193 (2002)
5. Soibelman, L., Wu, J., Caldas, C., Brilakis, I., Lin, K.-Y.: Management and Analysis
of Unstructured Construction Data Types. Advanced Engineering Informatics 22(1),
15–27 (2008)
6. Rosmayati, M., Abdul Razak, H., Zulaiha, A.O., Noor Maizura, M.N.: Ontological-
based for Supporting Multi Criteria Decision-Making. In: Wen, D., Zhou, J. (eds.) 2nd
IEEE International Conference on Information Management and Engineering, Cheng-
du, China, pp. 214–217. IEEE Press, Los Alamitos (2010)
7. Cowie, J., Lehnert, W.: Information Extraction. Communications of the ACM 39(1),
80–91 (1996)
8. Soderland, S.: Learning Information Extraction Rules for Semi-Structured and Free
Text. Machine Learning 34(1-3), 233–272 (1999)
162 R. Mohemad et al.
9. Grishman, R.: Information Extraction: Techniques and Challenges. In: Pazienza, M.T.
(ed.) SCIE 1997. LNCS, vol. 1299, pp. 10–27. Springer, Heidelberg (1997)
10. Maedche, A., Neumann, G., Staab, S.: Bootstrapping an Ontology-Based Information
Extraction System. In: Szczepaniak, P., Segovia, J., Kacprzyk, J., Zadeh, L. (eds.) Intel-
ligent Exploration of the Web. Studies In Fuzziness And Soft Computing, pp. 345–359.
Springer/Physica-Verlag, Heidelberg (2003)
11. Alani, H., Kim, S., Millard, D.E., Weal, M.J., Hall, W., Lewis, P.H., Shadbolt, N.R.:
Automatic Ontology-Based Knowledge Extraction from Web Documents. IEEE Intel-
ligent Systems 18(1), 14–21 (2003)
12. Holzinger, W., Krüpl, B., Herzog, M.: Using Ontologies for Extracting Product Features
from Web Pages. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D.,
Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 286–299.
Springer, Heidelberg (2006)
13. Shashirekha, H.L., Murali, S.: Ontology Based Structured Representation for Domain
Specific Unstructured Documents. In: Proceedings of International Conference on
Conference on Computational Intelligence and Multimedia Applications 2007, Tamil
Nadu, pp. 50–54 (2007)
14. Embley, D.W., Campbell, D.M., Smith, R.D.: Ontology-Based Extraction and Struc-
turing of Information from Data-Rich Unstructured Documents. In: Proceedings of the
Conference on Information and Knowledge Management (CIKM 1998), Washington
D.C, pp. 52–59 (1998)
15. Snoussi, H., Magnin, L., Nie, J.-Y.: Toward an Ontology-based Web Data Extraction.
In: Proceedings of the 15th Canadian Conference on Artificial Intelligence, Calgary,
Altas, Canada, pp. 1–8 (2002)
16. Biletskiy, Y., Brown, J.A., Ranganathan, G.: Information Extraction from Syllabi for
Academic e-Advising. Expert Systems with Applications 36(3), 4508–4516 (2009)