You are on page 1of 10

An Ontology-Based Pattern Mining System

for Extracting Information from Biological Texts


Muhammad Abulaish1 and Lipika Dey2
1

Department of Mathematics, Jamia Millia Islamia (A Central University),


New Delhi-110025, India
mdabulaish@rediffmail.com
2
Department of Mathematics, Indian Institute of Technology, Delhi,
Hauz Khas, New Delhi - 110 016, India
lipika@maths.iitd.ac.in

Abstract. Biological information embedded within the large repository


of unstructured or semi-structured text documents can be extracted more
eciently through eective semantic analysis of the texts in collaboration
with structured domain knowledge. The GENIA corpus houses tagged
MEDLINE abstracts, manually annotated according to the GENIA ontology, for this purpose. However, manual tagging of all texts is impossible and special purpose storage and retrieval mechanisms are required
to reduce information overload for users. In this paper we have proposed
an ontology-based biological Information Extraction and Query Answering (BIEQA) system that has four components: an ontology-based tag
analyzer for analyzing tagged texts to extract Biological and lexical patterns, an ontology-based tagger for tagging new texts, a knowledge base
enhancer which enhances the ontology, and incorporates new knowledge
in the form of biological entities and relationships into the knowledge
base, and a query processor for handling user queries.
Keywords: Ontology-based text mining, Biological information extraction, Automatic tagging.

Introduction

The collection of research articles in the eld of Molecular Biology is growing


at such a tremendous rate, that without the aid of automated content analysis
systems designed for this domain, the assimilation of knowledge from this vast
repository is becoming practically impossible [10]. The core problem in designing
content analysis systems for text documents arises from the fact that these documents are usually unstructured or semi-structured in nature. Recent eorts at
consolidating biological and clinical knowledge in the structured form of ontologies however have raised hopes of realizing such systems. Since ontology species
the key concepts in a domain and their inter-relationships to provide an abstract
view of an application domain [1], ontology-based Information Extraction (IE)
schemes can help in alleviating a wide variety of natural language ambiguities
present in a given domain.
ezak et al. (Eds.): RSFDGrC 2005, LNAI 3642, pp. 420429, 2005.
D. Sl

c Springer-Verlag Berlin Heidelberg 2005


An Ontology-Based Pattern Mining System

421

Let us illustrate our point of view through an example. Suppose a user poses
a query like What are the biological substances activated by CD28?, then a
specic answer like, Dierential regulation of proto-oncogenes c-jun and c-fos
in T lymphocytes activated through CD28, can be extracted from the repository
provided the system knows that proto-oncogenes, c-jun and c-fos are biological
substances. Kim et al. [7] suggest that such an answer can be produced if the
elements in the text are appropriately annotated. Though this scheme can solve
the information extraction problem very eectively, some of the key challenging
tasks towards designing such systems are: (i) recognition of entities to be tagged
(ii) automatic tagging of identied entities (iii) choosing a base ontology that can
be used to annotate the text eectively (iv) assimilate new domain knowledge.
In this paper we have proposed an ontology-based Biological Information Extraction and Query Answering (BIEQA) System that tries to address the above
problems within a unied framework. The system exploits information about
domain ontology concepts and relations in conjunction with lexical patterns
obtained from biological text documents. It extracts and structures information about biological entities and relationships among them within a pre-dened
knowledge base model. The underlying ontology is enhanced using knowledge
about feasible relations discovered from the texts. The system employs ecient
data structures to store the mined knowledge in structured form so that biological queries can be answered eciently.
The rest of the paper is organized as follows. We review some related works on
biological information extraction in section 2. The architecture of the complete
system is presented in section 3. Sections 4 through 7 present functional details
of each module. The performance of the system is discussed in section 8. Finally,
we conclude and discuss future work in section 9.

Related Work

A general approach to perform information extraction from biological documents


is to annotate or tag relevant entities in the text, and reason with them. Most
of the existing systems focus on a single aspect of text information extraction.
The tagging is often done manually. A signicant eort has gone towards identifying biological entities in journal articles for tagging them. Su et al. [11] have
proposed a corpus-based approach for automatic compound extraction which
considers only bigrams and trigrams. Collier et el. [2] have proposed a Hidden
Markov Model (HMM) based approach to extract the names of the genes and
gene products from Medline abstracts. Fukuda et al. [5] have proposed a method
called PROtein Proper-noun phrase Extracting Rules (PROPER) to extract material names from sentences using surface clue on character strings in medical
and biological documents.
Reasoning about contents of a text document however needs more than identication of the entities present in it. Kim et el. [7] have proposed the GENIA
ontology of substances and sources (substance locations) as a base to x the
class of molecular biological entities and relationships among them. The GE-

422

M. Abulaish and L. Dey

NIA ontology is widely accepted as a baseline for categorizing biological entities


and reasoning with them for inferring about contents of documents. There also
exists a GENIA corpus created by the same group, which contains 2000 manually tagged MEDLINE abstracts, where the tags correspond to GENIA ontology
classes. Tags can be both simple or nested, where simple tags have been awarded
to single noun phrases, whereas nested tags are for more complex phrases. Tags
can help in identication of documents containing information of interest to a
user, where interest can be expressed at various levels of generalization. Extracting contents from text documents based on relations among entities is also
another challenge to text-mining researchers. Craven and Kumlien [3] have proposed identication of possible drug-interaction relations between protein and
chemicals using a bag of words approach applied at the sentence level. Ono et al.
[8] report on extraction of protein-protein interactions based on a combination of
syntactic patterns. Rinaldi et al. [9] have proposed an approach to automatically
extract some relevant relations in the domain of Molecular Biology, based on a
complete syntactic analysis of existing corpus. Friedman et al. [4] has proposed
GENEIS - a natural language processing system for extracting information about
molecular pathways from texts.

Architecture of the BIEQA System

In contrast to the systems discussed earlier, BIEQA is conceived as a completely automated information extraction system that can build a comprehensive
knowledge base of biological information to relieve users from information overload. The system uses the GENIA ontology as its starting domain knowledge
repository.
Fig. 1 represents the complete architecture of the BIEQA system. Though
the primary role of the system is to answer queries eciently, initially the system is trained to tag new documents automatically, for the ease of information
extraction. The system has four main components. The rst component, Tag
Analyzer, has dual responsibilities. During system training, it operates on a set
of tagged texts to extract relationships between tags and other lexical patterns
within texts. Tag Analyzer also extracts key biological information from tagged
texts in the form of entities and their relations. This is stored in the knowledge
base to answer user queries eciently. The second module, Tag Predictor, tags
new documents using the knowledge about tagging extracted by the analyzer.
This module uses a combination of table look-up and maximum-likelihood based
prediction techniques. The system is equipped with a powerful entity recognizer
for identifying candidate phrases in new documents for both simple and nested
tagging. One of the unique aspects of our system is the integration of a module
called Knowledge Base Enhancer which has the capability to enhance existing
ontological structures with new information that is extracted from documents.
Query Processor, the fourth module, extracts relevant portions of documents
from a local database, along with their MEDLINE references. User interaction
with the system is provided through a guided query interface. The interface al-

An Ontology-Based Pattern Mining System

423

Fig. 1. System Architecture

lows users to pose queries at various levels of specicity including combinations


of entities, tags and/or relations. Further details of each module is presented in
the following sections.

Tag Analyzer

As already mentioned, the Tag Analyzer has two main functions. During training, the analyzer helps in extracting statistical information about tag-word cooccurrences from manually tagged documents. To start with, the Tag Analyzer
lters stop words from the corpus. We have considered most of the stop words
used by PubMed database. After that, it assigns a document id, a name (Medline
number), and a line number to every sentence of the corpus. Then it extracts
contents of a sentence and stores them in a tree structure, which is dened as
follows:
struct tag{
char * name;
struct tag * lchild; struct tag * rchild; struct tag *InnerTag;}
struct segment {
char *non tagged text; struct tag *tags;}
struct segment sentence[no of segments]
This tree is used by Tag Predictor and Knowledge Base Enhancer. Fig. 2
shows a sample tree structure created by Tag Analyzer corresponding to a tagged
sentence picked up from MEDLINE:95197524.

424

M. Abulaish and L. Dey

Fig. 2. Tree structure generated with a sample sentence. [] indicates tags. Entities are
in boldface.

Biological Tag Predictor

The main function of this module is to locate entities in the biological document
and tag them according to GENIA ontology. During the training phase of the
system, this module works on the tree structure generated by Tag Analyzer to
extract statistical information for tagging. After training, it applies the extracted
statistical knowledge to tag the biological entities appearing in a document.
To identify entities from documents, this module also implements a document
processor. Since tagging is based on neighboring words, hence, it also includes
a neighborhood extractor which stores information about occurrences of terms
surrounding an entity. We outline the functions of each of these components in
the following subsections.
5.1

Document Processor for Entity Recognition

The main function of this module is to locate entities. Since entities are either
nouns or noun phrases, the document processor consists of a Parts-Of-Speech
(POS) Tagger that assigns parts of speech to English words based on the context in which they appear. We have used a web-based Tagger, developed by the
Specialized Information Services Division (SIS) of the National Library of Medicine1 (NLM), to locate and extract noun phrases that are present in the text
document. In addition to identication of nouns and noun phrases as named
entities, this module also consists of a set of rules to handle special characters
which are very common in biological documents. Gavrilis et el. [6] have proposed
a rule set for pre-processing biological documents. We have modied this rule
set including some new ones. These rules have been identied after analysis of
100 manually tagged documents picked at random from the GENIA corpus. The
1

http://tamas.nlm.nih.gov

An Ontology-Based Pattern Mining System

425

Fig. 3. Entity recognition rules and a sample sentence with entities identied for tagging indicated by [ ]

modied rule set is shown in Fig. 3. These rules help in generating equivalent
names from a number of dierent appearances of Biological names, or even extract dierent instances from concatenations and commonly used abbreviations.
For example, concatenated instances like B- and T-cell or gp350/220 are converted into B-cell and T-cell, and gp350/gp250 respectively. The complete list of
entities are identied by applying the above-mentioned rules in conjunction with
the identied noun phrases. Our method is capable of identifying both simple
and nested entities to be tagged. All stop words are ltered from the documents
and the processed document is passed on to the Neighborhood Extractor for
extraction of bigrams. The pre-processor also assigns a document id and name
(generally Medline number) and breaks the document into sentences and assigns
them a unique line number within the document so that the combination of the
document id and line number form a primary key.
Initially, each entity is assigned an arbitrary tag. This tag acts as a placeholder which helps in identifying locations in the sentence where tags will be
inserted. The lower box in Fig. 3 shows a sample sentence along with its entities
identied and marked for tag prediction. The sentence with expected tag positions marked, is now converted into a tree structure similar to the one stated in
section 4, though tag names are not known now. This tree encodes all needed

426

M. Abulaish and L. Dey

information about an entity and its neighboring terms in a sentence, which is


used for predicting its tag.
5.2

Statistical Prediction of Tags for Entities

GENIA ontology denes 36 tag classes, including the other name tag. Tag
Predictor has been implemented to predict the most likely and most specic
tag from this collection. For all known entities, the tag is looked up from the
knowledge base. Otherwise, the tag prediction process uses a maximum likelihood based model to predict an entity Es tag based on the information about
its neighboring words. The task is to maximize (T |En ), where T stands for a
tag and En for the neighbourhood of the entity. (T |En ) is computed using bigrams of words and/ or tags. We dene En as a 4-tuple (two left words and two
right words), that surround entity E that is to be tagged. A 4-tuple denoted by < LHW 1, LHW, RHW, RHW +1 > species the state of the system. Tag Predictor generates the 4-tuple set for E through preorder traversal of the tree mentioned earlier. The likelihood of tag Ti , is computed using the following equation:
(Ti |LHW 1, LHW, RHW, RHW + 1) =
P (Ti |LHW 1) + P (Ti |LHW ) + P (Ti |RHW ) + P (Ti |RHW + 1)
where P (|) is calculated from training data as follows:
P (Ti |t) = N (Ti , t)/j=1to36 N (Tj , t)
where N (Tk /t) represents the number of times a tag Tk is observed in combination with term t in the given position, in training data.
For any entity, one or more of its neighboring words may be entities themselves, which need to be tagged. In this case we have considered the tag, which
has the maximum probability of occurrence in the corresponding position as
obtained from training data. The tag with maximum likelihood value which is
greater than a given threshold value is assigned to the new entity. If an entity
cannot be associated with any tag the entity is tagged with other name.

Knowledge Base Enhancer

Knowledge Base Enhancer conducts postorder traversal of the tree to extract


information about entities occurring in documents. While an ontology provides
useful domain knowledge for extracting information from unstructured texts,
it also becomes necessary at times to enhance existing domain knowledge with
additional information extracted from the text sources.
GENIA ontology species concept taxonomy over molecular biological substances and their locations only. Our system applies statistical analysis on the
tagged documents to extract information about other possible biological relations among substances, or locations or among substances and locations. This
is accomplished through extraction of relational verbs from documents and
analysing their relationships with surrounding tagged entities. To identify feasible relations to enhance the ontology structure, we have focused on recognition of relational verbs in the neighborhood of the entities. After analyzing

An Ontology-Based Pattern Mining System

427

the whole collection, those relational verbs which occur with maximal frequencies surrounded by ontology tags are accepted for inclusion into the ontology.
Some example relations extracted for incorporation to enhance the ontological structure are: inhibitors of(other organic compound, protein molecule), inducers of(lipid,protein complex), induces in(mono cell, protein molecule), activates(protein molecule, protein molecule) etc. It may be noted that prepositions
determine the role of a concept in a relation. Information about all known entities and documents in which they are occurring are stored separately in a trie
structure, and not in the ontology. For each entity in the collection it stores the
corresponding tag and its locations in the corpus.

Query Processor

Query processing is a two-step process - acceptance and analysis of the user query
and then nding the relevant answer from the structured knowledge base. Query
Processor uses the enhanced ontology structure to allow the user to formulate
queries at multiple levels of specicity. For example, a user may specify an entity
name and ask for documents containing it. However, a user can be more generic
and post a query like that shown earlier What are the biological substances
activated by CD28, which is a combination of generic concepts like biological
substances, specic instances like CD 28 and a specic relation activated by,
which should relate these elements in the document.
Our system stores various permissible templates to enable users to formulate
feasible queries at multiple levels of specicity. As the user formulates a query,
the most appropriate template is selected by the system. Templates could be of
the form <*,relation,Entity>, or <Entity,*, Tag>, or <Tag,relation, Tag> etc.
A query is analyzed and processed by the Answer Generator module to extract
the relevant information from the knowledge base. The query presented above
is passed on to system through template <*,relation=activated,Entity=CD28>.
Two of the responses extracted from the local knowledge base are:
1. MEDLINE:95081587: Dierential regulation of proto-oncogenes c-jun and
c-fos in T lymphocytes activated through CD28.
2. MEDLINE:96322738: The induction of T cell proliferation requires signals
from the TCR and a co-receptor molecule, such as CD28, that activate parallel and partially cross-reactive signaling pathways.
Another query How are HIV-1 and cell type related? instantiates the template
<Entity=HIV-1,relation=*,Tag=Cell type>. Some corresponding abstracts extracted for this are:
1. MEDLINE:95187990: HIV-1 Nef leads to inhibition or activation of T cells
depending on its intracellular localization.
2. MEDLINE:94340810: Superantigens activate HIV-1 gene expression in
monocytic cells.

428

M. Abulaish and L. Dey

System Performance Analysis

To judge the performance of the system, we have randomly selected 10 documents from the GENIA corpus that consist of 80 sentences and 430 tags. These
documents were cleaned by removing tags from them. Table 1 summarizes the
performance of entity recognition process over this set. It is observed that precision is somewhat low due to the fact that there are many noun phrases that
are not valid entities. To evaluate the eectiveness of Tag Predictor, we have
blocked the table look-up process because in this case all entities can be found
in the knowledge base. Table 2 summarizes the performance of Tag Predictor in
the form of a misclassication matrix. We have considered the fact that along
with the assignment of correct tags to relevant entities, the system should tag
those entities which are not Biological as other name. Computation of precision
takes this into account.
Table 1. Misclassication Matrix for Entity Recognition
Module

Relevant
Nonrelevant
Relevant
Prec- RePhrases extracted phrases extracted phrases missed ision call
Entity Recognizer 417
50
13
89.29 96.98

Table 2. Misclassication Matrix for Tag Predictor


Module

Correct Wrong
Right prediction Wrong prediction Prec- Reprediction prediction of othername
of othername
ision call
Tag Predictor 389
27
9
42
90.47 92.29

Conclusions and Future Work

In this paper we have presented an ontology-based deep text-mining system,


BIEQA, which employs deep text mining in association to information about
the likelihood of various entity-relation occurrences to extract information from
biological documents. User interaction with the system is provided through an
ontology-guided interface, which enables the user to pose queries at various levels
of specication including combinations of specic entities, tags and/or relations.
One of the unique aspects of our system lies in its capability to enhance existing ontological structures with new information that is extracted from documents. Presently, we are working towards generating a fuzzy ontology structure,
in which, relations can be stored along with their strengths. Strength of relations
can thereby play a role in determining relevance of a document to a user query.
Query Processor is also being enhanced to tackle a wide range of structured natural language queries. Further work on implementing better entity recognition
rules is also on.

An Ontology-Based Pattern Mining System

429

Acknowledgements. The authors would like to thank the Ministry of Human Resources and Development, Government of India, for providing nancial support
for this work (#RP01537).

References
1. Broekstra, J., Klein, M., Decker, S., Fensel, D., van Harmelen, F., Horrocks, I.:
Enabling Knowledge Representation on the Web by Extending RDF Schema. In
Proceedings of the 10th Int. World Wide Web Conference, Hong Kong (2001) 467478
2. Collier, N., Nobata C., Tsujii J.: Extracting the Names of Genes and Gene Products with a Hidden Markov Model. In Proceedings of the 18th Int. Conference on
Computational Linguistics (COLING2000), Saarbrucken Germany (2000) 201-207
3. Craven, M., Kumlien, J.: Constructing Biological Knowledge Bases by Extracting
Information from Text Sources. In Proceedings of the 7th Int. Conference on Intelligent Systems for Molecular Biology (ISMB99), Heidelburg Germany (1999)
77-86
4. Friedman, C., Kra, P.,Yu, H., Krauthammer, M., Rzhetsky, A.: GENIES: A
Natural-Language Processing System for the Extraction of Molecular Pathways
from Biomedical Texts. Bioinformatics 17 (2001) 74-82
5. Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Toward Information Extraction: Identifying Protein Names from Biological papers. In Pacic Symposium on
Biocomputing, Maui Hawaii (1998) 707-718
6. Gavrilis, D., Dermatas, E., Kokkinakis, G.: Automatic Extraction of Information
from Molecular Biology Scientic Abstracts. In Proceedings of the Int. Workshop
on SPEECH and COMPUTERS (SPECOM03), Moscow Russia (2003)
7. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus - A Semantically Annotated Corpus for Bio-Textmining. Bioinformatics 19(1) (2003) 180-182
8. Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated Extraction of Information on Protein-Protein Interactions from the Biological Literature. Bioinformatics
17(2) (2001) 155-161
9. Rinaldi, F., Scheider, G., Andronis, C., Persidis, A., Konstani, O.: Mining Relations
in the GENIA Corpus. In Proceedings of the 2nd European Workshop on Data
Mining and Text Mining for Bioinformatics, Pisa Italy (2004) 61-68
10. Stapley, B.J., Benoit, G.: Bibliometrics: Information Retrieval and Visualization
from Co-occurrence of Gene Names in MedLine Abstracts. In Proceedings of the
Pacic Symposium on Biocomputing, Oahu Hawaii (2000) 529-540
11. Su, K., Wu, M., Chang, J.: A Corpus-Based Approach to Automatic Compound
Extraction. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL94), Las Cruses New Maxico USA (1994) 242-247

You might also like