Professional Documents
Culture Documents
Introduction
421
Let us illustrate our point of view through an example. Suppose a user poses
a query like What are the biological substances activated by CD28?, then a
specic answer like, Dierential regulation of proto-oncogenes c-jun and c-fos
in T lymphocytes activated through CD28, can be extracted from the repository
provided the system knows that proto-oncogenes, c-jun and c-fos are biological
substances. Kim et al. [7] suggest that such an answer can be produced if the
elements in the text are appropriately annotated. Though this scheme can solve
the information extraction problem very eectively, some of the key challenging
tasks towards designing such systems are: (i) recognition of entities to be tagged
(ii) automatic tagging of identied entities (iii) choosing a base ontology that can
be used to annotate the text eectively (iv) assimilate new domain knowledge.
In this paper we have proposed an ontology-based Biological Information Extraction and Query Answering (BIEQA) System that tries to address the above
problems within a unied framework. The system exploits information about
domain ontology concepts and relations in conjunction with lexical patterns
obtained from biological text documents. It extracts and structures information about biological entities and relationships among them within a pre-dened
knowledge base model. The underlying ontology is enhanced using knowledge
about feasible relations discovered from the texts. The system employs ecient
data structures to store the mined knowledge in structured form so that biological queries can be answered eciently.
The rest of the paper is organized as follows. We review some related works on
biological information extraction in section 2. The architecture of the complete
system is presented in section 3. Sections 4 through 7 present functional details
of each module. The performance of the system is discussed in section 8. Finally,
we conclude and discuss future work in section 9.
Related Work
422
In contrast to the systems discussed earlier, BIEQA is conceived as a completely automated information extraction system that can build a comprehensive
knowledge base of biological information to relieve users from information overload. The system uses the GENIA ontology as its starting domain knowledge
repository.
Fig. 1 represents the complete architecture of the BIEQA system. Though
the primary role of the system is to answer queries eciently, initially the system is trained to tag new documents automatically, for the ease of information
extraction. The system has four main components. The rst component, Tag
Analyzer, has dual responsibilities. During system training, it operates on a set
of tagged texts to extract relationships between tags and other lexical patterns
within texts. Tag Analyzer also extracts key biological information from tagged
texts in the form of entities and their relations. This is stored in the knowledge
base to answer user queries eciently. The second module, Tag Predictor, tags
new documents using the knowledge about tagging extracted by the analyzer.
This module uses a combination of table look-up and maximum-likelihood based
prediction techniques. The system is equipped with a powerful entity recognizer
for identifying candidate phrases in new documents for both simple and nested
tagging. One of the unique aspects of our system is the integration of a module
called Knowledge Base Enhancer which has the capability to enhance existing
ontological structures with new information that is extracted from documents.
Query Processor, the fourth module, extracts relevant portions of documents
from a local database, along with their MEDLINE references. User interaction
with the system is provided through a guided query interface. The interface al-
423
Tag Analyzer
As already mentioned, the Tag Analyzer has two main functions. During training, the analyzer helps in extracting statistical information about tag-word cooccurrences from manually tagged documents. To start with, the Tag Analyzer
lters stop words from the corpus. We have considered most of the stop words
used by PubMed database. After that, it assigns a document id, a name (Medline
number), and a line number to every sentence of the corpus. Then it extracts
contents of a sentence and stores them in a tree structure, which is dened as
follows:
struct tag{
char * name;
struct tag * lchild; struct tag * rchild; struct tag *InnerTag;}
struct segment {
char *non tagged text; struct tag *tags;}
struct segment sentence[no of segments]
This tree is used by Tag Predictor and Knowledge Base Enhancer. Fig. 2
shows a sample tree structure created by Tag Analyzer corresponding to a tagged
sentence picked up from MEDLINE:95197524.
424
Fig. 2. Tree structure generated with a sample sentence. [] indicates tags. Entities are
in boldface.
The main function of this module is to locate entities in the biological document
and tag them according to GENIA ontology. During the training phase of the
system, this module works on the tree structure generated by Tag Analyzer to
extract statistical information for tagging. After training, it applies the extracted
statistical knowledge to tag the biological entities appearing in a document.
To identify entities from documents, this module also implements a document
processor. Since tagging is based on neighboring words, hence, it also includes
a neighborhood extractor which stores information about occurrences of terms
surrounding an entity. We outline the functions of each of these components in
the following subsections.
5.1
The main function of this module is to locate entities. Since entities are either
nouns or noun phrases, the document processor consists of a Parts-Of-Speech
(POS) Tagger that assigns parts of speech to English words based on the context in which they appear. We have used a web-based Tagger, developed by the
Specialized Information Services Division (SIS) of the National Library of Medicine1 (NLM), to locate and extract noun phrases that are present in the text
document. In addition to identication of nouns and noun phrases as named
entities, this module also consists of a set of rules to handle special characters
which are very common in biological documents. Gavrilis et el. [6] have proposed
a rule set for pre-processing biological documents. We have modied this rule
set including some new ones. These rules have been identied after analysis of
100 manually tagged documents picked at random from the GENIA corpus. The
1
http://tamas.nlm.nih.gov
425
Fig. 3. Entity recognition rules and a sample sentence with entities identied for tagging indicated by [ ]
modied rule set is shown in Fig. 3. These rules help in generating equivalent
names from a number of dierent appearances of Biological names, or even extract dierent instances from concatenations and commonly used abbreviations.
For example, concatenated instances like B- and T-cell or gp350/220 are converted into B-cell and T-cell, and gp350/gp250 respectively. The complete list of
entities are identied by applying the above-mentioned rules in conjunction with
the identied noun phrases. Our method is capable of identifying both simple
and nested entities to be tagged. All stop words are ltered from the documents
and the processed document is passed on to the Neighborhood Extractor for
extraction of bigrams. The pre-processor also assigns a document id and name
(generally Medline number) and breaks the document into sentences and assigns
them a unique line number within the document so that the combination of the
document id and line number form a primary key.
Initially, each entity is assigned an arbitrary tag. This tag acts as a placeholder which helps in identifying locations in the sentence where tags will be
inserted. The lower box in Fig. 3 shows a sample sentence along with its entities
identied and marked for tag prediction. The sentence with expected tag positions marked, is now converted into a tree structure similar to the one stated in
section 4, though tag names are not known now. This tree encodes all needed
426
GENIA ontology denes 36 tag classes, including the other name tag. Tag
Predictor has been implemented to predict the most likely and most specic
tag from this collection. For all known entities, the tag is looked up from the
knowledge base. Otherwise, the tag prediction process uses a maximum likelihood based model to predict an entity Es tag based on the information about
its neighboring words. The task is to maximize (T |En ), where T stands for a
tag and En for the neighbourhood of the entity. (T |En ) is computed using bigrams of words and/ or tags. We dene En as a 4-tuple (two left words and two
right words), that surround entity E that is to be tagged. A 4-tuple denoted by < LHW 1, LHW, RHW, RHW +1 > species the state of the system. Tag Predictor generates the 4-tuple set for E through preorder traversal of the tree mentioned earlier. The likelihood of tag Ti , is computed using the following equation:
(Ti |LHW 1, LHW, RHW, RHW + 1) =
P (Ti |LHW 1) + P (Ti |LHW ) + P (Ti |RHW ) + P (Ti |RHW + 1)
where P (|) is calculated from training data as follows:
P (Ti |t) = N (Ti , t)/j=1to36 N (Tj , t)
where N (Tk /t) represents the number of times a tag Tk is observed in combination with term t in the given position, in training data.
For any entity, one or more of its neighboring words may be entities themselves, which need to be tagged. In this case we have considered the tag, which
has the maximum probability of occurrence in the corresponding position as
obtained from training data. The tag with maximum likelihood value which is
greater than a given threshold value is assigned to the new entity. If an entity
cannot be associated with any tag the entity is tagged with other name.
427
the whole collection, those relational verbs which occur with maximal frequencies surrounded by ontology tags are accepted for inclusion into the ontology.
Some example relations extracted for incorporation to enhance the ontological structure are: inhibitors of(other organic compound, protein molecule), inducers of(lipid,protein complex), induces in(mono cell, protein molecule), activates(protein molecule, protein molecule) etc. It may be noted that prepositions
determine the role of a concept in a relation. Information about all known entities and documents in which they are occurring are stored separately in a trie
structure, and not in the ontology. For each entity in the collection it stores the
corresponding tag and its locations in the corpus.
Query Processor
Query processing is a two-step process - acceptance and analysis of the user query
and then nding the relevant answer from the structured knowledge base. Query
Processor uses the enhanced ontology structure to allow the user to formulate
queries at multiple levels of specicity. For example, a user may specify an entity
name and ask for documents containing it. However, a user can be more generic
and post a query like that shown earlier What are the biological substances
activated by CD28, which is a combination of generic concepts like biological
substances, specic instances like CD 28 and a specic relation activated by,
which should relate these elements in the document.
Our system stores various permissible templates to enable users to formulate
feasible queries at multiple levels of specicity. As the user formulates a query,
the most appropriate template is selected by the system. Templates could be of
the form <*,relation,Entity>, or <Entity,*, Tag>, or <Tag,relation, Tag> etc.
A query is analyzed and processed by the Answer Generator module to extract
the relevant information from the knowledge base. The query presented above
is passed on to system through template <*,relation=activated,Entity=CD28>.
Two of the responses extracted from the local knowledge base are:
1. MEDLINE:95081587: Dierential regulation of proto-oncogenes c-jun and
c-fos in T lymphocytes activated through CD28.
2. MEDLINE:96322738: The induction of T cell proliferation requires signals
from the TCR and a co-receptor molecule, such as CD28, that activate parallel and partially cross-reactive signaling pathways.
Another query How are HIV-1 and cell type related? instantiates the template
<Entity=HIV-1,relation=*,Tag=Cell type>. Some corresponding abstracts extracted for this are:
1. MEDLINE:95187990: HIV-1 Nef leads to inhibition or activation of T cells
depending on its intracellular localization.
2. MEDLINE:94340810: Superantigens activate HIV-1 gene expression in
monocytic cells.
428
To judge the performance of the system, we have randomly selected 10 documents from the GENIA corpus that consist of 80 sentences and 430 tags. These
documents were cleaned by removing tags from them. Table 1 summarizes the
performance of entity recognition process over this set. It is observed that precision is somewhat low due to the fact that there are many noun phrases that
are not valid entities. To evaluate the eectiveness of Tag Predictor, we have
blocked the table look-up process because in this case all entities can be found
in the knowledge base. Table 2 summarizes the performance of Tag Predictor in
the form of a misclassication matrix. We have considered the fact that along
with the assignment of correct tags to relevant entities, the system should tag
those entities which are not Biological as other name. Computation of precision
takes this into account.
Table 1. Misclassication Matrix for Entity Recognition
Module
Relevant
Nonrelevant
Relevant
Prec- RePhrases extracted phrases extracted phrases missed ision call
Entity Recognizer 417
50
13
89.29 96.98
Correct Wrong
Right prediction Wrong prediction Prec- Reprediction prediction of othername
of othername
ision call
Tag Predictor 389
27
9
42
90.47 92.29
429
Acknowledgements. The authors would like to thank the Ministry of Human Resources and Development, Government of India, for providing nancial support
for this work (#RP01537).
References
1. Broekstra, J., Klein, M., Decker, S., Fensel, D., van Harmelen, F., Horrocks, I.:
Enabling Knowledge Representation on the Web by Extending RDF Schema. In
Proceedings of the 10th Int. World Wide Web Conference, Hong Kong (2001) 467478
2. Collier, N., Nobata C., Tsujii J.: Extracting the Names of Genes and Gene Products with a Hidden Markov Model. In Proceedings of the 18th Int. Conference on
Computational Linguistics (COLING2000), Saarbrucken Germany (2000) 201-207
3. Craven, M., Kumlien, J.: Constructing Biological Knowledge Bases by Extracting
Information from Text Sources. In Proceedings of the 7th Int. Conference on Intelligent Systems for Molecular Biology (ISMB99), Heidelburg Germany (1999)
77-86
4. Friedman, C., Kra, P.,Yu, H., Krauthammer, M., Rzhetsky, A.: GENIES: A
Natural-Language Processing System for the Extraction of Molecular Pathways
from Biomedical Texts. Bioinformatics 17 (2001) 74-82
5. Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Toward Information Extraction: Identifying Protein Names from Biological papers. In Pacic Symposium on
Biocomputing, Maui Hawaii (1998) 707-718
6. Gavrilis, D., Dermatas, E., Kokkinakis, G.: Automatic Extraction of Information
from Molecular Biology Scientic Abstracts. In Proceedings of the Int. Workshop
on SPEECH and COMPUTERS (SPECOM03), Moscow Russia (2003)
7. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus - A Semantically Annotated Corpus for Bio-Textmining. Bioinformatics 19(1) (2003) 180-182
8. Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated Extraction of Information on Protein-Protein Interactions from the Biological Literature. Bioinformatics
17(2) (2001) 155-161
9. Rinaldi, F., Scheider, G., Andronis, C., Persidis, A., Konstani, O.: Mining Relations
in the GENIA Corpus. In Proceedings of the 2nd European Workshop on Data
Mining and Text Mining for Bioinformatics, Pisa Italy (2004) 61-68
10. Stapley, B.J., Benoit, G.: Bibliometrics: Information Retrieval and Visualization
from Co-occurrence of Gene Names in MedLine Abstracts. In Proceedings of the
Pacic Symposium on Biocomputing, Oahu Hawaii (2000) 529-540
11. Su, K., Wu, M., Chang, J.: A Corpus-Based Approach to Automatic Compound
Extraction. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL94), Las Cruses New Maxico USA (1994) 242-247