You are on page 1of 5

Integrating GO annotations with expression data to

analyze microarray experiments


Pedro Carmona-Sáez 1, Mónica Chagoyen1, Andrés Rodríguez 2, Oswaldo Trelles2,
José María Carazo1 and Alberto Pascual-Montano1*

Abstract. The use of biological knowledge about genes and gene With this purpose, we have used the “biological process” category
products in microarray data analysis is a necessary task when the of Gene Ontology as gene attributes. Gene Ontology (GO)
aim is to discover the biological patterns hidden in gene expression database is one of the most used information sources to categorize
profiles. In this paper we present a data mining approach able to and annotate gene products. Indeed, in the last few years, different
integrate different data sources in order to extract associations approaches have showed the great potential of GO annotations to
among gene characteristics and expression patterns. An analyze gene expression datasets [5, 10-14]. The aim of this work
implementation of this method is included in the EngeneTM is to show an implementation of ARD able to extract relevant
software package [1], which is freely accessible upon request at associations among GO annotations and expression data. An
http://www.engene.cnb.uam.es. interesting property of the method is that each gene can be
annotated with several terms and all of them will be taken into
account to extract those associations that are frequent and highly
1. INTRODUCTION related based on user specified thresholds.

Microarray technique has become a very popular method able to


measure expression levels of thousands of genes or whole genomes 2. METHODS
in a single experiment. In this way, large gene expression datasets
containing expression data of thousands of genes across tens or 2.1 Association Rules Discovery and transaction
hundreds of different experimental conditions are being generated. database for microarray data
In the analysis of microarray data a critical issue is to discover
biological patterns from gene expression profiles. Several research
Association Rules Discovery is a data mining technique oriented
groups have approached this problem using clustering techniques
towards finding interesting associations or correlation relationships
to group genes that share similar expression profiles. In this way,
among a large set of data. For example, to identify sets of attribute
co-clustered genes can be examined in order to discover biological
values (items) that frequently occur together in the same
connections among them, for example, common upstream
transaction, and then formulate rules that characterize these
sequence motifs [2]. Identifying biological relationships that may
relationships. An intuitive definition of the rules is given as follow:
exist among large sets of genes is not a trivial task. In this context
Making a formal statement of the problem, let
different bioinformatics tools have been developed in order to
connect lists of genes to biological databases and extract the most I = {i1 , i2 ,..., in } be a set of literals called items. Let S be a set
relevant biological themes [3-7]. Nevertheless, it is also frequent of transactions, where each transaction T is a set of items such that
that the researcher annotates co-clustered genes one by one using
biological databases or literature searches [7]. In this way, T ⊆ I . We can now say that a transaction T contains a set X of
expression data is independently analyzed and biological X ⊆ T . An association rule is an implication of the
items in I if
knowledge is introduced to the analysis as a subsequent step.
Although this is a widely and very often successful strategy it has
form X → Y , where X ⊆ I , Y ⊆ I and X ∩ Y = φ . The
some drawbacks [8]. Genes sharing similar expression levels, and rule X → Y has support s if s% of the transactions in S contain
grouped into the same cluster, might not be involved in the same X ∪ Y . The left hand side of the rule is called antecedent and the
biological process or, in the same way, genes with different
right hand side is called consequent. Such rules are usually
expression profiles can be functionally related. In addition,
interpreted as follow: when X occurs, it is often the case that Y also
grouping genes into separate and not related clusters does not take
occurs in the same transaction.
into account the fact that many gene products participate in more Different measures can be used to point out the relevance of a rule.
than one biological process. In general, the quality of clusters and The most important ones are: support (previously defined),
their ability to explain biological function can vary greatly [9]. A confidence and improvement. The confidence of the rule is defined
challenge in this field is to develop methods able to integrate
biological knowledge with expression data. P( X ∪ Y )
as; , and it is the probability for a transaction which
In this paper we show the application of a non supervised data P( X )
mining method, Association rules Discovery technique (ARD), in
order to extract intrinsic associations among biological properties contains X to also contain Y. Support and confidence are the most
of genes and expression patterns. common and, in many cases the only parameters used to mine
meaningful association rules. However, it is important to note that
sometimes both of these measures are high, indicating an
1 BioComputing Unit, Centro Nacional de Biotecnología (CNB-CSIC). association which could be good, and yet still produce a rule that is
Cantoblanco, 28049 Madrid, Spain. not useful. This is the case in which the elements of the consequent
2 Computer Architecture Department, Universidad de Málaga, 29080, are very frequent in the transaction database. For example, consider
Málaga, Spain. the following rule: {A → C, B} with support = 60% and
confidence = 80%. This rule indicates that in 80% of the
* To whom correspondence should be addressed: pascual@cnb.uam.es transactions containing A, B and C are also present. The above rule
looks like a good rule but, is it really a relevant association if B and Genes with expression value greater than the positive threshold
C are present in 100% of the total transactions? Thus, a third were considered as expressed, in the same way, genes with
measure of the quality of the rule is needed: the improvement. expression value lower than the negative threshold were being
P( X ∪ Y ) inhibited. Values between these two ranks were neither expressed
Improvement can be defined as; . Any rule with an nor inhibited.
P( X ) P(Y )
Table 1 Example of transaction database used to extract association rules
improvement less than 1 does not indicate a real correlation
among gene attributes and expression patterns.
between antecedent and consequent. On the contrary, when Transaction Itemset
improvement is greater than 1 the resulting rule is better at Gene 1 [+]Exp.A, [+] Exp. B, [-]Exp.C, characteristic 1, characteristic 2
Gene 2 [+]Exp.B, [-]Exp.C, characteristic 3, characteristic 4
predicting the consequent. Gene 3 [+]Exp.A, [+] Exp.B, [-]Exp.C, [+]Exp.D, characteristic 1, characteristic 3
Given a set of transactions S, the problem of mining association Gene 4 [-]Exp.A, [-]Exp.B, [-]Exp.C, characteristic 6
Gene 5 [+]Exp.A, [+]Exp.B, [-]Exp.C, characteristic 1, characteristic 4
rules is to generate all associations that have support greater than Gene 6 [+]Exp.A, [+]Exp.B, [-]Exp.C, characteristic 1, characteristic 7
the user-specified minimum support. The first and key step in the
generation of association rules is the mining of frequent items on
the database. Once the frequent items are located, the subsequent 2.2 Filtering rules
rules can be formed straightforwardly among them [15, 16].
To effectively generate association rules based on all possible One of the main drawbacks related to the extraction of Association
combinations of items, Agrawal proposed the Apriori algorithm Rules is the huge amount of associations generated, and the fact
[15]. The rationale of this method is to reduce the number of that many of the rules extracted contain redundant information. We
frequent candidate items used for creating the rules by eliminating applied different filter options to process the obtained rules.
those that do not satisfy a minimum frequency constraint. Apriori - Redundant consequent filter: We consider a rule to be redundant
algorithm and its many improved variants use a breadth first if its consequent is contained in the consequent of another rule with
approach for the exploration of the frequent-pattern search space, the same antecedent and with equal or higher values for support,
which makes them inefficient when dealing with dense datasets, confidence and improvement. For example: for the association
loads of frequent items, long patterns or very low support rules X: {A→ C, D, F}, Y: {A → D} and Z: {A→ D, F}; Y and Z
thresholds. On the other hand, recent depth first approaches [17], are redundant if their values for support, confidence and
show limitations in dealing with large sparse databases. improvement are less or equal than the corresponding values of X.
In addition to the exploration approach (breadth or depth first), - GO filter: This filter eliminates those rules that have the same
counting the frequency of items requires choosing a method and values for support, confidence and improvement and the same
data structure for projecting the transactions of the database onto consequent, but their antecedent contains parent terms in the GO
the representation of the frequent items search space. As it can be hierarchy. For example, for the two rules X: {G2 phase of mitotic
gathered from literature [18] no single combination of exploration
cell cycle → A} and Y:{cell cycle → A} with the same support,
approach and projection method is good for all situations, so it
confidence and improvement, Y is filtered out because the GO
would be necessary to select the correct algorithm implementation
annotation contained by X, “G2 phase of mitotic cell cycle”, is
to maximize efficiency for the particular application case.
more specific than the “cell cycle” term.
In this work, we have developed our own association rule
- Single antecedent filter: This option filters out all rules whose
discovery algorithm, a solution especially suitable for data
antecedent contains more than one item. It is applied to extract
collections in which elements appear clustered in sparse but
independent information about different biological processes.
strongly related groups as is the case with gene expression data.
The algorithm is efficient, in particular when searching for low-
support or rare data patterns, in which the explosive growth of the 2.3 Statistical significance of extracted rules
search space can make its exploration extremely expensive in
memory and CPU requirements. Although previously described support and improvement values
Our method for mining frequent patterns is based on breadth first provide information about the association between the antecedent
approach, but the breadth extension is limited to the items that are and consequent parts of the rule, they do not inform about their
actually present in the database [19] It directly uses the content of statistical significance [21]. The statistical significance of a rule
the database (the transactions) to drive the search procedure, was evaluated here using the influence of statistical dependency
performing an on-the-fly projection, enabling a memory and CPU between the antecedent and the consequent of the rule. In order to
efficient exploration of rare and infrequent association rules, which calculate the statistical significance, we used the χ2- test for
can have high levels of confidence and be of great interest for the statistical independence[21, 22]
researcher. A p-value associated to each rule is computed under the
What exactly constitutes an item or a transaction depends on the assumption that the null hypothesis of the test is true (both the
application. To extract the type of rules that we propose in this antecedent and consequent part of the rule are independent). For a
work a microarray dataset is transformed in a transaction database p-value of 0.01 we can reject the null hypothesis with a 99% of
in which genes represent transactions and the set of experiments in confidence. In our case, all rules that passed the filter conditions
which each gene is over- or under-expressed represent the itemset. are statistical significant according to this test.
In this transaction database gene attributes can also be included as
items (see Table 1) and association rules on the form of 2.4 Expression dataset and annotation of GO terms
{Characteristic X→ [+]Exp.A, [+]Exp.B, [-]Exp. C} can be
extracted. These rules are read as ”most of the genes annotated Iyer et al. monitored mRNA levels of human fibroblasts in
with characteristic X were over-expressed in experimental response to serum in a time course experiment. This serum
condition A and B and under-expressed in experiment C”. stimulation dataset [23] is publicly available at http://genome-
To construct the transaction database from microarray data the www.stanford.edu/serum/. Threshold values of 1 for over-
expression matrix has to be transformed into a Boolean matrix. For expression and -1 for down-expression were selected to generate
this purpose we can use statistical methods to detect differentially the transaction dataset. A value between –1 and 1 is neither up nor
expressed genes or simply use a threshold value [16, 20]. In this down. Experiments name represent time in which sample was
particular application, we have used two expression thresholds. harvested after serum exposure. Missing values were filled by k-
nearest neighbours approach with k = 10 [24] and mean expression to each term is considered as a single item ARD allow users to
levels were calculated for replicates samples. annotate with several attributes each gene and all will be taken into
Biological processes associated to each gene was annotated based account to discover associations. Association rules will be
on the controlled vocabulary of the Gene Ontology Consortium extracted about those terms that appear in a significant number of
(GO) [25]. For each gene we included all annotations for biological co-expressed genes and are highly associated to their respective
process category. Moreover, it was done at different levels of the expression pattern. We used all annotations included for each gene
GO hierarchy (level 3, 4 and 5) using annotations obtained with the at different levels of the GO hierarchy using the DAVID program
DAVID program [3]. [3]. 247 genes of the 517 were associated to at least one GO term.
Most of the genes were annotated with several terms of the
3. RESULTS “biological process” category, for example ATP citrate lyase
(ACLY) has more than 27 annotations in this category including all
To test the potential of ARD in extracting significant biological parental terms. We specified a minimum absolute support value of
associations from gene expression data we applied it to the serum 4, a minimum confidence of 70% and a minimum improvement of
stimulation dataset reported by Iyer et al. [23]. Genes from this 1. The ARD algorithm applied to this dataset was able to find more
dataset were annotated with their associated GO terms. We have than 19000 association rules. To verify that extracted rules reveal
focused our attention on finding rules among biological processes associations are not likely to have occurred by chance gene
and expression patterns using the “biological process” category of expression profiles and experiment expression profiles were
the GO ontology. The Gene Ontology Consortium has developed a random permutated. Only two associations were extracted with
standardized and dynamic vocabulary about gene products in only one element in the consequent and the lowest improvement
several organisms at three different categories; Molecular value.
Function, Biological Process and Cellular Component [25]. The We applied the filtering criteria described in the methods section to
three GO categories are hierarchy structured. At low levels in the eliminate those rules containing a redundant consequent (redundant
hierarchy the functional annotations are more specific but the consequent filter), more than one element in antecedent (single
number of annotated genes decreases. Moreover, one gene can be antecedent filter) and some redundancy based on GO hierarchy
annotated with different GO terms at different levels of GO (GO filter). We selected only those rules with one item in the
hierarchy. antecedent because we were mainly focused on showing
This hierarchical nature of the GO ontology and the fact that the information related to each one of biological processes obtained
same gene can be annotated into different categories implies a from GO ontology. Nevertheless, item combinations in the
problem if only one GO term is considered for each gene [26]. For antecedent can provide information about co-occurrences of gene
example, a particular gene could be annotated into the “fibroblast characteristics in sets of co-expressed genes which can be very
proliferation” category (GO:0048144) and another one into “cell interesting in some cases. For example, when transcription factors
proliferation” (GO:0008283), which is the immediately parent that bind to promoters regions are used as gene attributes,
category. Obviously, these two genes are functionally related, but information about their co-occurrences is very important due to
essentially they have a different annotation. Moreover, one gene different transcription factors can co-operate to regulate gene
can be annotated with different terms such as p21(WAF1), that is expression. 29 associations remained after post-processing the
involved in induction of apoptosis and regulation of cell cycle. Due obtained rules (Table 2).

Table 2 Association rules generated using biological process from Gene Ontology. Consequent element is graphically represented
by coloured squares. Dark grey represents over-expression, light grey under-expression and empty squares represent neither over-
expression nor under-expression.
15m
30m

12h
16h
20h
24h
1h
2h
4h
6h
8h

Antecedent Confidence Support Improvement p-value


alcohol metabolism 80.00 3.24 12.35 5.25E-22
angiogenesis 100.00 2.02 5.74 8.63E-07
angiogenesis 80.00 1.62 6.18 6.48E-06
angiogenesis 80.00 1.62 5.34 3.86E-05
blood coagulation 71.43 2.02 4.77 2.18E-05
chemotaxis 100.00 2.43 9.15 1.46E-12
cholesterol metabolism 100.00 2.43 15.44 4.41E-21
cholesterol metabolism 83.33 2.02 15.83 4.32E-18
cytokinesis 87.50 2.83 4.24 2.05E-06
cytoplasm organization and biogenesis 80.00 3.24 3.87 2.21E-06
DNA replication 83.33 2.02 4.04 1.23E-04
DNA replication and chromosome cycle 90.91 4.05 4.01 3.21E-08
DNA replication and chromosome cycle 81.82 3.64 3.96 2.93E-07
electron transport 75.00 2.43 2.41 6.52E-03
inflammatory response 83.33 2.02 5.56 2.04E-06
lipid biosynthesis 75.00 3.64 2.41 7.79E-04
M phase 75.00 4.86 3.31 2.35E-07
mitotic cell cycle 70.37 7.69 3.10 3.56E-10
response to chemical substance 88.89 3.24 6.86 4.83E-12
response to chemical substance 77.78 2.83 7.12 5.86E-11
steroid biosynthesis 87.50 2.83 13.51 2.93E-21
steroid biosynthesis 100.00 3.24 3.21 1.93E-05
steroid metabolism 72.73 3.24 11.23 6.68E-20
steroid metabolism 81.82 3.64 7.48 1.27E-14
steroid metabolism 90.91 4.05 2.92 1.21E-05
sterol biosynthesis 100.00 2.83 15.44 2.01E-24
sterol biosynthesis 71.43 2.02 13.57 1.82E-15
sterol metabolism 100.00 3.24 15.44 8.71E-28
sterol metabolism 75.00 2.43 14.25 2.71E-19
To evaluate the biological significance of the generated by the N.I.H. through grant 1R01HL67465-01. This work has also
associations, we paid attention to the support and confidence been partly funded by the project TEMBLOR (The European
values assigned to each rule. The support of the rule indicates the Molecular Biology Linked Original Resources) through grant EU
number of transactions (genes) in which the coincidence of a (QLRI-CT-2001-00015) and FISBIO2002-10855-E. P.C.S. is the
specific process with the co-expression pattern occurs. Confidence recipient of a fellowship from Comunidad de Madrid (CAM).
value represents the percentage of genes annotated into a given
category (antecedent) which have the expression pattern that
appears in the consequent of the rule. In this kind of rules, perhaps REFERENCES
confidence is the most significant value from the biological point
of view. For example, if only a small set of genes are annotated 1. Garcia de la Nava, J., et al., Engene: the processing and
within a given process in the entire dataset, the support value of exploratory analysis of gene expression data. Bioinformatics,
rules containing this process will be quite low but, if these rules 2003. 19(5): p. 657-8.
have a high confidence value they are telling us that most of the 2. Brazma, A., et al., Predicting gene regulatory elements in
genes annotated for this process show a similar expression pattern. silico on a genomic scale. Genome Res, 1998. 8(11): p. 1202-
Automatically extracted rules confirm previous evidences of 15.
biological processes associated to serum response [23, 27]. 3. Dennis, G., Jr., et al., DAVID: Database for Annotation,
Looking at the obtained rules in table 2 we can see that most of the Visualization, and Integrated Discovery. Genome Biol, 2003.
genes involved in inflammatory response, blood coagulation and 4(5): p. P3.
angiogenesis were mainly over-expressed as shown in the 4. Segal, E., H. Wang, and D. Koller, Discovering molecular
consequent element of the rules. These processes are well pathways from protein interaction and gene expression data.
characterized as serum response associated phenomena as well as Bioinformatics, 2003. 19 Suppl 1: p. i264-71.
induction of genes involved in cell proliferation and cell cycle. 5. Al-Shahrour, F., R. Diaz-Uriarte, and J. Dopazo, FatiGO: a
Extracted rules reveal that genes related with these last processes web tool for finding significant associations of Gene Ontology
were mainly over-expressed at last time points of the time course terms with groups of genes. Bioinformatics, 2004. 20(4): p.
experiment. Moreover, these rules reveals that over-expressed 578-80.
genes are related to M-phase associated processes which is in 6. Grosu, P., et al., Pathway Processor: a tool for integrating
agreement with the observation of Iyer et al. that the onset of M whole-genome expression results into metabolic networks.
phase were around 16 h after serum stimulation. Genome Res, 2002. 12(7): p. 1121-6.
In the original work Iyer et al. manually analyzed the genes 7. Hosack, D.A., et al., Identifying biological themes within lists
associated to each cluster in order to discover common biological of genes with EASE. Genome Biol, 2003. 4(10): p. R70.
properties. They noted that genes involved in cholesterol 8. Shatkay, H., et al., Genes, themes and microarrays: using
biosynthesis were under-expressed after serum exposure due to information retrieval for large-scale gene analysis. Proc Int
serum provided cholesterol to the cell. This fact is also manifested Conf Intell Syst Mol Biol, 2000. 8: p. 317-28.
by extracted rules. We can see that 100% of genes specifically 9. Raychaudhuri, S., et al., The computational analysis of
involved in cholesterol metabolism were under-expressed at 12, 16 scientific literature to define and recognize gene expression
and 20 h after serum exposure. clusters. Nucleic Acids Res, 2003. 31(15): p. 4553-60.
Results obtained in our approach showed that this methodology can 10. Lee, S.G., J.U. Hur, and Y.S. Kim, A graph-theoretic modeling
be a useful tool to analyze gene expression data integrating GO on GO space for biological interpretation of gene clusters.
annotations and expression patterns. It is important to remark that Bioinformatics, 2004. 20(3): p. 381-8.
although ARD has been previously applied to microarray data 11. Zhang, B., et al., GOTree Machine (GOTM): a web-based
analysis [16, 20] the type of rules proposed and the purpose of the platform for interpreting sets of interesting genes using Gene
application were totally different with respect to our approach. Ontology hierarchies. BMC Bioinformatics, 2004. 5(1): p. 16.
Previous works have used ARD to extract association rules among 12. Robinson, P.N., et al., Ontologizing gene-expression
genes basing only on expression data on the form of {[+]Gene A→ microarray data: characterizing clusters with gene ontology.
[+]Gene B, [-] Gene C}, mining that in a significant number of Bioinformatics, 2004.
experiments Gene A and B are over-expressed and Gene C under- 13. Feng, W., et al., Development of gene ontology tool for
expressed, and also that when Gene A is over-expressed is biological interpretation of genomic and proteomic data. Proc
probably that Gene B is also over-expressed and Gene C under- AMIA Symp, 2003: p. 839.
expressed. 14. Hvidsten, T.R., A. Laegreid, and J. Komorowski, Learning
Information about gene ontology hierarchy was compiled without rule-based models of biological process from gene expression
selection of a unique level of the hierarchy which is an interesting time profiles using gene ontology. Bioinformatics, 2003. 19(9):
property of the method. Associations were extracted in an p. 1116-23.
automatic way without any pre-established assumption and 15. Agrawal, R., T. Imielinski, and A. Swami, Mining Association
revealed important biological features of serum response based on Rules between Sets of Items in Large Databases. Proc. of the
GO annotations. Once genes were annotated with all GO terms ACM SIGMOD 1993, 1993: p. 207-216.
associated to each one, the process of extracting association rules 16. Creighton, C. and S. Hanash, Mining gene expression
was an easy and fast task. In this way, the researcher can focus on databases for association rules. Bioinformatics, 2003. 19(1): p.
analyzing associations that are expected to be significant, without 79-86.
wasting time testing different gene attributes that are not highly 17. Han, J., J. Pei, and Y. Yin, Mining Frequent Patterns without
associated with co-expression patterns. Candidate Generation. Proc. of ACM SIGMOD 2000, 2000: p.
1-12.
18. Liu, J., et al., Mining Frequent Item Sets by Opportunistic
ACKNOWLEDGMENTS Projection. Proc. of ACM SIGKDD 2002, 2002: p. 229-238.
19. Rodríguez, A., J.M. Carazo, and O. Trelles, Mining Association
This work was supported in part by the “Comisión Interministerial Rules from Biological Databases. Bioinformatics special topics
de Ciencia y Tecnología” through grant CICYT (BIO2001-1237), issue of the Journal of the American Society for Information
by the “Comunidad de Madrid” through grant 07B/0032/2002 and Science and Technology, 2004: p. in press.
20. Kotala, P., et al., Gene expression profiling dna microarray
data using peano count tree (p-trees). Proc. of the First Virtual
Conference on Genomics and Bioinformatics., 2001.
21. Motwani, R., S. Brin, and C. Silverstein, Beyond Market
Baskets: Generalizing Association Rules to Correlations.
Proceedings ACM SIGMOD Conference, 1997: p. 265-276.
22. Geurts, K., et al., Profiling high-frequency accident locations
using association rules. Transportation Research Record, 2003.
1840: p. 123-130.
23. Iyer, V.R., et al., The transcriptional program in the response
of human fibroblasts to serum. Science, 1999. 283(5398): p.
83-7.
24. Troyanskaya, O., et al., Missing value estimation methods for
DNA microarrays. Bioinformatics, 2001. 17(6): p. 520-5.
25. Ashburner, M., et al., Gene ontology: tool for the unification of
biology. The Gene Ontology Consortium. Nat Genet, 2000.
25(1): p. 25-9.
26. Troyanskaya, O.G., et al., A Bayesian framework for
combining heterogeneous data sources for gene function
prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci
U S A, 2003. 100(14): p. 8348-53.
27. Chang, H.Y., et al., Gene Expression Signature of Fibroblast
Serum Response Predicts Human Cancer Progression:
Similarities between Tumors and Wounds. PLoS Biol, 2004.
2(2): p. E7.

You might also like