Professional Documents
Culture Documents
Abstract
Background: Metabolic networks are represented by the set of metabolic pathways. Metabolic pathways are a series
of biochemical reactions, in which the product (output) from one reaction serves as the substrate (input) to another
reaction. Many pathways remain incompletely characterized. One of the major challenges of computational biology is
to obtain better models of metabolic pathways. Existing models are dependent on the annotation of the genes. This
propagates error accumulation when the pathways are predicted by incorrectly annotated genes. Pairwise
classification methods are supervised learning methods used to classify new pair of entities. Some of these
classification methods, e.g., Pairwise Support Vector Machines (SVMs), use pairwise kernels. Pairwise kernels describe
similarity measures between two pairs of entities. Using pairwise kernels to handle sequence data requires long
processing times and large storage. Rational kernels are kernels based on weighted finite-state transducers that
represent similarity measures between sequences or automata. They have been effectively used in problems that
handle large amount of sequence information such as protein essentiality, natural language processing and machine
translations.
Results: We create a new family of pairwise kernels using weighted finite-state transducers (called Pairwise Rational
Kernel (PRK)) to predict metabolic pathways from a variety of biological data. PRKs take advantage of the simpler
representations and faster algorithms of transducers. Because raw sequence data can be used, the predictor model
avoids the errors introduced by incorrect gene annotations. We then developed several experiments with PRKs and
Pairwise SVM to validate our methods using the metabolic network of Saccharomyces cerevisiae. As a result, when PRKs
are used, our method executes faster in comparison with other pairwise kernels. Also, when we use PRKs combined
with other simple kernels that include evolutionary information, the accuracy values have been improved, while
maintaining lower construction and execution times.
Conclusions: The power of using kernels is that almost any sort of data can be represented using kernels. Therefore,
completely disparate types of data can be combined to add power to kernel-based machine learning methods. When
we compared our proposal using PRKs with other similar kernel, the execution times were decreased, with no
compromise of accuracy. We also proved that by combining PRKs with other kernels that include evolutionary
information, the accuracy can also also be improved. As our proposal can use any type of sequence data, genes do
not need to be properly annotated, avoiding accumulation errors because of incorrect previous annotations.
Keywords: Metabolic network, Pairwise rational kernels, Supervised network inference, Finite-state transducers,
Pairwise support vector machine
*Correspondence: aroche@cs.umanitoba.ca
1 Department of Computer Science, University of Manitoba, Winnipeg,
Manitoba, Canada
Full list of author information is available at the end of the article
© 2014 Roche-Lima et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication
waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise
stated.
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 2 of 13
http://www.biomedcentral.com/1471-2105/15/318
Figure 1 Weighted transducer and weighted automaton representing sequences in the alphabet = = {G, C}. (a) Weighted
Transducer T. (b) Weighted Automaton A (A is obtained projecting the output of T).
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 4 of 13
http://www.biomedcentral.com/1471-2105/15/318
There are several operations defined on automata and Using Algorithm 1, the overall complexity to com-
transducers, such as inverse and composition. Given any pute one value for the rational kernel is O(|U||Mx ||My |),
transducer T, the inverse T −1 is the transducer obtained where |U| remains constant. In practice, this complexity is
when the input and output labels are swapped for each reduced to O(|U| + |Mx | + |My |) in many kernels which
transition. The composition operation of the transduc- have been used in areas such as natural language process-
ers T1 and T2 with input and output alphabets both ing and computational biology. For example, Algorithm 1
equal to is a weighted transducer, denoted by T1 ◦ for the n-gram kernel has a linear complexity (see a
2 , provided that the sum given by (T1 ◦ T2 )(x, y) =
T detailed description of the n-gram kernel below).
z∈ ∗ T1 (x, z)T2 (z, y) is well defined in R for all (x, y) ∈ Kernels used in training methods for discriminant clas-
∗. sification algorithms (e.g., SVM) need to satisfy Mercer’s
condition or equivalently be Positive Definite and Sym-
Rational kernels metric - PDS [18]. Cortes et al. [18] have proven a result
In order to manipulate sequence data, FSTs provide a sim- that gives a general method to construct a PDS rational
ple representation as well as efficient algorithms such as kernel using any WFSTs.
composition and shortest-distance [18]. Rational Kernels,
based on Finite-State Transducers, are effective for ana- Theorem 1. ([18]). If T is an arbitrary weighted trans-
lyzing sequences with variable lengths [17]. ducer, then U = T ◦ T −1 defines a PDS rational kernel.
As a formal definition, a function k : ∗ × ∗ → R
is a rational kernel if there exists a WFST U such that k n-gram kernel as a rational kernel
coincides with the function defined by U, i.e., k(x, y) = Hofmann et al. [26] have defined a class of similarity mea-
U(x, y) for all sequences x, y ∈ ∗ × ∗ [17]. From now sures between two biological sequences as a function of
on, we consider the input and output alphabets with the the number of equal subsequences that they have. As an
same symbols (i.e., = ), and only the terms and ∗ example of such measures is the spectrum kernel defined
will be used. by Leslie et al. [27]. Similarity values are the results of
In order to compute the value of U(x, y) for a partic- summing all the products of the counts for the same sub-
ular pair of sequences x, y ∈ ∗ × ∗ , the composition sequences. It is also referred to in computational biology
algorithm of weighted transducers is used [17]: as the k-mer or n-gram kernel. In the rest of this paper, we
use the term n-gram to follow the notation of Hofmann
• First, Mx , My are considered as trivial weighted
et al. [26] and Cortes et al. [17].
transducers representing x, y respectively, where
Mx (x, x) = 1 and Mx (v, w) = 0 for v
= x or w
= x. The n-gram kernel is defined as kn (x, y) =
|z|=n cx (z)cy (z) for a fixed integer n, which represents
Mx is obtained using the linear finite automata
subsequences of length n. Here, ca (b) is the number of
representing x by augmenting each transition with an
times that the subsequence b appears in a. kn can be
output label identical to the input label and by setting
represented as a rational kernel using the weighted trans-
all transition, initial and final weights to one. My is
ducer Un = Tn ◦ Tn−1 , where the transducer Tn is defined
obtained in a similar way by using y.
as Tn (x, z) = cx (z), for all x, z ∈ ∗ with |z| = n [18].
• Then, by definition of weighted transducer
For example, for n = 2, k2 (x, y) = |z|=2 cx (z)cy (z) is the
composition:
rational kernel where z represents all the subsequences
(Mx ◦ U ◦ My )(x, y) = Mx (x, x)U(x, y)My (y, y).
in ∗ with size 2 and T2 (x, z) = cx (z) counts how many
Considering Mx (x, x) = 1 and My (y, y) = 1, we
times z occurs in x.
obtain (Mx ◦ U ◦ My )(x, y) = k(x, y), i.e., the sum of
Allauzen et al. [16] extended the construction of
the weights of all paths of Mx ◦ U ◦ My is exactly
this kernel, kn , to measure the similarity between
U(x, y) = k(x, y).
sequences represented by automata. Firstly, they define
Based on this representation, a two-step algorithm is the count of a sequence z in a weighted automaton A
defined by Cortes et al. [17] to obtain k(x, y) = U(x, y). as cA (z) = u∈ ∗ cu (z)A(u), where u ranges over the
set of sequences in ∗ which can be represented by
the automaton A. This equation represents the sums
obtained for each u, of how many times z occurs in
Algorithm 1 Rational Kernel Computation u multiplied by the weight (or value) associated to the
INPUT: pair of sequences (x, y) and a WFST U sequence u in the automaton A (as is computed in
(i) compute N using composition as N = Mx ◦ U ◦ My Example 2).
(ii) compute the sum of all paths of N using Then, the similarity measure between the weighted
shortest-distance algorithm, which is equal to U(x, y). automata A1 and A2 , according to the n-gram kernel kn , is
RESULTS: value of k(x, y) = U(x, y) defined as:
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 5 of 13
http://www.biomedcentral.com/1471-2105/15/318
kn (A1 , A2 ) = (A1 ◦ Tn ◦ Tn−1 ◦ A2 )(x, y) di has binary values (e.g., the pair (xi , yj ) is classified as
x,y∈X +1 or −1), i = 1, . . . , n, j = 1, . . . , n and the mapping
function , then the Pairwise SVM methods find the opti-
= cA1 (z)cA2 (z) (1)
mal hyperplane, wT (xi , yi ) + b = 0, which separate the
|z|=n
points in two categories. One of the solutions is based on
Based on this definition and using Algorithm 1, the the dual formalism of the optimization problem described
n-gram rational kernel can be constructed in time in Cortes et al. [33]. In this case the decision function is:
O(|Un | + |Mx | + |My |), as described by Allauzen et al. [16] n
and Mohri et al. [28]. f (x, y) = αij K (xi , yj ), (x, y) + b,
i,j
Yu et al. [29] have verified that n-gram sequence kernels
alone are not good enough to predict protein interactions.
We address their concerns in our experiments by combin- where K is the pairwise kernel, (xi , yj ) is the set of train-
ing n-gram with other kernels that include evolutionary ing examples, α is obtained from the Lagrange Multipliers
information. as a function of w (the normal vector) and b is the offset
of the hyperplane (please, see Cortes et al. [33] for more
Pairwise kernels
details). In this case, α and b are the “learned” parame-
We apply kernel methods to the problem of predicting
ters during the training process. Thus, f classifies the new
relationships between two given entities, i.e., pairwise pre-
pairs (x, y). For example, if f (x, y) >= 0, (x, y) is classified
diction. Models to solve this problem have as an input
as +1, otherwise (x, y) is classified as −1.
two instances, and the output is the relationship between
them. Kernels used in these models need to define simi- Metabolic networks
larities between two arbitrary pairs of entities. Typically, In this work, the metabolic network is represented as a
the construction of pairwise kernels K are based on simple graph, in which the vertices are the enzymes, and the
kernels k, where k : X × X → R. In this paper four differ- edges are the enzyme-enzyme relations (two proteins are
ent pairwise kernels are investigated: Direct Sum Learning enzymes that catalyze successive reactions in known path-
Pairwise Kernel [21], Tensor Learning Pairwise Kernel (or ways). Figure 2 represents a graphical transition from a
Kronecker Kernel) [22,30,31], Metric Learning Pairwise metabolic pathway to a graph.
Kernel [23] and Cartesian Pairwise Kernel [10]. In a traditional representation of a metabolic path-
All these pairwise functions guarantee the symmetry way, enzymes are vertices (nodes), and metabolites are
of the pairwise kernels K, i.e., K((x1 , y1 ), (x2 , y2 )) = edges (branches). Following Yamanishi [9], we represent
K((x2 , y2 ), (x1 , y1 )), where x1 , x2 , y1 , y2 ∈ X. Also, if the it differently, where the interactions between pairs of
simple kernel k is PDS (satisfies the Mercer condition), enzymes are considered discrete data points. For exam-
the resulting pairwise kernel K also is PDS, for each of ple, in Figure 2(a), the enzyme numbered EC 5.3.1.9
the pairwise kernels defined above [10,32]. can create D-fructose-6-phosphate as a product, which
is in turn used as a substrate by the enzyme numbered
Pairwise support vector machine
EC 2.7.1.11. This means there is an enzyme-enzyme rela-
The rationale for the preceding discussion on represent-
tion between EC 5.3.1.9 and EC 2.7.1.11. Then, we create
ing disparate types of data as kernels is to enable us to
a graph in which enzyme-enzyme relations become edges
use them in machine learning formalisms such as Support
and enzymes are nodes as is shown in Figure 2(b). If there
Vector Machines (SVMs). SVMs are used for classifica-
is a relation between two enzymes, such a relation is clas-
tion and regression analysis, defined as supervised models
sified as +1 (i.e., interacting pair). Enzyme-enzyme pairs
with associated learning algorithms [33]. In this research,
for which no relation exists are classified as −1 (non-
we use SVMs for classification. SVMs represents the data
interacting pairs). Figure 2(c) describes these classifica-
as vectors in a vector space (i.e., input or feature space).
tions, which are used as training set in the SVM method.
As a training set, several entities xi (vectors) classified
in two categories are given. A SVM is trained to find a Using pairwise kernel and SVM to predict metabolic
hyperplane that separates the vector space in two parts. networks
Each part of the feature space groups the entities into The input data, considered as the training example
the same category. Then, a new entity x can be classified dataset ((xi , yi ), di ), is a set of known pairs of enzymes
depending their location in the feature space related to the (or genes) classified in two categories (interacting or
hyperplane [33]. non-interacting pairs). Figure 3(a) shows an example of
Pairwise Support Vector Machines, instead, classify pair the input data, obtained from the metabolic network
of entities (x, y) [32]. Let us formally define the binary described in Figure 2(c). In Figure 3(a), enzymes are rep-
Pairwise Support Vector Machine formulation, following resented by EC number (top) and gene nomenclature
Brunner et al. [32]: given a training data ((xi , yj ), di ), where (bottom).
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 6 of 13
http://www.biomedcentral.com/1471-2105/15/318
Figure 2 Conversion from a metabolic network to a graph representation. (a) Part of the Glycolysis Pathways, from BioCyc Database [5,6].
(b) The resulting graph with the nodes (enzymes) and edges (enzyme-enzyme relations). (c) Table that represents known enzymes relations (EC
numbers related are classified as +1 and non-related as -1).
Figure 3(b) represents an example of the pairwise A Pairwise SVM based on the dual formalism of the
kernel (K((x1 , y1 ), (x2 , y2 ))). Several state-of-the-art pair- optimization problem is represented in Figure 3(c). The
wise kernels were mentioned above. For example, if we parameters αij and b are learned, using the pairwise
consider the Tensor Product Pairwise Kernel K [22], then kernel, K, and the training dataset, (xi , yi ). Finally, new
K((x1 , y1 ), (x2 , y2 )) is computed using a simple kernel k pairs of enzymes or genes (x, y) can be classified as
(e.g., k could be the simple Phylogenetic (PFAM) ker- interacting or not-interacting, depending the evaluation
nel described by Ben-Hur et al. [22]). The PFAM kernel of the decision function f (see an example representation
(kpfam (x, y)) describes similarity measures based on the in Figure 3(d)). By predicting the gene interactions of the
PFAM database [34] between the gene x and the gene y. other unseen examples, all the metabolic pathways can be
Thus, the Tensor Product Pairwise Kernel K, using as a predicted.
simple kernel the PFAM Kernel kpfam is defined as: The pairwise kernel computation is one of the most
K((x1 , y1 ), (x2 , y2 )) = kpfam (x1 , x2 ) ∗ kpfam (y1 , y2 ) expensive tasks during the prediction of the metabolic
networks in processing and storage. Using sequence data
+ kpfam (x1 , y2 ) ∗ kpfam (y1 , x2 ) causes even longer execution times and large storage
For example, in Figure 3(b)-bottom, if the genes are needs. However, we have mentioned the advantages of
associated to the variables as follow: x1 = YAR071W, y1 = using sequence data in order to avoid error accumu-
YAL002W, x2 = YDR127W, y2 = YAL038W, the Tensor lation because of genome annotation dependencies. As
Product Pairwise Kernel is: well, SVMs guarantee better accuracy values than other
Figure 3 Diagram of pairwise SVM applied to metabolic network prediction. (a) An example of the pairs in the training set using the EC
numbers (top) or gene names (bottom). (b) The pairwise kernel as a matrix, where the numerical values in each cell correspond to a measure of
similarities, given two pairs of EC numbers (top) or two pairs of gene names (bottom). (c) A model is trained to estimate the parameters αi,j and b of
the decision function f . (d) Given a new pair of EC numbers (left) or gene names (right) the decision function is evaluated and the pair is classified as
interacting or non-interacting.
supervised learning methods along with sequence ker- • a Tensor Product Pairwise Rational Kernel
nels for metabolic network inference [7]. Therefore, we (KPRKT ) if
focus on improvement of the pairwise kernel computa- K((x1 , y1 ), (x2 , y2 )) = U(x1 , x2 ) ∗ U(y1 , y2 )+
tions and representation, by incorporating rational kernels U(x1 , y2 ) ∗ U(y1 , x2 )
to manipulate the sequence data. To accomplish this, we • a Metric Learning Pairwise Rational Kernel
have proposed a new framework called Pairwise Rational (KPRKM ) if
Kernels. K((x1 , y1 ),(x2 , y2 )) = (U(x1 , x2 )−U(x1 , y2 )−U(y1 , x2 )
+U(y1 , y2 ))2
Methods • a Cartesian Pairwise Rational Kernel (KPRKC ) if
Pairwise rational kernels
K((x1 , y1 ), (x2 , y2 )) = U(x1 , x2 ) ∗ δ(y1 = y2 )
In this section, we propose new pairwise kernels based
+δ(x1 = x2 ) ∗ U(y1 , y2 )
on rational kernels, i.e., Pairwise Rational Kernels (PRKs).
+U(x1 , y2 ) ∗ δ(y1 = x2 )
They are obtained using rational kernels as the sim-
+δ(x1 = y2 ) ∗ U(y1 , x2 )
ple kernels k. We have defined four PRKs, based on
where δ(x = y) = 1 if x = y and 0 otherwise,
the notations and definitions in the Background Section
∀x, y ∈ X.
above.
Definition 1. Given X ⊆ ∗ and a transducer U, then a Following Theorem 1, if we construct U using a
function weighted transducer T, such as U = T ◦ T −1 , then
K : (X × X) × (X × X) → R is: we guarantee U is a Positive Definite and Symmetric
(PDS) kernel. PDS is a needed condition to use kernels
• a Direct Sum Pairwise Rational Kernel (KPRKDS ) if in training classification algorithms. Since all the kernels
K((x1 , y1 ), (x2 , y2 )) = U(x1 , x2 ) + U(y1 , y2 )+ defined above are results of PDS kernel operations, the
U(y1 , x2 ) + U(x1 , y2 ) PRK kernels are also PDS [35].
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 8 of 13
http://www.biomedcentral.com/1471-2105/15/318
output values for the method described above, equivalent which represent abbreviated nucleotide sequences, inter-
to Figure 3(b), but using sequence data. act or do not interact. The decision function, f (x, y), was
previously obtained during the training process (see the
Example 3. Given nucleotide sequences x1 , y1 , x2 , y2 , Pairwise support vector machine Section for more details).
which represent abbreviated examples of known genes in If the resulting value of evaluating the decision function
the dataset, f (x, y) is greater than 0, the pair (x, y) interact, otherwise
x1 = GCTAAATTGGACAAATCTCAATGAAATTGTC the pair (x, y) do not interact. Suppose that the evaluation
TTGG is
y1 = ATGTCCTCGTCTTCGTCTACCGGGTACAGAA f (x, y) = f (CTCAAAGTCTTAATGCTTGGACAAATTGA
AA AATTGG . . . , TCTACAGAGTCGTCCTTCGTCTACCGG
x2 = CATGACTAAAGAAACGATTCGGGTAGTTATT GAAAAT . . .) = +3.
TGGCGG Then, we predict that these nucleotide sequences (x, y)
y2 = ATCTACAAGCGAACCAGAGTCTTCTGCAGGC interact in the context of the metabolic network of the
TTAGAT yeast Saccharomyces cerevisiae.
the Tensor Product Pairwise Rational Kernel KPRKT In this case, we used 755 genes during the training pro-
((x1 , y1 ), (x2 , y2 )) can be obtained using the 3-gram ratio- cess, but the species has more than 6000 genes [41].
nal kernel, e.g., for z = TCT, the values are: Then, the rest of the metabolic pathways can be predicted
by classifying all other pairs of genes (or pairs of raw
• cAx (z) = 2 because, TCT appears twice in x1 nucelotide sequences), as interacting or non-interacting,
1
GCTAAATTGGACAAATCT CAATGAAATTG using the decision function f . Note that the decision func-
TCT TGG, tion is obtained once during the training process, but can
• cAy (z) = 2 because, TCT appears twice in y1 be used as often as needed during the prediction process.
1
ATGTCCTCGTCT TCGTCT ACCGGGTACAGA
AAA, The advantage of using sequence data is that nucleotide
• cAx (z) = 1 because, TCT appears once in x2 sequences can be used, even if it is not annotated.
2
CATGACTAAAGAAACGATTCT GGTAGTTATT Also, any other type of sequence data, e.g., from high-
TGGCGG, and throughput analysis, can be considered and combined,
• cAy (z) = 3 because, TCT appears three times in y2 using a similar implementation.
2
ATCT ACAAGCGAACCAGAGTCT TTCT GCAGG
CTTAGAT. Experiment description and performance measures
We used pairwise SVM with PRKs for metabolic
With these results and other values corresponding network prediction, using the data and algo-
to 3-gram rational kernel, the KPRKT is computed as: rithms described above. We ran experiments for
KPRKT ((x1 , y1 ), (x2 , y2 )) = 0.3, where 0.3 is a measure of twelve different kernels. Firstly, we used four PRKs
similarity. described in Definition 1 using the 3-gram rational
kernel (i.e., KPRKDS−3gram , KPRKT−3gram , KPRKM−3gram
SVM and predicting process and KPRKC−3gram ). In addition, a combination of PRKs
To implement the pairwise SVM method, we use the with other kernels were considered. We included the
sequential minimal optimization (SMO) technique from phylogenetic kernel (Kphy ) described by Yamanishi 2010
the package LIBSVM [40] in combination with OpenKer- [9] and PFAM kernel (Kpfam ) describe by Ben-Hur et al.
nel library [39]. During the training process, the decision [22]. Then, a second set of experiments were devel-
function was obtained by estimating the parameters, as is oped combining PRKs with the phylogenetic kernel (i.e.,
shown in Figure 3(c). Now, the prediction process allows KPRKDS−3gram + Kphy , KPRKT−3gram + Kphy , KPRKM−3gram +
classification of new pairs of nucleotide sequences as Kphy and KPRKC−3gram + Kphy ). Finally, we combined
interacting or not interacting by evaluating the decision PRKs with the PFAM kernel, obtaining KPRKDS−3gram +
function. Example 4 shows a description of the prediction Kpfam , KPRKT−3gram + Kpfam , KPRKM−3gram + Kpfam and
process, similar to the process described in Figure 3(d), KPRKC−3gram + Kpfam kernels. Considering that the phy-
but using nucleotide sequences. logenetic and PFAM kernels were PDS, the resulting
combinations were also PDS [35].
Example 4. This example describe the predictor process. To compare the advantages of the PRKs framework,
Suppose we want to know if we developed a new set of experiments with the same
x = CTCAAAGTCTTAATGCTTGGACAAATTGAAAT dataset, but without using finite-state transducers. We
TGG, and considered the pairwise (n-gram) kernel, i.e., KT−3gram .
y=TCTACAGAGTCGTCCTTCGTCTACCGGGAAAAT, KT−3gram denoted the pairwise tensor product described
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 10 of 13
http://www.biomedcentral.com/1471-2105/15/318
in the Pairwise kernels Section. To be consistent with the accuracy is not comparable to Experiments II and III.
previous experiments, we combined the KT−3gram ker- Similar results were obtained by Yu et al. [29] with PPI net-
nel with the phylogenetic kernel (Kphy ) and PFAM kernel works. They stated simple sequence-based kernels, such
(Kpfam ), i.e., KT−3gram + Kphy and KT−3gram + Kpfam ker- as n-gram, do not properly predict-protein interactions.
nels, respectively. The pairwise SVM algorithm was used However, when Yu et al. [29] combined sequence kernels
to predict the metabolic network using the same data set with other kernels that incorporate evolutionary informa-
described above. Table 1 describes the groups created to tion, the accuracy of the model predictor was improved.
compare these kernels with the equivalent PRKs. We obtained similar results applied to metabolic net-
All the experiments were executed on a PC intel works predictions: when the PHY and PFAM kernels were
i7CORE, 8MB RAM. To validate the model, we used the included (Experiments II and III, respectively), accuracies
10-fold cross validation method and measured the average were improved while maintaining adequate processing
Area Under the Curve of Receiver Operating Characteris- times. The best accuracy value was obtained by com-
tic (AUC ROC) score. bining the PRK-Metric-3gram and PFAM kernels (aver-
Cross-validation method is a suitable approach to val- age AUC=0.844). Other papers have used similar kernel
idate performance of predictive models. In k-fold cross- combinations to improve the prediction of biological net-
validation, the original dataset is randomly partitioned works, such as Ben-Hur et al. [22] and Yamanishi [9].
into k equal-sized subsets. Then, the model is trained k However, rational kernels have not been used in previous
times. Each time, one of the k subsets is reserved for research.
testing and all the remaining k − 1 subsets are used for Ben-Hur et al. [22] report an average AUC value of
training. The final value is obtained as the average of the k 0.78 for PFAM kernels, while Yamanishi [9] reports an
results (see Kohavi et al. [42] for more details). average AUC of 0.77 for the PHY kernel for predicting
A Receiver Operating Characteristic (ROC) curve is a Saccharomyces cerevisiae metabolic pathways. We have
plot of the True Positive Rate (TPR) versus the False Pos- previously developed similar experiments but using SVM
itive Rate (FPR) for different possible cut-offs of a binary methods [7]. As a result, we obtain AUC values of 0.92
classifier system. A cut-off defines a level for discriminat- for PFAM kernel and 0.80 for PHY kernel, with execution
ing positive and negative categories. ROC curve analysis times of 12060 and 7980 seconds, respectively. However,
is used to assess the overall discriminatory ability of the in all cases a random selection of negative and posi-
SVM binary classifiers. The area under the curve (aver- tive training data was used. As noted by Yu et al. [29],
age AUC score) has been used as a metric to evaluate the the average AUC values obtained by random selection of
strength of the classification. data for training machine learning tools results in a bias
In addition, the 95% Confidence Intervals (CIs) towards genes (or proteins) with large numbers of inter-
have been computed, following the method described actions. As such, the high AUC results in these previous
by Cortes and Mohri [43]. The authors provide a works cannot be directly compared to the results in this
distribution-independent technique to compute confi- paper. We have employed the balanced sampling tech-
dence intervals for average AUC values. The variance niques suggested by Yu et al. [29] to combat bias in the
depends on the number of positive a negative examples training set. Our results, with average AUC values in the
(2575 and 2574 in our cases) and the number of classifica- range 0.5-0.844, are comparable to and exceed in cases the
tion errors, ranging between 889 and 1912 in our cases. results obtained by Yu et al. [29] with balanced sampling,
which range from 0.5-0.75 across several different kernels
Results and discussion for protein interaction problems. We have also obtained
Table 2 shows the SVM performance, execution times and these results in execution times of 15-140 seconds. With
95% CIs grouped by the kernels mentioned above. As we the exception of the direct sum kernel, all of the con-
can see, the experiments using only the PRK have the best fidence intervals are above the behaviour of a random
execution times (Exp. I) as the transducer representations classifier.
and algorithms speed up the processing. However, the We developed one more experiment with the PFAM
kernel as a simple kernel of the Pairwise Tensor Product
Table 1 Groups for PRK and pairwise kernel comparison (Kpfam ) using a balanced sampling as suggested by Yu
et al. [29]. Note that it is not a PRK; it is a regular
Group PRKs 1 Pairwise Kernel 2
pairwise kernel using PFAM as a simple kernel, similar
N-GRAM KPRKT−3gram KT−3gram to the example in the Using pairwise kernel and SVM
PHY KPRKT−3gram +Kphy KT−3gram + Kphy to predict metabolic networks Section. As a result, the
PFAM KPRKT−3gram +Kpfam KT−3gram + Kpfam average AUC was 0.61 and the execution time was 122
1
Kernels were taken from Table 2.
seconds. When we compare these values with the results
2
Computed with the Tensor Product Pairwise Kernel. in Table 2 Exp. I, we can see that the kernels KPRKM−3gram
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 11 of 13
http://www.biomedcentral.com/1471-2105/15/318
Table 2 Average AUC ROC scores and processing times for various PRKs
Exp Type of kernels Kernel Average AUC score Runtime (sec) Confidence intervals
Pairwise Rational PRK-Direct-Sum (KPRKDS−3gram ) 0.499 15.0 [0.486, 0.512]
Kernels (PRK) PRK-Tensor-Product (KPRKT−3gram ) 0.597 16.2 [0.589, 0.605]
I
(3-gram) PRK-Metric-Learning (KPRKM−3gram ) 0.641 17.4 [0.633, 0.648]
PRK-Cartesian (KPRKC−3gram ) 0.640 15.0 [0.632, 0.647]
PRKs combined PRK-Direct-Sum+Phy (KPRKDS−3gram + Kphy ) 0.425 136.2 [0.411, 0.438]
with phylogenetic PRK-Tensor+Phy (KPRKT−3gram +Kphy ) 0.733 135.6 [0.725, 0.741]
II
data (Kphy Non- PRK-Metric+Phy (KPRKM−3gram +Kphy ) 0.761 139.2 [0.753, 0.768]
sequence kernel) PRK-Cartesian+Phy (KPRKC−3gram +Kphy ) 0.742 132.6 [0.734, 0.749]
PRKs combined PRK-D-Sum+PFAM (KPRKDS−3gram +Kpfam ) 0.493 136.2 [0.480, 0.506]
with PFAM data PRK-Tensor+PFAM (KPRKT−3gram +Kpfam ) 0.827 136.8 [0.819, 0.834]
III
(Kpfam PRK-Metric+PFAM (KPRKM−3gram +Kpfam ) 0.844 140.4 [0.837, 0.850]
Sequence kernel) PRK-Cartesian+PFAM (KPRKC−3gram +Kpfam ) 0.842 132.0 [0.835, 0.849]
and KPRKC−3gram have better average accuracy (i.e., 0.641 between classification methods. McNemar’s test defines a
and 0.640, respectively) with lesser average execution z score, calculated as:
times (17.4 and 15.0 seconds, respectively). In addition,
|Nsf − Nfs | − 1
when the Pairwise Rational Kernel 3-gram was combined z = (2)
with the PFAM kernel in the Exp. III, (i.e., Tensor Nsf + Nfs
Product Pairwise Rational Kernel - KPRKT−3gram +Kpfam ),
the average accuracy value (average AUC=0.827) was where Nfs is the number of times Algorithm A failed
better than the Pairwise Tensor Product (Kpfam ), while and Algorithm B succeeded, and Nsf is the number of
the execution time just was increased 14.8 seconds (i.e., times Algorithm A succeeded and Algorithm B failed.
from 122 seconds, using Kpfam , to 134.8 seconds, using When z is equal to 0, the two algorithms have similar
KPRKT−3gram +Kpfam ). performance. Additionally, if Nfs is larger than Nsf then
In order to statistically compares theses results, we Algorithm B performs better than Algorithm A, and vice
applied the McNemar’s non-parametric statistical test versa. We computed the z scores considering Algorithm A
[44]. McNemar’s tests have been recently used by Bostanci as the SVM algorithm using the Pairwise Tensor Product
et al. [45] to prove significant statistical differences (Kpfam ) and three different Algorithm Bs, using SVM
Figure 4 Comparison of some pairwise rational kernels and pairwise kernels grouped by kernel types (N-GRAM group, PHY group and
PFAM group).
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 12 of 13
http://www.biomedcentral.com/1471-2105/15/318
with three different PRKs from Table 2 (i.e., KPRKM−3gram , used similar types of stochiometric data, which can be
KPRKC−3gram and KPRKT−3gram +Kpfam mentioned above). converted into kernels to be consider with PRKs.
In all cases, we obtained z scores greater than 0 (i.e.,
4.73, 4.54, 7.51), which mean the PRKs performed better. Conclusion
These z-score also proved that the difference was statis- In this paper, we introduced a new framework called
tically significant with a confidence level of 99% (based Pairwise Rational Kernels, where pairwise kernels are
on Two-tailed Prediction Confidence Levels described obtained based on transducer representations, i.e., ratio-
by [45]). nal kernels. We defined the framework, developed general
The Cartesian Kernel has not been widely used since algorithms and tested on the pairwise Support Vector
it was defined by Kashima et al. [10]. Kashima et al. [10] Machine method to predict metabolic networks.
used Expression, Localization, Chemical and Phylogenetic We used a dataset from the yeast Saccharomyces cere-
kernels to predict metabolic networks. Each of these visiae to validate and compare our proposal with similar
are non-sequence kernels. In the current experiments models using data from the same species. We obtained
we computed, for first time, the pairwise Cartesian ker- better execution times than the other models, while
nel with a rational kernel (sequence kernel) to repre- maintaining adequate accuracy values. Therefore, PRKs
sent sequence data for metabolic network prediction. improved the performance of the pairwise-SVM algo-
Cartesian kernels [10] have been defined as an alternative rithm used in the training process of the supervised
to improve the Tensor Product Pairwise Kernel [22] com- network inference methods.
putation performance. In the three experiments shown in In these methods, the learning process are executed
Table 2, we confirmed this definition, as we have obtained once to obtain the decision function. The decision func-
better accuracy and execution times when we used the tion can be used as many times as necessary to predict
Cartesian Pairwise Rational Kernel (KPRKC−3gram ) rather interaction between the other sequences in the species
than the Tensor Product Rational Kernel (KPRKT−3gram ). and predict the metabolic pathways.
Comparing our results with Kashima et al. [10], we The methods in this research used sequence data
obtained better average AUC values (i.e., 0.844 vs 0.79), (e.g., nucleotide sequences) to predict these interactions.
and approximately the same average of the execution Genes do not need to be correctly annotated as the raw
times (i.e., 93 seconds). Kashima et al. [10] used non- sequences can be used. Therefore, our methods were
sequence data and random selection of positive and nega- able to avoid the error accumulation due to wrong gene
tive data for training. annotations.
Figure 4 shows the results of the experiments compar- As future work, our proposal will be used to produce a
ing the PRK framework with other pairwise kernels. The set of candidate interactions of pathways from the same
three comparative groups described in Table 1 were used. and other species, that could be experimentally validated.
As can be seen, the execution times were better when the As well, other pairwise rational kernels may be developed
PRKs are used in the three groups. This proves that PRKs using other finite-state transducers operations.
compute faster because rational kernels use finite-state
transducer operations and representations, improving the Competing interests
The authors declare that they have no competing interests.
performance.
The power of using kernels is that almost any sort Authors’ contributions
of data can be represented using kernels. Therefore, ARL implemented the algorithms and developed the experiments. ARL, MD
and BF contributed equally to the drafting of this manuscript. All authors have
completely disparate types of data can be combined to reviewed and approved the final version of this manuscript.
add power to kernel-based machine learning methods
[8]. For example, coefficients describing relative amounts Acknowledgements
This work is funded by Natural Sciences and Engineering Research Council of
of metabolites involved in a biochemical reaction (i.e., Canada (NSERC) and Microbial Genomics for Biofuels and Co-Products from
stochiometric data) can also be represented as ker- Biorefining Processes (MGCB2 project).
nels and added to strength the predicting model. For
example, the reaction catalyzed by fructose-bisphosphate Author details
1 Department of Computer Science, University of Manitoba, Winnipeg,
aldolase [EC 4.1.2.13] splits 1 molecule of fructose Manitoba, Canada. 2 Department of Plant Science, University of Manitoba, R3T
1,6-bisphosphate into 2 molecules of glyceraldehyde 2N2 Winnipeg, Manitoba, Canada.
3-phosphate, where the relative amounts of substrate
Received: 11 April 2014 Accepted: 23 September 2014
and product are represented by the coefficients 1 and Published: 26 September 2014
2, respectively. A stoichiometric kernel therefore would
encode coefficients for all substrates and products, where References
1. Faust K, Helden J: Predicting metabolic pathways by sub-network
enzymes that do not interact would have stoichiometric extraction. In Bacterial Molecular Networks. Methods in Molecular Biology.
coefficients of 0. Other authors [46-48] have defined and Springer: New York; 2012:107–130.
Roche-Lima et al. BMC Bioinformatics 2014, 15:318 Page 13 of 13
http://www.biomedcentral.com/1471-2105/15/318
2. Beurton-Aimar M, Nguyen TV-N, Colombié S: Metabolic network 26. Hofmann T, Schölkopf B, Smola AJ: Kernel methods in machine
reconstruction and their topological analysis. In Plant Metabolic Flux learning. In The annals of statistics. New York: JSTOR; 2008:
Analysis. Springer: New York; 2014:19–38. 1171–1220.
3. Osterman A, Overbeek R: Missing genes in metabolic pathways: a 27. Leslie CS, Eskin E, Cohen A, Weston J, Noble WS: Mismatch string
comparative genomics approach. Curr Opin Chem Biol 2003, kernels for discriminative protein classification. Bioinformatics 2004,
7(2):238–251. 20(4):467–476.
4. Karp PD, Latendresse M, Caspi R: The pathway tools pathway 28. Mohri M: Weighted automata algorithms. In Handbook of Weighted
prediction algorithm. Stand Genomic Sci 2011, 5(3):424–429. Automata. New York: Springer; 2009:213–254.
5. Latendresse M, Paley S, Karp PD: Browsing metabolic and regulatory 29. Yu J, Guo M, Needham CJ, Huang Y, Cai L, Westhead DR: Simple
networks with biocyc. In Bacterial Molecular Networks. Springer: New sequence-based kernels do not predict protein–protein
York; 2011:197–216. interactions. Bioinformatics 2010, 26(20):2610–2614.
6. Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, 30. Basilico J, Hofmann T: Unifying collaborative and content-based
Kothari A, Krummenacker M, Latendresse M, Mueller LA: The metacyc filtering. In Proceedings of the Twenty-first International Conference on
database of metabolic pathways and enzymes and the biocyc Machine Learning. Helsinki, Finland: ACM; 2004:9.
collection of pathway/genome databases. Nucleic Acids Res 2012, 31. Oyama S, Manning CD: Using feature conjunctions across examples
40(D1):742–753. for learning pairwise classifiers. In Machine Learning: ECML 2004. New
7. Roche-Lima A, Domaratzki M, Fristensky B: Supervised learning York: Springer; 2004:322–333.
methods to infer metabolic network using sequence and 32. Brunner C, Fischer A, Luig K, Thies T: Pairwise support vector machines
non-sequence kernels. In Proceeding of International Workshop of and their application to large scale problems. J Mach Learn Res 2012,
Machine Learning in System Biology, Conference ISMB/ECCB’13. Berlin, 13:2279–2292.
Germany; 2013. 33. Cortes C, Vapnik V: Support-vector networks. Mach Learn 1995,
8. Fu Y: Kernel methods and applications in bioinformatics. In 20(3):273–297.
Handbook of Bio-Neuroinformatics. Germany: Springer Berlin-Heidelberg; 34. Gomez SM, Noble WS, Rzhetsky A: Learning to predict protein–protein
2014:275–285. interactions from protein sequences. Bioinformatics 2003,
9. Yamanishi Y: Supervised inference of metabolic networks from the 19(15):1875–1881.
integration of genomic data and chemical information. In Elements of 35. Horn RA, Johnson CR: Matrix Analysis. United Kingdom: Cambridge
Computational Systems Biology. USA: Wiley; 2010:189–212. University Press; 2012.
10. Kashima H, Oyama S, Yamanishi Y, Tsuda K: Cartesian kernel: An 36. Sikorski RS, Hieter P: A system of shuttle vectors and yeast host strains
efficient alternative to the pairwise kernel. IEICE Trans Inform Syst designed for efficient manipulation of dna in saccharomyces
2010, 93(10):2672–2679. cerevisiae. Genetics 1989, 122(1):19–27.
11. Kotera M, Yamanishi Y, Moriya Y, Kanehisa M, Goto S: GENIES: gene 37. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T,
network inference engine based on supervised analysis. Nucleic Acids Kawashima S, Okuda S, Tokimatsu T: KEGG for linking genomes
Res 2012, 40(W1):162–167. to life and the environment. Nucleic Acids Res 2008, 36(suppl 1):
12. Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G: Support 480–484.
38. Allauzen C, Riley M, Schalkwyk J, Skut W, Mohri M: Openfst: A general and
vector machines and kernels for computational biology. PLoS
efficient weighted finite-state transducer library. In Implementation
Comput Biol 2008, 4(10):1000173.
and Application of Automata. New York: Springer; 2007:11–23.
13. Yamanishi Y, Vert JP, Kanehisa M: Protein network inference from
39. Allauzen C, Mohri M: OpenKernel Library. http://www.openfst.org/
multiple genomic data: a supervised approach. Bioinformatics 2004,
twiki/bin/view/Kernel.
20(Suppl 1):363–370.
40. Chang C-C, Lin C-J: LIBSVM: A library for support vector machines.
14. Yamanishi Y, Vert JP, Kanehisa M: Supervised enzyme network
ACM Trans Intell Syst Technol 2011, 2:27–12727. Software available at
inference from the integration of genomic data and chemical
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
information. Bioinformatics 2005, 21(suppl 1):468–477. 41. Cliften PF, Hillier LW, Fulton L, Graves T, Miner T, Gish WR, Waterston RH,
15. Kato T, Tsuda K, Asai K: Selective integration of multiple biological Johnston M: Surveying saccharomyces genomes to identify
data for supervised network inference. Bioinformatics 2005, functional elements by comparative dna sequence analysis. Genome
21(10):2488–2495. Res 2001, 11(7):1175–1186.
16. Allauzen C, Mohri M, Talwalkar A: Sequence kernels for predicting 42. Kohavi R: A study of cross-validation and bootstrap for accuracy
protein essentiality. In Proceedings of the 25th International Conference estimation and model selection. In International Join Conferences on
on Machine Learning. ICML ’08. New York, NY, USA: ACM; 2008:9–16. Artificial Intelligence. Volume 14. Montreal, Canada; 1995:1137–1145.
17. Cortes C, Mohri M: Learning with weighted transducers. In Proceedings 43. Cortes C, Mohri M: Confidence intervals for the area under the roc
of the 2009 Conference on Finite-State Methods and Natural Language curve. Adv Neural Inform Process Syst 2005, 17:305.
Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008. 44. McNemar Q: Note on the sampling error of the difference between
Amsterdam, The Netherlands, The Netherlands: IOS Press; 2009:14–22. correlated proportions or percentages. Psychometrika 1947,
18. Cortes C, Haffner P, Mohri M: Rational kernels: theory and algorithms. 12(2):153–157.
J Mach Learn Res 2004, 5:1035–1062. 45. Bostanci B, Bostanci E: An evaluation of classification algorithms
19. Mohri M: Finite-state transducers in language and speech using McNemar’s test. In Proceedings of Seventh International Conference
processing. Comput Linguist 1997, 23(2):269–311. on Bio-Inspired Computing: Theories and Applications (BIC-TA 2012). Gwalior,
20. Mohri M, Pereira F, Riley M: Weighted finite-state transducers in India; 2012:15–26.
speech recognition. Comput Speech Lang 2002, 16(1):69–88. 46. Mailier J, Remy M, Wouwer AV: Stoichiometric identification with
21. Hertz T, Bar-Hillel A, Weinshall D: Boosting margin based distance maximum likelihood principal component analysis. J Math Biol 2013,
functions for clustering. In Proceedings of the Twenty-first International 67(4):739–765.
Conference on Machine Learning. Helsinki, Finland: ACM; 2004:50. 47. Bernard O, Bastin G: On the estimation of the pseudo-stoichiometric
22. Ben-Hur A, Noble WS: Kernel methods for predicting protein–protein matrix for macroscopic mass balance modelling of biotechnological
interactions. Bioinformatics 2005, 21(suppl 1):38–46. processes. Math Biosci 2005, 193(1):51–77.
23. Vert JP, Qiu J, Noble W: A new pairwise kernel for biological network 48. Aceves-Lara C-A, Latrille E, Bernet N, Buffière P, Steyer J-P: A
inference with support vector machines. BMC Bioinformatics 2007, pseudo-stoichiometric dynamic model of anaerobic hydrogen
8(Suppl 10):8. production from molasses. Water Res 2008, 42(10):2539–2550.
24. Rabin MO, Scott D: Finite automata and their decision problems. IBM J
Res Dev 1959, 3(2):114–125. doi:10.1186/1471-2105-15-318
25. Albert J, Kari J: Digital image compression. In Handbook of weighted Cite this article as: Roche-Lima et al.: Metabolic network prediction
automata, EATCS Monographs on Theoretical Computer Science. New York: through pairwise rational kernels. BMC Bioinformatics 2014 15:318.
Springer; 2009.