You are on page 1of 11

Optimal Aggregation of Binary Classifiers

for Multiclass Cancer Diagnosis


Using Gene Expression Profiles
Naoto Yukinawa, Shigeyuki Oba, Kikuya Kato, and Shin Ishii
AbstractMulticlass classification is one of the fundamental tasks in bioinformatics and typically arises in cancer diagnosis studies by
gene expression profiling. There have been many studies of aggregating binary classifiers to construct a multiclass classifier based on
one-versus-the-rest (1R), one-versus-one (11), or other coding strategies, as well as some comparison studies between them.
However, the studies found that the best coding depends on each situation. Therefore, a new problem, which we call the optimal
coding problem, has arisen: how can we determine which coding is the optimal one in each situation? To approach this optimal coding
problem, we propose a novel framework for constructing a multiclass classifier, in which each binary classifier to be aggregated has a
weight value to be optimally tuned based on the observed data. Although there is no a priori answer to the optimal coding problem, our
weight tuning method can be a consistent answer to the problem. We apply this method to various classification problems including a
synthesized data set and some cancer diagnosis data sets from gene expression profiling. The results demonstrate that, in most
situations, our method can improve classification accuracy over simple voting heuristics and is better than or comparable to state-of-
the-art multiclass predictors.
Index TermsMulticlass classification, error correcting output coding, gene expression profiling, cancer diagnosis.

1 INTRODUCTION
D
NA microarrays or alternative quantification techniques
have enabled genome-wide expression analyses of
various biological phenomena. One important application
of this technique is cancer diagnosis, where the expression
level of thousands of genes can be used as a vast amount of
molecular biomarkers of specific phenotypes. This analysis
is expected to overcome the conventional problems of
histopathological cancer diagnosis such as variations in
diagnosis by individual pathologists or difficulties in
differentiating between malignant and benign tissues due
to their morphological similarities. For constructing diag-
nosis systems using high-dimensional gene expression data,
supervised learning theories are often applied, and several
studies have been successful in recent years. Representative
studies include classification of two kinds of acute leukemias
[1] by weighted voting algorithm, classification of four types
of small round blue cell tumors (SRBCTs) by artificial neural
networks [2], and the diagnosis of multiple (14 types)
common adult malignancies by a multiclass support vector
machine (SVM) [3]. These existing studies revealed that
tissues from different origins can be well classified by
supervised classification algorithms, mainly because the
gene expression profile of an origin is considerably different
from the others. On the contrary, classifying multiple types
of tissue fromthe same origin, for example, hereditary breast
cancer [4], is much more difficult due to the similarity in
gene expression patterns between phenotypic variants; there
is still no definitive method. When considering histopatho-
logical applications in the postgenomic era, however, we
must deal with such difficult situations, and sophisticated
multiclass prediction methods are required. In this paper,
we propose a novel supervised learning approach to
multiclass classification problems.
For classifying gene expression profiles, SVM is thought
to be the most promising method in recent years, because a
larger margin of decision boundary between two classes
improves its generalization capability for class separation,
especially in a high-dimensional gene expression vector
space. SVM can originally handle binary classification
problems. In a multiclass problem, however, it needs some
device to integrate the binary classification results into the
final answer to the original multiclass (` classes)
classification problem. For the integration process, the
following simple voting heuristics have been frequently
used: 1) prepare a set of ` binary classifiers, each of which
separates one class from the other classes (one-versus-the-
rest: 1R); then, a single guess is determined by voting the
outputs from the ` binary classifiers [5] and 2) prepare a
set of `` 1,2 binary classifiers, each of which
separates one class from another (one-versus-one: 11);
then, a single guess is determined by a vote performed
by them [6]. These integration processes are generalized
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 6, NO. 2, APRIL-JUNE 2009 333
. N.Yukinawa and S. Oba are with the Graduate School of Information
Sciences, Nara Institute of Science and Technology, 8916-5 Takayama-cho,
Ikoma, Nara 630-0192, Japan. E-mail: {naoto-yu, shige-o}@is.naist.jp.
. K. Kikuya is with the Research Institute, Osaka Medical Center for Cancer
and Cardiovascular Diseases, 1-3-2 Nakamichi, Higashinari-ku, Osaka
537-8511, Japan. E-mail: katou-k@mc.pref.osaka.jp.
. S. Ishii is with the Graduate School of Informatics, Kyoto University,
Gokajo, Uji, Kyoto 611-0011, Japan, and the Graduate School of
Information Science, Nara Institute of Science and Technology, 8916-5
Takayama-chi, Ikoma, Nara 630-0192, Japan. E-mail: ishii@i.kyoto-u.ac.jp.
Manuscript received 18 May 2006; revised 25 May 2007; accepted 29 June
2007; published online 31 July 2007.
For information on obtaining reprints of this article, please send e-mail to:
tcbb@computer.org, and reference IEEECS Log Number TCBB-0113-0506.
Digital Object Identifier no. 10.1109/TCBB.2007.70239.
1545-5963/09/$25.00 2009 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
down to the framework of error correcting output coding
(ECOC) [7], which enables the use of a general set of binary
classifiers such as exhaustive coding [7] and random
coding [8]. In addition, arbitrary integration methods
rather than simple voting can be implemented in the
ECOC framework. For example, Hastie and Tibshirani [9]
proposed a probabilistic approach, which made it possible
to integrate probabilistic outputs from binary classifiers of
11 coding. Zadrozny [10] also presented a probabilistic
approach to integrate a general set of binary outputs.
There are also some comparison studies of these various
classification methods applied to multiclass cancer classifi-
cation problems. Li et al. [11] compared the performance of
several multiclass classification methods by applying them
to published data sets of gene expression profiles; they
evaluated SVMs including simple voting heuristics with 1R,
11, exhaustive, and random coding, as well as the Naive
Bayes method, KNN, and the J4.8 decision tree. They found
that SVMs showedoverwhelming performance in most cases
and that choosing a set of binary classifiers, i.e., favorable
coding, was problem specific. Ramaswamy et al. [3] also
compared the performance of SVMs with 1R and 11 and
concluded that 1R showed better performance. Statnikov et
al. [12] exhaustively compared the performance of several
SVMs, KNN, and neural networks by using published gene
expression data sets, concluding that multiclass SVMs [13],
[14] and simple voting (1R) were the better classification
methods; however, the best SVMalgorithmamong themwas
again problem specific.
In this study, we propose a novel framework to obtain
problem-specific optimal coding. We first revisit the
probabilistic approach proposed in [9], leading to our
modification called the maximum a posteriori (MAP)
method. In order to deal with the optimal coding problem,
then, we introduce weights to the constituent binary
classifiers, which are optimized so as to maximize the
classification performance for the training data set; this is
called a weighted MAP (WMAP) algorithm. It can obtain a
better graded set of binary classifiers than the conven-
tional 1R and 11 by solving the optimal coding problem. We
show that the proposed method improves classification
performance over simple voting heuristics by binary
classifiers not only for a synthesized problem but also for
several difficult multiclass cancer classification problems.
2 ECOC AND OPTIMAL CODING PROBLEM
The primary objective of supervised multiclass prediction
is to construct a predictor that predicts the class label
i
i
2 C of the ith sample from its pattern vector rr
i
,
where C f1. . . . . `g is a set of ` ! 3 class labels. The
predictor is constructed based on the training data set
consisting of ` samples accompanied by their class
labels, 11 frr
i
. i
i
g
i1.....`
.
In the framework of ECOC [7], [15], each multiclass
problem is decomposed into multiple binary prediction
problems, which are denoted by a code matrix
f
00
1.
00 00
0.
00 00

00
g
|`
, where | represents the number of
binary prediction problems (see Fig. 1). We call the
configuration of a code matrix coding or coding
method. For example (Fig. 1a), when the ,th row of the
code matrix includes 1 as the first and second elements,
0 as the fourth element, and as the third element,
this row indicates that the ,th binary predictor ideally
outputs 0 and 1 for input sample patterns belonging
to classes {1, 2} and {4}, respectively; the ,th predictor
does not care about the sample patterns belonging to the
third class. We call the pair of subsets corresponding to
334 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 6, NO. 2, APRIL-JUNE 2009
Fig. 1. Overview of ECOC method. (a) An example of target for a four class problem that represents the corresponding set of binary classification
problems. A target is represented by a three-valued row vector of length 4 (which corresponds to the number of classes), where each columns index
corresponds to a single class label. 1 (white square) and 0 (black square) indicate positive and negative class labels for binary classification (in
this case, {1, 2} versus {4}), respectively, and (gray square) indicates unused class labels. (b) Typical code matrices for the four class problem. A
code matrix consists of arbitrary target vectors, i.e., the row and column indices correspond to a target and a class label, respectively. 1R and
11 code matrices are traditional designs; 1R is the set of one-versus-the-rest targets ({1} versus {1, 2, 3}, {2} versus {1, 3, 4}, ), and 11 is the set of
one-versus-one targets ({1} versus {2}, {1} versus {3}, ). AA consists of all possible targets including 1R and 11. (c) Multiclass classification by
ECOC method. In this example, the 11 code matrix and the Hamming decoder are used. First, six binary classifiers, each 11 target, are trained
based on a training data set. Then, a test pattern is classified by the six classifiers, and consequently, the binary (coded) pattern is obtained. The
decoder searches for the nearest column vector (code word) in the designed 11 code matrix with respect to Hamming distance and outputs the
corresponding class label as the final guess.
the ,th row of the code matrix, {1, 2} and {4} in this case,
the ,th target. For an input sample rr whose multiclass
label is being predicted, multiple outputs from all the
binary predictors defined by the code matrix are aggre-
gated and decoded into a multiclass output i if the set
of binary outputs was most similar to the ith column of
the code matrix, which is termed code word (Fig. 1c).
Although some binary predictors may make errors in
actual cases, if the number of errors is not too large, an
appropriate decoding procedure can correct the errors
to restore the correct multiclass label. This is the basic idea
of ECOC.
To design an effective classifier according to the ECOC
framework, selecting the appropriate code matrix and
decoding procedure is essential. Conventional proce-
dures of multiclass prediction based on the one-versus-the-
rest (1R) or one-versus-one (11) methods are understood as
practical examples of ECOC employing simple code
matrices representing 1R or 11 (Fig. 1b) and the simplest
Hamming decoding procedure. In the ECOC framework,
favorable coding can be selected from the heuristics
candidates such as 11 and 1R and all-possible-combinations
(AA, see Fig. 1b). Although an optimal coding (or an
optimal code matrix), if it exists, is expected to enhance the
resultant multiclass prediction, which code matrix is the
optimal one has been found to depend on each situation
[16]. In our study, instead of looking directly for the optimal
coding, we intend to optimally weigh the binary classifiers
whose set is given arbitrarily as an initial code matrix. Since
this weight optimization is performed so as to exhibit the
best performance based on a given data set, we expect that
the optimal coding problem can be solved in a consistent
manner in each situation. The validity of this novel idea is
examined through experiments using a synthesized data set
and some difficult bioinformatics data sets.
3 COMBINING PROBABILISTIC GUESSES OF BINARY
CLASSIFIERS BY STATISTICAL ESTIMATION
Our framework employs a probabilistic decoding, which
was first proposed by Hastie and Tibshirani [9], in particular,
for 11 coding and later extended by Zadrozny [10] as a
general coding method. It decodes a probabilistic guess on
the multiclass problem from the aggregated probabilistic
guesses on the binary problems.
For the ith sample with a sample pattern vector rr
i
, we
assume a class membership probability vector jj
i
whose
component is a true but unobserved membership prob-
ability j
i
i
to each class label i 2 C:
j
i
i
! 0.
X
i2C
j
i
i
1. 1
We attempt to estimate jj
i
and call the estimate a
probabilistic guess of the primary multiclass problem. Let

i
,
1ii 2 1
,
jrr
i
. i 2 1
,
[ 0
,
be a probabilistic guess of
the ,th binary predictor to the ith sample, where 1
,
& C
and 0
,
& C are class subsets corresponding to the binary
outputs 1 (positive) and 0 (negative) of the ,th binary
predictor, respectively.
1
Let
i
f
i
,
g
,21
denote the set of
class membership probabilities, where 1 is the set of binary
predictors defined by a code matrix. It is noted that the code
matrix 1 can be represented by an arbitrary set of code
words (each of which corresponds to a class), not restricted
as 1R or 11, according to our approach (Fig. 1b). Thus, the
class membership probability vector for the entire data set,
f
i
g
i1..`
, is determined by a set of binary classifiers in
1, based on the training data set 11. In the following, we
omit the argument
i
when that does not risk causing
confusion.
Since our study aims at presenting a good methodology
to deal with the optimal coding problem, our task is, in
principle, free from the choice of binary classifiers. For
frequently used binary classifiers such as linear discrimi-
nant analysis and SVM, probabilistic outputs are not
available straightforwardly. In this study, we use SVM as
an individual binary classifier, to which we apply logistic
regression whose parameter is determined by cross valida-
tion with the training data set [17], in order to obtain a
probabilistic guess from the discriminant function value of
the SVM (for details, see Appendix A). The dependence on
individual binary classifiers will be briefly discussed in
Section 6.
Next, we proceed to an estimation procedure of multiclass
membership jj from the set of binary membership probabil-
ities . Based on the assumption of the true multiclass
membership probability jj, the true binary class probability
with respect to the ,th target,
,
rr 1ii 2 1
,
jrr. i 2 1
,
[ 0
,
,
is given by

,

j
1
,
j
1
,
j
0
,
. 2
where membership probability, j
|
, to a subset of class labels
| 2
~
2
C
is given by a simple summation of class membership
probabilities to single classes, j
|

P
i2|
j
i
. To obtain a jj,
which allows to best fit the observed , a weighted
Kullback-Leibler (KL) divergence between and is
minimized with respect to jj:
11; jj
X
,21
n
,

,
log

,

,
1
,
log
1
,
1
,
& '
. 3
where n
,
is a confidence weight variable corresponding to
the ,th target, which could be set at n
,
1 in the simplest
case. In the next section, we will consider how to determine
the n
,
value appropriately, which corresponds to the
optimal coding process. Since the natural distribution of jj
is multinomial, we introduce a Dirichlet prior to (3) for
regularization, and the problem is formulated as maximiza-
tion of the following objective function:
\ jj
X
,21
n
,

,
log j
1
,
1
,
log j
0
,
logj
1
,
j
0
,

X
i2C

0
log j
i
1.
4
where
0
is a hyperparameter that controls the intensity of
the Dirichlet prior, and 1 is a constant independent of jj.
The Dirichlet prior term controls prior knowledge of the
rate of random mislabels and contributes to stabilizing the
optimization algorithm. We set
0
0.001 in this study,
YUKINAWA ET AL.: OPTIMAL AGGREGATION OF BINARY CLASSIFIERS FOR MULTICLASS CANCER DIAGNOSIS USING GENE EXPRESSION... 335
1. In Fig. 1a, 1
,
f1. 2g and 0
,
f4g. In Fig. 1b, 1
,
and 0
,
are denoted by
white and black squares, respectively.
which leads to stability, whereas its variation did not affect
the results very much. By maximizing objective function
\ jj with respect to jj under constraint (1), we obtain the
probability estimate of class membership ^ jj. This maximiza-
tion can be performed by the steepest descent method with
a Lagrange multiplier. In the simplest case where all the
weight variables fn
,
g
,21
are set at unity, this probabilistic
estimation is similar to the existing probabilistic decoding
[9] and is subsequently called the MAP method. The
pseudocode for the MAP method is presented as Algo-
rithm 1 in Appendix B.
4 OPTIMIZATION OF THE WEIGHTS OF
BINARY CLASSIFIERS
In this section, we propose a procedure to optimize the
weight variable nn fn
,
g
,21
, which allows us to approach
the optimal coding within the usage of initial code matrix 1.
To optimize the weight nn, we define a gain function l that
represents the concordance between the class membership
probability estimate jj and the true class label i:
l lfjj
i
g
i1..`
. ftt
i
g
i1.....`

X
`
i1
X
i2C
t
i
i
mxj
i
i
.
5
where tt
i
t
i
1
. . . . . t
i
`
is an `-dimensional binary vector
that indicates a single class label; t
i
i
1 if sample i
belongs to class i, otherwise, t
i
i
0. mxjj
i
is a soft-max
function:
mxj
i

expuj
i

7
. 7
X
i
0
2C
expuj
i
0 .
where u is an inverse temperature parameter, which
controls the sharpness of the soft-max function; as
u ! 1, mxj
i
approaches 1 for i arg max
i
j
i
, or 0
otherwise. Since the setting of this parameter barely affects
the results, we set it at an appropriately large value.
The MAP solution does not depend on any linear scale of
the KL divergence (3), i.e., multiplication of every weight
n
,
, , 2 1 by a constant. To remove this scale insensitivity,
we introduce a constraint:
n
,
! 0.
X
,21
n
,
1. 6
The gain function l is an implicit function of nn, namely,
l depends on jj, which is obtained by maximizing a
function of nn. Therefore, the optimization of l with respect
to nn is to obtain the ~ nn that satisfies
~ nn arg max
nn
lf~ jjnn
i
g
i1.....`
. ftt
i
g
i1.....`

under condition 6.
7
~ jj
i
arg max
jj
i
\ jj
i
jnn under condition 1. 8
for the given training data set 11 f
i
. tt
i
g
i1.....`
. This is
a twofold optimization problem; outer and inner optimiza-
tion is given by (7) and (8). The optimal `-class classifier is
configured by optimizing ~ nn for the entire data set 11 in the
outer optimization, and by using it, the class membership
probability estimate jj
i
of each pattern vector rr
i
is given
in the inner optimization. In other words, the outer and
inner optimization corresponds to the optimal coding and
the decoding processes, respectively.
A solution to this optimization problem is shown in
Appendix C. We call this algorithm the WMAP method.
Note that we can utilize an arbitrary gain function in place
of (5), if the gain function is differentiable with respect to
jj
i
. Accordingly, our WMAP approach looks for the
optimal graded coding represented as the weight vector
nn within the initial setting of the binary code matrix 1.
This optimization process is in principle free from the
choice of individual binary classifiers (typically SVMs),
probabilistic transformation from their discriminant func-
tion (typically logistic regression), and the original code
matrix (typically AA). The pseudocode for this weight
optimization procedure is presented as Algorithm 2 in
Appendix B.
5 RESULTS
5.1 Experiment 1: Applications to Synthesized
Data Sets
We first examined the performance of the two methods,
MAP and WMAP, by applying them to a synthesized
data set. The aim of this experiment is to show
performance improvement by WMAP in each of three
designs of code matrix: 1R, 11, or AA. Assuming an
underlying 3-class structure of 2D data points, the data
set was synthesized according to the following proce-
dure. First, we generated each data point rr r
1
. r
2

from a 2D uniform distribution within 2. 2 2. 2.


Next, the class label of each data point was assigned as
arg min
i
krr rr
c
i
k
2
/
c
i
based on the distance between the
data point and the centroids of the three classes:
rr
c
1

2
p
.

2
p
, rr
c
0
1

2
p
.

2
p
, rr
c
2

2
p
.

2
p
, and
rr
c
3

2
p
.

2
p
g, where /
c
1
2 log0.35, /
c
0
1
2 log0.20,
/
c
2
2 log0.50, and /
c
3
2 log0.75. Note that c
1
and c
0
1
represent the same class: that is, this class has two class
centroids. We generated 400 points (c
1
, c
0
1
, c
2
, and c
3
:
100 points each) as a training data set and 600 points
(c
1
, c
0
1
, c
2
, and c
3
: 150 points each) as a test data set and
then merged c
1
and c
0
1
into a single class c
1
in each data
set (Fig. 2a shows the test data set). Because this data set
produces an apparently inseparable target, {1} versus
{2,3}, by a simple classifier, poor classification perfor-
mance would be expected when 1
11
is used as the
ECOC code matrix. When employing 1

, which
includes 1
11
as the initial coding, the weight (con-
fidence) optimized by our WMAP for such an unreliable
target as {1} versus {2, 3} should shrink to a small value.
We constructed a total of six combinations of two
multiclass classification algorithms, MAP and WMAP,
and code matrices, 1R, 11, and AA. As an individual
binary classifier, we used an SVM with a linear kernel
1rr. rr
0
rr
T
rr
0
. We set
0
2 and u 2. 000 for the
(hyper)parameters.
These combinations were evaluated by the means and
standard deviations of the three-class classification
accuracies for the training and test data sets over five
336 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 6, NO. 2, APRIL-JUNE 2009
different and random data generations (Table 1). As
expected, WMAP-AA represented the best test accuracy
among all combinations. The effect of the weight
optimization of binary classifiers is remarkable; WMAP-
11 and WMAP-AA showed higher test accuracies than
MAP-11 and MAP-AA, which employ uniform weights.
The reason why WMAP-11 had slightly lower test
accuracy than MAP-11 could be overtraining. The
performance of WMAP-1R was significantly improved
compared to MAP-1R; the poor performance of MAP-1R
was caused by a nuisance target in 1
11
, {1} versus {2, 3},
which could not be discriminated well by a linear kernel
SVM, and its weight successfully became almost zero in
WMAP-1R.
The WMAP-AAresult above can be seen as an example of
how weight optimization by WMAP worked. To construct
an optimal decision boundary by the whole multiclass
classifier, it is better to ignore unreliable binary classifiers
and also to appropriately weigh binary classifiers so as to
contribute to the final multiclass classification performance;
seeking effectively an appropriate graded coding starting
from the original coding, in this case, 1

. This experiment
demonstrated that our WMAP-AA automatically meets this
requirement. Fig. 2 shows the decision boundary (Fig. 2a)
and the weights of the binary classifiers (Fig. 2b) obtained by
WMAP. Interestingly, the weights of targets {1} versus {2, 3}
and {2} versus {3} became 0 in this result. Since it is difficult
to train the binary classifier for the target {1} versus {2, 3}
(Fig. 2b, line 4), thus making it unreliable, the weight of {1}
versus {2, 3} became approximately zero to ignore this
classifier in the whole multiclass classifier. On the contrary,
the weight of {2} versus {3} became approximately zero for
another reason. The data points of classes c
2
and c
3
for target
{2} versus {3} (Fig. 2b, line 3) have easily separable
distributions, and we obtained a good binary classification
performance for this target. If we put trust in this target, {2}
versus {3}, however, the performance of the multiclass
classification may degrade because c
1
would be classified
into c
2
or c
3
randomly, based on the decision boundary for
this target. The entire multiclass classifier preferred to
emphasize other binary classifiers to achieve higher accu-
racy for c
1
. The decision for c
2
and c
3
was then compensated
by voting by other classifiers such as {2} versus {1, 3} and {3}
versus {1, 2}. As a consequence of grading each element in
the code matrix 1

by optimizing the weights, the decision


boundary by WMAP-AA came to have an expanded margin
to minimize classification loss, in comparison to MAP-AA.
The result of this simple artificial problem suggests that the
optimal coding problem in ECOC can be solved by our
weight optimization method (WMAP) by making unneces-
sary targets in the initial code matrix shrink.
5.2 Experiment 2: Applications to Tumor
Classification Problems
Our method was next applied to four tumor classification
problems based on gene expression profiling. The informa-
tion of the data sets is summarized in Table 2, and the
details are described below.
5.2.1 Thyroid Cancer Data Set
The thyroid cancer data set is composed of original gene
expression profiles from four tissue types of human thyroid
origin that contain 168 samples and 2,000 genes measured by
an adaptor-tagged competitive PCR (ATAC-PCR) [18]
method. The main diagnostic procedure for thyroid cancer
is fine needle aspiration, but because the tissue structure is
disrupted during the sampling process, differential diag-
nosis is extremely difficult [19], [20], [21]. Thus, diagnosis
fromgene expression profiles has been anticipated, though it
would not be an easy task. The composition of the samples
are 58 (follicular adenoma: FA), 28 (follicular carcinoma: FC),
40 (normal: N), and 42 (papillary adenocarcinoma: PC).
5.2.2 Esophageal Cancer Data Set
This data set is also composed of original gene expression
profiles obtained from esophageal cancers of Japanese
patients by ATAC-PCR [22], [23]. It should be noted that
esophageal cancers in Japan are mostly squamous cell
carcinoma, while those in the US and Europe are adeno-
carcinoma, i.e., Barret tumors. The task here is differential
diagnosis of three histological types: poorly differentiated
(the sample number is 14), moderately differentiated (97),
and well differentiated (30).
YUKINAWA ET AL.: OPTIMAL AGGREGATION OF BINARY CLASSIFIERS FOR MULTICLASS CANCER DIAGNOSIS USING GENE EXPRESSION... 337
Fig. 2. Application of WMAP-AA and MAP-AA to a synthesized data set.
(a) The scatter plot of the data points and the decision boundaries
estimated by WMAP-AA (the solid lines) and by MAP-AA (the dotted
lines). The dash lines represent the Bayes optimal decision boundaries.
(b) The weight optimization result by WMAP-AA. The matrix represents
AA coding of the three class problem (the symbols are same as those in
Fig. 1). Two bars located in the right of each target vector represent the
weight of the target (black: value normalized so that its maximum value
became 1) and the training accuracy of the binary classifier for the target
(gray).
TABLE 1
Classification Performance of Combinations of
Binary Classifiers for an Artificial Problem
TABLE 2
Gene Expression Data Sets of
Four Tumor Classification Problems
5.2.3 SRBCT Data Set [2]
Gene expression profiles about small round blue cell
tumors (SRBCTs) of childhood, which contain 83 samples
and 2,308 genes measured by cDNA microarrays, can be
accessed at http://research.nhgri.nih.gov/microarray/
Supplement/. SRBCTs, which include the Ewing family of
tumors (EWS), rhabdomyosarcoma (RMS), Burkitt lympho-
ma (BL), and neuroblastoma (NB), have some difficulty in
being distinguished solely histologically due to their similar
appearance. The composition of the samples is 29 (EWS),
25 (RMS), 11 (BL), and 18 (NB).
5.2.4 Leukemia Data Set [24]
Gene expression profiles about three types of leukemia,
which contain 72 samples and 11, 225 genes measured by
Affymetrix oligonucleotide arrays, can be accessed at
http://www-genome.wi.mit.edu/cancer. The composition
of the samples is 28 (acute myeloid leukemia: AML),
24 (acute lymphoblastic leukemia: ALL), and 29 (MLL
translocation: MLL).
We prepared the six ways of aggregating binary
classifiers identically to those used in Experiment 1. For
each binary classifier to be aggregated, we prepared an
SVM with a linear kernel using all genes without any
selection procedure. It should be noted that in many
classification problems based on gene expression profiling,
employing linear kernels in SVMs has exhibited better
performance than employing more complicated kernels;
since complicated kernels implicitly assume high-dimen-
sional feature spaces, they may overfit the relatively large
noise involved in gene expression data. We preset the
(hyper)parameters of the MAP and WMAP methods at
2 and u 2. 000 for the thyroid cancer, esophageal cancer,
and SRBCT data sets, and at 2 and u 1. 500 for the
leukemia data set. We also prepared three state-of-the-art
multiclass classification algorithms: a nearest shrunken
centroid algorithm
2
(NSC) [25] and two direct implementa-
tions of MC-SVM, Weston and Watkins (WW) [13], and
Crammer and Singer (CS), which is a modification of the
WW approach [14]. These methods cast multiclass categor-
ization problems as a constrained optimization problem
with a quadratic objective function by introducing a
generalized notion of the margin into multiclass problems.
In NSC, the shrinkage parameter was optimized by
searching from 0 to 6 at intervals of 0.25. In the two MC-
SVM variants, a linear kernel was also employed, because it
showed the best performance. The parameters for NSC and
MC-SVM were optimized based on just the training data
sets for avoiding information leak from the test data sets.
For each data set and each method, training accuracies
and test accuracies were evaluated with a fivefold cross-
validation framework, where for each split of the five folds,
the ratios of all classes were maintained to be similar to the
other folds. The mean and standard deviation of the results
are shown in Table 3. For the SRBCT data set, all classifiers
exhibited 100 percent accuracy at both training and test; so,
the results are not shown in the table.
Comparing the proposed six ways, the 1

coding was
often found to be better than the others; it was the best for
the thyroid and esophageal data sets and comparable to the
best for the SRBCT and leukemia data sets. For the three
data sets except esophageal, the training CV accuracy by all
of the six combinations reached the upper limit of 1.0. The
SRBCT data may be too easy to be classified perfectly even
for the test, while for the thyroid and leukemia data sets, the
training CV accuracies of 1.0 might come from overfitting
because the test CV accuracies did not reach 1.0.
Either of the two cases above can be a hazard to our
WMAP procedure, because the training accuracy is so
saturated that the room for the weight optimization is
restricted. This is why the test CV accuracies were the same
between WMAP and MAP in some cases. Even in such
saturated cases, however, the optimization with respect to
the soft-max accuracy can improve the aggregation of
multiple binary classifiers, especially when there are a lot of
constituent binary classifiers, as can be seen in Section 6.
Compared to the existing state-of-the-art multiclass classi-
fication methods, we found our proposed methods, espe-
cially with weight optimization (WMAP), exhibited better
or comparable performance.
5.3 Experiment 3: Applications to a Larger Class
Problem
When the number of classes (`) is large, the initial setting
of the exhaustive coding (AA) becomes computationally
intractable, because the number of targets #1

`
in the
AA coding increases exponentially with respect to `:
338 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 6, NO. 2, APRIL-JUNE 2009
2. NSC is known as a method implemented in the PAM: Prediction
Analysis for Microarrays software package (http://www-stat.stanford.edu
/~tibs/PAM/).
TABLE 3
Cross-Validation Accuracies for
Three Real Gene Expression Data Sets
#1

`

X
d`,2e
i1
X
`i
,1
1
1 c
i,
`
i ,

i ,
,

. 9
where c
i,
is the Kroneckers delta.
In this section, we apply our code optimization based on
the WMAP framework to a 14-class problem [3] of a global
cancer map (GCM) data set, where the number of targets in
the initial AA coding becomes as high as 2,375,101.
Although the original GCM data set consists of 16,063 genes
expression profiles for 144 training samples and 54 test
samples of 14 common cancer types, we merged the
training and test samples to construct a data set of
190 samples (eight test samples of another cancer type
were removed) in this experiment.
Although WMAP would be able to find an optimal and
possibly sparse (or graded) subset by starting from the
whole AA code matrix in principle, it is in practice difficult
partly because of the lack of a sufficient number of data and
partly because of the computational intractability on a set
whose element number becomes exponential. By applying a
sparse random coding (SR) to this 14-class problem, we
obtained a set of 58 targets, where the number of targets was
determined as a rough standard d15 log
2
`e, as proposed in
[15], and the 58 targets were selected from 10,000 random
sets of 58 targets, so that any two code words were the
furthest apart from each other in the sense of Hamming
distance. The details are described in Appendix D. The
objective of our WMAP method here is to obtain an
appropriate graded code matrix starting from the initial set
SR and hence to optimize the weight values for this
random but rigid code matrix.
We evaluated the six methods, namely, the combinations
of three designs of code matrix, 1R, 11, and SR, and the two
procedures, MAP and WMAP, for the 14-class problem by
the means of a fivefold cross validation. Each binary
classifier was implemented as a linear kernel SVM employ-
ing all genes. Multiclass SVMs implemented as MC-
SVM(CS) and MC-SVM(WW), and the NSC method were
also compared within the same conditions as Experiment 2.
Table 4 shows the results.
A simple voting by linear kernel SVMs showed better
performance with MAP-1R than with MAP-11. MAP-SR
was also better than MAP-11 but did not exceed MAP-1R;
these results were consistent with those by [3]. However,
the results were different when the weight optimization
was performed; WMAP-SR became better than MAP-1R.
The performances of WMAP-1R and WMAP-11 were not
improved by the weight optimization, probably because the
training accuracy was saturated, suggesting this 14-class
classification problem contains rather little data in compar-
ison to the complexity of the multiclass problem. Even in
such a saturated condition and when the initial code matrix
includes a lot of targets, our weight optimization method
works well, as confirmed by the improvement of WMAP-SR
over MAP-SR. The other three state-of-the-art alternatives
did not exceed the results of the WMAP-SR. From this
experiment, we can see that an appropriate rigid code
matrix may improve performance by appropriately decod-
ing from it based on the MAP method, over the simple
voting heuristics by 1R or 11, and introducing the weight
optimization to seek the optimal graded set from the rigid
one can further improve the performance.
6 DISCUSSION AND CONCLUSIONS
The statistical model of the MAP decoder is an expanded
version of the pairwise coupling method of binary prob-
ability estimates [9], which used only 1
11
, and is concep-
tually similar to the method in [10], which was also an
expansion in [9], while the MAP method incorporates an
additional term that naturally represents prior knowledge
of class distribution. In [10], 1
11
, 1
11
, and random targets
from 1

were dealt with in combinations, but 1

itself
was not considered, and the optimization of code matrices
(optimal coding problem) was unsolved. Our weight
optimization method used in WMAP successfully obtained
a graded code matrix starting from any code matrix without
any prior knowledge about the data, and this method could
be one answer to the optimal coding problem. This feature
is essential for practical tumor classification problems using
gene expression profiling, because we often do not have
much information on the data.
When 1

, containing all possible targets was used as


the code matrix to be weighted, the WMAP method often
showed the best performance. Especially when 1

could
not be used as in the 14-class GCM problem, various sets of
targets can be considered by reducing 1

but determining
which code matrix shows the best performance among
them requires some heuristics. In the current study, we
used the SR method [15], but still, the weight estimation
method worked well when applied to the reduced code
matrix, implying that our method could solve, at least to
some extent, the optimal coding problem by searching the
analog coding space restricted within the initial binary
coding. The current results suggest that the larger the code
matrix to be optimized, the better the performance becomes,
though the optimization of large code matrices requires
heavy computation. Although it is important, in practice, to
seek a better configuration of the initial code matrix than the
exhaustive coding 1

or the SR 1
o1
when the class
number is not small, this problem is not the target of our
current study but a future one, because our code optimiza-
tion technique can in principle employ any initial setting of
the code matrix.
In our code optimization method, overtraining of weights
may occur, especially when the number of targets is large.
This problem could be avoided by using large training data
sets, but in many actual gene expression analyses, handling
small data sets is required. One possible way of dealing with
this problem is to use various parameter optimization
techniques such as the leave-two-out (LTO) method [26],
which is a hierarchical cross-validation approach. When the
training accuracy by the MAP decoder reaches a higher
limit, i.e., saturation occurs, the roomfor taking advantage of
YUKINAWA ET AL.: OPTIMAL AGGREGATION OF BINARY CLASSIFIERS FOR MULTICLASS CANCER DIAGNOSIS USING GENE EXPRESSION... 339
TABLE 4
Performance Comparison of MAP and WMAP
Using a 14-Class GCM Data Set
the WMAP method is restricted. This is an another aspect of
overtraining, and this tendency is more apparent when the
binary classification method is strong enough. One way of
solving this overtraining problem is to split the training data
set into two or more subsets and to train the binary classifiers
and adjust the weights individually using different data
subsets.
The linear-kernel SVMs used in this study have been
preferred in many bioinformatics studies [27], [28]. On the
other hand, we can also tune parameters of the kernel
and/or use other kernels in the SVM or use more
sophisticated binary predictors such as AdaBoost [29].
The flexibility of our framework that can employ any
binary classification algorithms could yield further perfor-
mance and work well especially when a relatively large
amount of data are available. For example, when we use a
quadratic-kernel SVM as a unit classifier (the kernel
function is 1rr. rr
0
1 rr
T
rr
0

2
), the resultant cross-
validation accuracy for the thyroid data set was 1 in the
training and 0.780 in the test. Note that this improvement
was due to the change in the unit classifier rather than the
weight optimization. If we employ highly adjustable binary
classifiers and an appropriate code matrix, the classifiers
are well adapted to the given data set, and the decoder can
easily integrate them. In such a case, there is little room to
achieve improvement by the WMAPs code optimization
because of the saturation of the gain function. Accordingly,
the performance improvement by WMAP is dependent on
both the data set (and the class structure underlying the
data) and the choice of binary classifiers. Still, however, it
is important that our code optimization technique can be
employed with any choice of binary classifiers.
From the view point of bioinformatics, our WMAP
method may also be valuable for feature extraction and
existing classifiers such as the NSC method [25]. The
optimized weights can be interpreted as a numerical feature
vector in the binary classifier space, each of whose elements
characterizes the degree of the corresponding binary
classifiers contribution to multiclass classification results,
i.e., disease (medical or phenotypic) information in cancer
classification problems. In other words, the code matrix
optimized by WMAP would enable us to observe the
interclass relationships in a multiclass classification problem
from the perspective of several binary classification pro-
blems. For example, the code optimized by our WMAP
method can provide information on the geometrical relation-
ship of multiple classes, which can be seen in the experiment
using a synthesized data set (Experiment 1). On the other
hand, the centroids of NSCcan be construed as characteristic
pattern vectors in the gene expression space, each of which
represents the corresponding class label and each of whose
elements indicates genes responsibility to each class label.
While these are class specific patterns, they do not represent
interclass relationship directly. Consequently, WMAP and
NSC extract some characteristic features from data sets, but
they contain different types of information. As stated above,
the WMAP method itself does not select informative genes
for cancer classification tasks. However, it is possible to
obtain some evidences about the gene contribution by using
binary classification algorithms such as the weighted voting
method [1] and SVM-RFE [30] as a unit classifier of WMAP,
which incorporates gene selection processes: The higher
ranked or survived genes in a binary classifier that has a
larger optimized weight are supposed to have substantial
influence on multiclass classification results. To do this, we
must not only optimize the code matrix but also tune gene
selection parameters of binary classifiers, but the parameter
tuning is an independent issue of our approach. However,
we expect that our approach will elucidate biological
meaning through linkages between the optimized code
words and the class labels in gene expression analyses.
The novel approaches introduced in this study show
promise as the means to differentiate similar tumor types of
the same origin, as are thyroid and esophageal cancers, for
example. Before the final determination of their efficacy, a
number of confirmatory experiments are necessary. Never-
theless, we believe that our algorithms based on ECOC
coding/decoding will contribute to providing advanced
tools in the pathological diagnosis of cancer in the near
future.
APPENDIX A
PROBABILITY ESTIMATION FROM DECISION VALUES
OF BINARY CLASSIFIERS
In order to convert a discriminant function value from a
binary predictor (in this study, a binary SVM) into the
probability estimate, we employed a regression-based
method proposed by Platt [17]. Let d
,
rr 2 R be a discrimi-
nant function value from the ,th predictor constructed based
on the partial training data 11
,
frr
i
. i
i
g
i2ii
,
, where ii
,
is
the index set of samples used to make this predictor. The
logistic regression model assumes that the probabilistic
guess
i
,
1ii 2 1
,
jrr
i
. i 2 1
,
[ 0
,
is given by the para-
metric sigmoidal function of d
,
rr:

i
,

1
1 exp
,
d
,
rr
i
1
,

.
where
,
and 1
,
are the model parameters specific to the
,th target. These parameters were estimated by maximizing
the log likelihood on the transformed training data
11
0
,
fd
,
rr
i
. i
i
g
i2ii
,
:
max
X
i2ii
,
i
i
,
log
i
,
1 i
i
,
log1
i
,

n o
. 10
i
i
,
is the target probability defined as
i
i
,

i
1
,
if i
i
2 1
,
.
i
0
,
if i
i
2 0
,
.
&
where i
1
,
and i
0
,
are explained below. We used a gradient
descent method to maximize (10) with respect to
,
and 1
,
.
Platts method incorporated the following two techni-
ques to avoid overfitting to the training data. First, the
estimated target probabilities, i
1
,
`
1
,
1,`
1
,
2 and
i
0
,
1,`
0
,
2, were used instead of typical choices, i
1
,

1 and i
0
,
0, where `
1
,
and `
0
,
are the numbers of samples
belonging to 1
,
and to 0
,
, respectively. This setting is
effective especially in dealing with unbalanced training
data sets. Second, cross validation was used for generating
340 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 6, NO. 2, APRIL-JUNE 2009
unbiased training data of the sigmoidal fitting. It should be
noted that a naively made transformed data set 11
0
,
could be
biased because of the effects from the optimization of the
,th predictor. We used fivefold cross validation in this
study. Namely, the training data set 11
,
was divided into
five blocks, five binary predictors were trained by using
four out of the five blocks, and then, d
,
rr was evaluated on
the remaining block for each of the five predictors. The
concatenation of the evaluation over the five disjoint blocks
was used as the unbiased transformed training data set 11
0
,
.
APPENDIX B
PSEUDOCODES FOR MAP AND WMAP
To provide the procedures of MAP and WMAP in step-by-
step manner, we present their pseudocodes here. Algo-
rithm 1 is a pseudocode for MAP that estimates the
multiclass membership probability from the set of binary
membership probabilities.
Algorithm 1. MAP class membership probability estimation
1: procedure ESTIMATEP. jj
0
. nn. C. 1
2: if jj
0
is NULL then
3: for all i 2 C do > Initialize class
membership probability if jj
0
is not specified
4: j
0
i
1,#C
5: end for
6: end if
7: \
0
\ jj
0
j. nn. 1 > Calculate the initial
objective function value according to (4)
8: T 0
9: repeat
10: T T 1
11: # Update jj by the steepest descent method
12: for all i 2 C do
13: log j
T
i
log j
T1
i

cj
T1
i
0\ jjj. nn. 1,0j
i
j
jjjj
T1 > Step size cis
determined by line search algorithm
14: end for
15: Normalize jj
T
so as to
P
i2C
j
T
i
1
16: \
T
\ jj
T
j. nn. 1 > Update the objective
function value
17: until T `or1tci
j
or \
T
\
T1
< T/ic:/o|d
j
> `or1tci
j
and T/ic:/o|d
j
are arbitrary constants
18: ~ jj jj
T
19: return ~ jj
20: end procedure
In this code, procedure ESTIMATEP takes a set of binary
membership probability f
,
g
,21
, an arbitrary initial
value of multiclass membership jj
0
fj
0
i
g
i2C
, a fixed weight
vector nn fn
,
g
,21
, the set of class labels C f1. . . . . `g,
and the code matrix 1 as inputs, and outputs the estimated
multiclass membership ~ jj according to the steepest descent
method. Since j
i
0, we used the gradient along log jj,
which stabilizes the optimization (Algorithm 1, line 12). If
we do not specify jj
0
, elements of jj
0
are automatically set by
the uniform probability 1,#C (Algorithm 1, lines 2-6).
Algorithm 2 is a pseudocode for WMAP, which
optimizes the weights for the given code matrix 1.
Algorithm 2. WMAP weight optimization
1: procedure TRAINW
QQ f
i
g
i1.....`
. TT ftt
i
g
i1.....`
. C. 1
2: for all , 2 1 do > Initialize weight nn fn
,
g
,21
3: n
0
,
1,#1
4: end for
5: 11
0
ESTIMATEPALLQQ. NULL. nn
0
. C. 1
6: l
0
l11
0
. TT > Calculate the initial objective
function value according to (5)
7: T 0
8: repeat
9: T T 1
10: nn
T
arg max
nn
l11
T1
. TT under
P
,21
n
T
i
1
# Update nn by a gradient ascent method based on
gradient (12)
11: 11
T
ESTIMATEPALLQQ. 11
T1
. nn
T
. C. 1
# Update 11 with new weights
12: l
T
l11
T
. TT > Update the objective
function value
13: until T `or1tci
n
or l
T
l
T1
< T/ic:/o|d
n
> `or1tci
n
and T/ic:/o|d
n
are arbitrary constants
14: ~ nn nn
T
15: return ~ nn
16: end procedure
17: procedure ESTIMATEPALL
f
i
g
i1.....`
. fjj
i
g
i1.....`
. nn. C. 1
18: for i 1 to `
19: jj
0i
ESTIMATEP
i
. jj
i
. nn. C. 1
20: end for
21: return fjj
0i
g
i1.....`
22: end procedure
This code consists of two procedures: a main procedure
TRAINW and an auxiliary procedure ESTIMATEPALL,
which is a wrapper of ESTIMATEP of Algorithm 1. TRAINW
takes a set of binary membership QQ f
i
g
i1.....`
of
` samples, the corresponding true class label vectors
TT ftt
i
g
i1.....`
. C. and 1, and outputs the optimized
weight vector ~ nn. ESTIMATEPALL is used for the inner
optimization given by (8), which updates the multiclass
membership probabilities of all samples 11 fj
i
g
i1.....`
for the current weight estimate nn
T
(Algorithm 2, lines 8-13).
TRAINW performs the outer optimization given by (7).
APPENDIX C
DERIVATION OF WMAP METHOD
The optimization of (8) can be simply executed by the MAP
method, but we need a technique to optimize (7) because l
depends on nn indirectly through ~ jj f~ jj
i
g. We define a
function )nn. jj of nn and jj fjj
i
g:
)nn. jj
0
0jj
~
\ jjjnn.
where
~
\ is the sum of \ and the Lagrange multiplier term.
The stationary condition of
~
\ with respect to jj:
)nn. ~ jj 0
YUKINAWA ET AL.: OPTIMAL AGGREGATION OF BINARY CLASSIFIERS FOR MULTICLASS CANCER DIAGNOSIS USING GENE EXPRESSION... 341
provides the ~ jjnn that satisfies (8). Then,
)nn dnn. ~ jj djj
)nn. ~ jj
0
0nn
)nn 0dnn. ~ jj 0djjdnn

0
0~ jj
)nn 0dnn. ~ jj 0djjdjj

X
,21
0)
0n
,
dn
,

X
i2C
X
`
i1
0)
0j
i
i
dj
i
i
0
11
gives an another solution of (8), ~ jj djj, when nn has an
infinitesimal change dnn, where 0 < 0 < 1.
For description simplicity, we introduce matrix
fo,. jg, where indices , and j correspond to a target
, 2 1 and an element j
i
i
of jj, respectively. Each element of
matrix is defined by
o,. i. i
0
2
~
\
0n
,
0j
i
i
.
We also introduce a square matrix HH f/j. j
0
g whose
element is defined by
/i. i. i
0
. i
0

0
2
~
\
0j
i
i
0j
i
0

i
0
.
where j and j
0
index an element j
i
i
of jj. By using these
notations, solution condition (11) is expressed in an implicit
function as
dnn HHdjj 0.
and when dnn ! 0,
dj
i
i
dn
,
( )

djj
dnn
HH
1
.
Using this derivative,
0l
0nn

0~ jj
0nn
0l
0~ jj
HH
1

0l
0jj
. 12
Each element of 0l,0jj is written as
0l
0j
i
i
1
expuj
i
i

P
i
0
2C
expuj
i
i
0
!
expuj
i
i

P
i
0
2C
expuj
i
i
0
ut
i
i
. 13
and then, (7) can be optimized by a gradient ascent method
based on gradient (12).
APPENDIX D
SPARSE RANDOM CODING
The SR, which enables us to design efficient initial code
matrices for large-class problems, was proposed by Allwein
et al. [15]. Code matrices of SR for `-class problems consist
of nonoverlapping | d15 log
2
`e targets (row vectors).
Each element was assigned a value from {1, 0, } with
certain probabilities; 1 or 0 with 1/4 or with 1/2. To
obtain good error correcting properties, the minimum and
averaged distance between each pair of code words (rows
in the code matrix) should be large. For calculating distance
of a pair of code words nn, .. 2 f
00
1.
00 00
0.
00 00

00
g
|1
, we used a
generalized Hamming distance:
,
X
|
i1
i
i
. where i
i

0 if n
i
.
i
^ n
i
6 0 ^ .
i
6 0.
1 if n
i
6 .
i
^ n
i
6 0 ^ .
i
6 0.
0.5 if n
i
0 _ .
i
0.
8
<
:
After generating 10,000 code matrices according to the
process above, we selected the optimal code matrix whose
minimum , value was the maximal among them, checking
that no column or row contained only .
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers
for their helpful comments and suggestions. This work was
partly supported by the 21st Century COE Research
Program: Exploiting New Frontiers in Bioscience and by a
Grant-in-Aid for Scientific Research on Priority Areas:
Deepening and Expansion of Statistical Mechanical Infor-
matics, both from the Ministry of Education, Culture,
Sports, Science, and Technology, Japan.
REFERENCES
[1] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,
J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri,
C.D. Bloomfield, and E.S. Lander, Molecular Classification of
Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring, Science, vol. 286, no. 5439, pp. 531-537, Oct. 1999.
[2] J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F.
Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C.
Peterson, and P.S. Meltzer, Classification and Diagnostic Predic-
tion of Cancers Using Gene Expression Profiling and Artificial
Neural Networks, Nature Medicine, vol. 7, no. 6, pp. 673-679, June
2001.
[3] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M.
Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio,
W. Gerald, M. Loda, E.S. Lander, and T.R. Golub, Multiclass
Cancer Diagnosis Using Tumor Gene Expression Signatures,
Proc. Natl Academy Sciences USA, vol. 98, no. 26, pp. 15149-15154,
Dec. 2001.
[4] I. Hedenfalk, M. Ringner, A. Ben-Dor, Z. Yakhini, Y. Chen, G.
Chebil, R. Ach, N. Loman, H. Olsson, P. Meltzer, A. Borg, and J.
Trent, Molecular Classification of Familial non-BRCA1/BRCA2
Breast Cancer, Proc. Natl Academy Sciences USA, vol. 100, no. 5,
pp. 2532-2537, Mar. 2003.
[5] B. Schoelkopf, C. Burges, and V. Vapnik, Extracting Support Data
for a Given Task, Proc. First Intl Conf. Knowledge Discovery and
Data Mining, pp. 252-257, 1995.
[6] B. Schoelkopf, C. Burges, and A. Smola, Advances in Kernel Methods
Support Vector Learning. MIT Press, 1999.
[7] T.G. Dietterich and G. Bakiri, Solving Multiclass Learning
Problems via Error-Correcting Output Codes, J. Artificial Intelli-
gence Research, vol. 2, pp. 263-286, 1995.
[8] E.L. Allwein, R.E. Schapire, and Y. Singer, Reducing Multiclass
to Binary: A Unifying Approach for Margin Classifiers, Proc. 17th
Intl Conf. Machine Learning, pp. 9-16, 2000.
[9] T. Hastie and R. Tibshirani, Classification by Pairwise Coupling,
Advances in Neural Information Processing Systems, vol. 10, pp. 507-
513, 1998.
[10] B. Zadrozny, Reducing Multiclass to Binary by Coupling
Probability Estimates, Advances in Neural Information Processing
Systems, vol. 14, pp. 1041-1048, 2001.
[11] T. Li, C. Zhang, and M. Ogihara, A Comparative Study of Feature
Selection and Multiclass Classification Methods for Tissue
Classification Based on Gene Expression, Bioinformatics, vol. 20,
no. 15, pp. 2429-2437, Oct. 2004.
[12] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy,
A Comprehensive Evaluation of Multicategory Classification
Methods for Microarray Gene Expression Cancer Diagnosis,
Bioinformatics, vol. 21, no. 5, pp. 631-643, 2005.
342 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 6, NO. 2, APRIL-JUNE 2009
[13] J. Weston and C. Watkins, Multi-Class Support Vector Machine,
technical report, Univ. of London, 1998.
[14] K. Crammer and Y. Singer, On the Algorithmic Implementation
of Multiclass Kernel-Based Vector Machines, J. Machine Learning
Research, vol. 2, pp. 265-292, 2001.
[15] E.L. Allwein, R.E. Schapire, and Y. Singer, Reducing Multiclass
to Binary: A Unifying Approach for Margin Classifiers,
J. Machine Learning Research, vol. 1, pp. 113-141, 2001.
[16] L. Shen and E.C. Tan, Reducing Multiclass Cancer Classification
to Binary by Output Coding and SVM, Computational Biology and
Chemistry, vol. 30, no. 1, pp. 63-71, Feb. 2006.
[17] J. Platt, Probabilistic Outputs for Support Vector Machines and
Comparison to Regularized Likelihood Methods, Advances in
Large Margin Classifiers, A.J. Smola, P. Bartlett, B. Schoelkopf, and
D. Schuurmans, eds., pp. 61-74, 2000.
[18] K. Kato, Adaptor-Tagged Competitive PCR: A Novel Method for
Measuring Relative Gene Expression, Nucleic Acids Research,
vol. 25, no. 22, pp. 4694-4696, Nov. 1997.
[19] E. Saxen, K. Franssila, O. Bjarnason, T. Normann, and N. Ringertz,
Observer Variation in Histologic Classification of Thyroid
Cancer, Acta Pathologica et Microbiologica Scandinavica A,
vol. 86A, no. 6, pp. 483-486, Nov. 1978.
[20] A.S. Fassina, M.C. Montesco, V. Ninfo, P. Denti, and G. Masarotto,
Histological Evaluation of Thyroid Carcinomas: Reproducibility
of the WHO Classification, Tumori, vol. 79, no. 5, pp. 314-320, Oct.
1993.
[21] Z.W. Baloch, S. Fleisher, V.A. LiVolsi, and P.K. Gupta, Diagnosis
of Follicular Neoplasm: A Gray Zone in Thyroid Fine-Needle
Aspiration Cytology, Diagnostic Cytophathology, vol. 26, no. 1,
pp. 41-44, Jan. 2002.
[22] K. Kato, R. Yamashita, R. Matoba, M. Monden, S. Noguchi, T.
Takagi, and K. Nakai, Cancer Gene Expression Database
(CGED): A Database for Gene Expression Profiling and Accom-
panying Clinical Information of Human Cancer Tissues, Nucleic
Acids Research, vol. 33, pp. D533-D536, 2005.
[23] K. Taniguchi, T. Takano, A. Miyauchi, K. Koizumi, Y. Ito, Y.
Takamura, M. Ishitobi, Y. Miyoshi, T. Taguchi, Y. Tamaki, K. Kato,
and S. Noguchi, Differentiation of Follicular Thyroid Adenoma
from Carcinoma by Gene Expression Profiling with Adapter-
Tagged Competitive Polymerase Chain Reaction, Oncology,
vol. 69, pp. 428-435, 2005.
[24] S.A. Armstrong, J.E. Staunton, L.B. Silverman, R. Pieters, M.L. den
Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, and S.J.
Korsmeyer, MLL Translocations Specify a Distinct Gene Expres-
sion Profile that Distinguishes a Unique Leukemia, Nature
Genetics, vol. 30, no. 1, pp. 41-47, Jan. 2002.
[25] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, Diagnosis of
Multiple Cancer Types by Shrunken Centroids of Gene Expres-
sion, Proc. Natl Academy Sciences USA, vol. 99, no. 10, pp. 6567-
6572, May 2002.
[26] M. Ohira, S. Oba, Y. Nakamura, E. Isogai, S. Kaneko, A.
Nakagawa, T. Hirata, H. Kubo, T. Goto, S. Yamada, Y. Yoshida,
M. Fuchioka, S. Ishii, and A. Nakagawara, Expression Profiling
Using a Tumor-Specific cDNA Microarray Predicts the Prognosis
of Intermediate Risk Neuroblastomas, Cancer Cell, vol. 7, no. 4,
pp. 337-350, Apr. 2005.
[27] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M.
Schummer, and D. Haussler, Support Vector Machine Classifica-
tion and Validation of Cancer Tissue Samples Using Microarray
Expression Data, Bioinformatics, vol. 16, no. 10, pp. 906-914,
evaluation studies, Oct. 2000.
[28] S. Dudoit, J. Fridlyand, and T.P. Speed, Comparison of
Discrimination Methods for the Classification of Tumors Using
Gene Expression Data, J. Am. Statistical Assoc., vol. 97, pp. 77-87,
2002.
[29] Y. Freund and R. Schapire, Experiments with a New Boosting
Algorithm, Proc. Intl Conf. Machine Learning (ICML 96), pp. 148-
156, 1996.
[30] I. Guyon, J. Weston, S.M.D. Barnhill, and V. Vapnik, Gene
Selection for Cancer Classification Using Support Vector Ma-
chines, Machine Learning, vol. 46, pp. 389-422, 2002.
Naoto Yukinawa received the BS degree in
bioscience from the Tokyo Institute of Technol-
ogy in 2001 and the PhD degree from the Nara
Institute of Science and Technology in 2006,
where he studied the statistical machine learning
approaches for system identification and classi-
fication of gene expression profiles. Since 2006,
he has been working as a researcher in the
group of Dr. Shin Ishii at the Nara Institute of
Science and Technology. His current research
interests include machine learning and their applications to transcrip-
tomic and proteomic analyses.
Shigeyuki Oba received the MS degree in
geophysics from Kyoto University in 1998 and
the MS and PhD degrees in information
science from the Graduate School of Informa-
tion Science, Nara Institute of Science and
Technology in 2001 and 2002, respectively. He
has been in the Nara Institute of Science
Technology since 2002 as a researcher and
since 2003 as an assistant professor. His
current research interests are machine learning
and their application to bioinformatics.
Kikuya Kato received the MD degree and the
PhD degree in molecular genetics and biochem-
istry from Osaka University Medical School in
1980 and 1984, respectively. He is currently the
director of the Research Institute, Osaka Med-
ical Center for Cancer and Cardiovascular
Diseases. His current interest is molecular
genetics and transcriptome analysis of human
cancer tissues.
Shin Ishii received the BE, ME, and PhD
degrees from the University of Tokyo in 1986,
1988, and 1997, respectively. He is currently a
professor at the Graduate School of Informatics,
Kyoto University, after a 10-year career at the
Graduate School of Information Science, Nara
Institute of Science and Technology. He has
been interested in statistical bioinformatics and
systems neurobiology and has approached these
areas from both basic and practical viewpoints.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
YUKINAWA ET AL.: OPTIMAL AGGREGATION OF BINARY CLASSIFIERS FOR MULTICLASS CANCER DIAGNOSIS USING GENE EXPRESSION... 343

You might also like